Interrupt

Using Asserts in Embedded Systems | Interrupt

The use of asserts is one of the best ways to find bugs, unintended behavior, programmatic errors, and to catch when systems are no longer 100% functional and need to be reset to recover. If instrumented correctly, an assert can give a developer context about when and where in the code an issue took place. Despite the numerous benefits, the practice of using asserts in firmware is not common or agreed upon.


This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/asserts-in-embedded-systems

I did some similar observations some years ago (especially related to C++) and came to the conclusion, that using assert() can come with a very small overhead if done right (as you, by just using the PC): http://robitzki.de/blog/Assert_Hash

1 Like

Yes! I loved your article. I stumbled across it while researching for this blog post, and I especially loved the code size table (I hope you don’t mind I was inspired by it).

I did come across another post, https://barrgroup.com/Embedded-Systems/How-To/Define-Assert-Macro, which goes to great lengths to make an even smaller assert, encoding the file, task, line number, and version, all into a single 32 bit address. I felt it was overkill and the tradeoffs weren’t worth it, especially if a system already had a bit of logging set up or coredumps.

With the hashing strategy of the file, line number, did you ever run into collisions?

With the hashing strategy of the file, line number, did you ever run into collisions?

I’ve actually never used it. Meanwhile, I came to the conclusion, that just using the program counter is sufficient to identify the code location. In case of a hardware exception, you just have that information. So just using the PC somehow unifies crash handling code. Of cause, you have to have the binaries and you have to know, what version of your firmware failed, but I think that are requirements that are not that hard to fulfill.

I’ve actually never used it.

Cool, got it. Awesome that the optimal solution worked for you!

that just using the program counter is sufficient to identify the code location

It’s enough to know where it crashed, but having another frame before or multiple with coredumps is even better.

you have to know, what version of your firmware failed

We’ve used GNU Build ID to mark our builds in the past, stored those in a simple key-value style blob storage, and were able to retrieve them on demand when we needed to. Quite simple, and far better than trying to made builds by semantic versioning and build flags.

1 Like

Making the Most of Asserts

I record the backtrace and the precise program counter. That’s it. Nothing more. OK. Also a time stamp can be useful and maybe a couple of uintptr_t words the programmer can add to help debug.

Care needs to be taken as the optimizer will sweep all common code into one call to the assert utility, then you don’t know which of several asserts in a function fired! We ended up going with a gcc asm oneliner to get the precise PC.

https://www.gnu.org/software/libc/manual/html_node/Backtraces.html

My biggest problem with code that checks for malloc returning null and attempting to handle it…

…usually it is untested, buggy, and somewhere along the line uses malloc to do it’s job! (Guess what lurks in the depths of a printf?)

The next problem on a system with swap… these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!

The light weight IP stack uses pool allocators with quite small pools for resources that may have (potentially malicious) spikes in usage. But then you will find all over it the attitude “this is an IP packet, somethings wrong / I can handle it / I don’t know enough / I don’t have enough resources / …” I’ll just drop the packet. If it matters the higher layers will retry.

Another good pattern is to malloc everything you need for this configuration at initialization time … at least then you know then and there that that configuration will work… if you can’t, you reboot to a safe configuration.

When Not to Assert

Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).

Design by Contract

Please read and understand https://en.wikipedia.org/wiki/Design_by_contract

I regard DbC as the most important concepts in producing correct software, and has a lot to say about asserts.

Hey,
Nice article Tyler! Always clear and useful :slight_smile:

Regarding the asserts during the boot up sequence, I added a flag that is cleared when the system is fully running. If an assert fails while booting, I reboot into Bootloader mode.

Thanks for the note Cyril, it’s a smart way to make sure you don’t shoot yourself in the foot. We’ll add a note about it on the post (cc @tyler).

Awesome article! I implemented something similar but instead of bkpt I used a while 1 loop. Is there any functional difference between the two approaches? Also, how does bkpt work when you don’t have a debugger connected (i.e. the device is being used by the client)? Would bkpt get ignored when theres no debugger connected?

Welcome @Thekenu! I’m glad you enjoyed the article.

The use of bkpt in the article was mainly so that we can easily pause and then step through using the debugger when going through examples. If you are wanting devices to pause execution and wait for a hardware watchdog or further user action to take place in the future, I’d continue using your use of while (1). Using a while (1) trap is actually the default behavior on many systems fault or assert handlers, for better or worse. Just make sure to have the hardware watchdog set up (at least on customer devices) if this is the case.

To answer your other questions:

Is there any functional difference between the two approaches?

Yes there are. bkpt will pause execution if a debugger is attached, and otherwise trigger a HardFault or go into LOCKUP, depending on how the system is configured.

Would bkpt get ignored when theres no debugger connected?

It will not be ignored. It will trigger the HardFault or go into LOCKUP. You’ll have to explicitly check whether a debugger is attached if you’d like to ignore the behavior of bkpt.

Reference:

bkpt - https://developer.arm.com/docs/dui0553/b/the-cortex-m4-instruction-set/miscellaneous-instructions/bkpt
Lockup - https://developer.arm.com/docs/dui0553/a/the-cortex-m4-processor/fault-handling/lockup

Great article!
I don’t understand fully the section When Not to Assert for example why not using assert in these cases:

  • Don’t assert on operations that depend on the hardware behaving appropriately. If a sensor says it will return a value between 0-100, it’s probably best not assert when it’s above 100, because you can never trust today’s cheap hardware.
  • Don’t assert on the contents of data read from persistent storage, unless it’s guaranteed to be valid. The data read from flash or a filesystem could be corrupted.

can you describe a little more here?

I can definitely explain a little more

Don’t assert on operations that depend on the hardware behaving appropriately. If a sensor says it will return a value between 0-100, it’s probably best not assert when it’s above 100, because you can never trust today’s cheap hardware.

Asserts are used primarily to validate that the code you or others on your team write is behaving correctly. Asserts should generally not be used when you are trying to validate code from other developers. You should check return values, validate them, and if they are invalid, raise errors or error codes.

This applies also to hardware. I’ve had experience when working with vendor chips where the documentation says “THIS WILL NEVER HAPPEN”, and of course it happens every now and then. Just because a chip is misbehaving or return invalid values doesn’t mean one should bring down the whole system, especially if the system isn’t reliant on it behaving 100% correctly (and the bug is likely in software anyways, not hardware).

If something comes back from hardware that is invalid, a soft reset of the chip or the vendor stack is generally enough, and shouldn’t require asserting.

Don’t assert on the contents of data read from persistent storage, unless it’s guaranteed to be valid. The data read from flash or a filesystem could be corrupted.

This is a fun one. It’s again mostly related to vendor code and hardware. When you read data from a flash chip, there is always the chance that the read subtly failed, whether it’s due to previous corruption of the flash chip, a bad filesystem, incorrect timings used, a previously failed erase, etc. You should always because able to stomach and recover from an invalid flash read or invalid flash contents.

Most systems handle this by validating the contents (not asserting!), and if invalid, erasing the flash chip or bad sectors and starting anew.

A common bootloop issue that occurs on systems is when a developer chooses to assert the contents of the flash chip on boot (maybe provisioning dat). When the device boots, let’s say it reads the device serial number to print to a screen. It reads that value from flash, asserts the length, the assert fails due to corruption, and reboots. When the device wakes up again, it will do the same exact thing.

A common way to mitigate this is to add a boot counter to detect boot loops, as mentioned above by @cyril, but the point still stands.

Don’t assert things that aren’t 100% reliable and consistent, and if that doesn’t hold true, make sure you really want to reboot the system if the assert fails.

Thanks a lot, @tyler for great explanation now it’s more clear but still I have some questions :wink:
So maybe the best idea is to split the diagnostic in the project for:

  • errors which required the reset of device
  • errors which can be covered during runtime
  • errors from HW and external chip which required reset for this appropriate damage chip or not handling with this corrupted HW on the pcb.
    What do you think?
    Next question is if assert need to be equal with reset the whole system? special in the development phase when I want to have for example all problems display via uart to check the device stability? If is it good to use the assert as this case for all mentioned errors?

This is equivalent to having no dynamic allocation (e.g. no heap, no malloc, etc.), and therefore only static allocation. That way you’ll know whether or not your configuration will work at compile time.

Certainly compile time allocation is better if you can do it…

I do embedded systems where the customer configures the device as desired for that customer, and then sends it out (maybe 1000kms) into the field.

The order of decreasing niceness is…

  1. Everything is guaranteed to work at compile time.
  2. Everything is guaranteed to work at link time. (We build tens of product variants from the same source).
  3. Everything is guaranteed to work when the customer presses “Go” to reconfigure (or the configuration app, has in a user friendly manner, prevented him from creating an unworkable configuration).
  4. Everything works if it boots.
  5. You find out in the field that it doesn’t work.

We try hard never to get down to level 4, but if you’re testing new versions of the software, 5. becomes…

  1. You find out after after hours of testing on the racks, or worse, when you release 1000’s of hard to upgrade devices to quality sensitive customers.
  • errors which required the reset of device
  • errors which can be covered during runtime
  • errors from HW and external chip which required reset for this appropriate damage chip or not handling with this corrupted HW on the pcb.
    What do you think?

That sounds ideal to me. The important bit is to make sure that enough data is collected from the system and reported back so that the issue can be fixed in firmware or an RMA can be issued (maybe even proactively!). Everyone will have a different use case, so I wouldn’t say one approach fits all use cases.

Next question is if assert need to be equal with reset the whole system? special in the development phase when I want to have for example all problems display via uart to check the device stability? If is it good to use the assert as this case for all mentioned errors?

I’ve worked at places with different philosophies here. The two approaches I’ve used are:

  1. Assert only when absolutely necessary. In cases where one would likely want to assert, log out the issues to the UART and report an analytic event back to a server or on flash. Don’t force a reset.
  2. Assert as often as possible, especially when it is a result of developer error. Capture a core dump or a stack trace, keep a running log of these resets, and report them back to a central server.

The first one is the easiest, but it will allow bugs to slip through and one is banking on the hope that customers won’t notice or that the system won’t get into a bad state. Using this approach, I’ve seen bugs go unreported for months and only when a customer was mad enough did we hear about the issue.

The second one requires a few things for it to work properly, but it, in my opinion, is the right way to do things. It requires:

  • Adequate time dogfooding the latest releases.
  • The ability to push updates quickly and reliably. Think nightly releases.
  • The infrastructure to collect and surface critical issues that are reported by devices, and the ability to fix bugs quickly.

We definitely had early branch cuts where the device was rebooting every 30 minutes with a particular firmware release, but we would quickly fix all the asserts, crashes, hardfaults, etc in a few days and we’d be left with UX issues to fix. The important bit about resetting on every assert is that information is recorded and reported about the crash. There is no use taking down the system if no crash data is kept around.

The primary reset for resetting the system on an assert is that the developer claimed that if an assert was hit, it was undefined behavior. Maybe the device’s RAM was corrupted. Maybe there was an overflow. In these cases, it isn’t easy to restore the system to a working state without wiping all RAM, and in that case, might as well reboot. Embedded systems can reboot in a matter of seconds.