How to debug a HardFault on an ARM Cortex-M MCU | Interrupt

Faults happen on embedded devices all the time for a variety of reasons – ranging from something as simple as a NULL pointer dereference to something more unexpected like running a faulty code path only when in a zero-g environment on the Tower of Terror in Disneyland. It’s important for any embedded engineer to understand how to debug and resolve this class of issue quickly.


This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/cortex-m-fault-debug

There seems to be pictures in the article with URL that do not work (https://interrupt.memfault.comimg/cortex-m-fault/ufsr.png for example)

Thanks for the report! I do see there’s an issue in the link you posted (missing / between com & img) but I can’t find a reference to that link anywhere in the article. The url looks correct in the section it is referenced.

That’s how the article looks to me:

The issue should now be fixed. Thanks for the report!

__attribute__((optimize("O0")))
void my_fault_handler_c(sContextStateFrame *frame) {
  // If and only if a debugger is attached, execute a breakpoint
  // instruction so we can take a look at what triggered the fault
  HALT_IF_DEBUGGING();

In this code section I would like to understand how the frame parameter would get optimized out if we didn’t use optimization level O0.

Hi @Kongu,

Good observation! You are right that in this exact scenario it should not be necessary and could be removed.

If the function was marked static or link time optimization (LTO) was being used during compilation, the compiler could optimize away passing the argument in the first place since it’s not actually referenced in the function. If the HALT_IF_DEBUGGING() macro was commented out, the compiler may optimize out the function call altogether since nothing was happening in it.

You can use compiler explorer to see examples of these types of optimizations here.

Thank you Chris.
With the help of the tool now I understand how the optimization takes place.
Nice article BTW.

Hi! Thanks for this amazing high quality resource. I have struggled with debugging an imprecise fault and this was very helpful. I have a few questions:

  • The tower of terror bug is interesting although not uncommon, what I thought was cool is all those crash logs you got. How I wish I had such logs to go through on a failed device from the field. Can you give us some insight into how such logging can be provisioned?
  • How many times have you found yourself needing/using ETM tracing, since those analyzers tend to be very expensive. Any interesting use cases?

Hey @rookie, welcome to Interrupt!

This probably isn’t the answer you’re looking for, but in-the-field crash logging is what @chrisc, @tyler, and I offer at Memfault. Shoot us a note if you’d like to learn more: hello@memfault.com

Note that you do not need to shell out for a proprietary analyzer. There are several open source tools that support ETM. Check out Sigrok if you need a tool, and OpenCSD if you need a library.

Hang tight, we’ll be writing more about ETM/ITM/DTW. There are some great use cases for it!

1 Like

Hi @chrisc,
if I’m not mistaken there is a typo in the example with examining BusFault sub-register
print/x *(uint8_t *)0xE002ED29

shouldn’t it be

print/x *(uint8_t *)0xE000ED29 ?

Also, couldn’t help but wanted to mention, that your blog is a great source of knowledge. Thanks!
Deni

Hi @dnsglk, that is correct, great catch! Thanks for pointing it out, I’ll fix that up in the article!

Also, couldn’t help but wanted to mention, that your blog is a great source of knowledge. Thanks!

Thanks, glad to hear you are enjoying!

@chrisc First thanks for great blog. I am facing some issue in understanding assembly code at HARDFAULT_HANDLING_ASM. I got the idea that first you are checking active stack pointer using 2nd bit so using ‘tst’ instruction for AND operation on ‘lr’ with 0x100 to get only second bit. Latest using ite for comparing data but not sure which value, may be condition flags. I didn’t understand next 2 instruction mrseq and mrsne. Can you please explain it?

Also I was think, If we are working with bare metal, what should be return address during recovery ? If we return to any line of main(), there is possibility that arm registers (R0-R12)are holding incorrect value and this can lead to unexpected behavior of system, Correct ?

Great rundown, as always, fellas. I have a question about jumping to the C fault handler after identifying which SP was in use at the time of the HardFault, based on some things Joseph You says in “The Definitive Guide to the M0/M0+”. Isn’t it true that if the HardFault was the result of a stack overflow, then the SP (and/or the memory it points to) might not be valid? In that case, wouldn’t it be problematic to jump to a C function since that function might use the stack? Or maybe a better question is why don’t I see this sort of SP checking in the assembly language portion of the HardFault handler? None of the ones I’ve seen (yours, Segger’s, mbed’s) check for that but Joseph Yiu makes a point of it in his book. Thanks for any help you can provide!

hi @nathancharlesjones, glad you enjoyed the article and great question!

You are right that it is possible one can fault from the C handler if the stack is already outside of RAM bounds.

Note that typically a separate stack (process / “psp” stack) is used for thread mode vs when running from an exception in handler mode (main / “msp” stack). The “msp” stack is typically placed at the end of RAM which makes the chances of it running off RAM unlikely but not impossible of course. It’s certainly possible another area of RAM (i.e data, bss or a heap) gets trampled if the handler uses too much stack but usually the system will be rebooted at the end of the handler so it will get back to a good state anyway if that type of corruption does occur.

By using separate stacks, if the process stack overflows in normal operation, the Hardfault Handler will still be able to run since it’s a different stack. If you are using the MPU as a stack guard, you can also leave the HFNMIENA bit clear to prevent faulting in that way as well.

Worst case, if you do stack overflow or fault from the HardFault_Handler, the Cortex-M MCU will enter the “Lockup” state. The behavior in this state is technically left up to the MCU implementer so it’s worth looking it up in the datasheet for the MCU but generally if the “Lockup” condition is encountered, the MCU will automatically reset itself when not attached to a debugger.

All that being said, if you have extra RAM available and want to guarantee the fault handler can always execute, a strategy I’ve always been quite fond off is having a dedicated pre-allocated area of SRAM to set the stack pointer to before invoking the fault handler. You can find an example of that strategy in this handler implementation.

Hi,

Where are you placing that ASM shim? If I use extern linkage to that function immediately in the C fault handler then I can make the handler void, but I need to account for an offset in the stack since the compiler will try and stack the LR,R7 when branching.

Maybe there is something I’m missing, I guess it’s possible if using __asm() but sometimes that’s not possible and the offset just comes with that limitation.

Hi,

Thanks for code !!!
It was helpful for my stm32H7 project - the designers made pretty nasty decision that each ECC flash fault causes hard fault exception. This is very annoying because the processor has two separate flash banks and even if you use second one just for data (running code form the first one) you cannot safely read the second bank without possibility to trigger ECC related hard fault - if the previous storage or erase was interrupted by power failure.

Simply excellent resource! Saved in my personal list of best articles <3

I have done some work in this area too, and pushed it to github: