Stuck Debugging a Hardfault on a cortex M7

I am currently trying to debug a hardfault on a cortex M7 running freertos. I have been using your great blog post and webinar on this subject and I have learnt a lot from it, so thank you very much for that. Although it has been really helpful and allowed me to improve my processes, I have not been to progress much in finding the cause of my hardfault.

Context:
I have a monitoring system that has async I2C coms, it was running stable until I changed the optimization from none to ‘optimize for debugging’ and now it triggers a hardfault. There are a couple things I can do to stop the hardfault occurring like, carrying out the i2c transactions sequentially or putting a delay between the read_start and read_end calls. Also if I simplify the logic for sensor selection the hardfault takes a lot longer to occur.

When I was using the debugger I found that putting breakpoints in certain places or stepping through the code in certain ways would cause the hardfault not to be triggered at the expected point and would push it back to occur later if the program was left to run freely.

When using the methods described in the blog post and webinar I found that the state of the fault registers and the back trace are not consistent each time I trigger the fault and the back trace contains functions that should not be present.

Questions:
Basically because of these inconsistencies and that I’m not very experienced with debugging programs at such a low level I am not sure how to proceed from here. I’m wondering is there anything that can be deduced from the behaviour I seen so far that may help me in progressing or are there any other methods I can try out?

Thanks for providing this platform and the great content.

It sounds like you have one of the following:

  1. A race condition
  2. Memory corruption

These are tricky issues to debug because the bug is triggers silently, and the HardFault isn’t raised until later on in unrelated code.

I would check the following:

  • Has the stack overflowed? Check your stack pointer is in bound, and enable stack overflow protection if not already done.
  • Are non-IRQ-safe APIs being called in an interrupt context? FreeRTOS has a config flag you can toggle to catch those instances (if using FreeRTOS).
  • Are pointers being used after they’ve been freed? Some heap implementations have debug toggles you can enable to catch this
  • Are you forgetting to lock a mutex somewhere? Every function that expects a mutex to be held should assert if not.

These are just a few ideas.

Hi Francois,

Thanks for the reply.

I have used the overflow hooks and high water mark tools that FreeRTOS provides and they have not shown any issues. I may double check the hooks again though and prove they are working.

I will double check the other suggestions also but I am pretty certain that these things are not occurring.

Another thing I thought of trying out is turning off optimization off for certain functions until the hardfault stops, seeing as the program seemed stable with no optimizations.

ps. Is this the correct section of the forum to post this in?

  • Do you have the chance to try it on a 2nd device to exclude an hardware error?
  • Could it be an alignment problem in dynamically allocated memory?