Interrupt

Reproducible Firmware Builds | Interrupt

If you have ever worked on a large scale embedded project before, you’ve probably run into situations where the build system or binary behaves differently depending on where it was compiled. For example, maybe the binary completely fails to compile on one computer and on another it compiles but crashes on boot!


This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/reproducible-firmware-builds
1 Like

This article is amazing. I also face the problem of reproducibility every day, I’m also an embedded firmware developer. I avoid using macros like __TIME__, use relative paths on the gcc command line, etc.
But I can see that you’re emphasizing certain aspects that I hadn’t considered and that I don’t fully understand.
What matters to me is the binary, not so much the elf it comes from. It’s the binary that’s written in the microncontroller flash. So I’m wondering why you want elf reproducibility, too. If from two different directories you compile different elf, but the binaries are the same, then what’s the problem?

About the second patch: I tried to replace the ROOT_DIR := $(abspath .) assignment with ROOT_DIR := . and not to apply patch 03. I would have expected it to work instead it doesn’t but I don’t understand why.

Finally, I ask you how risky it is to embed the GNU build ID into the binary, since it also depends on the debug sections.
So even changing just one gcc parameter to drive debugging (like -g3 or -g2) would change the SHA1 without changing the binary functionally.
In my opinion, it is risky to introduce a dependence on information (debugging information) that is not present in the binary itself.

best regards
Max

Hi Max,

Glad to hear you found it useful! Great questions, I’ve added some thoughts inline below!

About the second patch: I tried to replace the ROOT_DIR := $(abspath .) assignment with ROOT_DIR := . and not to apply patch 03. I would have expected it to work instead it doesn’t but I don’t understand why.

By default, the GNU DWARF writer will emit the DW_AT_comp_dir attribute (the directory the file was compiled in) as an absolute path. You need to provide the -fdebug-prefix-map compiler option (patch 03) to change this.

You can examine particular attributes by using the --debug-dump option of readelf:

arm-none-eabi-readelf --debug-dump build/nrf52.elf  | grep DW_AT_COMP -i
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0x456f): /private/tmp/dev/interrupt/example/reproducible-build

If from two different directories you compile different elf, but the binaries are the same, then what’s the problem?
[…]
So even changing just one gcc parameter to drive debugging (like -g3 or -g2) would change the SHA1 without changing the binary functionally.

Nothing would be the matter per-se. You would need to set up your own tooling to compute a md5 over the final binary and use that to verify that the same build has been generated but that wouldn’t be hard to do.

While it’s true just changing a compiler flag related to debugging won’t change the binary, I do think it’s nice to know that the exact same debug information is being generated for a few reasons:

  • If the debug (DWARF) information emitted changes, different developers may get different results when debugging locally. For example, depending on what settings were changed, one build may be able to correctly display backtraces and one may not, etc
  • If you are collecting cores / automating analysis of crashes (e.g with gdb-python scripts), you’ll want a way to get the same debug info so you get the same analysis results. For example, changing from -g3 to -g2 removes macro definitions from the ELF … if an analysis script was looking up the value of a #define, it will no longer work.

Finally, I ask you how risky it is to embed the GNU build ID into the binary, since it also depends on the debug sections.

I think it’s quite safe to store this information. A lot of effort has been put into the GNU project to make sure the debug information emitted is reproducible. The build ID itself was originally added to the GNU project to aid in being able to reliably look up the symbols for a linux core dump.

Also, just in case you didn’t see it we talk about using the build id in embedded applications in a bit more detail here.

best!
-Chris

If you are collecting cores / automating analysis of crashes (e.g with gdb-python scripts), you’ll want a way to get the same debug info so you get the same analysis results.
[…]
The build ID itself was originally added to the GNU project to aid in being able to reliably look up the symbols for a linux core dump.

Your arguments are persuasive, and I agree with you. Every developer in the team should be able to get the same results in debugging.
But I can see that you also put emphasis on core analysis, however in a microcontroller environment it’s a problem that doesn’t exist (unless there is a way to make a firmware create a core when it crashes).

best regards
Max

But I can see that you also put emphasis on core analysis, however in a microcontroller environment it’s a problem that doesn’t exist (unless there is a way to make a firmware create a core when it crashes).

There actually is! (and full disclosure, it is one of the tools we work on at Memfault)

In general, all that’s needed on the firmware side is a way to collect some of the RAM and register state from the microcontroller and get it off the device. At the minimum this could just be something like the pc & lr registers but often can be all of RAM since microcontrollers don’t usually have that much :slight_smile:

The posts on interrupt about asserting and debugging cortex-m hardfaults touch upon some of the things you can start to collect for postmortem analysis.

This is really good. I usually try to reproduce the crashes at my desk with the JTAG probe connected, at that point I can dump the RAM and analyze what happened. I think it’s similar to what you say.

However, I want to explain better what my point is.
We produce devices that can be updated on the field. The upgrade is a critical phase: if the devices lose power during the upgrade they brick, and have to return to service.
What worries me and my company is that there will be no unnecessary updates. This is why the update only takes place if the binary differs.
And that’s why I feel it is unsafe to integrate the GNU build ID into the binary. In fact, it is sensitive not only to the build parameters (like -g2 or -g3, as we said in a previous answer), but also to the variable names, or the order in which functions are defined or variables are declared.
You will see that SHA1 changes if you change the name of a variable or parameter. For example in the main.c file try replacing static void prvQueueuePongTask(void *pvParameters) with static void prvQueueuePongTask(void *pvParameters_), and you will see that the SHA1 will be different.
Or try swapping the order of two functions, for example:

void recover_from_task_fault(void) {
  while (1) {
    vTaskDelay(1);
  }
}

void vAssertCalled(const char *file, int line) {
  __asm("bkpt 3");
}

Here I swapped the definition order of vAssertCalled() and recover_from_task_fault() and, again, the SHA1 changes.

Therefore, even a cleaning up of the code (e.g. for the sake of clearness) would risk causing an unwanted update.

My thought is that the GNU build ID hash identifies, indeed, the build and not the binary. For this reason I believe that we (and all those who have similar needs) cannot (or should not) integrate the GNU build ID into the binary.

How do you think these problems can be solved (excluding removing the hash from the binary)?

best regards
Max

Hi Max!

Good questions! Some thoughts below:

This is really good. I usually try to reproduce the crashes at my desk with the JTAG probe connected, at that point I can dump the RAM and analyze what happened. I think it’s similar to what you say.

That’s definitely a great first step and what I try to do too. What I’ve found though is once devices have been shipped to end customers, there are always some bugs that cannot be reproduced in the office & that frustrate customers so much they wind up returning the product or stop paying for the subscription which comes with it. These issues can be for a variety of reasons … whether it be a customer connecting to a different phone/router/computer/etc than was tested in the office and hitting an interoperability issue, a problem that arose due to aging hardware, a customer using a product in a different way than envisioned, or some sort of regional climate factor impacting the hardware. One of my personal favorites that took me a while to debug was a digital compass implementation failing to calibrate correctly in parts of Africa and South-America because that’s where earths magnetic field strength is the weakest. Without collecting some sort of data in the field, it can be near impossible to solve some of these types of issues.

We produce devices that can be updated on the field. The upgrade is a critical phase: if the devices lose power during the upgrade they brick, and have to return to service. What worries me and my company is that there will be no unnecessary updates. This is why the update only takes place if the binary differs.

Makes sense, a couple thoughts here:

  1. You could place the GNU build id at fixed offset in your binary and when you run the logic to decide whether an update is needed you could skip over the build id when comparing the images. I would suspect though that typically if several different changes have been added to the repository leading up to a release, there will be some parts of the binary which have changed so you may be optimizing for a situation that doesn’t happen in practice.
  2. I’d considering reworking the firmware update strategy for future products so that a reboot in the middle of the update will not brick the device. This is usually done by reserving some space to stage an update before it is installed and a bootloader which does the update. If the device reboots in the middle of the update, the bootloader can detect there is no valid main image and start retrying the update.

Wouldn’t that be like not having build ID?
You may have different builds (with different build IDs) but identical binary. How exactly do you use build ID information?

In fact we also added into binary the git commit tag like this:

git describe --always --dirty --broken --exclude=*

In this way we identify the commit from which the build comes.

Our workflow is exactly as you described it. And the problem is only when you need to update the bootloader.
Our test department was able to brick a device by removing power during the very short time frame of the bootloader update. This has worried my boss and the company a lot (very paranoid). That’s why we want to avoid unwanted updates as much as possible.
I think the probability of that happening in the field is low enough. Our test department, by trying voluntarily, has only succeeded once!

best regards
Max

IMO if you want to be really safe, but want to retain the ability to update the bootloader, you need to implement A/B updates for the bootloader (i.e. have two “slots” reserved for bootloader). There are some more complicated schemes you can use as well (with multi-stage bootloaders) to save some code space.

relying only on the gcc version can be a bit risky, because there are more toolchain that integrate the same gcc version.
I have a small database of toolchain:

max@resfw04:/opt$ for cc in `find . -name "arm-none-eabi-gcc" | sort`; do printf "%s -- %s\n" $cc "$($cc --version | grep -i 'GNU Tools for Arm Embedded')"; done 
./gcc-arm-none-eabi-5_3-2016q1/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589]
./gcc-arm-none-eabi-5_4-2016q2/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.4.1 20160609 (release) [ARM/embedded-5-branch revision 237715]
./gcc-arm-none-eabi-5_4-2016q3/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.4.1 20160919 (release) [ARM/embedded-5-branch revision 240496]
./gcc-arm-none-eabi-6_2-2016q4/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 6.2.1 20161205 (release) [ARM/embedded-6-branch revision 243739]
./gcc-arm-none-eabi-6-2017-q1-update/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q1-update) 6.3.1 20170215 (release) [ARM/embedded-6-branch revision 245512]
./gcc-arm-none-eabi-6-2017-q2-update/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
./gcc-arm-none-eabi-7-2017-q4-major/bin/arm-none-eabi-gcc --  arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major)  7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]
./gcc-arm-none-eabi-7-2018-q2-update/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2018-q2-update) 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907]
./gcc-arm-none-eabi-8-2018-q4-major/bin/arm-none-eabi-gcc --  arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2018-q4-major)  8.2.1 20181213 (release) [gcc-8-branch revision 267074]
./gcc-arm-none-eabi-8-2019-q3-update/bin/arm-none-eabi-gcc -- arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 8.3.1 20190703 (release) [gcc-8-branch revision 273027]
./gcc-arm-none-eabi-9-2019-q4-major/bin/arm-none-eabi-gcc --  arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major)  9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]

You can see that both 5_4-2016q2 and 5_4-2016q3 embed version 5.4.1 of gcc.
And also 6-2017-q1 and 6-2017-q2 that both embed version 6.3.1 of gcc.

I think that even (or only) the toolchain version is to be taken into account.
Or at least the date of creation (but if, for a project, you need to recompile the toolchain by yourself this no longer applies):
5.4.1 20160609 or 5.4.1 20160919
6.3.1 20170215 or 6.3.1 20170620

What do you think?

best regards
Max

Thanks for yet another great post!

I had to sort my source files as well, as for some reason the CI would compile and link the files in a slightly different order and thus the sections would end up in different order in the final image. I didn’t realize until I looked at the disassembly and saw that the sections were identical, but just shuffled around.

I simply changed:

C_OBJS := $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.c,$(SOURCE)))))
CPP_OBJS := $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.cpp,$(SOURCE)))))
S_OBJS := $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.S,$(SOURCE)))))

To:

C_OBJS := $(sort $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.c,$(SOURCE))))))
CPP_OBJS := $(sort $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.cpp,$(SOURCE))))))
S_OBJS := $(sort $(addsuffix .o,$(addprefix $(BUILD_DIR)/,$(basename $(filter %.S,$(SOURCE))))))

Where “SOURCE” is a list of all my source files.

1 Like

Very interesting @Lauszus! Thanks for sharing and glad to hear you enjoyed the article!

Usually when something like this happens it’s a good indicator that one of the build tools (in this case make) is different between CI & the local environment which is leading to indeterminate results. For example, OSX ships with an extremely old version of make (3.81 from 2006!) whereas ubuntu 16.04 includes something far more recent (4.1 from 2014). (In case it’s interesting, we like to use conda to make sure the build tools in our local environments match CI and have a post about setting it up here)

That being said, if you happen to be using the wildcard function to collect source files, gnu make has a somewhat checkered past of maintaining a consistent ordering in the result. In that case, I’d also recommend adding a $(sort) to wildcard calls to get stable results.

Both the CI and my machine runs Ubuntu 18.04 and I confirmed that make was the same version. I do use wildcards, so that explains why the sort was needed.