Better Firmware with LLVM/Clang | Interrupt

ASTGnomes · April 26, 2021, 3:46am

I worked on LLVM professionally back when there was just llvm-gcc and no Clang, and for several years after Clang was released. LLVM has been quite capable of lots of things people don’t even attempt for years; llvm IR was originally meant to be cross-processor, and you can generally build a project with processor features disabled to prevent intrinsic generation as IR, then lower to machine code for any supported architecture with the same data sizes (it may have additional / different ones as long as it has the maximum sized integer and float that you built against) given a platform with a compatible C library. This isn’t recommended or necessarily a good idea, but I’ve taken C code compiled to IR built against armv5tej-linux-eabi and lowered it to all of windows-x86_64, darwin-ppc, android-arm, linux-x86, and ios-arm and it remained functional and ran without errors after forcibly lowering to the wrong platform / architecture… Not super useful but kinda neat. Apple requires app store MachO submissions to include bitcode as another multiarch so they can verify code more easily and presumably so they can do new optimized builds for minor arch changes and compiler updates.

A lot of times, due to compiler flags differing in behavior (as seen in previous posts here), people give up or assume LLVM is producing bad code… there are at least 3 bugs I saw on the dev list today complaining about bad optimization compared to GCC where the entire problem could be fixed using -Os or -Oz instead of -O3, because llvm O3 is like telling it, “the target machine has effectively unlimited RAM and icache, spew all the assembly you want if it has any benefit”, which often creates large bizarre looking constructs that execute quickly because of pipelining / processor rules that are non-obvious. This is one of those situations where assembly is kind of all or nothing, and unless you’ve read the thousands of pages of architecture guides and timing like me and similarly pedantic asses who have worked on or currently work on the backends, you’re better off not second guessing the compiler without profiling code at the very least.

One strange example is 3.7.2 in the intel optimization manual where instead of:
mov eax, [ebx +4]
mov ecx, 60
…to set up a loop scanning a list,
mov eax, [ebx +4]
mov eax, [ebx +4]
mov eax, [ebx +4]
mov ecx, 60

Is preferred in most cases on Core architectures because it triggers hardware L1 prefetch on the rest of the data. At first glance this code would look like a compiler error to most sane people, including me if I hadn’t read that manual before, but the 3x duplicated instruction forces insanely fast background prefetch and is safe to do 90% of the time. With constrained hardware (a different processor architecture with wider variance between the high end and embedded procs with a similar optimization, maybe), it might be more important to get rid of the extra instructions with -Oz because you’re working within 128k of flash and the extra speed during your hardware init sequence doesn’t matter. I saw a bug filed today about redundant loads that made me think of this. It was very likely an instance of this optimization or a related one.

Another example was a bug filed about REP MOVSB being selected instead of the wider versions (REP STOSD/Q) with GCC output being given as the example of doing it right… The problem is that GCC is wrong on Ivy Bridge and up; there’s a processor fast path for REP MOVSB and STOSB that makes them universally faster than STOSD and as fast or faster than AVX / AVX2 on copies over 128 bytes. Intel recommends never using rep movsd with movsb to finish mod 4 leftovers for memcpy, and only to use AVX for very short copies now. AVX is faster for those because enhanced rep movsb has startup overhead that isn’t amortized with avx register copying until the 128 byte range. Technically both compilers in the “bug” should have conditionally used AVX based on data size, but it has its own problems on Haswell-E / Broadwell-E / Skylake like dynamic TurboBoost upper speed limit dropping by 200mhz by default on the core or chip when executed that can make it more costly than expected. GCC either defaults to some ancient target processor or doesn’t know about this at all. Now if you didn’t read all that, you’d think copying data as bytes would be the slowest way to do things (and it would using mov reg, [mem]; mov [mem], reg loops for a couple of reasons) but processor architectures can be weird and you’ll have things like that.

There’s another where micro-op fusion occurs with test / jcc but not cmp / jcc, etc.

The whole point there is that even though I liked reading through Intel / ARM / PPC manuals at one point in time and would have contributed to an open source compiler if I had something to contribute, it never would have been to a GPL3 project. Getting paid wasn’t enough to make me submit an instruction encoding fix patch to GAS for ARM, I just flat-out refused (i could get away with a lot as the only person there who could read and write assembly and the compiler backends fluently for every architecture we supported) and then converted our arm-linux-eabi toolchain to clang in a few days instead and got the company to ditch gcc across every supported platform over the next several months after proving it would work in that one.

That was my initiative, not some corporate doesn’t want to release source thing. They’d have gladly submitted patches to gcc/gas forever instead, but I never wanted to have to look at GAS again after doing so once, and already disliked the GPL, so I made absolutely sure I wouldn’t be asked to and neither would anyone else there. We required and integrated with LLVM anyway, so switching wasn’t a hard sell, it’s just that nobody had thought it was possible at that point in time. We still relied on libstdc headers from the toolchain but LGPL doesn’t morally offend me as much and my idea of forcing customers to switch to a more permissively licensed embedded OS wasn’t going to fly (mostly because they may or may not exist).

So, I was happy to see this post because firmware is one of the last holdouts where not-really-free tools must be dealt with now that clang is able to build Linux itself more or less.

Lastly, and probably of more interest here than my uarch / anti-GPL rants, anyone interested in breaking that last part of the reliance on gcc within what you’re building should look at “llvm-libc” C Standard Library — LLVM 12 documentation (source-in-progress at llvm-project/libc at main · llvm/llvm-project · GitHub ) and see if you can contribute. They will accept pull requests from anyone after a well-defined review process. Somewhat ironically it looks as though LLVM-LIBC is being written in C++ (avoiding use of STL, it can be built standalone ), but that’s standard for the project overall aside from language bindings.

Topic		Replies	Views
The Best and Worst GCC Compiler Flags For Embedded \| Interrupt Blog	18	3977	October 25, 2023
Firmware Static Analysis with CodeChecker \| Interrupt Blog	8	1160	September 2, 2021
A Deep Dive into ARM Cortex-M Debug Interfaces \| Interrupt Blog	2	1309	November 23, 2020
The Best and Worst MCU SDKs \| Interrupt Blog	29	3871	June 20, 2024
A Modern Firmware Development Environment General	2	442	December 11, 2023

Better Firmware with LLVM/Clang | Interrupt

Related topics