Counting Crashes to Improve Device Reliability | Interrupt

The first step to making reliable IoT devices is understanding that they are inherently unreliable. They will never work 100% of the time. This is partially because we firmware engineers will never write perfect code. Even if we did, our devices need to operate through various networks and gateways, such as cellular modems, mobile phone Bluetooth applications, Wi-Fi routers, cloud backends, and more, and each of these may introduce unreliability. The devices may work with today’s Wi-Fi routers and mobile phones, but whether they function with tomorrow’s is a mystery.


This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/device-reliability-metrics

Great article @tyler! Really appreciate the context around each metric and how to interpret it.

One thing I have noticed is there is sometimes an inclination (for better or for worse) to express these sorts of metrics in a very statistic-oriented manner. I’ve found it useful to approach such reports with a critical eye and probe for the meaning/interpretation behind the numbers (and in the process, refresh my rusty knowledge of statistical analysis). For example, you can have a better MTBF in a new firmware release, but if certain types of devices are worse off, that could get lost in the numbers (or not reported at all) and lead to a bad time for a subset of your users.

Long way of saying, +100 to this post. Make sure you know what you’re comparing and what it actually means for your code base and your users.