Counting Crashes to Improve Device Reliability | Interrupt

discobot · November 8, 2023, 9:02pm

The first step to making reliable IoT devices is understanding that they are inherently unreliable. They will never work 100% of the time. This is partially because we firmware engineers will never write perfect code. Even if we did, our devices need to operate through various networks and gateways, such as cellular modems, mobile phone Bluetooth applications, Wi-Fi routers, cloud backends, and more, and each of these may introduce unreliability. The devices may work with today’s Wi-Fi routers and mobile phones, but whether they function with tomorrow’s is a mystery.

This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/device-reliability-metrics

shiva · November 9, 2023, 4:50am

Great article @tyler! Really appreciate the context around each metric and how to interpret it.

One thing I have noticed is there is sometimes an inclination (for better or for worse) to express these sorts of metrics in a very statistic-oriented manner. I’ve found it useful to approach such reports with a critical eye and probe for the meaning/interpretation behind the numbers (and in the process, refresh my rusty knowledge of statistical analysis). For example, you can have a better MTBF in a new firmware release, but if certain types of devices are worse off, that could get lost in the numbers (or not reported at all) and lead to a bad time for a subset of your users.

Long way of saying, +100 to this post. Make sure you know what you’re comparing and what it actually means for your code base and your users.

tester · March 7, 2025, 4:22pm

If you don’t use dynamic memory, nor share codespace with black-box SW you CAN go crashless.
Think of your car braking system…

Topic		Replies	Views
Tracking Fleet Health with Heartbeat Metrics \| Interrupt Blog	3	833	December 11, 2020
Could Someone Give me Advice for Implementing Firmware Over-the-Air Updates in IoT Devices? Blog	0	52	July 29, 2024
I want to know Top Techniques for Using Memfault to Implement Remote Debugging General	0	29	October 11, 2024
A Practical Guide to BLE Throughput \| Interrupt Blog	5	1703	January 3, 2021
Beyond Error Codes - Debugging Ill-Defined Problems \| Interrupt Blog	2	25	April 21, 2025

Counting Crashes to Improve Device Reliability | Interrupt

Related topics