Tracking Fleet Health with Heartbeat Metrics | Interrupt

Releasing a connected device in today’s world without some form of monitoring in place is a recipe for trouble. How would you know how often or if devices are experiencing faults or crashing? How can the release lead be confident that no connectivity, performance, or battery-life regressions have occurred between the past and current firmware update?


This is a companion discussion topic for the original entry at https://interrupt.memfault.com/blog/device-heartbeat-metrics

Interesting read. Though I wonder about the heartbeat reset vs. continuous argument and if reset really is always better.
First for analyzes, when using good analytics tooling (eg. splunk) you can easily transform the data vice versa. So it doesn’t really make a difference for analysis.
And when having a unreliable connection with potential message loss, continuous data has the advantage that the total sum still is correct, vs. with reset every lost in-between package means the total sum will be off.
Message drop with continuous values lead to resolution loss whereas message drop with reset values lead to data loss.

Non-reliable communication is also another reason for some kind of timestamp. Is message ordering guaranteed? Can messages be duplicated? Timestamps help against all this issues.

And in my experience some kind of correlation id (like the mentioned boot ID) is extremely helpful. This way static information (software version, configuration, …) need only be sent once and the heartbeat messages then can be easily correlated only via the boot ID.

Regarding the heartbeat duration: Shouldn’t the timestamp be enough do derive the heartbeat duration from it? Why do you need to send it explicitly with every heartbeat?

I can tell you’ve thought a lot about this problem! Thanks for the comment. Happy to provide my thoughts.

First for analyzes, when using good analytics tooling (eg. splunk) you can easily transform the data vice versa. So it doesn’t really make a difference for analysis.

Those tools are expensive though :wink:. For many hardware companies, budgets are tight. Also, I genuinely don’t know, but do you know how well Splunk would work doing this sort of transformation + analysis for 1k or 1m devices?

And when having a unreliable connection with potential message loss, continuous data has the advantage that the total sum still is correct, vs. with reset every lost in-between package means the total sum will be off.

Constant monitoring, especially around debugging and visibility, is less about getting 100% accurate summations, and it’s more about quick and dirty estimations and finding changes in the rates or frequency of issues. I don’t think “monitoring” is a good way to keep track of critical values. That should, as you say, be stored internally on the device and ensure the counts are never accidentally lost due to loss of power or crashes, then sent up periodically.

Re timestamps, if devices have a stable Internet connection are pinging the time servers directly, then much of this article isn’t applicable. With some of the hardware products I’ve worked on in the past, we would go days or weeks without syncing to a time server (through a mobile app over BLE), so our time drift would be substantial. And when we did sync, the timestamps would jump forward and backward depending on the drift. It’s nice to send the best effort timestamp as well, but we rarely used this in our calculations at previous organizations, or used it as a rough estimate rather than a source of truth.

And in my experience some kind of correlation id (like the mentioned boot ID) is extremely helpful. This way static information (software version, configuration, …) need only be sent once and the heartbeat messages then can be easily correlated only via the boot ID.

YES! I’m a huge proponent of boot id’s. It’s actually something I mentioned in the article, but did not mention it as a way to de-duplicate or save space on data.

Regarding the heartbeat duration: Shouldn’t the timestamp be enough do derive the heartbeat duration from it? Why do you need to send it explicitly with every heartbeat?

If your device has guaranteed time, of course! Many, many do not, so that is where this take on heartbeat metrics comes into play. By recording a constant window of time, measured by a stable crystal on the device, a developer can ensure they captured a stable amount of time for all heartbeats and across all devices.

I’d be curious to hear more about how you use Splunk and what kinds of queries and transformation you are performing on the data!

Hello Tyler thanks a lot for supporting and helping FW community. This is a little out of contex request. Could you please make a tutorial on ring buffer and how to pass data between an interrupt handler and the main thread in bare metal FW. Let’s say an interrupt handler is receiving the data through a serial protocol of some sensor data and we want to process the sensor data in the main thread or task. THis is a bare metal FW. How can we achieve this without losing the data from the sensor? If let’s say we use the ring buffer for this problem then what if the buffer overflow will occur what to do when there is buffer overflow. Sorry for the long question.

Thanks a lot.

Chitrang