Tracking Fleet Health with Heartbeat Metrics | Interrupt

valoh · December 2, 2020, 11:09pm

Interesting read. Though I wonder about the heartbeat reset vs. continuous argument and if reset really is always better.
First for analyzes, when using good analytics tooling (eg. splunk) you can easily transform the data vice versa. So it doesn’t really make a difference for analysis.
And when having a unreliable connection with potential message loss, continuous data has the advantage that the total sum still is correct, vs. with reset every lost in-between package means the total sum will be off.
Message drop with continuous values lead to resolution loss whereas message drop with reset values lead to data loss.

Non-reliable communication is also another reason for some kind of timestamp. Is message ordering guaranteed? Can messages be duplicated? Timestamps help against all this issues.

And in my experience some kind of correlation id (like the mentioned boot ID) is extremely helpful. This way static information (software version, configuration, …) need only be sent once and the heartbeat messages then can be easily correlated only via the boot ID.

Regarding the heartbeat duration: Shouldn’t the timestamp be enough do derive the heartbeat duration from it? Why do you need to send it explicitly with every heartbeat?

Topic		Replies	Views
Counting Crashes to Improve Device Reliability \| Interrupt Blog	3	385	March 7, 2025
Trouble Adding New Metrics Memfault Help	1	571	June 1, 2022
I want to know Top Techniques for Using Memfault to Implement Remote Debugging General	0	29	October 11, 2024
Memfault April 2021 Changelog Memfault Help	0	648	May 6, 2021
Nrf9160 mefault sample does not increase Switch1ToggleCount Memfault Help	3	576	May 19, 2023

Tracking Fleet Health with Heartbeat Metrics | Interrupt

Related topics