Tracking Fleet Health with Heartbeat Metrics | Interrupt

Interesting read. Though I wonder about the heartbeat reset vs. continuous argument and if reset really is always better.
First for analyzes, when using good analytics tooling (eg. splunk) you can easily transform the data vice versa. So it doesn’t really make a difference for analysis.
And when having a unreliable connection with potential message loss, continuous data has the advantage that the total sum still is correct, vs. with reset every lost in-between package means the total sum will be off.
Message drop with continuous values lead to resolution loss whereas message drop with reset values lead to data loss.

Non-reliable communication is also another reason for some kind of timestamp. Is message ordering guaranteed? Can messages be duplicated? Timestamps help against all this issues.

And in my experience some kind of correlation id (like the mentioned boot ID) is extremely helpful. This way static information (software version, configuration, …) need only be sent once and the heartbeat messages then can be easily correlated only via the boot ID.

Regarding the heartbeat duration: Shouldn’t the timestamp be enough do derive the heartbeat duration from it? Why do you need to send it explicitly with every heartbeat?