I can tell you’ve thought a lot about this problem! Thanks for the comment. Happy to provide my thoughts.
First for analyzes, when using good analytics tooling (eg. splunk) you can easily transform the data vice versa. So it doesn’t really make a difference for analysis.
Those tools are expensive though . For many hardware companies, budgets are tight. Also, I genuinely don’t know, but do you know how well Splunk would work doing this sort of transformation + analysis for 1k or 1m devices?
And when having a unreliable connection with potential message loss, continuous data has the advantage that the total sum still is correct, vs. with reset every lost in-between package means the total sum will be off.
Constant monitoring, especially around debugging and visibility, is less about getting 100% accurate summations, and it’s more about quick and dirty estimations and finding changes in the rates or frequency of issues. I don’t think “monitoring” is a good way to keep track of critical values. That should, as you say, be stored internally on the device and ensure the counts are never accidentally lost due to loss of power or crashes, then sent up periodically.
Re timestamps, if devices have a stable Internet connection are pinging the time servers directly, then much of this article isn’t applicable. With some of the hardware products I’ve worked on in the past, we would go days or weeks without syncing to a time server (through a mobile app over BLE), so our time drift would be substantial. And when we did sync, the timestamps would jump forward and backward depending on the drift. It’s nice to send the best effort timestamp as well, but we rarely used this in our calculations at previous organizations, or used it as a rough estimate rather than a source of truth.
And in my experience some kind of correlation id (like the mentioned boot ID) is extremely helpful. This way static information (software version, configuration, …) need only be sent once and the heartbeat messages then can be easily correlated only via the boot ID.
YES! I’m a huge proponent of boot id’s. It’s actually something I mentioned in the article, but did not mention it as a way to de-duplicate or save space on data.
Regarding the heartbeat duration: Shouldn’t the timestamp be enough do derive the heartbeat duration from it? Why do you need to send it explicitly with every heartbeat?
If your device has guaranteed time, of course! Many, many do not, so that is where this take on heartbeat metrics comes into play. By recording a constant window of time, measured by a stable crystal on the device, a developer can ensure they captured a stable amount of time for all heartbeats and across all devices.
I’d be curious to hear more about how you use Splunk and what kinds of queries and transformation you are performing on the data!