The way old heartbeats are detected is to include a “serial number” in each one. This number starts as a low value, and then increments with each new heartbeat. When a heartbeat is received the number is checked, and if it is a little less than the highest number received from that host it is discarded. If it is a lot less then it is accepted, to allow for rsfmon being restarted on a remote node.
Disc heartbeats have an extra check… the first disc heartbeat read on every disc heartbeat channel is always discarded, and only when it starts to change is it accepted for further checking. (There is a disc_check heartbeat state which may occasionally be seen very briefly on startup which means that the first change is being awaited) All heartbeat information is then used to update the UP/DOWN state of the heartbeats, and then one of those heartbeats, with the highest available serial number, is examined to extract the state of services on the sending node. Thus if disc heartbeats are being delayed by one heartbeat poll time they will not be used to check the state of a remote machine. If all other heartbeats are lost, then there will be a small delay until the incoming disc heartbeats “catch up” with the latest known remote state before the new state information is used… A newly written heartbeat which is old enough to look like a remote rsfmon restart can defeat this.
Posted in: General