Why do serial heartbeats stop working?

There is a known problem with serial heartbeats in some Opensolaris derived operating systems. The problem is that after serial heartbeats have been used for some time, the serial interface stops working and the heartbeats go down. When this happens, the only way to fix the heartbeats is to reboot the server (restarting RSF-1 to close and reopen the ports does not work).

This has been seen so far in NCP (Nexenta version 3) and OmniOS. The problem appears to have been fixed in the Illumos kernel.

There is no RSF-1 fix for this problem since it is a kernel issue. Our recommendation for a solution if you see this problem, would be to use a different heartbeat mechanism. We always recommend using at least two types of heartbeat, so in this situation, network and disk heartbeats should be used.

Last update: 2013-09-20 13:36
Author: Paul Griffiths
