ID #1069

What happens to the Disk Heartbeats when the Network Interface is removed?

This page outlines an issue pertaining to heartbeats whereby the reporting of disk heartbeats suffers when a network interface is taken down resulting in the heartbeats not being read by rsfcli.

Background

RSF-1 uses a number of different heartbeat channels to communicate between cluster members, including: network, serial and shared disk. Heartbeats are small packets of information exchanged between two or more RSF-1 servers that are used to determine inter-cluster communication as well as to share and compare status information, service configuration and provide cluster wide administration capability.

The RSF-1 command line interface, rsfcli communicates with other cluster members using network only (default TCP port 1195). If a network interface or network fabric fails on any particular cluster member, and there are other heartbeat mechanisms being used (e.g. shared disk), then heartbeat communication is maintained and the cluster remains healthy. If there are network issues however, it may be that rsfcli may fail to communicate directly with some or all cluster nodes and that some administrative functions will be restricted.

The Problem

Consider the following scenario. There are two nodes (NodeA and NodeB) that comprise a cluster. The heartbeat configuration includes 4 disk heartbeats and 1 network heartbeat creating ten heartbeats in total; 5 on NodeA and 5 on NodeB.

On NodeA you can issue an rsfcli status command which prints the status of all nodes, services and heartbeats in the cluster. It does this in the following way:

  •     Contacts NodeA via a network connection and gets the status of the heartbeats on that node.
  •     Contacts NodeB on a network connection and gets the status of heartbeats as seen by NodeB (we do not depend on NodeA's interpretation of heartbeats on NodeB as despite NodeA writing them; they may not actually be being read by NodeBTherefore we need to communicate with NodeB to receive its interpretation of its own heartbeats).

Now consider the scenario where the network interface is removed. This will automatically result in the two network heartbeats being removed. However, it will also result in rsfcli being unable to communicate with the remote node (in this case NodeB) as this communication takes places via the network interface. As there is no possibility for communication  rsfcli will return that there are no heartbeats active on the remote node despite that not necessarily being the case.

So, in our example running rsfcli on NodeA after removing the network interface will return the four disk heartbeats present on the NodeA but not the two network heartbeats which have been removed along with the network interface and not the four heartbeats that are present on the remote node but cannot be accessed as the communication protocol for that access is the now removed network interface.

If you were to log on to the remote node (NodeB) and run rsfcli you would again see four disk heartbeats. However, this time you would see the four disk heartbeats present on NodeB. The following is an example of a line returned by rsfcli in this scenario:

3 disc zx4000-1-rack5-6fl -> zx4000-2-rack5-6fl (via /dev/rdsk/c10t6018d33s0:518,/dev/rdsk/c10t6018d33s0:512): Down (no response from zx4000-2-rack5-6fl)

The crucial part is no response from zx4000-2-rack5-6fl which is logging the fact rsfcli could not obtain any response from the remote node and therefore cannot provide any report on the heartbeats.

Tags: -

Related entries:

Last update: 2014-02-04 22:20
Author: Paul Foster
Revision: 1.11

{writeDiggMsgTag} {writeFacebookMsgTag} {writePrintMsgTag} {writeSend2FriendMsgTag} {writePDFTag}
{translationForm}
Please rate this FAQ:

Average rating: 0 (0 Votes)

completely useless 1 2 3 4 5 most valuable

You cannot comment on this entry