RSF-1 can be configured to create a cluster of any number of nodes. HA services can be allowed to run on any node, or a subset of nodes, with simple changes to the config file. The most important consideration when creating a multi-node cluster, is the heartbeat configuration.
Two node cluster example
The simplest and most common configuration is a 2 node cluster. An example config file is shown below:
# Optional global defaults & definitions # CLUSTER_DESC RSF-1 HA-Cluster CLUSTER_NAME HA-Cluster DISC_HB_BACKOFF 20,3,3600 POLL_TIME 1 REALTIME 1 IPDEVICE_MONITOR 3,2 # Machines section MACHINE nodea NET nodeb DISC nodeb /dev/rdsk/c0t0d0s0:512:518 DISC nodeb /dev/rdsk/c0t1d0s0:512:518 MACHINE nodeb NET nodea DISC nodea /dev/rdsk/c0t0d0s0:518:512 DISC nodea /dev/rdsk/c0t1d0s0:518:512 # Services section SERVICE tank vip01 "ZFS service" OPTION "sdir=appliance ip_up_after=1 ip_down_before=1 zfs_export_fail_reboot=n zfs_mhdc_disable=n" INITIMEOUT 20 RUNTIMEOUT 8 MOUNT_POINT "/tank" SETENV RSF_RESERVATION_DRIVE_1 c0t2d0 SERVER nodea IPDEVICE "net0" SERVER nodeb IPDEVICE "net0"
Here, there are two nodes configured (nodea and nodeb). There are three heartbeats configured between the nodes – one network and two disk heartbeats. The three heartbeats are mentioned twice in the Machines section because each one is bi-directional.
There is 1 service (tank) set up, and it is allowed to run on both nodea and nodeb.
Adding a third node
Adding a third node adds some extra complexity to the config file, but only to the Machines section. It is best to start with a plan of the heartbeat configuration that will work best for your cluster. For this example, we will do the same as the two node cluster, and configure two disk heartbeats and one network heartbeat between the nodes. Each node needs to be able to detect the other two nodes, so each node must have heartbeats to the other two. The diagram below illustrates this:
In this configuration, there are the same number of heartbeats between each pair of nodes (2 disk and 1 network) as in the 2 node cluster. It should be noted that this configuration requires the use of 6 disks for heartbeats. The HB disks should ideally be spread across different pools, to minimise the risk of there not being enough SCSI reservation disks in any service.
The above diagram has been used to construct an updated config file, which is shown below. In this config file, the various heartbeat definitions have been commented to show how they correspond to the heartbeats in the diagram:
# Optional global defaults & definitions # CLUSTER_DESC RSF-1 HA-Cluster CLUSTER_NAME HA-Cluster DISC_HB_BACKOFF 20,3,3600 POLL_TIME 1 REALTIME 1 IPDEVICE_MONITOR 3,2 # Machines section MACHINE nodea NET nodeb # NET 1: A -> B NET nodec # NET 2: A -> C DISC nodeb /dev/rdsk/c0t0d0s0:512:518 # DISK 1: A -> B DISC nodeb /dev/rdsk/c0t1d0s0:512:518 # DISK 2: A -> B DISC nodec /dev/rdsk/c0t2d0s0:512:518 # DISK 3: A -> C DISC nodec /dev/rdsk/c0t3d0s0:512:518 # DISK 4: A -> C MACHINE nodeb NET nodea # NET 1: B -> A NET nodec # NET 3: B -> C DISC nodea /dev/rdsk/c0t0d0s0:518:512 # DISK 1: B -> A DISC nodea /dev/rdsk/c0t1d0s0:518:512 # DISK 2: B -> A DISC nodec /dev/rdsk/c0t4d0s0:512:518 # DISK 5: B -> C DISC nodec /dev/rdsk/c0t5d0s0:512:518 # DISK 6: B -> C MACHINE nodec NET nodea # NET 2: C -> A NET nodeb # NET 3: C -> B DISC nodea /dev/rdsk/c0t2d0s0:518:512 # DISK 3: C -> A DISC nodea /dev/rdsk/c0t3d0s0:518:512 # DISK 4: C -> A DISC nodeb /dev/rdsk/c0t4d0s0:518:512 # DISK 5: C -> B DISC nodeb /dev/rdsk/c0t5d0s0:518:512 # DISK 6: C -> B # Services section SERVICE tank vip01 "ZFS service" OPTION "sdir=appliance ip_up_after=1 ip_down_before=1 zfs_export_fail_reboot=n zfs_mhdc_disable=n" INITIMEOUT 20 RUNTIMEOUT 8 MOUNT_POINT "/tank" SETENV RSF_RESERVATION_DRIVE_1 c0t6d0 SERVER nodea IPDEVICE "net0" SERVER nodeb IPDEVICE "net0" SERVER nodec IPDEVICE "net0"
It can be seen that only a minimal change had to be made to the service definition in order to allow it to run on a third server. The main chance is in the Machines section. A new machine (nodec) has been defined, and all three machines now have 6 heartbeats configured – 1 network and two disk heartbeats to both of the other servers.
The clock offsets in the disk heartbeat lines tell the machine where on the disk to write HB information, and where to read information from. For the heartbeats to work correctly, one machine must read from block offset 512 and write to 518, and the other machine must do the opposite. This means after the disk name in the ‘DISK’ line, one server should have 512:518, and the other should have 518:512.
It is easy to see that more nodes could be added to this cluster. The main consideration should be that each node must heartbeat to every other node in the cluster. Of course, if more nodes are added, more disk heartbeats will be needed, and so less disks will be available for SCSI reservations. If there are a large number of disks in the system, this should not be a problem.
Number of disks required for heartbeats
Where each node has two disk heartbeats to every other node, the number of disks required for heartbeats is n x (n-1), where n is the number of nodes in the cluster.
This means a 3 node cluster needs 6 heartbeat disks as above, a 4 node cluster needs 12, a 5 node cluster needs 20, etc.
This may not be necessary in some cases. Consider for example a three node cluster with two services. Service 1 is allowed to run on node A and node B, but not node C. Service 2 is allowed to run on node A and node C, but not node B. In this case there is no need for heartbeats between nodes B and C, because there is no common service shared by these two nodes.
Note that these numbers of heartbeats are for the entire cluster, not just per service.
Choosing which disks to use for heartbeats
RSF-1 uses special disk block offsets to allow it to heartbeat through data disks without corrupting data. This means it is not necessary to have separate heartbeat disks.
When selecting disks, if possible, single points of failure should be avoided. For example, if there are multiple disk arrays being used by services in the cluster, then disk heartbeats should be spread across all of the arrays, to avoid one array failure bringing down all disk heartbeats.
It should also be noted that one disk cannot be simultaneously used for SCSI reservations and heartbeats. There needs to be some disks available in each service for SCSI reservations. We recommend reserving enough disks to cause a pool to be marked as faulted, should those disks be unavailable. For example, if a pool is made of a number of RAIDZ vdevs, then you should reserve at least 2 disks from a single RAIDZ. That way, if those two disks became unavailable (causing the reservations to stop working), there would still be no chance of a split brain, because the pool would be faulted.
The same consideration has to be taken for serial heartbeats as for disk heartbeats. If serial heartbeats are configured in the cluster, and you want each node to heartbeat to every other node, then more serial heartbeats will be required. Because serial links are direct, that means, for a three node cluster, each node will need to use 2 serial ports to connect to the other nodes.
Network heartbeat configuration is not restricted by hardware in the way that disk and serial heartbeats are. Only one network interface is required per node (assuming that they are attached to an ethernet switch), and only access to one subnet is required, to allow heartbeats to any number of other nodes. In the above example, net 1, net 2 and net 3 are all on the same subnet, and are using a single interface on each node (net0).
If each node has a second interface, that could be used for additional heartbeats by adding more NET lines to the config file – this time with the IP address of the second interface of the machine to heartbeat to (i.e. NET nodea <IP addr>). These extra network interfaces should be connected to a second switch to avoid having the switch as a single point of failure.
Posted in: Configuration