FAQ / Change in the use of SCSI reservations

When a clustered service is created, a number of drives are selected as heartbeat drives. In versions of RSF-1 prior to 3.6.5, these drives have also been used as SCSI-2 reservation drives. Before a pool is imported by a node, that node forcibly takes ownership of the pool by issuing SCSI-2 reservations to the heartbeat drives, meaning that if the pool is already imported by another node, that other node’s failfast driver will cause a kernel panic (and thus ensure data integrity). After waiting for 100ms the first node releases the SCSI-2 reservations, starts failfast and imports the pool. The SCSI-2 reservations are released before pool import to enable the disk heartbeats to function correctly (a reserved heartbeat drive would not be visible to a node which did not have ownership and thus heartbeat read/writes would fail).

An edge case has been identified, which could cause a pool to be imported on more than one node. If the second node has temporarily frozen while the first node issues and releases the SCSI-2 reservations, then its failfast mechanism will in effect miss the reservation window and thus not trigger a panic. A pool can therefore end up being imported on both nodes when the frozen node comes back to life.

Old method

Note that this is an unusual scenario where a node freezes for a finite time, and then resumes without a panic. The node has to freeze in such a way that it even stops responding to pings, so it behaves as if it were down.

In order to enhance data security, the import sequence has been altered. In this new version of RSF-1, when a service is created, both heartbeat drives and reservation drives are specified. The reservation drives now must not be the same as the heartbeat drives. Before a pool is imported by a node, the node takes ownership of the pool by issuing a SCSI-2 reservation to each of the reservation drives. If this fails, then it forcibly takes ownership. As before, this will cause any other node to panic if it has the pool imported. The first node then starts failfast and imports the pool. The reservations no longer need to be released, since the heartbeats use different drives.

Using this method, the reservations are persistent, so if the second node temporarily freezes, the first node will take ownership of the pool, leaving SCSI-2 reservations in place. When the frozen node comes back to life, its failfast will immediatly see the reservations and panic the kernel, thereby avoiding a possible split brain.

New method

Note that both the heartbeat drives and the reservation drives can still be used for data storage within a volume – no dedicated quorum drives are required. The change here is that the reservation drive cannot be also used as a heartbeat drive.

Enabling persistent reservations

To enable persistent reservations first identify drives in each pool using zpool status <pool>, for example for a pool named vol01:

root@nodea:~# zpool status vol01
  pool: vol01
 state: ONLINE
 scan: none requested

        NAME                       STATE     READ WRITE CKSUM
        vol01                      ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t20000004CFF30A65d0  ONLINE       0     0     0
            c0t2000000C500DF02Ed0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c0t2000000C5098DE7Ad0  ONLINE       0     0     0
            c0t2000000C50A2AFC4d0  ONLINE       0     0     0
          mirror-2                 ONLINE       0     0     0
            c0t2000000C50A2F010d0  ONLINE       0     0     0
            c0t2000000C50AD31CCd0  ONLINE       0     0     0

errors: No known data errors

In the above disc list, it is first necessary to identify cluster heartbeat drives, do this using the following grep:

root@nodea:~# grep "^ *\<DISC\>" /opt/HAC/RSF-1/etc/config
 DISC nodeb /dev/rdsk/c0t20000004CFF30A65d0s0:518:512 TAG vol01
 DISC nodeb /dev/rdsk/c0t2000000C50A2AFC4d0s0:518:512 TAG vol01
 DISC nodeb /dev/rdsk/c0t2000000C50AD3243d0s0:518:512 TAG vol02
 DISC nodeb /dev/rdsk/c0t2000000C50DAC9F5d0s0:518:512 TAG vol02
 DISC nodea /dev/rdsk/c0t20000004CFF30A65d0s0:512:518 TAG vol01
 DISC nodea /dev/rdsk/c0t2000000C50A2AFC4d0s0:512:518 TAG vol01
 DISC nodea /dev/rdsk/c0t2000000C50AD3243d0s0:512:518 TAG vol02
 DISC nodea /dev/rdsk/c0t2000000C50DAC9F5d0s0:512:518 TAG vol02

In the above output, there are two drives in use for disc heart beating (paths are identified in the configuration file for each node as controller numbers could vary between nodes, so node specific paths are used). When we remove the the above drives from our list that leaves us with the remaining drives c0t2000000C500DF02Ed0, c0t2000000C5098DE7Ad0, c0t2000000C50A2F010d0, c0t2000000C50AD31CCd0. Select at least one of these drives and declare it in the file /opt/HAC/RSF-1/etc/.res_drives.<pool name> (this needs to be done for each clustered pool), in our example the file is /opt/HAC/RSF-1/etc/.res_drives.vol01 and it contains:


After the .res_drives.<pool name> files have been created and distributed to remote nodes, the new functionality will be enabled after the service is started. Note that if the service is already running, it should be stopped before adding the .res_drives.<service> file. To minimise downtime, create the .res_drives file on the passive node (nodeb), fail over the service (to nodeb), create the file on nodea. Then fail back over to nodea.

To avoid any errors in creating these text files, and to eliminate the need for failovers, a shell script has been created, which aims to deal with the above steps, by allowing the user to select a number of reservation drives per clustered volume. After using this script, ensure that the .res_drives.<pool name> files have been distributed to the other node(s) in the cluster. This script does not require any failovers because it disables failfast in the same way that RSF-1 would do on service shutdown, creates the .res_drives file and then enables the reservations. A full description is also attached and can be found here.

Posted in: General