ID #1059

Change in the use of SCSI reservations

When a clustered service is created, a number of drives are selected as heartbeat drives. In versions of RSF-1 prior to 3.6.5, these drives have also been used as SCSI-2 reservation drives. Before a pool is imported by a node, that node forcibly takes ownership of the pool by issuing SCSI-2 reservations to the heartbeat drives, meaning that if the pool is already imported by another node, that other node's failfast driver will cause a kernel panic (and thus ensure data integrity). After waiting for 100ms the first node releases the SCSI-2 reservations, starts failfast and imports the pool. The SCSI-2 reservations are released before pool import to enable the disk heartbeats to function correctly (a reserved heartbeat drive would not be visible to a node which did not have ownership and thus heartbeat read/writes would fail).

An edge case has been identified, which could cause a pool to be imported on more than one node. If the second node has temporarily frozen while the first node issues and releases the SCSI-2 reservations, then its failfast mechanism will in effect miss the reservation window and thus not trigger a panic. A pool can therefore end up being imported on both nodes when the frozen node comes back to life.

Old method

Note that this is an unusual scenario where a node freezes for a finite time, and then resumes without a panic. The node has to freeze in such a way that it even stops responding to pings, so it behaves as if it were down.

In order to enhance data security, the import sequence has been altered. In this new version of RSF-1, when a service is created, both heartbeat drives and reservation drives are specified. The reservation drives now must not be the same as the heartbeat drives. Before a pool is imported by a node, the node takes ownership of the pool by issuing a SCSI-2 reservation to each of the reservation drives. If this fails, then it forcibly takes ownership. As before, this will cause any other node to panic if it has the pool imported. The first node then starts failfast and imports the pool. The reservations no longer need to be released, since the heartbeats use different drives.

Using this method, the reservations are persistent, so if the second node temporarily freezes, the first node will take ownership of the pool, leaving SCSI-2 reservations in place. When the frozen node comes back to life, its failfast will immediatly see the reservations and panic the kernel, thereby avoiding a possible split brain.

 

New method

Note that both the heartbeat drives and the reservation drives can still be used for data storage within a volume - no dedicated quorum drives are required. The change here is that the reservation drive cannot be also used as a heartbeat drive.

Enabling persistent reservations

To enable persistent reservations first identify drives in each pool using zpool status <pool>, for example for a pool named vol01:

root@nodea:~# zpool status vol01
  pool: vol01
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        vol01                    ONLINE       0     0     0
          c0t2000000C50BA2D83d0  ONLINE       0     0     0
          c0t2000000C50BA2EE4d0  ONLINE       0     0     0
          c0t2000000C50DAC2FDd0  ONLINE       0     0     0
          c0t20000011C6BA5E4Bd0  ONLINE       0     0     0
          c0t20000011C6CBBFF4d0  ONLINE       0     0     0
          c0t20000011C6CBCAD2d0  ONLINE       0     0     0

errors: No known data errors
root@nodea:~#

In the above disc list, it is first necessary to identify dcluster heartbeat drives, do this using the following grep:

root@nodea:~# grep 'TAG vol01' /opt/HAC/RSF-1/etc/config
 DISC nodeb /dev/rdsk/c0t2000000C50BA2D83d0s0:518:512 TAG vol01
 DISC nodeb /dev/rdsk/c0t2000000C50BA2EE4d0s0:518:512 TAG vol01
 DISC nodea /dev/rdsk/c0t2000000C50BA2D83d0s0:512:518 TAG vol01
 DISC nodea /dev/rdsk/c0t2000000C50BA2EE4d0s0:512:518 TAG vol01
root@nodea:~#

In the above output, there are two drives in use for disc heart beating (paths are identified in the configuration file for each node as controller numbers could vary between nodes, so node specific paths are used). When we remove the the above drives from our list that leaves us with the remaining drives c0t2000000C50DAC2FDd0, c0t20000011C6BA5E4Bd0, c0t20000011C6CBBFF4d0, c0t20000011C6CBCAD2d0. Select at least one of these drives and declare it in the file /opt/HAC/RSF-1/etc/.res_drives.<pool name> (this needs to be done for each clustered pool), in our example the file is /opt/HAC/RSF-1/etc/.res_drives.vol01 and it contains:

c0t2000000C50DAC2FDd0
c0t20000011C6BA5E4Bd0

To avoid any errors in creating these text file, a shell script has been attached to this FAQ, which aims to deal with the above steps, by allowing the user to select a single reservation drive per clustered volume. After using this script, ensure that the .res_drives.<pool name> files have been distributed to the other node(s) in the cluster.

After the .res_drives.<pool name> files have been created, the new functionality will be enabled AFTER an import of each affected pool. This can be done either by failing over each service between nodes, or executing 'rsfcli restart' (which exports all services and imports them again).

attached files: reservation_script

Tags: -

Related entries:

Last update: 2013-01-11 11:49
Author: Matt
Revision: 1.9

{writeDiggMsgTag} {writeFacebookMsgTag} {writePrintMsgTag} {writeSend2FriendMsgTag} {writePDFTag}
{translationForm}
Please rate this FAQ:

Average rating: 0 (0 Votes)

completely useless 1 2 3 4 5 most valuable

You cannot comment on this entry