Creating a ZFS HA Cluster using shared or shared-nothing storage

This guide goes through a basic setup of a RSF-1 ZFS HA cluster. Upon completion the following will be configured:

A working Active-Active cluster with either shared or shared-nothing storage
A clustered service sharing a ZFS pool (further services can be added as required)
A virtual hostname by which clients are able access the service

Introduction

RSF-1 supports both shared and shared-nothing storage clusters.

Shared Storage

A shared storage cluster utilises an common set of storage devices that are accessible to both nodes in the cluster (housed in a shared JBOD for example). A ZFS pool is created using these devices and access to that pool is controlled by RSF-1.

Pool integrity is maintained by the cluster software using a combination of redundant heartbeating and PGR3 disk reservations to ensures any pool in a shared storage cluster can only be accessed by a single node at any one time.

flowchart TD
    SSa("Node A") & SSb("Node B") <-- SAS/FC etc. --> SSS[("Storage")]

Shared-Nothing

A shared-nothing cluster consists of two nodes, each with their own locally accessible ZFS storage pool residing on non shared storage:

flowchart TD
    SNa("Node A")<-->|SAS/FC etc.|SNSa
    SNb("Node B")<-->|SAS/FC etc.|SNSb
    SNSa[("Storage")]
    SNSb[("Storage")]

Data is replicated between nodes by an HA synchronisation process. Replication is always done from the active to the passive node, where the active node is the one serving out the pool to clients:

flowchart LR
    SNa("Node A (active)<br />Pool-A")-->|HA Synchronisation|SNb
    SNb("Node B (passive)<br />Pool-A")

Should a failover occur then synchronisation is effectively reversed:

flowchart RL
    SNa("Node B (active)<br />Pool-A")-->|HA Synchronisation|SNb
    SNb("Node A (passive)<br />Pool-A")

Before creating pools for shared nothing clusters

To be eligible for clustering the storage pools must have the same name on each node in the cluster
It is strongly recommended the pools are of equal size, otherwise the smaller of the two runs the risk of depleting all available space during synchronization

Download cluster software

If not already done so, download and install the RSF-1 cluster software onto each cluster node. More information can be found here.

Initial connection and user creation

Prerequisites

Shared Nothing

If setting up a shared-nothing cluster, both nodes require passwordless ssh access to each other using the root account; this is required for ZFS replication and must be the root user. For further information please see the documentation on configuring an ssh connection here.

Firewall configuration

Ensure any firewalls in the cluster environment have the following ports open before attempting configuration:

- 1195 (TCP & UDP)
- 4330 (TCP)
- 4331 (TCP)
- 8330 (TCP)

On systems running firewalld (RHEL based systems for example), open the relevant RSF-1 ports by issuing the following commands on each node in the cluster to open the required RSF-1 ports in the active zone (in this example the public zone is used):

firewall-cmd --permanent --zone=public --add-port={1195/tcp,1195/udp,4330/tcp,4331/tcp,8330/tcp}
firewall-cmd --reload

To list current active zones use the command:

firewall-cmd --get-active-zones

And to change an interfaces' zone:

firewall-cmd --change-interface=eth0 --zone=work --permanent

Please see the firewalld documentation for further information.

Once the cluster software is installed on all cluster nodes, navigate to the RSF-1 GUI on any one of the nodes on port 8330:

https://<hostname>:8330

You will then be presented with the welcome screen. Click BEGIN SETUP to start configuring the cluster:

welcome-screen

Create an admin user account for the GUI. Enter the information in the provided fields and click the REGISTER button when ready:

create-admin-user

Once you click the REGISTER button, the admin user account will be created and you will be redirected to the login screen. Login with the username and password just created:

login-page

Once logged in the cluster uninitialized page is displayed:

dashboard-cluster-uninitialized

Configuration and Licensing

To begin configuration, click on Create option on the side-menu (or the shortcut shown on the uninitialized page).

Editing your /etc/hosts file

Before continuing, ensure the /etc/hosts file is configured correctly on both nodes. Hostnames cannot be directed to 127.0.0.1, and both nodes should be resolvable. Here is a correctly configured hosts file for two example nodes, node-a and node-b:

127.0.0.1 localhost
10.6.18.1 node-a
10.6.18.2 node-b

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

The Cluster Create page scans for clusterable nodes (those running RSF-1 that are not yet part of a cluster) and presents them for selection. If any nodes are unlicensed then an additional panel will be included to request 45 day trial licenses:

cluster-create-node-selection

Now enter the cluster name and description, and then select the type of cluster being created (either shared-storage or shared-nothing).

Shared-Nothing: Adding non-local nodes for shared nothing clusters

For a shared-nothing cluster an additional panel is included to allow manual entry of a cluster node. This option is provided as RSF-1 only detects unclustered nodes on the local network and for shared-nothing clusters it is often the case that nodes are located on remote networks and cannot be detected automatically¹.

In this case fill in the host name and IP address of the remote node and click the + ADD NODE button:

sn-additional-node

The node will then be availble to select as the secondary node in the cluster:

selectable-additional-node

Obtain trial licenses

Once the cluster nodes have been selected click the LICENCE button to obtain 45 day trial licenses directly from the High-Availability license server. The RSF-1 End User License Agreement (EULA) will then be displayed. Click ACCEPT to proceed:

EULA

Once the licensing process has completed click the refresh button on the node selection panel to update the license status:

nodes-licensed

Shared-Nothing: Configure SSH tunnel between nodes

For shared nothing clusters an additional step is required to configure a bi-directional ssh tunnel between the two nodes. This tunnel is used to transfer snapshots between the two cluster nodes. A panel with pre-filled out values consisting of the cluster node names is presented:

sn-ssh-tunnel

Using the default values of the cluster node names snapshot data will be transfered over primary hosts network. However, this can be changed by specifying a different end point for the target hostnames. For example, if an additional network is configured between the two nodes using the addresses node-a-priv and node-b-priv respectively then this would be configured thus:

ssh-tunnel-private

By default ssh is used as the transport protocol as it encrypts/decrypts data during snapshot transfer.

If data encryption is not required then mbuffer is offered as an alternative as it makes much better utilisation of the available bandwidth:

ssh-sync-protocol

Next click the TEST SSH CONNECTION button to confirm the ssh tunnel is operating correctly:

ssh-test-success

Note

Once the cluster is created the SSH connections settings can be changed from the Settings->Shared Nothing page.

Create and initialise the cluster

Click the CREATE CLUSTER button to cluster the nodes together and create an initial network heartbeat between them over the hosts network:

cluster-created

Creating a Pool in the WebApp

Now the cluster is created the next step is to cluster a pool. This can either be an existing pool or one created using the webapp; if you already have pools configured skip ahead to the next section. In this example a mirrored pool will be created.

To create a pool, navigate to ZFS -> Pools and then click the + CREATE POOL button on the main pools page to bring up the pool configuration page:

zfs-pools-create-main-window

Fill out the Pool name field and select the desired structure of the pool from the Pool Mode list. The cluster supports three types of layout when creating vdevs for a pool:

Layout	Description
`mirror`	Each drive in the vdev will be mirrored to another drive in the same vdev. Vdevs can then be striped together to build up a mirrored pool.
`raidz1`	One drive is used as a parity drive, meaning one drive can be lost in the vdev without impacting the pool. When striping raidz1 vdevs together each vdev can survive the loss of one of its members.
`raidz2`	Two of the drives are used as parity drives, meaning up to two drives can be lost in the vdev without impacting the pool. When striping raidz2 vdevs together each vdev can survive the loss of two of its members.
`raidz3`	Three of the drives are used as parity drives, meaning up to three drives can be lost in the vdev without impacting the pool. When striping raidz3 vdevs together each vdev can survive the loss of three of its members.
`jbod`	Creates a simple pool of striped disks with no redundancy.
`draid1`	One drive is used as a parity drive, meaning one drive can be lost in the vdev without impacting the pool. When striping draid1 vdevs together each vdev can survive the loss of one of its members.
`draid2`	Two of the drives are used as parity drives, meaning up to two drives can be lost in the vdev without impacting the pool. When striping draid2 vdevs together each vdev can survive the loss of two of its members.
`draid3`	Three of the drives are used as parity drives, meaning up to three drives can be lost in the vdev without impacting the pool. When striping draid3 vdevs together each vdev can survive the loss of three of its members.

dRAID (distributed raid) and RAIDZ are two different vdev layouts in ZFS, each offering distinct approaches to data redundancy and fault tolerance. RAIDZ is a traditional RAID-like structure that distributes data and parity information across multiple disks, while dRAID distributes hot spare space throughout the vdev, enabling faster rebuild times after drive failures.

Configure options according to your requirements - for a more in-depth discussion on options please see the HAC ZFS Tuning Guide:

Option	Description
`Compression`	Compress data before it is written out to disk, choose either no compression, `lz4` or `zstd` (`on` is an alias for `lz4`)
`Record Size`	The recordsize property gives the maximum size of a logical block in a ZFS dataset. Unlike many other file systems, ZFS has a variable record size, meaning files are stored as either a single block of varying sizes, or multiple blocks of recordsize blocks.
`Access Time`	Updated the access time of a file every time it is read or written. Recommended setting is off for better performance.
`Linux Access Time`	Hybrid setting meaning the access time is only updated if the mtime or ctime value changes or the access time has not been updated for 24 hours (on next file access).
`Alignment Shift`	Set to the sector size of the underlying disk - typically this is the value 12 for 4K drives (note some drives report a 512 byte sector size for backwards compatibility, but are in reality 4K; if unsure check manufacturers specifications)
`Extended Attributes`	This property defines how ZFS will handle Linux' eXtended ATTRibutes in a file system. The recommended setting is `sa` meaning the attributes are stored directly in the inodes, resulting in less IO requests when extended attributes are in use. For a file system with many small files this can have a significant performance improvement.

Mirrored Pool

Mirrored pools are created by striping together individual mirrored vdevs.

Start by creating an individual mirrored DATA vdev (in this example a two way mirror is created, but these could be three, four way mirrors etc). Select drives for the vdev from the Available Disks list and click DATA to add them as data vdevs - in this example sdm and sdn are used to create the first mirror:

zfs-pools-mirror-step1

To configure multiple striped mirrors, select the next set of drives using the same number of drives as the existing data vdev, click DATA, then from the popup menu select + New vdev (note that selecting vdev-0 would extend the existing vdev rather than creating a new one):

zfs-pools-mirror-step2

This action will then add a further pair of mirrored drives to the pool layout, creating a mirrored stripe; additional drives are adding in the same manner:

zfs-pools-mirror-step3

Add further vdevs as required, here a log and a cache have been added:

zfs-pools-mirror-step4

Once configuration is complete, click SUBMIT and the pool will be created and displayed in the main pools page ready for clustering. The configuration of the pool can be checked by clicking the expand/collapse arrow on the left hand side of the pool entry:

zfs-pools-mirror-step5

Preparing Pools to Cluster

Pools must be imported on one of the nodes before they can be clustered. Check pool import/export status by navigating to ZFS -> Pools from the side menu:

all-exported-pools

In the above example pool1 and pool2 are both exported. To import pool1, open the pools ACTIONS menu and select Import Pool (in a shared nothing cluster both pools will be imported on their respective nodes simultaneously):

show-import-action

The status of the pool should change to Imported and CLUSTERABLE indicating the pool is now ready for clustering:

pool-imported

Unclusterable Pools

Should any issues be encountered when importing the pool it will be marked as UNCLUSTERABLE. Check the RestAPI log (/opt/HAC/RSF-1/log/rest-operations.log) for details on why the import failed. With a shared-nothingcluster, this may happen if the pools aren't imported on both nodes.

Clustering a Pool

Highlight the desired pool to be clustered (choose only pools marked CLUSTERABLE ), then select Actions followed by Cluster this pool:

cluster-this-pool

Fill out the description and select the preferred node for the service:

What is a preferred node

When a service is started, RSF-1 will initially attempt to run it on it's preferred node. Should that node be unavailable (node is down, service is in manual etc) then the service will be started on the next available node.

For shared nothing clusters the system will synchronise data from the preferred node to remote node(s), with any data on the destination pools being overwritten. If clustering a pool with existing data, set the preferred node to be the node where the pool with existing data is imported to prevent its data being overwritten.

cluster-pool-details

With a shared-nothing pool the GUID's for each pool will be shown:

cluster-pool-sn-details

To add a virtual hostname to the service click Add in the Virtual Hostname panel. Enter the IP address, and optionally a hostname. For nodes with multiple network interfaces, use the drop down lists to select which interface the virtual hostname should be assigned to.

Finally, click the Create button:

add-vip

The pool will now show as CLUSTERED:

clustered

View Cluster Status

To view the cluster status, click on the Dashboard option on the side-menu:

dashboard-main-window

The dashboard shows the location of each service and the respective pool states and failover modes (manual or automatic). The dashboard also allows the operator to stop, start and move services in the cluster. Select a pool then click the ACTIONS button on the right hand side to see the available options:

Cluster Heartbeats

To view cluster heartbeat information navigate to HA-Cluster -> Heartbeats on the side-menu:

initial-heartbeats

To add an additional network heartbeat to the cluster, select ADD NETWORK HEARTBEAT PAIR. In this example an additional connection exists between the two nodes with the hostnames node-a-priv and node-b-priv respectively. These hostnames are then used when configuring the additional heartbeat:

add-new

Click SUBMIT to add the heartbeat. The new heartbeat will added and shown in the heartbeat panel:

updated-heartbeats

This completes basic cluster configuration.

For more advanced configuration and operational procedures please see the online user guide.

RSF-1 uses broadcast packets to detect cluster nodes on the local network. Broadcast packets are usually blocked from traversing other networks and therefore cluster node discovery is usually limited to the local network only. ↩