Frequently Asked Questions
I've received my license keys, how do I use them?
If you have setup your cluster using the Web App, your licenses will be automatically added to your machines. In some instances you may be required to manually add these to your system. You should have received an email when the licenses were generated, which will be of the form:
{
"Info": {
"Expiry": "2233-03-22"
},
"License": {
"Error": "false",
"license.b4e36c20-2ce9-d149-ffee-123aa2355d1f": "5ZHb4UZXRvlL2VnB/7huB1XX",
"license.e259127f-6da2-6b42-bbdf-ba4002389c08": "ha45gP8lakgGG1o1MUAmb1XX"
}
}
In the above example the licenses are 5ZHb4UZXRvlL2VnB/7huB1XX and ha45gP8lakgGG1o1MUAmb1XX respectively.
The two license keys need to be placed in the file /opt/HAC/RSF-1/etc/licenses.txt on each node - the cluster will automatically pick up the relevant license from this file, thereby making administration simpler as a single file can be used to license all cluster nodes (i.e. multiple clusters can use a single license file). Comments can be introduced to the license file using a '#' symbol in column 0, blank lines are ignored.
# Licenses renewed 13 Sept 2020.
# license.b4e36c20-2ce9-d149-ffee-123aa2355d1f
5ZHb4UZXRvlL2VnB/7huB1XX
# license.e259127f-6da2-6b42-bbdf-ba4002389c08
ha45gP8lakgGG1o1MUAmb1XX
Finally, restart RSF-1 to reload the licenses (note - the license file is checked every hour, on the hour by the cluster as a matter of course; any changes detected will cause the file to be reloaded automatically).
Do all client systems that access the cluster have to be in the same network?
Clients access services in the cluster using a virtual IP (VIP) address for each service.
The VIP address is a normal, routable, IP address, and acts like any other such address. If a service is accessible when run as a non clustered service, then it will also be accessible when run as a clustered service.
How do I investigate what triggered a failover?
The main RSF-1 log file is /opt/HAC/RSF-1/log/rsfmon.log. Each time RSF-1 is restarted, or a set log file size limit is reached, this log file is rotated, so after a time the log directory will contain the current log file rsfmon.log, and a historical set prefixed with .0to .9, with .9 representing the oldest log file. Each entry in the log file is timestamped along with the PID of the process that updated the log.
What are the prop_zpool_fail_mode properties for?
The failmode property of a zpool controls how the pool handles I/O after it has gone into a 'faulted' state. There are 3 options:
wait- all I/O from clients will hangcontinue- clients will get I/O errors for all I/O operations to the poolpanic- as soon as the pool goes faulted, ZFS triggers a kernel panic
For RSF clusters, panic should be used (which is not the default failmode property on newly created pools). The mode setting of panic means if a pool goes faulted due to a faulty controller card, broken fibre cable, etc. the active node will panic, and the service can automatically fail over to the other node.
RSF-1 is configured by default to change the failmode of all zpools to panic each time a service starts. If this behaviour is not wanted for any reason, it can be changed by altering an RSF database property.
The property prop_zpool_fail_mode controls the failmode on a cluster-wide basis. If it is necessary to have a pool use a different failmode, then a new property can be created with the format prop_zpool_fail_mode_<pool> (note that this is a pool name, not a service name; if a service contains more than one pool, then a separate property can be declared for each pool).
To modify the cluster wide failmode property to wait run:
To add a new property (in this case continue) specifically for the pool tank, run:
Possible values for the global and individual pool setting are wait, continue, panic and none. A value of none means RSF will not set the failmode of pools on import, so they will retain the failmode setting they already had.
Possible values for the pool specific settings are wait, continue, panic, none and default. A value of default effectively disables the setting and causes that pool to use the global value prop_zpool_fail_mode. A value of none causes RSF not to set the failmode of this pool at all.
For example, if there are 5 pools in a cluster, pool1, pool2, pool3, pool4 and pool5, the properties:
prop_zpool_fail_mode : panic
prop_zpool_fail_mode_pool1 : wait
prop_zpool_fail_mode_pool2 : default
prop_zpool_fail_mode_pool3 : none
prop_zpool_fail_mode_pool4 : continue
mean that the following failmodes are applied:
pool1 - wait
pool2 - panic
pool3 - no failmode setting used (keeps its original setting)
pool4 - continue
pool5 - panic (as there is no specific declaration for pool5, the default is used)
Does RSF-1 support one node cluster configurations?
Yes it does. We can provide a single node license to provide access to the webapp without the HA features.
NFSv4 Failover
NFS Version 4 has removed support for UDP as an underlying transport protocol (as opposed to V3 which supported both UDP and TCP), therefore all NFSV4 connections are TCP based. This exclusive use of TCP in NFSv4 has implications for failover recovery time in certain scenarios due to TCP's TIME_WAIT (sometimes referred to as 2MSL) state that can be entered into during multiple failover operations.
The reason for a TCP socket to enter a TIME_WAIT state is to prevent delayed data packets from one connection being misinterpreted as part of a subsequent newly established connection on the same machine (applying stale data from a previous connection to the current one could have potentially disastrous affects on data integrity).
The implication of TIME_WAIT for failover is observed when HA services are moved from one node to another and then back again in a short period of time. Once the initial move is complete the originating server enters the TIME_WAIT state as part of the normal TCP protocol. If during this 'wait period' services are moved back to the originating server clients will be unable to re-connect until the TIME_WAIT period has expired (and in some cases the client connections will timeout themselves), therefore manual moves back and forth in quick succession (circa 2-4 minutes) between machines that provide data over NFSv4 should be avoided. This type of quick failover/failback scenario is only normally seen during system testing exercises and not representative of production environments.
For machines where the failover was instigated as a result of a systems crash, TIME_WAIT is irrelevant as the TCP connections will have no knowledge of the previous connection.
What does broken_safe and broken_unsafe mean and how do I fix it?
Broken_safe and broken_unsafe refer to a state of an RSF-1 service that has either failed to start up or shut down correctly.
As a service is started or stopped RSF-1 executes the scripts in the directory /opt/HAC/RSF-1/etc/rc.<service>.d/* where <service> is the service name itself; for example a service named web would have the service directory /opt/HAC/RSF-1/etc/rc.web.d/. The service directory contains three types of scripts:
start- prefixed by anS<num>stop- prefixed by aK<num>panic- prefixed by aP<num>
The order in which the scripts are run is dictated by the <num> portion of the prefix, going from low to high. The scripts perform actions to either start or stop a service. Each script should run successfully and complete with a 0 exit code. However, if during the running of one of these scripts something goes wrong, then the script will exit with a non zero exit code (exit code definitions can be found in /opt/HAC/bin/rsf.sh).
If an error occurs when running the start or stop scripts, a script can indicate this in its exit code. If the failure occurred when starting a service, then the shutdown scrips are run to release any shared resources that the failed startup attempt may have reserved/started etc.
If the start scripts failed, and the following stop scripts succeeded, the service is marked broken_safe. Broken indicates that something is wrong - the service could not be started, and this should be investigated and remedied before trying to start the service on this server again. The safe part indicates that the stop scrips completed successfully, so no shared resources are allocated and it is safe to try to start the service on a different server.
However, if an error occurs when running the stop scripts, (e.g. failure to unmount a shared file system, even with a forcible unmount), then the service is marked broken_unsafe. As before broken indicates that some investigation is required, but this time the unsafe suffix means that shared resources may still be allocated, and therefore it is NOT safe to try to start the service elsewhere in the cluster (for example, if you were to try to mount and use the file system on another host data corruption could occur).
It is also possible for the startup scripts to indicate that the service should be marked broken_unsafe immediately, without running the stop scripts. This is to allow for situations in which a severe error has been detected by the start scrips, and that running the stop scrips, or allowing another server to try to start the service, may further exasperate the situation.
In either case, the underlying issue causing the broken state needs to be resolved. Check the log file /opt/HAC/RSF-1/log/rsfmon.log to discover where the error occured and what needs to be done. Once the problem has been resolved, RSF-1 needs to be told that the issue is now resolved; do this by first issuing the command (as root):
Problems accessing the RSF-1 WebApp
If the WebApp becomes inaccessible and you cannot connect via a browser, a restart of the webapp services may be required.
Linux
BSD
TrueNAS
OmniOS/Solaris
Installing custom SSL certificates
By default the cluster WebApp uses self signed certificates for https authentication.
What are the different cluster timeouts used for?
RSF-1 uses a number of timeouts to manage service startup. These timeouts are:
- Initial timeout - used when a cluster node first boots up.
- Heartbeat timeout - used to trigger heartbeat failure detection.
- Run timeout - used on service startup.
The following sections describe each of these timeouts in detail.
Initial Timeout
One of the first operations a cluster node performs during its start up process (i.e. when it is first powered on) is to gather information about the services configured in the cluster and, upon finding any eligible services1 not running, start them.
This approach however introduces a slight race condition,
typically triggered when more than one cluster node starts
up at the same time (for example in the case of a power outage
and subsequent restoration).
To understand how this condition manifests itself,
consider the following scenario in a two node
cluster, Node-A and Node-B, with a single service Service-1, who's
preferred node is Node-A:
- Both nodes are powered up at the same time.
Node-Bcompletes booting ahead ofNode-Aand starts gathering service infomation.- Because
Node-Ahas not yet completed booting,Node-BseesService-1as down with no other nodes available to run the service. Node-Bstarts the service.
As as result Service-1 is started, but not on its preferred node.
This is what the initial timeout value addresses - it is a time in seconds
that each node should wait once booted, but before attempting to
start services. This allows some leeway for other cluster nodes to
complete their boot process before service startup decisions are made.
In the above example, should Node-A have completed booting in
the initial timeout period then, when Node-B came to make any service
start decisions, it would see Node-A was available and of a higher priority
for the service and therefore allow the service to start on the that node.
The default value for the initial timeout is 20 seconds. This value will always be a balance of what is a reasonable amount of time to wait for all cluster node to starts vs. what is an acceptable time to defer service startup waiting for other cluster nodes.
Heartbeat Timeout
Nodes in a cluster communicate with each other using heartbeats. A cluster node sends heartbeats to all other nodes in the cluster at a regular beat rate,2 and should in turn receive heartbeats from all peers in the cluster at the same rate. Heartbeats serve two important functions:
- They let other nodes in the cluster know the sending node is up and running.
- They communicate important state information about cluster services.
The information contained within a heartbeat is then used to update the internal state held for the sending node, which, most importantly, includes the last time the remote node was seen (and therefore considered up). The heartbeat timeout is then used in conjunction with the last time a nodes state was updated to decide when that node is considered down.3
Once all heartbeats are down for a node, then any services that node was running are eligible for starting in the cluster - however, before starting any services, a secondary timeout has to occur, this is known as the run timeout.
Run timeout
The run timeout setting is used to assist cluster node coordination during a service startup. Its usage is slightly different depending on if it is a two node or greater than two node cluster.
The minimum value for the run timeout is two seconds. This limit is imposed to provide sufficient time for state information to percolate to all nodes in the cluster after taking into account possible synchronicity delays in disk heartbeats (as they have to be polled every second, unlike network heartbeats which are event driven).
Two node cluster
With a two node cluster there are four possible scenarios why a service startup on a node occurs:
- The service is stopped and in
manualmode on all cluster nodes. The service is then transitioned toautomaticin the cluster. The highest priority machine for the service should then immediatly attempt to start the service. The secondary node waits for the run timeout to expire before considering if it should start the service. Under normal circumstances the service will start on the primary node and the secondary will do nothing. If however the service fails to start on the primary node, then, once the run timeout expires on the secondary node, it will detect the service has not started and attempt to start it itself. - The node running the service fails. In this case the run timeout is ignored and the remaining node (if eligible) will attempt to start the service once the heartbeat timeout has expired.
- A move service command is issued. In this case the run timeout is ignored.
- One node halts the service due to a resource failure (for example the network monitor has detected a failure on one of the VIP interfaces). In this case the run timeout is ignored and the other node starts the service if eligible.
Two+ node cluster
When there are more than two nodes in a cluster then the rules governing the use of the run timeout apply to only the nodes eligible to run a specific service. Furthermore, where a service is bound to two nodes only, then the two node cluster rules above apply.
When a service can potentially run on more than two nodes then the above rules apply along with an additional rule:
- In the case where more than one node is eligible to run a service
then the highest priority node for that service will take over (once the
run timeout is honoured). Any
other, less eligible, servers will defer to that server (i.e. will not attempt
to start the service).
There is however one corner case where the run timeout is used to arbitrate between eligible servers where the most eligible is inmanualmode and the next in line is inautomatic. In this scenario the node in automatic will wait for the run timeout to expire and, if the other higher priority node is still inmanual, attempt to start the service. In the mean time if the other server is set toautomatic, then it will also honour the run timeout, giving it enough time to see that the service is now being started on another, lower priority, node, thus avoiding any potential race condition.
What is the difference between a machine name and a host name?
Every machine in an RSF cluster needs a unique and unchanging machine name for RSF-1 to associate with it. This is normally the same as the host name, but must be different if the host name changes as part of a service fail over (or the host name doesn't resolve to a valid IP address).
The machine names used by RSF-1 are the names which appears in MACHINE lines in the config file. These names are normally associated with real machines by checking to see that they match the host name of that machine. However if a host ID appears on a MACHINE line, then the host name check is not done, and the association is made by checking the host ID (as returned by hac_hostid) instead. Note that in this case the IP address of the machine MUST be specified on the end of the line, as it is assumed that the machine name is not the same as the host name, and thus can't be used to look up the IP address of the host.
This flexible naming scheme also allows multiple hosts in a cluster to have the same host name (a requirement seen in a few applications).
To specify the optional host ID of the machine on the MACHINE line preceded it by "0x" (to indicate it is a hexadecimal value). RSF-1 will then identify that machine by the name on the MACHINE line, even if it does not match the real host name. Here is an example MACHINE entry with corresponding host ID:
MACHINE slug 0x2ae5747
RSF also sets the environment variable RSF_MACHINE_NAME in any service startup/shutdown scripts to the machine name in use. This allows scripts to create log messages using the same machine name as rsfmon itself.
Machine names are also used on heartbeat (NET, DISK and SERIAL) lines to indicate what machine the heartbeat is being sent to, and on SERVER lines to indicate which machines may act as act as a server for a service.
Escape syntax when the machine name / vip starts with a number
In the case where the machine name starts with a number or contains special characters, a specific syntax is required when using the name in a rsfcdb command or updating the config file.
For example, if 0-node1 and 0-node2 are the machine names, then, when using the rsfcdb to build the database, the machine name must be preceded by the percent symbol, i.e. %0-node1, therefore to establish a network heartbeat between the two nodes 0-node1 and 0-node2, the command to use is:
rsfcdb ha_net %0-node1#0-node2 0-node2#0-node1
Escape syntax for VIP's
VIP's with special characters are handled slightly differently and should be enclosed in double quotes.
For example, if the vip name we would like to use is 0-testvip1, then to add that VIP to a service named zpool1 the command is:
rsfcdb sa_desc zpool1#"0-testvip1" "RSF-1 zpool1 ZFS service"
Which will result in a configuration file entry similar to:
# Machines section
MACHINE %0-node1
NET %0-node2
DISC %0-node2 /dev/rdsk/c15t2d0s0:512:518
SERVICE tank1 "0-viptest1" "RSF-1 zpool1 ZFS service"
Do I need to use dedicated disks for heartbeats or reservations?
An RSF-1 cluster uses shared disks for both heartbeats and disk fencing.
Disk Heartbeats
When a disk is used for heartbeats, RSF-1 on each cluster node will use a small portion of the disk to regularly write information about the state of the cluster according to that node. That information will then be read by the remote node and used to build up a picture of the whole cluster.
For a ZFS cluster, disk heartbeats are not required to be dedicated disks. RSF-1 understands the layout of ZFS pool drives and is able to place heartbeat information on any pool disk without disrupting either user data or ZFS metadata.
Disk fencing
When a disk is used by RSF-1 for the purposes of fencing a SCSI reservation is placed on that disk during the startup sequence of a HA service. The reservations are placed before the ZFS pool is imported with the purpose to prevent other cluster nodes from writing to the pool after it is locally imported. This is important for any situation where a cluster node appears to go offline - prompting a service failover to the remaining node - but in reality that node is still running and accessing the pool. In that case, the SCSI reservations will block access to the pool from the failing node, allowing the remaining node to safely take over.
Reservation drive selection is handled automatically by recent versions of the RSF-1 cluster but for older, manually configured versions or for any situations where the cluster configuration must be changed from the default settings, there are three important requirements for the selection of reservation drives:
-
Because the reservations are used to fence the ZFS pool's disks, it must be the pool's disks that are reserved - so dedicated disks should not be used for reservations. Additionally, it should be regular data disks that are reserved. SCSI reservations on disks marked as "cache" (L2ARC) or "spare" will not have any effect on the ability of another node to access the pool and will therefore not contribute towards adequate data protection.
By default, the cluster will also avoid using disks marked as "log" (SLOG). This is less important from a data protection perspective but it has been found that since the purpose of a separate log device is to provide a performance improvement, the type of disk devices used for log tend to be more "cutting edge" than regular data disks and are more likely to exhibit unexpected behaviour in response to SCSI reservations. -
Reservation disks should be selected in such a way that the reservations would prevent pool access from the remote node. For example, if a pool is made up of several 4-way mirrors, then as a minimum, reservations should be placed on all 4 devices in any one mirror vdev. This would mean the entire vdev will be inaccessible to the remote node and therefore, the whole pool will be inaccessible.
-
Reservations cannot be placed on the same disks as heartbeats. Depending on the type of reservations used by the cluster the reservation will block either reads and writes, or just writes, from one of the cluster nodes. Because each disk heartbeat requires both nodes to be able to read and write to the disk, reservations and disk heartbeats will conflict and result in the disk heartbeat being marked down while the service is running on either node.
What disk layout should I use for my ZFS HA pool and how does this impact reservations and heartbeat drives?
For ZFS file systems there are essentially two main approaches in use, RAID Z2 and a mirrored stripe. To give a brief overview of these two schemes lets see how they layout when we have six drives, 1TB each (note that within any pool, any drives used for reservations or to heartbeat through are still usable for data, i.e. NO dedicated drives are required; the cluster software happily co-exists with ZFS pools).
RAID Z2
Raid Z2 uses two parity drives and at least two data drives, so the minimum amount of drives is four. With six 1TB drives this then equates to the following layout with roughly 4TB of usable space (parity drives highlighted in green):
D1 |
D2 |
D3 |
D4 |
P1 |
P2 |
With this configuration up to two drives (parity or data) can be lost and pool integrity still maintained; any more drive losses though will result in the pool becoming faulted (essentially unreadable/unimportable).
In order to place reservations on this drive layout it is necessary to reserve three drives (say P1, P2, D1) - in this way no other node will be able to successfully import the pool as there are not enough unreserved drives to read valid data from.
With resertions in place on drives P1, P2 and D1, this leaves drives D2, D3 and D4 free to use for disk heartbeats. The RSF-1 cluster software is aware of the on-disk ZFS structure and is able to heartbeat through the drives without affecting pool integrity.
RAID 10
Raid 10 is a combination of mirroring and striping; firstly mirrored vdevs are created (RAID 1) then striped together (RAID 0). With six drives we have a choice on the mirror layout depending on the amount of redundancy desired. These two schemas can be visualised as follows, firstly two three way mirrors striped together:
D0 |
D3 |
|
D1 |
D4 |
|
D2 |
D5 |
|
| vdev1 | + | vdev2 |
In this example two mirrors have been created (D0/D1/D2 and D3/D4/D5) giving a total capacity of 2TB. This layout allows a maximum of two drives to fail in any single vdev (for example D0 and D2 in vdev1, D0 and D3 in vdev 1 and 2, etc.); the pool could survive four drive failures as long as a single drive is left in vdev1 and vdev2, but if all fail in one side of the stripe (for example D3, D4 and D5) then the pool would fault.
The reservations for this layout would be placed on all drives in either vdev1 or vdev2, leaving three drives free for heartbeats.
Alternatively the drives could be layed out as three two way mirrors striped together:
D0 |
D2 |
D4 |
||
D1 |
D3 |
D5 |
||
| vdev1 | + | vdev2 | + | vdev3 |
In this example three mirrors have been created (D0/D1, D2/D3 and D4/D5) giving a total capacity of 3TB, with a maximum of one drive failure in any single vdev. Reservations will be placed on either vdev1, vdev2 or vdev3 leaving four drives available for heartbeating.
In all of the above scenarios it is NOT necessary to configure reservations or heartbeats; when a pool is added to a cluster, the cluster software will interrogate the pool structure and automatically work out the amount of drives it needs to reserve with any remaining drives utilised for heartbeats. Note that for each clustered pool only a maximum of two heartbeat drives are configured (any more is overkill).
RSF-1 network ports and firewalls
The following network ports are used by the cluster software:
1195(TCP/UDP) used for low level interprocess communications and to handle requests from the RESTapi4330(TCP) used by the RESTapi process to service REST calls (including those from the WEB app) and to communicate with RSF-1 on port11958330(TCP) the port the WEB application listens on
Firewall considerations:
- Access to ports
1195and4330is only required between cluster nodes. - Port
8330needs to be accessible if you wish to access the web app from non cluster machines.
Configuring additional network heartbeats
The standard process for creating a network heartbeat is:
- Plumb in the addresses on the interfaces for the additional heartbeat
- Add the address to
/etc/hostswith a hostname - the new addresses need to be added to both nodes - Create the heartbeats for both directions via the webgui using the new hostnames
In this example there is a new IP on each of the nodes, mgomni1 has mgomnihb1 with address 192.168.16.1, and mgomni2 has mgomnihb2 with address 192.168.16.2, here's what /etc/hosts looks on both machines, and their interfaces:
# Host table
::1 localhost
127.0.0.1 localhost loghost
10.6.16.1 mgomni1
10.6.16.2 mgomni2
192.168.16.1 mgomnihb1
192.168.16.2 mgomnihb2
root@mgomni1:~# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
vmxnet3s0/dhcp dhcp ok 10.6.16.1/8
vmxnet3s1/hb static ok 192.168.16.1/24
lo0/v6 static ok ::1/128
root@mgomni2:~# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
vmxnet3s0/dhcp dhcp ok 10.6.16.2/8
vmxnet3s1/hb static ok 192.168.16.2/24
lo0/v6 static ok ::1/128
Then, in the webapp, create the additional heartbeat as below:

From mgomni2 to mgomni1, the target ip/hostname would be mgomnihb1, and for mgomni1 to mgomni2 the target would be mgomnihb2.
iSCSI Target Failover using TGT on Linux
Available from RSF-1 version 1.11 onwards
RSF-1 has built in support for iSCSI target failover using the TGT iSCSI framework. Each pool/service can have its own TGT configuration which is handled seamlessly by RSF-1 on service start and failover.
Individual TGT configuration files are located within the pool so they migrate with the pool during failover.
Prerequisites
-
Install TGT on each cluster node:
-
Create a zvols directory in which to place individual zvols (not strictly necessary, but HAC recommended best practice):
-
For each target data source create a zvol block device to be used as backing store for the iSCSI targets:
Note
It is possible to use a regular file as backing store. However this is not an approach we would recommend as, from a performance point of view, it introduces the unnecessary overhead of a filesystem layer for I/O operations and somewhat negates the performance advantage of using block level storage.
Target configuration
Once zvols have been created the next step is to configure iSCSI targets that utilize those zvols. A target configuration file is held in each pool. That configuration is specific to the zvols in that pool and should not include zvols from any other pool - each pool needs to have its own configuration.
- Create a tgt config file named
.tgt-ha.confin a pools root directory, i.e./<pool>/.tgt-ha.conf. For example: -
Stop and start the service via the webapp to initially load the configuration into TGT.
-
Check the target is now configured by running the following command:
# tgtadm --mode target --op show Target 1: iqn.2023-09.com.hac:deb.target01 System information: Driver: iscsi State: ready I_T nexus information: LUN information: LUN: 0 Type: controller SCSI ID: IET 00010000 SCSI SN: beaf10 Size: 0 MB, Block size: 1 Online: Yes Removable media: No Prevent removal: No Readonly: No SWP: No Thin-provisioning: No Backing store type: null Backing store path: None Backing store flags: LUN: 1 Type: disk SCSI ID: IET 00010001 SCSI SN: beaf11 Size: 210 MB, Block size: 512 Online: Yes Removable media: No Prevent removal: No Readonly: No SWP: No Thin-provisioning: No Backing store type: rdwr Backing store path: /dev/zvol/pool1/zvols/disk01 Backing store flags: Account information: ACL information: ALL - The target should now be discoverable by the client via the service
VIP (here the VIP is configured as
10.6.19.21and TGT is listening on the default port of 3260):# iscsiadm --mode discovery --type sendtargets --portal 10.6.19.21:3260 10.6.19.21:3260,1 iqn.2023-09.com.hac:deb.target01 # iscsiadm -m node -T iqn.2023-09.com.hac:deb.target01 -l Logging in to [iface: default, target: iqn.2023-09.com.hac:deb.target01, portal: 10.6.19.21,3260] Login to [iface: default, target: iqn.2023-09.com.hac:deb.target01, portal: 10.6.19.21,3260] successful. # lsscsi [4:0:0:0] storage IET Controller 0001 - [4:0:0:1] disk IET VIRTUAL-DISK 0001 /dev/sdc
Additional Pools
When RSF-1 has additional pools configured into a service there are two approaches for the TGT configuration:
-
Create separate configuration files in each additional pool and include them in the main pools configuration. This is our recommended approach as it facilitates easier pool management. In this example
pool1is the main pool: -
Declare all targets in the main pool configuration. For example:
# cat /pool1/.tgt-ha.conf <target iqn.2023-09.com.hac:deb.target01> <backing-store /dev/zvol/pool1/vol/disk01> </backing-store> </target> <target iqn.2023-09.com.hac:dlp.target03> <backing-store /dev/zvol/pool2/vol/v1> </backing-store> </target> <target iqn.2023-09.com.hac:dlp.target04> <backing-store /dev/zvol/pool2/vol/v2> </backing-store> </target>
iSCSI Qualified Name (IQN)
iSCSI uses the iSCSI Qualified Name (IQN) scheme for addressing.
IQN's are made up of four distinct fields separated by a period .
and can be created/used by any organization that owns a domain name
(known as the the naming authority).
The four fields are:
-
The string
iqn.- this distinguishes the name from aneui.formatted name. -
A date in the format YYYY-MM, for example
1995-10.- this must be a valid year/month combination, corresponding to when the naming authority first owned the domain (used in the next field). -
The DNS domain of the naming authority in reverse, for example
com.high-availability -
A optional string prefixed by a
:that the naming authority deems appropriate. Effectively this string provides detail to the IQN and can include product types, serial numbers etc. It can also include a colon:as a boundary separator. For examplestorage:jbod1,storage:jbod2or02:b11f6a06-c9bd-cfeb-ea26-885a25d080c4.
Here are some examples of valid IQN addresses:
iqn.1995-10.com.high-availability:02:b11f6a06-c9bd-cfeb-ea26-885a25d080c4
iqn.2001-04.com.high-availability:storage.clusterA.jbod7
iqn.2001-04.com.high-availability
Extended Unique Identifier (EUI)
iSCSI also allows another form of addressing managed by the IEEE
Registration Authority. This type of addressing represents a globally
unique identifier (EUI) and is assigned by the registration authority.
The format of the EUI consists of the string eui. followed by a EUI-64
identifier (16 ASCII-encoded hexadecimal digits).
Here are some examples of valid EUI addresses:
eui.0567AH88952981FF
eui.ACC5412369875DAG
Typically EUI's are used by manufacturers who are registered with the
IEEE Registration Authority and use the EUI-64 scheme for its worldwide
unique names (the EUI-64 is also used in other network protocols, such as
Fibre Channel, which defines a method of encoding the EUI-64 it into
the World Wide Name used by FC as a unique identifier).
User Defined Startup/Shutdown Scripts
During service start and stop, RSF-1 will run the scripts located in /opt/HAC/RSF-1/etc/rc.appliance.c/.
Here is an example directory (note some scripts may be missing depending on OS):
root@node-a:/opt/HAC/RSF-1/etc/rc.appliance.c # ls -l
total 130
-rwxr-xr-x 1 root wheel 10623 Aug 21 09:58 C14res_drives
-rwxr-xr-x 1 root wheel 846 Aug 21 09:58 K01announce.pyc
-rwxr-xr-x 1 root wheel 1033 Aug 21 09:58 K02ApplianceStopping
-r-x------ 1 root wheel 3856 Aug 21 09:58 K03snap.pyc
-rwxr-xr-x 1 root wheel 3040 Aug 21 09:58 K32tn.pyc
-rwxr-xr-x 1 root wheel 706 Aug 21 09:58 K70samba
-rwxr-xr-x 1 root wheel 52436 Aug 21 09:58 K80zfs
-rwxr-xr-x 1 root wheel 6069 Aug 21 09:58 K85zfs_mhdc
-rwxr-xr-x 1 root wheel 417 Aug 21 09:58 K98ApplianceStopped
-rwxr-xr-x 1 root wheel 846 Aug 21 09:58 K99announce.pyc
-rwxr-xr-x 1 root wheel 846 Aug 21 09:58 S01announce.pyc
-rwxr-xr-x 1 root wheel 1033 Aug 21 09:58 S02ApplianceStarting
-rwxr-xr-x 1 root wheel 10623 Aug 21 09:58 S14res_drives
-rwxr-xr-x 1 root wheel 6069 Aug 21 09:58 S15zfs_mhdc
-rwxr-xr-x 1 root wheel 52436 Aug 21 09:58 S20zfs
-rwxr-xr-x 1 root wheel 10623 Aug 21 09:58 S21res_drives
-rwxr-xr-x 1 root wheel 3040 Aug 21 09:58 S68tn.pyc
-rwxr-xr-x 1 root wheel 417 Aug 21 09:58 S98ApplianceStarted
-rwxr-xr-x 1 root wheel 846 Aug 21 09:58 S99announce.pyc
Sxx scripts during service start, Kxx scripts during service stop.
It is recommended that user start scripts are ran after the RSF-1 scripts have been run, and the stop scripts before.
To acheive this the start script should be numbered S69-S97, stop scripts K04-K31.
Custom scripts should be created using the following template:
#!/bin/sh
#
. /opt/HAC/bin/rsf.sh
service=${RSF_SERVICE:-"service_name"}
script="`basename $0`"
##########################################################
# For service specific scripts, un-comment the following #
# test and replace "my-service" with the service name. #
# This will exit the script immediately when the service #
# name does not match. #
##########################################################
#
#if [ "${service}" != "my-service" ] ; then
# rc_exit ${service} ${RSF_OK}
#fi
case "${1}" in
'start')
#######################################
# commands to be run on service start #
# placed in this section #
#######################################
rc_exit ${service} ${RSF_OK}
;;
'stop')
#######################################
# commands to be run on service stop #
# placed in this section #
#######################################
rc_exit ${service} ${RSF_OK}
;;
'check')
exit ${RSF_CHECK_NORESRC}
;;
*)
rc_exit ${service} ${RSF_WARN} "usage: $0 <start|stop|check>"
;;
esac
Using this format means that the script can contain both start and stop commands. Furthermore the script can be symbolically linked so that the Sxx and Kxx script refer to the same file.
For example:
lrwxr-xr-x 1 root wheel 9 Nov 14 16:46 K10custom -> S70custom
-rwxr-xr-x 1 root wheel 0 Nov 14 16:46 S70custom
Multiple kernel partition messages appearing in syslog from the Udev sub-system
By default, the Udev daemon (systemd-udevd) communicates with the kernel and receives device uevents directly from it each time a device is removed or added, or a device changes its state.
Because of the way RSF-1 writes its heartbeats using the ZFS label, the udev sub-system sees this as a state change and erroneously updates syslog each time a heartbeat is transmitted. This can result in multiple messages appearing in syslog of the form:
Aug 10 17:22:24 nodea kernel: [2422456.906302] sdf: sdf1 sdf9
Aug 10 17:22:24 nodea kernel: [2422456.013538] sdg: sdg1 sdg9
Aug 10 17:22:25 nodea kernel: [2422458.418906] sdf: sdf1 sdf9
Aug 10 17:22:25 nodea kernel: [2422458.473936] sdg: sdg1 sdg9
Aug 10 17:22:25 nodea kernel: [2422459.427251] sdf: sdf1 sdf9
Aug 10 17:22:25 nodea kernel: [2422459.487747] sdg: sdg1 sdg9
The underlying reason for this is because Udev watches block devices by binding to the IN_CLOSE_WRITE event from inotify and each time it receives this event a rescan of the device is triggered.
Furthermore newer versions of the ZFS Event Daemon listen to udev events (to manage disk insertion/removal etc.) and catches the udev events generated due to the disk heartbeats, and then attempts to find to which pool (if any) the disk belongs to, resulting in unnecessary I/O.
The solution to this is to add a udev rule that overrides this default behaviour and disables monitoring of the sd* block devices. Add the following to the udev rules file /etc/udev/rules.d/50-rsf.rules1:
ACTION!="remove", KERNEL=="sd*", OPTIONS:="nowatch"
Finally, reload the udev rules to activate the fix.
Thanks to Hervé BRY of Geneanet for this submission.
REST service fails to start due to port conflict
The RSF-1 REST service (rsf-rest) uses port 4330 by default.
If this port is in use by another service (for example pmlogger
sometimes attempts to bind to port 4330) then the RSF-1 REST service will
fail to start.
To check the service status run the command
systemctl status rsf-rest.service and check the
resulting output for any errors; here is an example
where port 4330 is already in use:
# systemctl status rsf-rest.service
● rsf-rest.service - RSF-1 REST API Service
Loaded: loaded (/usr/lib/systemd/system/rsf-rest.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-07-14 08:23:00 EDT; 5s ago
Process: 4271 ExecStart=/opt/HAC/RSF-1/bin/python /opt/HAC/RSF-1/lib/python/rest_api_app.pyc >/dev/null (code=exited, status=1/FAILURE)
Main PID: 4271 (code=exited, status=1/FAILURE)
Jul 14 08:23:00 mgc81 python[4271]: return future.result()
Jul 14 08:23:00 mgc81 python[4271]: File "/opt/HAC/Python/lib/python3.9/site-packages/aiohttp/web.py", line 413, in _run_app
Jul 14 08:23:00 mgc81 python[4271]: await site.start()
Jul 14 08:23:00 mgc81 python[4271]: File "/opt/HAC/Python/lib/python3.9/site-packages/aiohttp/web_runner.py", line 121, in start
Jul 14 08:23:00 mgc81 python[4271]: self._server = await loop.create_server(
Jul 14 08:23:00 mgc81 python[4271]: File "/opt/HAC/Python/lib/python3.9/asyncio/base_events.py", line 1506, in create_server
Jul 14 08:23:00 mgc81 python[4271]: raise OSError(err.errno, 'error while attempting '
Jul 14 08:23:00 mgc81 python[4271]: OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 4330): address already in use
Jul 14 08:23:00 mgc81 systemd[1]: rsf-rest.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 08:23:00 mgc81 systemd[1]: rsf-rest.service: Failed with result 'exit-code'.
The simplest way to resolve this is to change the port the RFS-1 REST service listens on. To do this run the following commands on each node in the cluster (in this example the port is changed to 4335):
# /opt/HAC/RSF-1/bin/rsfcdb update privPort 4335
# systemctl restart rsf-rest.service
The RSF-1 REST service will now restart and listen on the new port. A status check should now show the service as active and running:
# systemctl status rsf-rest.service
● rsf-rest.service - RSF-1 REST API Service
Loaded: loaded (/usr/lib/systemd/system/rsf-rest.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2022-07-14 09:22:57 EDT; 2s ago
Main PID: 52579 (python)
Tasks: 1 (limit: 49446)
Memory: 31.8M
CGroup: /system.slice/rsf-rest.service
└─52579 /opt/HAC/RSF-1/bin/python /opt/HAC/RSF-1/lib/python/rest_api_app.pyc >/dev/null
Jul 14 09:22:57 mgc81 systemd[1]: Started RSF-1 REST API Service.
This can be confirmed by navigating to the Webapp via the new port https://<ip of node>:4335
Mounting ZVOL's with filesystems
RSF-1 can be configured to mount and unmount ZVOL's with a filesystem on service startup/shutdown.
To enable this feature, ZVOL's are declared in the file /opt/HAC/RSF-1/etc/mounts/<pool name>.<filesystem type>
which should be present on each node in the cluster. The format of this file is:
Note
The <mount options> field specifies options to be
passed to the mount command using the -o parameter. This field in
itself is optional and ignored if not present.
For example, a pool named pool1 has two ZVOL's:
NAME USED AVAIL REFER MOUNTPOINT
pool1 2.16G 115M 307K /pool1
pool1/zvol1 1.21G 1.25G 77.6M -
pool1/zvol2 968M 1006M 77.6M -
Each of these volumes has an xfs filesytem created within them. To mount these
on service startup the file pool1.xfs has been created in the
/opt/HAC/RSF-1/etc/mounts/ directory containing two entries:
Note
It is important to use the ZVOL path rather than the device it points to as device numbering can change on reboot, whereas the ZVOL path remains static.
The suffix xfs in the filename tells RSF-1 to pass the filesystem type xfs to the mount
command. RSF-1 will now mount these filesystems on service startup, and unmount
them on service shutdown.
The mount operation takes place before any VIP's are plumbed in, and the umount
operation performed after service VIP's are unplumbed therefore it is safe to
share these filesystems out (NFS/SMB etc) using the service VIP.
No options will be passed to the mount of /zvol1, whereas the mount of /zvol2 will have
options defaults,_netdev,relatime,nosuid,uquota passed using the -o parameter.
There is no limit placed on the amount of file systems to mount, or their type (xfs,
ext4, vfat etc). In
the above example if two additional ext4 file systems are created as zvol3 and zvol4
then the file pool1.ext4 would be added to the mounts directory with the contents:
The mounts directory will now contain the files pool1.xfs and pool1.ext4 with each one
being processed on service start/stop. Further pools are added to the configuration by
creating additional configuration files in the mounts directory:
total 12
-rw-r--r-- 1 root root 60 Mar 26 15:54 pool1.ext4
-rw-r--r-- 1 root root 60 Mar 25 17:55 pool1.xfs
-rw-r--r-- 1 root root 60 Mar 25 19:02 pool2.vfat
Cluster wide configuration
The filesystem configuration files must be manually created or copied over to each node in the cluster. It is done this way to allow for granularity in which filesystems are mounted on which node during failover.
For example, one node in the cluster may mount zvol1 and zvol2 on service startup, but only
mount zvol1 on the other node should a failover occur.
Allowing secondary IP's on interfaces to be promoted when primary is removed
Synopsis
If service failover/halting is causing vips from another service to be removed,
then the likely cause is the non-promotion of secondary IP addresses caused by
the promote_secondaries IPv4 system configuration.
Note
This FAQ entry is only relevant to interfaces used in a cluster that have no permanent static IP address assigned, for example:
When configuring IP's on Linux using the
ip command, the first IP added to an interface
(in a specific subnet) is assigned as the
primary address; any additional addresses
added in the same subnet will be flagged as secondary,
for example:
# ip a l ens19
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP group default qlen 1000
link/ether 5e:36:85:51:89:ef brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 172.16.20.10/24 scope global ens19
valid_lft forever preferred_lft forever
inet 172.16.20.11/24 scope global secondary ens19
valid_lft forever preferred_lft forever
inet 172.16.20.12/24 scope global secondary ens19
valid_lft forever preferred_lft forever
inet6 fe80::5c36:85ff:fe51:89ef/64 scope link
valid_lft forever preferred_lft forever
172.16.20.10 in this case) be removed,
any secondary IP's in the same subnet are impacted by the
system setting net.ipv4.conf.<selector>.promote_secondaries4.
A value of 0 results in those addresses being removed, i.e.:
# ip -f inet address del 172.16.20.10/24 dev ens19
# ip a l ens19
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP group default qlen 1000
link/ether 5e:36:85:51:89:ef brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet6 fe80::5c36:85ff:fe51:89ef/64 scope link
valid_lft forever preferred_lft forever
# ip -f inet address del 172.16.20.10/24 dev ens19
# ip a l ens19
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP group default qlen 1000
link/ether 5e:36:85:51:89:ef brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 172.16.20.11/24 scope global ens19
valid_lft forever preferred_lft forever
inet 172.16.20.12/24 scope global secondary ens19
valid_lft forever preferred_lft forever
inet6 fe80::5c36:85ff:fe51:89ef/64 scope link
valid_lft forever preferred_lft forever
Example
Consider a cluster with two services, each with their own VIP. When those two services are running on a single node, one of the VIPs will be primary and the other secondary. Should the service with the primary VIP be moved to another server, then the removal of it's VIP as part of the failover will cause the secondary VIP to also be removed and thus impact the accessibility of that service to clients.
To enable on ALL interfaces:
To make this the default action enable the default setting:
OR for individual interfaces:
To make a permanent change to the system, update /etc/sysctl.conf with:
# avoid deleting secondary IPs on deleting the primary IP
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
and reload with:
VHCI: devices not recognised as multi-path candidates for Solaris/OmniOS and derivatives
With the Solaris family of OS's, the virtual host controller interconnect (VHCI) driver enables a device with multiple paths to be represented as single device instance rather than as an instance per physical path. Devices under VHCI control appear in format listings with the device path starting /scsi_vhci, as in the following example:
# format
Searching for drives...done
AVAILABLE DISK SELECTIONS:
0. c0t5000C500237B2E53d0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c500237b2e53
1. c0t5000C5002385CE4Fd0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c5002385ce4f
2. c0t5000C500238478ABd0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c500238478ab
3. c1t5000C50013047C55d0 <HP-DG0300BALVP-HPD3-279.40GB>
/pci@71,0/pci8086,2f04@2/pci1028,1f4f@0/iport@1/disk@w5000c50013047c55,0
4. c2t5000C5000F81EDB1d0 <HP-DG0300BALVP-HPD4-279.40GB>
/pci@71,0/pci8086,2f08@3/pci1028,1f4f@0/iport@20/disk@w5000c5000f81edb1,0
However, in the above example two devices are not under the control of the VHCI driver, as can be seen by the device /pci path rather than the /scsi_vhci one. In order to resolve this the VHCI driver needs to be made aware these drives can be multipathed. This is accompished by adding specific entries into the VHCI configuration file /kernel/drv/scsi_vhci.conf; in essence, for each differing (vendor/model combination) candidate SCSI target device, the scsi_vhci code must identify a failover module to support the device by adding them to the property scsi-vhci-failover-override in the VHCI configuration file.
By using the format command we can identify the device vendor/model from the resulting output. Taking the entry <HP-DG0300BALVP-HPD4-279.40GB> from the above example, the first two digits identify the manufacturer, HP, with the next block identifying the model number, DG0300BALVP. These identifiers can then be added to the VHCI configuration file /kernel/drv/scsi_vhci.conf thus (syntax for more than one entry shown here for reference):
scsi-vhci-failover-override =
"HP DG0300BALVP", "f_sym",
"HP DG0300FARVV", "f_sym";
#END: FAILOVER_MODULE_BLOCK (DO NOT MOVE OR DELETE)
Please note that the spacing is important in the vendor declaration - it must be padded out to eight characters, immediately followed by the model number (which does not require any padding). Once the entries have been added, the host machine must be rebooted in order for them to take effect. In the example above, once the configuration has been updated and the host rebooted, the output of format now returns:
AVAILABLE DISK SELECTIONS:
0. c0t5000C500237B2E53d0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c500237b2e53
1. c0t5000C5002385CE4Fd0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c5002385ce4f
2. c0t5000C500238478ABd0 <SEAGATE-ST3300657SS-ES62-279.40GB>
/scsi_vhci/disk@g5000c500238478ab
3. c1t5000C50013047C55d0 <HP-DG0300BALVP-HPD3-279.40GB>
/scsi_vhci/disk@g5000c50013047c57
4. c2t5000C5000F81EDB1d0 <HP-DG0300BALVP-HPD4-279.40GB>
/scsi_vhci/disk@g5000c500238478ab
The drives have now been sucessfully configured for multi-pathing via the VHCI driver.
Reservation drives are getting 'Failed to power up' errors
When a ZFS service is running on a node in the cluster, that node will hold SCSI reservations on some of the zpool disks to prevent the other node from being able to access those disks. With some disk models, when the passive node reboots, it will no longer be able to access those reservation disks and will get the message:
Device <path-to-device> failed to power up
Because of the failure to power up then that node will then always encounter I/O error from those disks.
To resolve this issue, add an entry to /kernel/drv/sd.conf to disable the bootup power check for a specific disk model. The entry should be similar to:
sd-config-list= "SEAGATE ST2000NM0001","power-condition:false";
or if there are multiple disk models showing this behaviour:
sd-config-list= "SEAGATE ST2000NM0001","power-condition:false",
"SEAGATE ST32000644NS","power-condition:false";
After sd.conf has been modified on both nodes, there should be no 'failed to power up' error on the next bootup and the passive node should be able to access the disks as expected (although it will still get 'reservation conflict' because the disks are still reserved).
RSF-1 Services not starting due to missing libc.so.1
When installing from scratch (clean OS install), the following issue may occur with RSF-1 services starting:
Starting RSF-1 REST Service...
ld.so.1: python3.9: fatal: libc.so.1: version 'ILLUMOS_0.39' not found (required by file /opt/HAC/Python/bin/python3.9)
ld.so.1: python3.9: fatal: libc.so.1: open failed: No such file or directory
[ Jun 14 08:40:50 Method "start" exited with status 0. ]
[ Jun 14 08:40:50 Stopping because all processes in service exited. ]
[ Jun 14 08:40:50 Executing stop method ("/lib/svc/method/svc-rsf-rest stop"). ]
[ Jun 14 08:40:50 Method "stop" exited with status 0. ]
[ Jun 14 08:40:50 Restarting too quickly, changing state to maintenance. ]
(END)
This can occur due to libc.so.1 being out of date and can be resolved by running pkg update to get the up-to-date libraries and rebooting.
-
Udev rules are defined into files with the .rules extension. There are two main locations in which those files can be placed:
/usr/lib/udev/rules.dis used for system-installed rules whereas/etc/udev/rules.d/is reserved for custom made rules. In this example we've used the name50-rsf.rules, but any suitable file name can be use. ↩↩ -
Currently every second. ↩
-
Note that any heartbeats received cause state information to be updated, so in order for a node to be considered down all heartbeats from that node must be lost. ↩
-
The <selector> can be all, default or a specific interface, i.e. enp0s19 ↩