RSF-1 allow for manual and automatic failovers. An automatic failover happens when RSF-1 detects a system failure and automatically moves the service from the failed active node to an alternative node. A manual failover is an operator-initiated command to move the service from the active node to an alternate server, usually to maintain service availability whilst performing maintenance or upgrades.
The key difference between a manual controlled RSF-1 failover and an automatic (active system down) one is that a controlled failover ensures the retreating active node cleanly shuts down the service (ZFS pools and associated services) before performing the startup on the standby server. The time taken to perform both the shutdown and startup will depend on the number and complexity of ZFS pools and services being moved.
The startup process that generally takes longest is the ZFS recovery time, which largely depends on the number of transactions in transit at the time of the failure.
Analysing the RSF-1 log files (/opt/HAC/RSF-1/log/rsfmon.log), look for lines like:
“[vol01 S20zfs] Zpool import completed status 0, in 2 seconds SUCCESS”
This will indicate the time the ZFS import / export processes take compared to the overall failover time.
Recovery time from a failure caused by removing power, pulling network cables or forced shutdown is generally quicker as the clean shutdown time on active node is avoided, but startup can be longer if the ZFS pools need recovery. RSF-1 speeds this process up using relocatable ZFS cache files. Pulling SAS cables on the other hand causes a failover due to the pool becoming *FAULTED* (only if any of the discs disconnected do not have equivalent mirrors etc.), at which point the next write will deliberately cause the node to panic (as ZFS fail mode is set to panic, as is the case with NexentaStor) thus triggering a failover – note it is necessary to perform a panic in this situation as a faulted pool cannot be successfully exported.
Posted in: General