How Was The RSF-1 Cluster Software Pioneered & Designed

Following on from the previous section that describes High Availability Cluster design concepts and principles, this document describes how these best practice principles were pioneered and built into the RSF-1 High Availability Cluster product.

Illustration showing RSF-1 conducting servers, storage and applications

The initial design principles for the RSF-1 cluster software called for the technology to use very lightweight processes that would place minimal additional load on server nodes, to remove as many single points of failure as possible and to be self-contained with no external influences or dependencies. It was also a design requirement that the technology be as Unix-like as possible using system administration and command line capabilities that would be familiar to Unix administrators.

The very first commercial requirement was for the deployment of a two node Active/Active topology to support a mission-critical proprietary production insurance application using an Ingres database back-end. The specific requirement was to configure both the live production system and a secondary lower-priority development test version in an active/active high availability framework. The hardware to be used were Sun SPARCCentre 2000 and 1000 servers running Solaris 2 with dual-ported storage using Sun’s Solstice Disksuite volume management software. This system went live in 1995.

RSF-1 Server Configuration

On system startup, each server node runs a lightweight real-time RSF-1 Server process that is responsible for establishing cluster heartbeats, communication with other cluster server nodes and the management of the highly available application services. It also provides a comprehensive cluster administrator management framework.

RSF-1 uses the concept of master and standby nodes on a per service basis, where the master node is the preferred node for ordinarily starting that service. This means that for each service, there is a preferred master node and one or more standby nodes. Each service is also configured with a timeout parameter for both master and standby nodes that are used by the RSF-1 Server framework to negotiate service execution priorities.

RSF-1 Cluster Control

Using the above concepts, RSF-1 Cluster control has been designed to be self-configurable and does not utilise external influences or quorum devices to establish master and standby status. Instead, cluster control status is determined as follows.

The first task that each RSF-1 Server undertakes is to establish and agree cluster configuration and cluster status with other cluster server nodes. If heartbeat communication cannot be established with other cluster nodes, it is assumed that this cluster server node is the first to start and the only current member of the cluster. For each of the highly available services that are runnable, i.e. that the cluster knows to be in a clean state, configured to be enabled to run on this server and set for automatic startup, service-specific timeout countdown sequences begin. It uses the master countdown parameter if it is configured as the preferred node for the service, or the standby countdown otherwise. If heartbeat communication is still not established with other cluster server nodes at the termination of the countdown sequence(s), the service(s) will be started.

If however cluster communication is established, either on system startup or during service countdown sequences, other cluster nodes will be consulted to determine the real-time state of the cluster. If services are already running elsewhere, the countdown and service startup sequence will be aborted, if on the other hand the service is not running elsewhere in the cluster, but a service countdown sequence has been initiated elsewhere, the preferred server for that service will take priority and begin the service startup sequence.