High Availability Architecture Design

This section highlights and describes the key components and operations that are essential capabilities for an effective High Availability Cluster software framework.

High Availability Server Architecture

The first task that members of a High Availability Cluster need to establish on cluster initiation is a cluster-wide agreement on configuration, roles and responsibilities. In other words, do all cluster nodes share and agree a single source of truth regarding the configuration and rules of the cluster? This is very important to establish in case inconsistent configuration files exist among cluster members which may have been out of service where cached configuration files and status cannot be assumed to be up to date. Such a situation would lead to confusion and conflict and ultimately system downtime. In the event of such a conflict administrator intervention should be required to resolve inconsistencies.

Once cluster-wide configuration is agreed, the next task is to establish individual server node responsibilities and service running order. In other words, which node is the master for which service, which services need to be started where and when.

What is High Availability Quorum Voting?

Some High Availability Cluster products utilise voting systems in conjunction with a number of quorum devices to establish cluster node roles and responsibilities. This is to avoid potential deadlock situations, particularly for cases where there may be only two cluster members, or only two responsive cluster members available.

High-Availability.com's RSF-1 product however is completely self-sufficient and utilises a sophisticated configuration, negotiation and fencing process that eradicates the need for any external voting intervention, thus removing a potential single point of failure and additional equipment cost.

What are High Availability Cluster Heartbeats

Cluster heartbeats refers to the process by which individual cluster nodes communicate with each other for the purpose of checking node health and responsiveness. For many High Availability Cluster solutions, a heartbeat takes the form of a network echo response, or ping which is a low level "are you there?" call with an expected immediate answer of "yes". A failure to respond within a small time-window assumes node failure followed by an appropriate course of cluster action.

To avoid false alarm situations, more than one heartbeat channel is deployed and generally across independent and dedicated network links. This helps avoid false alarms in the event of overloaded busy networks or individual network failures. All heartbeat methods between individual cluster nodes must therefore fail before a cluster node failure is assumed.

As well as network-based heartbeat capability, RSF-1 utilises several independent heartbeat mechanisms that do not rely on networking heartbeats alone and can therefore maintain cluster integrity in the event of total network failure. RSF-1 also shares stateful cluster configuration information in the heartbeat messaging process providing a greater level of intelligence and information between cluster nodes rather than just a simple echo response. This also provides a better test of overall system health as it interacts with a running system process rather than relying solely on a low level network echo response.