What are High Availability Clusters?

A High Availability Cluster is typically used for the provision of mission-critical application service delivery and consists of a group of computing resources operating together as a single system to provide higher uptime capability than a single computing resource alone.

This is achieved using enhanced redundancy built into the cluster to provide greater levels of fault tolerance and the implementation of a High Availability software framework that is able to automatically move critical applications and services to other members of the cluster in the event of failure to provide near-time continuous service availability.

Basic diagram showing a simple high availability cluster

The High Availability software framework is responsible for monitoring the health of all cluster nodes and managing the run-time location and execution of applications and service availability across the cluster based on the health and availability of individual members, or nodes, within the cluster. In the remainder of this document, service refers to an individual mission-critical application together with it's associated software, environment and dependencies required to run in a cluster.

High Availability Cluster Heartbeats and Monitoring

There are two essential aspects to cluster health monitoring:

  • Cluster node health monitoring is achieved using a continuous heartbeat capability to determine that cluster nodes are alive and responsive in order for the cluster to manage service availability and acts as the central nervous system of the physical cluster. Heartbeat mechanisms usually take the form of multiple physical dedicated and independent connections between cluster nodes.
  • Service health monitoring is essential to ensure that each cluster service is alive and responsive, and behaving as expected on the cluster node on which it is running. The mechanisms for monitoring service health may be unique and specific to the service being deployed.

A detected server failure will trigger a failover event of all services running on the failed node to an alternate cluster node, whereas a service failure may trigger a service restart attempt on the same node, or a failover event to an alternate cluster node if a local restart attempt first fails.

For a critical service to be capable of being highly available and executable on any available member cluster node, each server node in the cluster must have the ability to access all the run-time hardware, software and network resources, environment and system dependencies required to the same extent as if running on a dedicated standalone server node. Such dependencies include access to the underlying operating file systems, associated hardware, software licensing and configuration, system data, and user authentication and access.

Typically, clusters require shared storage to provide the underlying software and data that highly available services require to run, however, a shared-nothing cluster configuration is also possible whereby the underlying storage component is not physically attached but provided virtually or externally. The following topologies however can be deployed for both shared and shared nothing configurations.

Whilst High Availability clusters can typically include a wide number of nodes, the traditional configuration is that of a 2-node cluster, in an Active/Active or Active/Passive configuration, usually to provide High Availability for a specific function or application stack.