What is High Availability Monitoring?

An application service monitoring subsystem should be implemented to verify that the application is healthy and running as expected. In much the same way as cluster heartbeats monitor server health and responsiveness, an application health check regular checks the specific application health and responsiveness and informs the cluster to restart or failover the service in the event of failure.

Illustration showing engineer monitoring server health

High Availability Service User Access

It is crucial that user access to highly available services running in the cluster is continuous and seamless and that no user reconfiguration is required if the target service is running on alternate cluster nodes in the event of a failover event. This is commonly achieved through network configuration using Virtual IP Addressing or similar network virtualisation capabilities whereby the IP Address, or network name associated with service access, is assigned to the cluster node currently running that service.

Illustration showing IT manager performing a cluster failover

In this way, the user does not know, and should not care, where the service is physically located or running.

High Availability Application Service Failover

In the event of a cluster server node failure running a highly available service, a failover event occurs, and the services running on the failed server will be restarted elsewhere in the cluster. The actual time to recovery will vary depending on a number of influences but would typically be within 1-3 minutes. During this time, users would expect to see their sessions to the application service hang until the service has been successfully restarted on an alternate node in the cluster. At this point, the user's session should resume continue transparently.

Service failover time is typically the total time for that service to be restarted from scratch on an alternate cluster server node with the longest step typically being the acquisition and securing of the underlying storage subsystem and subsequent filesystem mounting. There may however be storage design and optimisation tasks that can be performed to speed up this part of the startup process, and hence overall elapsed failover time.

High Availability Cluster Server and Services

This document outlines the key Cluster Server and Cluster Services components and frameworks essential for a robust High Availability Cluster Design. It addresses how a High Availability Cluster is initiated, monitored and secured and how the roles, rules and responsibilities of the server node members are negotiated.

It also discusses how mission critical application services are managed within a High Availability Cluster with regards to data protection, service startup, shutdown, user access, health checking and failover.

The framework and components here described are fundamental high-level concepts and should be present in any High Availability Cluster product. The next chapter will explore how these concepts have been designed and built into High-Availability.com's RSF-1 High Availability Cluster technology.