Implementing Hardware Redundancy

As the system is only as strong as it's weakest component, the availability levels of most hardware components in a system should be increased by using high-quality and proven enterprise-grade products together with the introduction of system redundancy. This means the introduction of multiple components or additional capacity to ensure uninterrupted service so that a localised fault in that component can be tolerated resulting in no system downtime. Higher levels of availability are achieved by reducing or removing Single Points of Failure in the system.

Examples of hardware redundancy include:

  • Dual power supplies
  • Multiple network cards
  • RAID storage
  • Cooling fans
  • Multiple storage (multipath) connections
Diagram showing hardware redundancy for a highly available system

It should be noted however that no matter how much High Availability is built into a single system, that system is also a single point of failure. In addition, for in-system single points of failure where availability cannot be easily increased with redundancy, such as a system motherboard or memory card, redundancy can only be realised with the introduction of duplicate systems.

Individual component High Availability is normally localised and self-recoverable without any external High Availability service management. For example, the switching of a failed power supply or redundant cooling fan. Similarly, a failed disk drive in a mirrored RAID configuration, or broken storage channel with multipath enabled does not necessarily require external intervention. In these cases, the important factor for continuing High Availability operation is for such failures to be alerted in order that the failed units can be quickly replaced at a convenient maintenance schedule. Until the failed part can be replaced and integrated, a single point of failure potential has arisen, and the entire system becomes vulnerable.

Server Redundancy

The addition of a duplicate server into the system increases overall system High Availability as it removes the risk of a dedicated server being a significant single point of failure. It is vital however that the addition of a redundant server can reliably, and predictability maintain High Availability service continuity in the event of server failure. This redundant server pairing can be configured in an Active/Passive or Active/Active topology and does require a High Availability Cluster to manage server failover.

Systemic Redundancy

Potential single points of failure beyond the systems physical servers and hardware includes all aspects that the overall system requires for High Availability levels of system operation.

The following table highlights potential risks and single points of failure together with examples of how associated redundancy can be used to mitigate failure to provide a greater level of system Higher Availability:

AspectRisk / Single Point of FailureRedundancy Mitigation
UtilitiesPowerUPS, back-up generators, multiple power sources
CoolingMultiple coolers
LocationSingle location/site of physical location of systemsPlace redundant equipment in different racks or datacentres where possible
ConnectivityExternal user connectivityMultiple network providers, back-up network availability
Operating conditionsOperating temperature, Humidity & Air qualityMultiple air-conditioning units, dehumidifiers, and air filters
EnvironmentExtreme weather, earthquake and flood risk, proximity to other risky locations (e.g., airport, warzones), sabotage (physical and remote access)Alternative geographically separated fully redundant facility