Hardware Redundancy | High Availability

Implementing Hardware Redundancy

As the system is only as strong as it's weakest component, the availability levels of most hardware components in a system should be increased by using high-quality and proven enterprise-grade products together with the introduction of system redundancy. This means the introduction of multiple components or additional capacity to ensure uninterrupted service so that a localised fault in that component can be tolerated resulting in no system downtime. Higher levels of availability are achieved by reducing or removing Single Points of Failure in the system.

Examples of hardware redundancy include:

Dual power supplies
Multiple network cards
RAID storage
Cooling fans
Multiple storage (multipath) connections

Diagram showing hardware redundancy for a highly available system

It should be noted however that no matter how much High Availability is built into a single system, that system is also a single point of failure. In addition, for in-system single points of failure where availability cannot be easily increased with redundancy, such as a system motherboard or memory card, redundancy can only be realised with the introduction of duplicate systems.

Individual component High Availability is normally localised and self-recoverable without any external High Availability service management. For example, the switching of a failed power supply or redundant cooling fan. Similarly, a failed disk drive in a mirrored RAID configuration, or broken storage channel with multipath enabled does not necessarily require external intervention. In these cases, the important factor for continuing High Availability operation is for such failures to be alerted in order that the failed units can be quickly replaced at a convenient maintenance schedule. Until the failed part can be replaced and integrated, a single point of failure potential has arisen, and the entire system becomes vulnerable.

Server Redundancy

The addition of a duplicate server into the system increases overall system High Availability as it removes the risk of a dedicated server being a significant single point of failure. It is vital however that the addition of a redundant server can reliably, and predictability maintain High Availability service continuity in the event of server failure. This redundant server pairing can be configured in an Active/Passive or Active/Active topology and does require a High Availability Cluster to manage server failover.

Systemic Redundancy

Potential single points of failure beyond the systems physical servers and hardware includes all aspects that the overall system requires for High Availability levels of system operation.

The following table highlights potential risks and single points of failure together with examples of how associated redundancy can be used to mitigate failure to provide a greater level of system Higher Availability:

Aspect	Risk / Single Point of Failure	Redundancy Mitigation
Utilities	Power	UPS, back-up generators, multiple power sources
	Cooling	Multiple coolers
Location	Single location/site of physical location of systems	Place redundant equipment in different racks or datacentres where possible
Connectivity	External user connectivity	Multiple network providers, back-up network availability
Operating conditions	Operating temperature, Humidity & Air quality	Multiple air-conditioning units, dehumidifiers, and air filters
Environment	Extreme weather, earthquake and flood risk, proximity to other risky locations (e.g., airport, warzones), sabotage (physical and remote access)	Alternative geographically separated fully redundant facility

System