Single Point of Failure | High Availability

How to Avoid A Single Point of Failure

Here we start to explore single points of failure, fault detection and recovery strategies. The key to understanding the risks and potential failures within the system is to identify all the Single Points of Failure (SPOFs) and to implement redundancy, fault detection and recovery strategies to cope with failure.

At a simple level, using only high-quality components and adding redundancy throughout the system will increase the MTBF of the overall system, albeit at a cost.

It is critical to consider SPOFs and redundancy strategies across every aspect of the overall operation including external dependencies, utilities, user access points and it's operating environment rather than just the system itself.

In addition to identifying Single Points of Failure and implementing redundancy strategies to minimise and recover from failure, it is also important to design and implement procedures for failure detection and recovery to minimise unplanned system downtime.

Illustration of a broken cable as a single point of failure

Identifying Single Points of Failure

The following table outlines aspects of the overall system that should be considered together with associated risks and mitigation, fault detection and recovery strategies to minimise system downtime:

Aspect		Considerations
Physical system components	• Risks	CPU, memory, I/O cards, power supplies, cooling fans, storage drives
	• Mitigation	Use high quality enterprise-grade components with high MTBF ratings and redundant components throughout
	• Detection	System monitoring and alerts
	• Recovery considerations	Secure fast access to spare components, documentation and engineering capability
Storage	• Risks	Connectivity, storage configuration, data integrity, performance
	• Mitigation	Implement optimal redundant storage design and configuration, e.g. mirroring
	• Detection	System monitoring and alerts
	• Recovery considerations	In-house capability for monitoring, storage configuration and backup and restore management
Software	• Risks	Operating System, middleware and applications, licensing, patch management
	• Mitigation	Pre-production testing, controlled updates and upgrades, robust patch management
	• Detection	System software alerts, application monitoring agents, bug reports
	• Recovery considerations	Readily available and up-to-date backups and recovery procedures
User Access	• Risks	Network components, firewalls, routers, cabling, user authentication
	• Mitigation	Secure physical cabling, optimal network and firewall design, robust user authentication
	• Detection	Access monitoring, network alerts
	• Recovery considerations	On-site network engineering monitoring, schematics and testing capability
External dependencies	• Risks	3rd party systems, online services, network configuration, other external system requirements
	• Mitigation	Ensure all external dependencies are known and identified with service agreements and support
	• Detection	Access monitoring and alerts
	• Recovery considerations	On-site testing, monitoring and engineering capability
Environmental	• Risks	Air quality and moisture, temperature, hygiene
	• Mitigation	Identify risks and implement air-conditioning and purifying, dehumidifying, dust collection, regular cleaning and preventative maintenance
	• Detection	Environmental monitoring and alerts
	• Recovery considerations	Consider relocation of system to cleaner and cooler environment
Physical Security	• Risks	Nearby risks, system console access, secure power and cabling
	• Mitigation	Secure systems away from physical hazards such as water sources and fire risks. Secure physical console access and cabling
	• Detection	Site inspections and reviews
	• Recovery considerations	Consider relocation of system to more secure location
Building Security	• Risks	Risk of damage from fire, earthquake, weather, terrorism
	• Mitigation	Locate systems away from flood and fire risks and areas prone to natural or other external potential disasters
	• Detection	Environmental reviews, news reports, weather forecasts
	• Recovery considerations	Consider relocation of system to safer location
Utilities	• Risks	Utilities • Risks External power, network, cooling
	• Mitigation	Implement UPS, backup generators, multiple physical networks and providers, robust SLAs with utility providers. Contract with Disaster Recovery (DR) providers
	• Detection	System monitoring and alerts
	• Recovery considerations	Use backup utility providers, implement proven and tested DR procedures
Human error	• Risks	Root user negligence
	• Mitigation	Manage and lockdown system access for only essential users and procedures. Robust and comprehensive training for administrators and security
	• Detection	System monitoring and alerts, logging, use of tripwire technologies
	• Recovery considerations	Review incident, system password management, use of robust recovery procedures
Sabotage	• Risks	Security, system access, port management, encryption, network manipulation, Denial of Service attacks
	• Mitigation	Review system access needs, lockdown all unrequired network access, use of encryption technologies, lockdown security with service provider SLAs. Regular security penetration tests
	• Detection	System and firewall monitoring and alerts
	• Recovery considerations	Identify vulnerabilities leading to incident, lock down as appropriate, review security service providers and procedures

A High Availability Strategy

Identify and analyse all Single Points of Failure
A High Availability strategy should begin with the identification and analysis of all the Single Points of Failure that the overall system uses and depends on, together with strategies to mitigate the potential risks together with redundancy, fault detection and recovery strategies to minimise server downtime.
Identify components that be bolstered with System Redundancy
As we have discussed, increased redundancy usually results in an exponential increase in costs to achieve higher levels of availability. A risk-reward and cost benefit analysis exercise should be undertaken to determine an acceptable trade off balance between minimum system service availability expectation versus cost.
Minimise server downtime
Implementing system redundancy and procedures to tolerate and recover from failures is key to minimising both planned and unplanned server downtime, and deploying a framework that not only manages all the failure, detection and recovery scenarios outlined in this document, but that is optimally designed to automatically recover to minimise server downtime, is critical.
Implement a High Availability Cluster framework
This is what High-Availability.com do, and our flagship product, RSF-1, has been providing cost-effective enterprise-grade High Availability technology to thousands of mission critical system deployments across all industries and around the globe for over 25 years.

System