How to Avoid A Single Point of Failure

Here we start to explore single points of failure, fault detection and recovery strategies. The key to understanding the risks and potential failures within the system is to identify all the Single Points of Failure (SPOFs) and to implement redundancy, fault detection and recovery strategies to cope with failure.

At a simple level, using only high-quality components and adding redundancy throughout the system will increase the MTBF of the overall system, albeit at a cost.

It is critical to consider SPOFs and redundancy strategies across every aspect of the overall operation including external dependencies, utilities, user access points and it’s operating environment rather than just the system itself.

In addition to identifying Single Points of Failure and implementing redundancy strategies to minimise and recover from failure, it is also important to design and implement procedures for failure detection and recovery to minimise unplanned system downtime.

Illustration of a broken cable as a single point of failure

Identifying Single Points of Failure

The following table outlines aspects of the overall system that should be considered together with associated risks and mitigation, fault detection and recovery strategies to minimise system downtime:

Physical system components• RisksCPU, memory, I/O cards, power supplies, cooling fans, storage drives
• MitigationUse high quality enterprise-grade components with high MTBF ratings and redundant components throughout
• DetectionSystem monitoring and alerts
• Recovery considerationsSecure fast access to spare components, documentation and engineering capability
Storage• RisksConnectivity, storage configuration, data integrity, performance
• MitigationImplement optimal redundant storage design and configuration, e.g. mirroring
• DetectionSystem monitoring and alerts
• Recovery considerationsIn-house capability for monitoring, storage configuration and backup and restore management
Software• RisksOperating System, middleware and applications, licensing, patch management
• MitigationPre-production testing, controlled updates and upgrades, robust patch management
• DetectionSystem software alerts, application monitoring agents, bug reports
• Recovery considerationsReadily available and up-to-date backups and recovery procedures
User Access• RisksNetwork components, firewalls, routers, cabling, user authentication
• MitigationSecure physical cabling, optimal network and firewall design, robust user authentication
• DetectionAccess monitoring, network alerts
• Recovery considerationsOn-site network engineering monitoring, schematics and testing capability
External dependencies• Risks3rd party systems, online services, network configuration, other external system requirements
• MitigationEnsure all external dependencies are known and identified with service agreements and support
• DetectionAccess monitoring and alerts
• Recovery considerationsOn-site testing, monitoring and engineering capability
Environmental• RisksAir quality and moisture, temperature, hygiene
• MitigationIdentify risks and implement air-conditioning and purifying, dehumidifying, dust collection, regular cleaning and preventative maintenance
• DetectionEnvironmental monitoring and alerts
• Recovery considerationsConsider relocation of system to cleaner and cooler environment
Physical Security• RisksNearby risks, system console access, secure power and cabling
• MitigationSecure systems away from physical hazards such as water sources and fire risks. Secure physical console access and cabling
• DetectionSite inspections and reviews
• Recovery considerationsConsider relocation of system to more secure location
Building Security• RisksRisk of damage from fire, earthquake, weather, terrorism
• MitigationLocate systems away from flood and fire risks and areas prone to natural or other external potential disasters
• DetectionEnvironmental reviews, news reports, weather forecasts
• Recovery considerationsConsider relocation of system to safer location
Utilities• RisksUtilities • Risks External power, network, cooling
• MitigationImplement UPS, backup generators, multiple physical networks and providers, robust SLAs with utility providers. Contract with Disaster Recovery (DR) providers
• DetectionSystem monitoring and alerts
• Recovery considerationsUse backup utility providers, implement proven and tested DR procedures
Human error• RisksRoot user negligence
• MitigationManage and lockdown system access for only essential users and procedures. Robust and comprehensive training for administrators and security
• DetectionSystem monitoring and alerts, logging, use of tripwire technologies
• Recovery considerationsReview incident, system password management, use of robust recovery procedures
Sabotage• RisksSecurity, system access, port management, encryption, network manipulation, Denial of Service attacks
• MitigationReview system access needs, lockdown all unrequired network access, use of encryption technologies, lockdown security with service provider SLAs. Regular security penetration tests
• DetectionSystem and firewall monitoring and alerts
• Recovery considerationsIdentify vulnerabilities leading to incident, lock down as appropriate, review security service providers and procedures

A High Availability Strategy

  • Identify and analyse all Single Points of Failure

    A High Availability strategy should begin with the identification and analysis of all the Single Points of Failure that the overall system uses and depends on, together with strategies to mitigate the potential risks together with redundancy, fault detection and recovery strategies to minimise server downtime.

  • Identify components that be bolstered with System Redundancy

    As we have discussed, increased redundancy usually results in an exponential increase in costs to achieve higher levels of availability. A risk-reward and cost benefit analysis exercise should be undertaken to determine an acceptable trade off balance between minimum system service availability expectation versus cost.

  • Minimise server downtime

    Implementing system redundancy and procedures to tolerate and recover from failures is key to minimising both planned and unplanned server downtime, and deploying a framework that not only manages all the failure, detection and recovery scenarios outlined in this document, but that is optimally designed to automatically recover to minimise server downtime, is critical.

  • Implement a High Availability Cluster framework

    This is what do, and our flagship product, RSF-1, has been providing cost-effective enterprise-grade High Availability technology to thousands of mission critical system deployments across all industries and around the globe for over 25 years.