High Availability Architecture

A Highly Available system is one that provides continuous availability and accessibility to service operation that can both tolerate, and recover from, anticipated and unexpected failure. Highly Available systems are generally systems designed to run mission critical applications and services that must remain operational and available when intended and required.

What makes a system Highly Available is the usage of high quality and reliable system components, the elimination of Single Points of Failure using component redundancy, and the ability to tolerate and quickly recover from failure to provide optimal system uptime.

Highly Available systems operation also needs to include considerations for planned preventative maintenance and service windows to ensure minimal service availability expectations can be met.

Threats to High Availability

Anticipated failures would include situations where a failure could be expected to occur at some point, for example, a component with a finite life expectancy. All components will eventually fail, and those with moving parts such as hard disk drives, and those that are exposed to regular power cycles such as power supplies, are more prone to failure, and therefore, can be anticipated.

It can be foreseen that every component in a computer system will fail at some point, and therefore failure detection and recovery planning is required for every conceivable anticipated failure scenario to minimise overall system and service downtime.

Failure with software components can also be reasonably anticipated in that at some point, a system will crash and require a reboot due to Operating system or application software failure, and planned maintenance windows will be required to install software upgrades and patches to mitigate against future potential failures.

IT manager problem solving during system downtime

Security is another key area of potential failure anticipation that should not be overlooked. For example, nobody expects their home to be broken into, but if windows and doors are inadvertently and visibly left open, the risk of burglary would be greater and therefore a visible lack of security would mean a break-in is more likely. The same applies to computer services where the risk of sabotage and failure can be significantly reduced by closing all the doors, such as disabling unrequired IP ports, securing system console access, the use of firewalling and robust access and password policies.

Unexpected and Unlikely Failures

Anticipated failures are largely in-system risks that can be identifiable, predictable, and reasonably mitigated against. Unexpected failure risks on the other hand may be harder to envision, mitigate and recover from. For example, a data centre flood, a building fire, an earthquake, breakout of war, sabotage. The list of potential risks is endless as would be the potential associated cost of failure mitigation. It is therefore important to consider cost versus benefit analysis when undertaking High Availability planning taking into account the potential risks that could be reasonably considered affordably recoverable and justifiable.

Planning and costing High Availability and recovery strategies to mitigate anticipated failures is however a reasonably straightforward exercise that can be effectively implemented and tested.