Gone are the days when people who don’t understand technology knew what it means by “Server down”. We have become used-to uninterrupted services. Even a small lag raises eyebrows now.
As more and more mission critical workloads move into the Cloud, High Availability (HA) has become a crucial aspect of system design. HA refers to guarantees of a system/component being continuously operational or availabile in service. It is usually measured relative to 100% availability. You would have seen up-time guarantees like 99.99999% available. This means ~6min downtime per year.
You would find Cloud service providers providing guarantees in 11, 12 and 15, 9s. This is good, but still a big enough down time for a mission critical service.
We normally build an HA system with redundant hardware and s/w fault tolerance to minimize human intervention. The term, ‘Tolerance‘, is of two types, Fault and Failures.
A Failure is a state where System fails to meet its specifications. Fault is failure of a sub-system. It can result in other sub-systems to fault and, optinally, the overall system to fail.
Faults can be transient or permanent. They can be intermittent. Following are effective ways of dealing with faults
- Forecast faults, you need mathematical models that identifies presence of a fault and its consequence. These models are often built using fault-injection and studying the resultant faults/failures.
- Avoid and Remove faults, the system needs to go through different verification techniques that guarantees a stronger system.
- Tolerance, is also called graceful-degradation or fail-safe, methods through which the system can be stopped into a safe state.
At the core of Fault tolerance is Redundancy. H/w, S/w, Time and information redundancy. This enables a system to have high tolerance and, thus, provide high availability.
In Azure, there are three design patterns that help design system with maximized availability
- Throttling – Controlling resource consumption by application, tenant or service
- Queue based load leveling – Use queue as buffers between a service and its tasks to smoothen loads
- Health monitoring – Expose functional checks that external tool can access