High availability clusters are often used for load balancing, backup and failover purposes. To properly configure a high-availability (HA) cluster, the hosts in the cluster must all have access to the same shared storage.


A high availability cluster is a group of hosts that act like a single system and provide continuous uptime.

High availability clusters are often used for load balancing, backup and failover purposes. To properly configure a high-availability (HA) cluster, the hosts in the cluster must all have access to the same shared storage. This allows virtual machines (VMs) on a given host to fail over to another host without any downtime in the event of a failure

  • Data Backups
  • Recovery and Replication
  • Clustering
  • Fail Over Solutions
  • Geographic redundancy
  • Plan for failure

When Is High Availability Important ?

When setting up robust production systems, minimizing downtime and service interruptions is often a high priority. Regardless of how reliable your systems and software are, problems can occur that can bring down your applications or your servers.

Implementing high availability for your infrastructure is a useful strategy to reduce the impact of these types of events. Highly available systems can recover from server or component failure automatically.

dscloudz-high availability

if all your servers are located in the same geographical area, an environmental condition such as an earthquake or flooding could take your whole system down. Having redundant servers in different datacenters and geographical areas will increase reliability.


highly available servers should be resilient to power outages and hardware failures, including hard disks and network interfaces. the whole software stack, including the operating system and the application itself, must be prepared for handling unexpected failure that could potentially require a system restart, for instance.


Data loss and inconsistency can be caused by several factors, and it’s not restricted to hard disk failures. Highly available systems must account for data safety in the event of a failure. unplanned network outages represent another possible point of failure for highly available systems. It is important that a redundant network strategy is in place for possible failures.

Data Backups, Recovery and Replication

The hallmark of a good data protection plan that protects against system failure is a sound backup and recovery strategy. Valuable data should never be stored without proper backups, replication or the ability to recreate the data. Every data center should plan for data loss or corruption in advance. Data errors may create customer authentication issues, damage financial accounts and subsequently business community credibility. The recommended strategy for maintaining data integrity is creating a full backup of the primary database then incrementally testing the source server for data corruptions. Creating full backups is at the forefront of recovering from catastrophic system failure.


Load balancing is an effective way of increasing the availability of critical web-based applications. When server failure instances are detected, they are seamlessly replaced when the traffic is automatically redistributed to servers that are still running. Not only does load balancing lead to high availability it also facilitates incremental scalability. Network load balancing can be accomplished via either a ‘pull’ or a ‘push’ model. It facilitates higher levels of fault tolerance within service applications.


Geo-redundancy is the only line of defense when it comes to preventing service failure in the face of catastrophic events such as natural disasters that cause system outages. Like in the case of geo-replication, multiple servers are deployed at geographical distinct sites. The locations should be globally distributed and not localized in a specific area. It is crucial to run independent application stacks in each of the locations, so that in case there is a failure in one location, the other can continue running. Ideally, these locations should be completely independent of each other.


Even with the highest quality of software engineering, all application services are bound to fail at some point. High availability is all about delivering application services regardless of failures. Clustering can provide instant failover application services in the event of a fault. An application service that is ‘cluster aware’ is capable of calling resources from multiple servers; it falls back to a secondary server if the main server goes offline. A High Availability cluster includes multiple nodes that share information via shared data memory grids. This means that any node can be disconnected or shutdown from the network and the rest of the cluster will continue to operate normally, as long as at least a single node is fully functional. Each node can be upgraded individually and rejoined while the cluster operates. The high cost of purchasing additional hardware to implement a cluster can be mitigated by setting up a virtualized cluster that utilizes the available hardware resources.


High availability architecture traditionally consists of a set of loosely coupled servers which have failover capabilities. Failover is basically a backup operational mode in which the functions of a system component are assumed by a secondary system in the event that the primary one goes offline, either due to failure or planned down time. A ‘cold failover’ occurs when the secondary server is only started after the primary one has been completely shut down. A ‘hot failover’ occurs when all the servers are running simultaneously, and the load is directed entirely towards a single server at any given time. In both scenarios, tasks are automatically offloaded to a standby system component so that the process remains as seamless as possible to the end user. Failover can be managed via DNS, in an well-controlled environment.


Despite the fact that applying the best practices for high availability is essentially planning for failure; there are other actions an organization can take to increase their preparedness in the event of a system failure leading to downtime. Organizations should keep failure or resource consumption data that can be used to isolate problems and analyze trends. This data can only be gathered through continuous monitoring of operational workload. A recovery help desk can be put in place to gather problem information, establish problem history, and begin immediate problem resolutions. A recovery plan should not only be well documented but also tested regularly to ensure its practicality when dealing with unplanned interrupts. Staff training on availability engineering will improve their skills in designing, deploying, and maintaining high availability architectures. Security policies should also be put in place to curb incidences of system outages due to security breaches.