With the introduction of Stretched Clusters for VMware Cloud on AWS, customers can protect applications against failures of entire AWS Availability Zones (AZs). Because of the increased availability of Stretched Clusters, customers have asked us: Should you use Stretched Clusters as a disaster recovery (DR) solution?
The answer is no. VMware Cloud on AWS Stretched Cluster is a high availability feature, not a DR solution.
Let’s take a closer look.
One of the first steps in a DR planning process is to decide what types of failures or disasters you want to guard against. From security attacks to disk drive failures, malicious attacks and software bugs; sources of failures and outages are all around us. For our purposes here, we focus on infrastructure vulnerabilities. The scope of an infrastructure failure falls in one of three categories:
- Local failures
- Data Ccenter failures
- Region failures
Individual component or system failures, like disk errors, NIC failures, or other general server instance failures, are not uncommon. Hardware is not infallible and things sometimes break. We define Local failure in this case to be a failure that does not take out the entire AZ, for example a failure that affects a single host. VMware Cloud on AWS handles these failures through a combination of the SPBM Failures To Tolerate (FTT) setting and the VMware Cloud on AWS service. More details on this can be found in Glenn Sizemore’s excellent blog that explains how vSAN works with VMware Cloud on AWS. Since the VMware Cloud on AWS platform in large part handles fault tolerance for Local failures for you, your applications don’t need to be rearchitected to manage this complexity.
The bottom line for errors contained within a single host is that a single AZ SDDC is all you need. A multi AZ Stretched Cluster SDDC handles the next level of failure, when an entire data center fails.
Data Center Failures
An AWS AZ is a logical datacenter within an AWS Region. While AWS has carefully designed AZs with redundancy built in, failures at an AZ level can still happen. As a result, Amazon recommends launching applications across AZs to minimize the chances of downtime. This is why VMware Cloud on AWS introduced Stretched Clusters. This feature synchronously replicates data between AZs, enabling your applications to survive even AZ-level outages. Again, Glenn’s blog explains the details of how Stretched Clusters work under the hood.
What is amazing about Stretched Clusters is that your applications running on VMware Cloud on AWS can survive AZ failures without having to change any code; your apps don’t need to know anything about AZs. Mission-critical services that require the highest possible availability and protection against the rare event of a full AZ failure should be running in a Stretched Cluster SDDC.
This is where the DR question comes in. If Stretched Clusters protect the application even from the very rare occurrence of an AZ-level failure, is a DR solution still necessary? The answer is yes, because of the even more rare possibility of an entire AWS region outage.
As you undoubtedly already know, AWS deploys data centers in geographically separate regions all over the world. These regions are completely independent: they are physically isolated in different locations. They don’t share power, cooling, water supply, networks, or anything else.
AWS regions are each composed of two or more AZs. AZs connect to each other over low latency links, but are otherwise also separate from each other: separate data centers, separate power, separate servers, separate networks, etc. What is NOT separate is the location; AWS AZs in the same region are only physically separated by a relatively short distance. For example, the AWS Oregon region currently has three AZs: us-west-2a, us-west-2b, us-west-2c. Each of these AZs are their own logically separate data centers, but they are all in Oregon. So, while AWS has carefully engineered each AZ to be isolated from failures in other AZs, it cannot protect against a disaster that impacts an entire region.
Since a Stretched Cluster spans AZs in the same AWS region, it can guard against AZ failures, but it cannot help if a disaster impacts an entire region. Think earthquakes, floods, or hurricanes that can take out all data centers in an AWS region. To protect against these types of disasters, you need a DR solution that saves your data in another geographically separate location. This is where DR solutions like the VMware Site Recovery add-on come in.
Infrastructure failures can be categorized in terms of their scope: failures that are local to the AZ, failures impacting an entire AZ, and failures impacting an entire AWS Region. In each case, the mitigation is different.
The VMware Cloud on AWS Stretched Cluster feature provides resiliency to applications that need to survive AZ failures. It is not intended to cope with disasters that impact multiple AZs or entire AWS regions. Stretched Clusters are an HA feature, not a DR solution. To maximize application availability and provide DR in case of large scale disasters that might take out entire regions, Stretched Clusters should be implemented alongside a DR solution, like VMware Site Recovery.