There are three vital components to supporting business continuity in VMware Cloud on AWS. You need Data Protection to ensure data integrity, High Availability to minimize disruptions, and Disaster Recovery to recover from unplanned outages. Learn about this three-pronged approach in the latest blog and create a proactive plan to mitigate risk.
As Jeremiah Megie and Glenn Sizemore discussed in VMware Cloud on AWS: SDDC Availability Deep Dive (HBI1924BU) session at VMworld 2019, there are three components of Business Continuity: Data Protection, High Availability, and Disaster Recovery.
Figure 1. Business Continuity Components
These three components are not replaceable by one another. But together, are essential to the overall business-continuity strategy.
As best practices, you need: Data Protection in place to ensure data integrity and minimize data loss; High Availability to minimize disruptions through redundant or fault-tolerant components; And Disaster Recovery to focus on recovering from unplanned outages or natural disasters.
Michael Kolos discussed a shared responsibility model in VMware Cloud on AWS in his Business Continuity Planning Basics blog. In this shared model, VMware is responsible for the infrastructure and underlying software and configuration of the Software-Defined Data Center (SDDC). This includes management components such as ESXi hosts and vCenter – to name a few. You are responsible for your workloads running in the SDDC. VMware does not maintain backup or redundant copies of customer data in the SDDC beyond what is configured in the customer-managed vSAN Storage Policy.
Figure 2. Data Protection Partner Strategy
VMware Cloud on AWS relies on our partners for data protection solutions. You can check the VMware Partner Solutions to find certified backup partners for VMware Cloud on AWS. Some of these certified backup partners provide protection for VMs on the hypervisor level. However, there are some solutions that work by installing agents inside each individual VM.
vSphere High Availability is turned on by default on all clusters in an SDDC on VMware Cloud on AWS. For High Availability in VMware Cloud on AWS, there are two production cluster offerings to choose from (see Figure 3). Both offerings are built in the infrastructure layer and eliminate the need to architect high availability in the application layer.
Figure 3. Production Cluster Offerings
A single (or standard) cluster in the SDDC is protected by vSphere High Availability and host auto-remediation. However, it doesn’t protect against AWS Availability Zone failure. We can protect against multiple host failures depending on Failure to Tolerate (FTT) settings. VMware requires FTT to be set to 1 for up to 5 hosts, and FTT to be set to 2 for 6 or more hosts. The minimum host deployment for a standard SDDC is 3 hosts.
A stretched cluster feature deploys a single SDDC across two Availability Zones with zero Recovery Point Objective (RPO) due to zero synchronous replication built in the cloud SDDC. You also get significantly better Recovery Time Objective (RTO) compared to a standard cluster. A stretched cluster protects against an Availability Zone failure but doesn’t protect against an AWS Region failure. The minimum requirement for a stretched cluster configuration in SDDC is 6 hosts. Additional hosts can be added but must be done in pairs across the Availability Zones. For more information about stretched clusters in VMware Cloud on AWS, check out this blog by Glenn Sizemore.
Even though vSphere High Availability has been around for a while, the approach to hardware failure is unique to VMware Cloud on AWS – where VMware needs to maintain a Service Level Agreement (SLA). In order to meet the SLA, VMware closely monitors the health of the system at all times and immediately gets alerted whenever failure occurs. Once a failure has been identified, VMware has the ability to auto-remediate the hardware in the SDDC within minutes at no additional cost.
Another feature that further enhances the availability and resiliency of an SDDC cluster is Elastic DRS. Elastic DRS allows you to scale your cluster in response to demand, or lack of demand, by adding or removing hosts automatically based on specific policies that are configured. You can refer to a blog by Jeremiah Megie for more information.
Depending on your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements, there are various Disaster Recovery (DR) strategies that you can choose to implement. We will be discussing some simple scenarios that are not reference architecture. Actual design and topology for Disaster Recovery will vary based on your requirements.
Example 1: Provision SDDC on-demand and restore from backup
For non-critical workloads where you don’t need to recover quickly, you can choose not to stand up an SDDC until a disaster occurs. In this example, you have implemented a backup solution based on your RPO requirements. When a disaster occurs, you will provision a VMware Cloud on AWS SDDC on demand with the number of hosts you need to run your workloads.
Once the SDDC has been provisioned, you can recover your workloads from an accessible backup repository (For example, AWS S3 bucket) into VMware Cloud on AWS using your backup solution. Clearly, this process could be time consuming. Even though it takes less than a couple hours to provision an SDDC, configure connectivity, and less than 15 minutes to add hosts, you will need to consider the time it takes to restore your workloads from backup. This can vary depending on how much data you have and the type of connectivity (VPN or Direct Connect to the AWS region).
Example 2: SDDC with minimum number of hosts and scale up
In this example, we are protecting an on-premises production environment on VMware Cloud on AWS as the recovery site.
Faction, one of our Managed Service Providers (MSP) has multiple architectures for a Disaster Recovery solution. This example will fit into their Pilot Light DR architecture. Faction enables you to begin with a 3-node Pilot Light environment which provides an always-on way to run Tier-0 applications as well as critical utility VMs like secondary domain controllers. You can scale up on-demand during the event of a disaster. That way, you only pay for 3-nodes until you need to do a physical DR failover to scale up into more nodes. You can refer to Faction’s website for more information on this architecture.
Another option is to manage your own VMware Cloud on AWS SDDC with a minimum 3-node configuration and leverage a supported backup solution to recover your workloads as you scale your SDDC to meet capacity on-demand in the event of a disaster.
Example 3: SDDC with all hosts provisioned
In this example, you will stand up an SDDC with all the hosts required to run a production environment and leverage VMware Site Recovery in the event of disaster.
Figure 4. Disaster Recovery with VMware Site Recovery
Compared to Example 2, this example provides a better RTO because you don’t need to wait for the scale up process and restore to complete. VMware Site Recovery orchestrates the failover and failback process with a single-click initiation. It also provides the ability to non-disruptively test your recovery plan and ensure the predictability of your RTO. In addition, VMware Site Recovery enables you to generate detailed reporting for auditing and accountability.
High Availability + Disaster Recovery
If you are looking for enhanced availability where you can ensure High Availability and the ability to recover from a disaster, you can use the combination of Stretched Cluster and VMware Site Recovery services. Here’s an example where we have a stretched cluster environment in a VMware Cloud on AWS SDDC across two Availability Zones and VMware Site Recovery.
Depending on your requirements, you might setup your recovery site using one or more VMware Cloud on AWS SDDCs in a different AWS region.
Figure 5. Combining Stretched Cluster with Site Recovery Manager