VMware Cloud on AWS is a scalable, managed cloud solution based on VMware’s SDDC stack, running on AWS hardware that provides a flexible, self-repairing availability model. However, as with all business systems, it’s important to understand the potential failure scenarios of the platform and identify the corresponding risks to business applications and operations so that they can be appropriately mitigated. Let’s review some of the considerations when developing a successful business continuity plan including VMware Cloud on AWS.
Business continuity plan
To ensure that business operations can continue to function in case of a major disruption, organizations need to develop a plan to identify what services are required for their business to run, what RPO and RTO for each service is required, and then identify and implement technical solutions and develop operational procedures to ensure the availability and recovery targets of the services are met. A key component of the business continuity is testing the plan, to ensure the systems and procedures work as expected.
Shared responsibility model
VMware Cloud on AWS works on a shared responsibility model, much like other cloud-based solutions. This means that VMware and the customer both assume certain responsibilities for protecting, securing, and recovering various components of the SDDC.
VMware is responsible for the infrastructure and underlying software and configuration of the SDDC. This means that VMware will provide, maintain, recover and secure the infrastructure components – ESXi hosts, vCenter, and any other management components that are part of the SDDC. This is outlined in the Service Description document.
The customer is responsible for all workloads run in the SDDC – this covers providing, maintaining, recovering and securing any VM that they migrate or create.
When it comes to business continuity, one of the most critical components of the customer’s responsibility is protecting the data they place in the SDDC. While VMware does provide the underlying storage as an infrastructure, and this storage is highly available, there are still scenarios that can result in partial or total data loss. It is the customer’s responsibility to ensure they mitigate these scenarios to a level that meets their business requirements. Note that VMware does NOT maintain backup or redundant copies of customer data in the SDDC beyond what is configured in the customer-managed vSAN Storage Policy.
Availability and disaster recovery planning
While the SDDC is designed to be redundant and provide a high level of availability, there are some scenarios that may cause some, or all of the services, to become unavailable. In these circumstances, VMware will do their best to recover the SDDC and management components as quickly as possible. The contractual time allowed before financial penalties are incurred is defined in the SLA.
Every customer has different business requirements, and in fact may have multiple sets of different requirements for different applications or components. There are options in the SDDC to adjust policies to provide tolerance for these different failure scenarios. Since additional protection comes at the cost of higher resource consumption, these options allow for customizing the cost/availability ratio.
Planning for the right level of availability isn’t always simple. It can be helpful to map out different failure scenarios and look at what options are available for mitigating that loss. Mitigations are generally classified by 2 properties: Recovery Point Objective, or RPO, which defines how much data loss can be tolerated, and Recovery Time Objective, or RTO, which defines how long the system can be unavailable. The scenarios can also be classified by 2 factors: Likelihood, which is how likely the event will occur, and Impact, or how severe a problem the scenario would cause.
Sample scenario planning and mitigation
|Failure Scenario||Likelihood/impact||Mitigation(s) / Recovery||RTO/RPO|
|Application data corruption||Low / Very High||Point-in-time backups / Data restore & application roll-back||Application-dependent|
|Single storage device failure||Likely / Low||vSAN FTT > 0 maintains multiple copies of data.||0 / 0|
|Single node failure||Likely / Low||vSAN FTT > 0 maintains multiple copies of data.||0 / 0|
|Multiple node failure||Low / Moderate||vSAN FTT > number of failures will tolerate the failures without impact.||0 / 0|
|VM Backup / Recovery||As per backup frequency. Typical 24 hours / typically within a few hours, depending on VM size.|
|VM Replication / SRMaaS||As low as 5 minutes / within a few minutes|
|Total site failure||Very Low / High||VM Backup / Recovery||As per backup frequency. Typical 24 hours / typically within a few hours, depending on VM size.|
|VM Replication / SRMaaS||As low as 5 minutes / within a few minutes|
These are examples of some of the most common failure scenarios, and mitigations used for each. However, there are other scenarios that can result in a failure, as well as solutions to mitigate those scenarios, and they are dependent upon individual business needs. When planning requirements, it’s usually best to start with a Business Impact Analysis, or BIA, to determine the internal requirements for each service or application. This will allow for the appropriate protection levels and mitigations to be put in place.
Note that Backup & Restore – where backups are stored off-site – is often considered a bare minimum. Although it has a higher RPO and RTO than many other solutions, its primary advantages are its reliability and the wide range of scenarios it can protect against. Once a backup is taken and shipped off-site (or replicated to different sites with cloud-based backups), the data can be restored back to the exact state it was in at the time of the backup. In addition, it’s possible to keep multiple backups, allowing for restoration of data back to different points in time, which is important in the case of data corruption or accidental deletion that may not be noticed immediately.
Storage availability options
There are other mitigations described that affect the environment’s resiliency to infrastructure failures, which are some of the most common infrastructure failures. The SDDC provides the option to set the number of failures to tolerate (FTT) in the storage policy. Storage policies can be applied to a group of VMs that share the same requirements. As the FTT level increases, the available capacity decreases, but the resiliency goes up. While it’s possible to change these values to suit your business requirements, to support the SLA in the SDDC, VMware requires the FTT to be set to 1 for clusters of 4 or 5 hosts, and 2 for clusters of 6 or more hosts. Note that the storage policy is customer-managed, and therefore the customer is responsible for ensuring the policy is adjusted for any host number changes.
General best practices
In summary, to ensure basic recoverability and protection of workloads in the SDDC, the following steps should be observed:
- Ensure a supported backup solution is installed on the SDDC, and all VMs are being backed up at least once per day.
- Ensure the vSAN Storage policy is configured appropriately for the number of hosts in the cluster (FTT=1 for 4 or 5 hosts, and FTT=2 for 6 or more hosts)
- Document a recovery plan and regularly perform a test recovery of the environment to ensure the procedure in place will work in the event it is required, and that staff are familiar with the processes to avoid extending the recovery time.
- Monitor the backup and recovery processes that are in place to ensure a failure to protect the data is caught and addressed immediately.
- If the time it takes to recover the environment based on testing does not meet business requirements, consider additional availability solutions to supplement backups:
- VMware Cloud on AWS Site Recovery to replicate the entire site to a remote AWS AZ or Region
- (coming soon) VMware Cloud on AWS Multi-AZ SDDC with stretched vSAN clusters to extend your SDDC to multiple AZs within the same AWS region.
Business continuity planning is a critical component of IT operations, whether workloads are running in the cloud or on-premises. VMware cloud on AWS supports a number of configurations to help protect your workloads and applications, and we’ve provided a few recommendations in this document to help ensure you are prepared for the unexpected.