Talks Tech #36: Build Resilient Environment Through Cloud Disaster Recovery

Talks Tech #36: Build Resilient Environment Through Cloud Disaster Recovery

Written by Ramya Victor, Shwetha Lakshman Rao, Ranjani Swaminathan


Women Who Code Talks Tech 36     |     SpotifyiTunesGoogleYouTubeText

Ramya Victor, Staff Engineer, VMWare, Shwetha Lakshman Rao, R&D Manager at VMware, and Ranjani Swaminathan, Manager R&D, VMWare Software India Private Limited, sit down for a conversation, “Build Resilient Environment Through Cloud Disaster Recovery.” They discuss the importance of a disaster recovery plan and how to pick the right one based on a company’s needs, budget, and flexibility.

Shwetha: What comes to your mind when I say disaster? From a technology perspective, disasters can come in any form, equipment failures, power outages, cyber-attacks, etc. Disaster is a nightmare for any IT organization as the average cost is estimated to be nearly $5600 per minute. It’s essential to have a disaster recovery plan in place. Disaster recovery can be considered as a subset of business continuity. Consider a scenario where all your IT infrastructure, including the production and backup server, is created on the local premise. The data and business applications are therefore hosted on the local server. What will happen if the entire area gets affected by an earthquake or a ransomware attack? What is your plan to protect critical business data then?

Disaster recovery keeps the business running even when a disaster hits your local premises. The workload from local IT is replicated, not a backup, to offsite server locations, which access a secondary IT infrastructure for your application. The entire business process, data, and application are mirrored to the redundant server in real time. This process is called failover, and the offsite location is called the failover site. You can keep your work going amidst the disaster from the failover site with minimal downtime. There may be a little slowness in the access. When the disaster’s aftereffects have been dealt with, and your local IT comes into a normal working state, the access can be restored with a failback process.

Disaster recovery as a service is a cloud computing model that allows an organization to take the backups of its data and IT infrastructure in a third-party computing environment and provide all the DR orchestration. It is also a SaaS solution to regain access and functionality to the IT infrastructure after a disaster. With DRaaS, an organization’s physical server or virtual machines are replicated to the cloud. The service provider that hosts the infrastructure using public order private cloud. Depending on the specific type of disaster recovery as a service chosen by the customer, the provider will also be able to manage the failover process, transitioning the user from the primary environment to its hosted service environment and migrating the workload back to the customer environment when the local IT has come back.

On-prem, the onsite servers will allow for more control over your server. It keeps the company data private as no third party is associated. The data is accessible without the internet. However, on-prem DR, the disadvantage is increased capital investment. You should have all your infra in place and limited scalability. With the organization’s growth, you should keep adding your infrastructure to ensure that you have sufficient investment done on the hardware as the organization grows. There is a need for the space to build and store your hardware and additional maintenance, management, and IT support costs. Data loss is more likely to occur during disasters, If it is entirely two different locations, the price will again increase. Over the cloud, there is no onsite hardware or building cost, and it is scalable to the growth of the business lease, pay only for what you use. It’s easy to connect to the cloud from anywhere, any device, and backing data to the cloud can happen as often as every 15 minutes.

Ranjani: The first step in disaster recovery planning is understanding the infrastructure, applications, and software constructs involved. What are the compute, storage, and network requirements involved in the workload that you are trying to plan disaster recovery for? The next step is to conduct a business impact analysis. So when a disaster hits your workload, what is your impact on your business? That is something that should be made clear by the team that is requesting disaster recovery service. The next step is to create a disaster recovery plan based on the recovery time objective and the state of communication the workload should be after your recovery. After doing all these things, we approach the cloud or disaster recovery providers.

After reaching out to this disaster recovery provider, build the disaster recovery infrastructure and have this disaster recovery plan noted down or documented. Then we must simulate a disaster and test whether this plan works properly. Several factors influence what type of disaster recovery your method falls into. Those are the budget, time taken for healing, the resources or expertise available within the organization or at the provider site, and the testing and validation efforts. Three broad types of disaster recovery are available currently. First is managed disaster recovery, the second is assisted disaster recovery, and third is self-service disaster recovery. In managed disaster recovery, once you have reached out to the provider, the provider takes all the other steps. The provider does everything. This is an expensive way of disaster recovery, but it gives guaranteed results. It doesn’t depend on the platform or infrastructure on which you have hosted your applications.

The second type of disaster recovery is assisted disaster recovery. Considering the cases where you will use it, think of a highly customized application with some unique needs that may not be feasible to be explained to the service provider. In such cases, we may need expertise within the customer organization and some tasks that can be offloaded to the provider. The provider may provide consultations on optimizing the disaster recovery procedures. This is a bit less expensive than managed disaster recovery. The customer has some control over what testing can be performed and how the integration and validation can be performed, and the providers can supply additional resources during the testing.

The next type is self-service disaster recovery, where we have the expertise within the organization, and disaster recovery is performed from within the organization. There are in-house expertise and program management capabilities to plan this disaster recovery and execute and deploy all the recovery. Self-service disaster recovery is the least expensive. The results are not based on the provider but the internal expertise.

Before selecting your disaster recovery providers, other considerations and valid questions should be asked. Disaster recovery services may range. It could be simple data recovery for administrators and selective recovery from within the service provider’s data center. It may be a virtual machine mounting where physical to VM conversion should happen within the service provider’s data center for temporary application availability. It may be a full Veeam which is happening at the provider’s data center, including services such as VM mounting, VPN rerouting, domain name resolution updating or virtual desktop infrastructure access to the customers of the organization which is seeking this disaster recovery and some physical access for administrators too.

Do not assume that a service provider offering DRaaS will cover all the disasters and deliver all possible services. It depends on the requirements of the organization. Does it require zero downtime and zero loss of data, or is there any defined period of downtime and some loss of data acceptable? How does this requirement vary by application, workload, and so on? What type of disaster are we protecting against? Is it only hardware failures? Does it cover human errors, malware, and natural disasters? What is the budget? How many concurrent failures can they support, and what happens if the provider cannot provide the disaster recovery service during a disaster? The next consideration is what access will be available to the customers using these workloads or applications. How will the customers and the internal administrators have access? Will the VPNs be managed or rerouted? Will there be access for the virtual desktop users, and how will the access be provided for outward-facing applications? Will the DNS be updated or how will the necessary access be provided to the administrators? All these are some of the key considerations in considering end-user access.

Ramya: We have seen a lot of enterprises that have started evaluating disaster recovery options on the public cloud, to name a few, AWS, Microsoft Azure, and Google Cloud platforms. The sudden adoption of public cloud resources is because of the migration of employees to the remote workspace. The most significant benefit of using DRaaS in the public cloud is the public cloud infrastructure as a service. It cuts down the cost of the standby hardware by replacing them with on-demand cloud resources. Amazon Web Services cloud-based disaster recovery and workload migration offerings are provided through the CloudEndure DRaaS. This was acquired sometime back, and their integration with AWS provided the DRaaS services in AWS. This CloudEndure DRaaS can replicate the workloads from on-premise or any other infrastructure as a service environment to AWS. It can even do it between AWS regions. It has the flexibility of replicating from on-prem or other infrastructure across AWS.

There are a couple of advantages that one needs to consider. One of them is continuous data replication. Continuous data replication always has a low cost because it uses enough AWS compute and storage resources to support the data replication only on an on-demand basis. The second important point is the automated machine conversions from any native point to the supported AWS instance and images. It does complete automatic conversions between them. It supports popular enterprise software, operating systems, and cloud environments, including Azure, GCP, IBM Cloud, Oracle, and VMware.

Azure Site Recovery offers the DRaaS service. It has provided this offering through an acquisition it made back in 2014. This service protects the physical and the virtual Windows or Linux workloads outside of its primary data center. It’s a traditional on-premise backup service. The emphasis here is given to protection for these workloads. During this Azure site recovery setup process, you can choose Azure or any other data center for the replication target. You’re given that flexibility. Your apps can continue operating in the Azure Cloud with minimal downtime. Azure site recovery also supports the cloud failover of both the VMware and Hyper-V virtual infrastructure. It’s not restricted to VMware but also provided to Hyper-V.  Additional key factors are that it creates VM instances on-demand during recovery incidents. Everything is done on a demand basis. It’s not as pre-configured as you would do in a hardware kind of DRaaS environment in an on-prem. Also, it provides a non-disruptive DR testing kind of scoping environment for you. Targets can be customized. When you have a recovery environment set up for you, you can customize the target objectives and even the recovery time objectives. Along with that, the customer can also customize the recovery plans in the form of runbooks. These runbooks can be leveraged using the Azure automation of the PowerShell scripts. You can have a set of commands put into these runbooks using the Azure automation feature provided or the PowerShell scripts. This makes complex scenarios straightforward for us to do in an automated environment.

One of the key points to note in GCP is that Google Cloud does not offer a packaged DRaaS like Azure or AWS. However, it provides documentation on cloud-based disaster recovery planning and how to use the existing GCP services as a DRaaS platform. It also offers many products that can be used as a building block when creating a secure and reliable DR plan. This is more like a do-it-yourself DR plan. GCP’s other important thing is its strong partnership with companies like Veeam, Active Full, and Xenon. These companies offer disaster recovery capabilities on GCP. Why do you have to go for the GCP kind of DRaaS environment? GCP services are beneficial if you want a do-it-yourself  DRaaS process. When you do that, you get cloud monitoring and a cloud status dashboard, which helps you do the application monitoring, metrics evaluation, and event assessment. Another thing is the cloud deployment manager. It automatically creates the GCP environment for the pre-defined templates mentioned in your runbooks. The third part is the third-party infrastructure templating. It provides a third-party infrastructure templating using them and configuring management software with available GCP supports.

VMware provides DRaaS through the acquisition of Datrium. It provides an easy-to-use cloud-based solution that combines efficient cloud storage with simple SaaS-based management for efficient IT resilience at scale. The benefits of VMware DRaaS are consistency and familiarity with VMware products across production and DRaaS sites. It’s been a pioneer in the virtualization platform for quite some time. People are very familiar with those products. There is a pay-as-you-need failover capability model. There are instant power-on capabilities for faster recovery.