In today’s modern world, any business can be subject to disruption. Business continuity depends on the efficient and uninterrupted flow of data across the organization. Even a simple interruption in the flow of data can be the cause for the loss of thousands of opportunities. The cause of interruption can be human error or mechanical error. So, for securing the business from such disasters or disturbances we need a proactive disaster recovery strategy.
AWS offers a wide range of services including database storage, compute power, content delivery, and different other features. AWS supports different Disaster Recovery architectures from smaller to big workloads that enables rapid failover at scale.
Before we dig in into the details on disaster recovery in AWS, there are few things we need to learn.
Understanding RPO and RTO
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two of the most important parameters of a disaster recovery or data protection plan.
RPO: Recovery Point Objective
Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the Business Continuity Plan maximum allowable threshold or “tolerance.” In other words, it’s the time up to which data loss can be tolerable in a data outage.
RTO: Recovery Time Objective
Recovery Time Objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with a break in continuity. In other words, the RTO is the answer to the question: “How much time did it take to recover after notification of business process disruption?“
Options for Disaster Recovery in AWS
- Backup and Restore
- Cheapest setup with High RTO and RPO.
- In this approach, we must periodically backup data from the system to AWS.
- Amazon S3 is an ideal destination for backing up data that might be needed quickly to perform a restore.
- We can use AWS Snowball to transfer very large data sets by shipping storage devices directly to AWS.
- For longer term data storage where retrieval times of several hours are adequate, we can leverage Amazon Glacier.
- Also, Amazon S3 and Amazon Glacier can be used in conjunction to produce a tiered backup solution.
- Key Steps for backup and restore:
- Select an appropriate tool or method to backup data into AWS.
- Ensure appropriate data retention policy, security measure including encryption and access policies.
- Pilot Light
- Slight expensive than Backup and restore approach but RTO and RPO are shorter.
- A minimal version of an environment is always running in the cloud.
- In AWS, we maintain a pilot light by configuring and running the most critical core elements of the system and when the time comes for recovery, we can rapidly provision a full-scale production environment around the critical core.
- Beside critical infrastructures, to provision the remainder infrastructure to restore business critical servers, we can typically have some pre-configured servers bundled as Amazon Machine Images (AMIs).
- Key Steps for pilot light preparation:
- Set up Amazon EC2 instances to replicate or mirror data.
- Ensure all supporting custom software packages are available in AWS.
- Create and maintain AMI of key servers where fast recovery is required.
- Regularly run these servers, test them, and apply any software updates and configuration changes.
- Consider automating the provisioning of AWS resources by Cloudformation templates.
- Warm Standby
- Costs more and decreased RTO and RPO than Pilot light approach.
- A scaled-down version of a fully functional environment is always running in the AWS Cloud.
- A warm standby solution extends the pilot light elements and preparation.
- Servers can be running on a minimum sized fleet of Amazon EC2 instances on the smallest possible sizes. The solution is not scaled to take a full production load but it is fully functional
- The setup can also be used for non-production work such as testing, QA, and internal use.
- In a disaster, the system is scaled up quickly to handle the production load.
- Key steps for warm standby preparation:
- Set up Amazon EC2 instances to replicate or mirror data.
- Create and maintain AMIs
- Run application using a minimal footprint of Amazon EC2 instances or AWS infrastructures.
- Patch and update software and configuration files in line with the primary environment.
- Key steps for warm standby recovery:
- Increase the size of the EC2 fleets in service with a load balancer (Horizontal Scaling).
- Change to larger instance types to handle the production load as needed (Vertical Scaling).
- Change DNS records manually or use Amazon Route53 automated health checks so that all traffic is routed to
- AWS environment in any disaster scenario.
- We can consider using Auto-scaling to right size the fleet or accommodate the increased load.
- Add resilience or scale up databases.
- This is the active-active setup. The RTO and RPO are nearly zero but it is the most expensive approach to Disaster recovery.
- Primary as well as secondary environment both run full production setup in an active-active configuration.
- We can use Route53 to route traffic to both sites either symmetrically or asymmetrically.
- In an on-site disaster situation, we can adjust the DNS weighting and send all traffic to the secondary AWS environment.
- We might need additional application logic to detect the failure of primary database services and cut over to the parallel database running in AWS.
- Key steps for multi-site DR preparation:
- Set up the AWS environment to duplicate the on-site production environment.
- Set up DNS weighting, or similar traffic routing technology, to distribute incoming requests to both sites. Configure automated failover to re-route traffic away from the affected site.
- Key steps for multi-site recovery:
- Change the DNS weighting either manually or by using Amazon Route53 Failover routing policy so that all requests are sent to the secondary site in AWS.
- Consider having application logic for failover to use the AWS database servers for all queries.
- Consider using Auto-scaling to automatically right size the server fleet.
CloudEndure is an AWS Disaster Recovery service that makes quick and easy to shift disaster recovery strategy to the AWS cloud from existing physical or virtual data centers, private clouds or other public clouds. It supports automated cloud orchestration and machine conversion along with continuous data replication, automated failback, and no disk size limitations. Failback was one of the major concerns for the Nepalese client for their architecture which has been now solved by the introduction of CloudEndure by AWS.
Recently, we had a very good experience of the Disaster recovery migration with Sipradi Trading Pvt.Ltd. They migrated their On-premise server to the AWS through the help of CloudEndure service of AWS and during one month of POC they successfully migrated and tested their Disaster recovery. Response from their side was very good as the cost was reduced by almost 60% and also the AWS services like EC2, RDS, S3, etc. showed very good response time.