I’ve been reading that the outage of AWS US-East-1 which happened earlier today or last night (at the point of writing) has an impact on thousands of online services including SaaS platforms that probably do not follow a reliable DR strategy. Some people have been complaining that they can’t connect to voice-enabled as well as smart devices that use the affected AWS infrastructure. I was personally on a website that is purely built on the serverless platform when that happened. At that time, I noticed the loading pages took unusually very long, and at some point, it simply stopped loading. Services that use AWS Kinesis got affected badly. At the point of writing, AWS announced, as of 10:30 PM PST, they had restored all traffic to Kinesis Data Streams from Internet-facing endpoints, and they were continuing to incrementally restore all requests to Kinesis Data Streams using VPC Endpoints — which is great news. I really love AWS’s transparency, determination, and effort to restore the services, and keep all of us updated on the status!
So, yeah, this kind of incident does happen. It may just be a single commonly used service (e.g. S3) or a number of services, region-level outage can always happen. And if you are a single-regioner (i.e, deploying cloud native apps in a single region), you may probably have a multi-AZ architecture where your workloads are spread across different availability zones within the same region. However, don’t think a region outage will not ever happen to your region. A region outage can completely knock out your services, and critically affect your business application’s availability for a certain period, especially if your application is built around a single-region architecture, and not following a multi-region DR strategy.
As a security practitioner and cloud architect, I’d like to share my two cents on what we can learn from this incident.
#1 Move out of a single-region architecture, and have a multi-region DR plan. Your workloads do not need to be deployed in an active-active fashion across multiple regions in order for them to be considered “multi-region”. Depending on your Recovery Time Objective (RTO) and Recovery Point Objective (RPO), you can build your tailor-made DR plan based on the following strategies from AWS well-architected framework:
- Backup and Restore — Purely scheduled backup and restore, but do consider the time the restoration process can consume.
- Pilot Light — Keep critical data and services in a different region for easy service restoration, and minimizing data loss
- Warm Standby — Core services running at all times so that they can be “promoted” to production in the event of a region-level failure. This is a quick bounce-back strategy.
- Active-Active — Also known as “Hot standby” and the most expensive strategy which requires keeping the exact replica of the entire architecture in another region. You can even split the traffic between regions using AWS Route53, and in the event of a region-level disaster, you’ll still have another region that’s running.
You and your team will have to set your organization’s RTO and RPO objectives, and balance between costs, the criticality of your services, your organization’s reputation in order to come out with the best, tailored-made strategy for your organization. Having a multi-region strategy doesn’t have to be expensive. For example, instead of going active-active, you can always choose to maintain a small-scale environment of critical services running in a different region.
#2 Cross-Region backup — If you want to avoid getting your entire system knocked out in the event of region-level outage, you’ll need to involve cross-region backups in your DR plan. Your critical data needs to be backed up in multiple-regions, following security best practices around data. Being able to restore your data from different regions securely will minimize your data loss in case of region-level failure. For example, consider cross-region replication for some of the data stored on S3 buckets if the compliance regulation that your organization needs to comply to does not require that data reside within a specific region. For database servers, consider taking regular snapshots of your RDS instances. Better yet, create read replicas of your RDS instances in multi-region deployments. For maximum resiliency of database tier, you can even have multi-AZ deployments in each region. If you have a serverless architecture that uses NoSQL database such as DynamodB on AWS, you can opt for global tables whenever applicable. Global tables can provide low latency application experience as well as facilitate disaster recovery.
#3 Test your DR Plan — This is very important. Testing your DR plan is preparing your organization for incidents like region-level outage. Testing will validate that procedures, processes, configuration management & automation tools, and IaC such as Terraform being used in the DR process are really working as planned, and expected. It will ensure your configurations and services will be restored as planned, and your services can be up and running as per targeted RTO and RPO objectives. If you are using AWS Route53, make sure that DNS health checks are working, and the Route53 failover routing policy is in place (and working). If you are using microservices, make use of X-Ray to gain visibility into the applications in a different region. How about your security tools? Will they still be working in your DR (now production) environment? If you have maintained, for example, a Cloudformation template which is supposed to spin up ec2 instances in a DR region, you’ll need to make sure that the Cloudformation template will launch instances from correct AMIs in the new region, and also correctly configure them using the userdata . The only way to see whether the orchestration will be a success is if you test the template.
#4 Automate it! — AWS well-architected framework does recommend automating the DR failover, using AWS tools or third-party tools. Whichever way you use, you’ll need to make sure most, if not all, of the processes, and steps involved in DR failover, can be automated once the DR failover has been triggered. Automation will greatly help your services for a “quick bounce-back”. You can use AWS tools such as CloudEndure, as well as third-party tools you can find in AWS Marketplace. One important thing to take note of is the automation doesn’t need to come from a single technology or solution. You can definitely use a mixture of IaC, serverless functions, and DR automation tools to orchestrate and automate the entire process. You’ll just need to have a way to gain visibility into the state of the DR failover.
#5 Ensure that your security measures are restored after a DR failover — Some people may be inclined to consider security to be the least priority in the event of a disaster. They probably think the worst is over once all services are restored after a DR failover. What good is your restored application if it gets hacked after a disaster recovery? When you move your production workloads to another region, you’ll also need to move all security services, and tools that were previously used move along with the services (if they’re not already there)! Be it in a DR drill or actual DR failover, are your security gateways still protecting your newly relaunched production apps in the new region? Are your serverless functions still being protected by workload firewalls? How about your entire control plane — Do you still gain the same level of visibility around your control plane? If not, you’ll need to find a way to ensure all security measures taken previously are back in effect again in the new region. (Maybe a security playbook for a post-DR cloud environment will come in handy.)
Finally, does a region outage mean you’ll need to start considering a multi-cloud approach (perhaps in the same region)?
Not Necessarily. Unless you have certain requirements for being multi-cloud (e.g. your data strictly must reside within your region due to regulations), you don’t really need to go multi-cloud just yet. Not because of a region-level services outage. You’ll just need to make sure if that ever happens to your region, you’ll have a robust and fault tolerant architecture that can keep your business-critical applications online and operational despite the outage. In short, consider a multi-region approach whenever possible.
Leave your comments if you have any challenges with a multi-region approach!