Recently, Amazon Web Services (AWS) suffered another highly publicized outage. AWS’s explanation is very much worth reading, but at a high level, the DynamoDB (AWS’s NoSQL database) service experienced timeout issues due to problems with how the database handles metadata. This had a cascading effect on other widely used AWS services (those that depend on DynamoDB) such as EC2 autoscaling, Cloudwatch and Simple Queue Service (SQS). Many popular internet facing websites and applications were affected, as were countless enterprises running their critical workloads on AWS. However some websites like Netflix, who is perhaps AWS largest and most noteworthy tenant, weathered the outage with no noticeable issues. Netflix, like all savvy AWS users, understands how to build incredibly resilient, fault tolerant systems on AWS. In fact, they design and build in failure into everything they do. Enterprises of all sizes can learn valuable lessons from Netflix.
While you don’t necessarily have to build your own Netflix Chaos Monkey into your Cloud applications, you can certainly use a few common sense best practices for building the right high availability (HA) systems in the Cloud. In general, Clearpath Solutions Groups operates under the following principles when making Cloud architecture decisions:
- Time: How fast can you bring this solution to your target internal or external audience/market?
- Risk: What is your tolerance for downtime and security/compliance risk and data loss
- Money: How much will it cost you in Cloud/AWS utilization and for your internal resources to support this on an ongoing basis?
- Scale: How fast is your application and how many users can access it?
Based on those criteria and how important each factor is to your business, the degree to which you follow HA best practices will vary. Many of these best practices are familiar to traditional data center engineers and architects, while some are unique to public Cloud platforms such as AWS. Clearly, every application has a unique architecture and there are always the aforementioned tradeoffs between cost and availability, but these general best practices are a good place to start:
Understand the tiers and layers of AWS. Amazon Web Services IaaS offerings are hosted in multiple locations all over the world. These locations are made up of regions and Availability Zones (AZs). Each AWS region is located a separate geographic area and is generally made up of multiple data centers. Each region has multiple, physically and virtually separated/isolated locations called Availability Zones. AWS allows you to put Cloud resources, such as AWS virtual machines (instances), and data in multiple locations. Please note that these Cloud resources are not replicated across Availability Zones or regions by default. You must choose to build your application in a way that takes advantage of the AWS highly available and geographically distributed architecture.
Build in redundancy at each layer of your application and avoid single point of failure. Examples include:
- Instance/Virtual Machine Level – Deploy more than one instance for each tier such as a web front end, NAT, application or database tier.
- Networking – Create backup VPN connections and take advantage of Elastic Load Balancing (ELB). With ELB you can balance traffic across multiple AWS EC2 instances across multiple Availability Zones. You can also ensure that traffic is directed to only those EC2 instances healthy enough and prepared to accept said traffic.
- Storage – Create regular backups/snapshots to highly durable storage including AWS S3 and Glacier for long term storage. Specifically, EBS volumes for databases should have a snapshot within the region so a quick recovery can occur.
- AWS Region/AZ – Utilize multiple AZs and potentially, even regions, depending on criticality/cost factors.
- Use monitoring and health checks to auto-recover/self heal. Services such as AWS DNS offering, Amazon Route53, the load balancer service - Amazon Elastic Load Balancer (ELB), and EC2 instance health and status checks can help. These health checks continually ping, connect and request data from EC2 instances. If there is a problem with the health check, you can use autoscaling to compensate for a specific instance failure and to automatically spin up another instance. Additionally, AWS has built in an EC2 automatic recovery feature to mitigate the risk of underlying host issues.
These are the very basics of building highly resilient AWS solutions. In future blog articles, we’ll dig deeper into specific use cases for High Availability, related security concerns and address the various Disaster Recovery scenarios.
Clearpath Solutions Group can help you design high availability solutions in the AWS Cloud or conduct a risk and security assessment on your current Cloud implementation. Schedule a discussion with one of our Cloud specialists today to protect your organization from an outage.