How Amazon Web Services crashed and rose again

If Amazon Web Service (AWS) had gone down on Monday, September 21, morning, instead of Sunday, September 20, people would still be screaming about it. Instead, it went down at 3 AM Pacific Daylight Time (PDT) and barely anyone noticed.

Unless, of course, you were a system administrators of popular user services and websites, such as Amazon Video and Reddit. If you were one of these people, you noticed. Boy did you notice.

Only one major AWS customer, Netflix, seems to have been ready for a major AWS data-center failure. No one else seems to have been.

You see this wasn't a "simple" data-center problem like a backhoe taking out AWS US-East's Internet backbone. No, it was much more complicated.

It all began with the Amazon DynamoDB service in Virginia having problems. DynamoDB is a fast, flexible NoSQL database service. It's designed to support applications, which require consistent, single-digit millisecond latency at scale. That, as you would guess, means that it's used by many, if not all time-sensitive AWS cloud services.

Officially, an AWS spokesperson said, "Between 2:13 AM and 7:10 AM PDT on September 20, 2015, AWS experienced significant error rates with read and write operations for the Amazon DynamoDB service in the US-East Region, which impacted some other AWS services in that region, and caused some AWS customers to experience elevated error rates."

When DynamoDB started having read/write issues its performance started collapsing. This issue impacted some other AWS services in US East. When that happened, all the other US East AWS services application programming interfaces (API)s started timing out. From there, services built on AWS started failing.

Some customers were affected more than others. In most cases what happened was that these customers experienced an increase in errors. This prevented some customers from accessing their sites and services. Many of these sites did not "go down," but their performance fell to unacceptable levels.

According to AWS Service Health Dashboard entry for DynamoDB that Sunday, here's how the problem manifested.

3:00 AM PDT We are investigating increased error rates for API requests in the US-EAST-1 Region.

3:26 AM PDT We are continuing to see increased error rates for all API calls in DynamoDB in US-East-1. We are actively working on resolving the issue.

4:05 AM PDT We have identified the source of the issue. We are working on the recovery.

4:41 AM PDT We continue to work towards recovery of the issue causing increased error rates for the DynamoDB APIs in the US-EAST-1 Region.

4:52 AM PDT We want to give you more information about what is happening. The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information. Our recovery efforts are now focused on restoring metadata operations. We will be throttling APIs as we work on recovery.

So, Amazon took two hours to nail down the root cause. They then slowed down all AWS APIs so their system administrators could work on the problem.

5:22 AM PDT We can confirm that we have now throttled APIs as we continue to work on recovery.

5:42 AM PDT We are seeing increasing stability in the metadata service and continue to work towards a point where we can begin removing throttles.

6:19 AM PDT The metadata service is now stable and we are actively working on removing throttles.

7:12 AM PDT We continue to work on removing throttles and restoring API availability but are proceeding cautiously.

7:22 AM PDT We are continuing to remove throttles and enable traffic progressively.

7:40 AM PDT We continue to remove throttles and are starting to see recovery.

7:50 AM PDT We continue to see recovery of read and write operations and continue to work on restoring all other operations.

8:16 AM PDT We are seeing significant recovery of read and write operations and continue to work on restoring all other operations.

So, from start to finish, it took AWS just over five hours to get back to full-speed.

In theory, July 16 Amazon DynamoDB release could have helped mitigate this problem. That's because this release include DynamoDB cross-region replication. With this client-side solution, AWS customers could have maintained identical copies of DynamoDB tables across different AWS regions, in near real time. With this, you can, for additional fees of course, use cross region replication to back up DynamoDB tables, or to provide low-latency access to geographically distributed data.

Still, as this episode showed, even the largest cloud provider in the world can have major failures. If your business depends on always being available, investing in DynamoDB cross-region replication, would be a smart move.

Related Stories:

Article Source

How Amazon Web Services crashed and rose again | ZDNet
How Amazon Web Services crashed and rose again. September 20th was a bad day for Amazon. For five hours, the East coast AWS was misbehaving. Here's why it happened

LXer: How Amazon Web Services crashed and rose again
Published at LXer: September 20th was a bad day for Amazon. For five hours, the East coast AWS was misbehaving. Here's why it happened and how AWS was

LXer: How Amazon Web Services crashed and rose again
Login If you don't have an account yet, visit the registration page to sign up. If you already have an account, you may login here:

Amazon Echo just became much more useful with IFTTT ...
How Amazon Web Services crashed and rose again. Newsletters. You have been successfully signed up. To sign up for more newsletters or to manage your account,

?How Amazon Web Services crashed and rose again | Regator
Technology: TechCrunch: Enterprise. Amazon Web Services (AWS) announced on stage at AWS re:Invent that it has dropped pricing about 25% across the board for its S3

How Amazon Web Services crashed and rose again | Hallow Demon
September 20th was a bad day for Amazon. For five hours, the East coast AWS was misbehaving. Heres why it happened and how AWS was restored to service.

Amazon's Elastic Compute Cloud service crashed
Amazon EC2 outage downs Reddit, Quora SCVNGR and other sites took to Twitter after a rare and major outage of Amazon's cloud-based Web service.

The Heat of the Moment (Bad Boys of Baseball) - Kindle ...
In Katie Roses sweet, and barely recovered after the relationship crashed and burned. Katie Rose brings the heat again with another sweet and sexy

Summary of the Amazon EC2 and Amazon RDS Service ...
Get Started Start developing on Amazon Web Services using one of our pre-built sample Once again, in a normally if any, node crashes; however, during this re

Blogs | ICT Innovations
ICT Innovations has released ICTDialer Also we offer service to integrate open source telephony projects / components to provide a complete business solution

How Amazon Web Services crashed and rose again

Populer Post

Category

Blog Archive