You might have heard about AWS outage, right?
But, most of you might not know the exact reason for the AWS outage and how Amazon fixed this issue.
This blog gives you the perfect answer to your question with insightful information.
In this article, we are going to discuss what is Amazon S3, what is AWS outage, how it happened, and how Amazon resolved the issue.
What is Amazon S3?
Amazon's simple storage service (S3) is a storage that can store and search for big data from any data sourcing centers like mobile applications, websites, and other devices. With the help of Amazon S3, any developer can access the data that is durable, secure, and highly scalable.
What is AWS outage?
AWS outage is a service interruption that occurred in the AWS cloud platform due to which a large volume of data was lost.
Now, we will have a look into how the interruption occurred.
Subscribe to our youtube channel to get new updates..!
In the early hours of February 28th, 2017, in the Northern Virginia region, the AWS S3 team was working on debugging an S3 Billing system issue. One of the team members while doing so entered a wrong command. Due to the wrong command, a large amount of data was lost. The regions that got affected by the outage were the Northern Virginia region, East Ashburn region, and many other parts of the world.
You may think that it is not a big deal as it's just a command, and you can wipe it off with a backspace key, am I right?
This is actually a very big issue as that single wrong command has swiped off a large set of servers supporting two S3 subsystems. Removing a server means losing data.
However, AWS offers a data recovery feature, but this time, it doesn’t work.
This is because the index subsystem manages the metadata and data source location information is lost in one subsystem, and the second subsystem that manages the allocation of new data storage objects is also lost.
The worst part is, there is another region that is relying on the S3 Service. It also got impacted as S3 is not responding to the service requests even though the system gets restarted. All the S3 APIs associated are not available.
To back up the lost data, both S3 subsystems should restart, and this takes a lot of time.
All of this wasn’t just affecting Amazon S3 customers but a few AWS services as well, such as CloudWatch, WorkSpaces, Simple Email Service, Cognito, and DynamoDB. Some of them have suffered complete disruption creating an error like the one mentioned below.
Upcoming Batches - AWS Training!
7:00 AM IST
6:30 AM IST
6:30 AM IST
6:30 AM IST
How Amazon solved it?
Amazon said that it designed the system to work even if a big part of it gets failed. Also, it acknowledged that the S3 subsystem had not fully restarted because the subsystem was in the offline state for many years.
So, Amazon changed its system tool and rewritten the code so that even its engineers cannot make the same mistake again, and also, it is doing safety checks in the system to avoid such problems.
After four hours of hard work, Amazon got back all the lost data and it apologized to the customers for the trouble caused by the storage system.
We hope you have got the answers to all your questions. Now, there are no issues with AWS S3 and even if any issue raises in the system, Amazon is ready to clear the issues without affecting its customers' storage within no time.