Outage in AWS
Elevated API Error Rates
Resolved
Minor
November 09, 2021 - Started about 3 years ago
- Lasted 5 months
Need to monitor AWS outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including AWS, and never miss an outage again.
Start Free Trial →
Outage Details
1:58 PM PST We are investigating increased 5xx error rates and latencies for requests to the Amazon S3 APIs in the US-EAST-1 Region. Where possible, we recommend that requests that fail with a 5xx error be retried.
2:48 PM PST We have identified the S3 subsystem responsible for increased 5xx error rates for the S3 PUT APIs, and are working to isolate the root cause within this subsystem. Customers may also be experiencing increased latency when performing PUT operations. During this time, we recommend customers retry any failed requests.
4:09 PM PST We are continuing to see increased 5xx error rates and latencies for S3 API requests, in particular S3 PUT API calls. We have narrowed down the root cause to a specific sub-system within S3 and continue to make progress in mitigating the impact to this service but have not yet seen significant improvement. S3 API error rates and latencies have stayed a consistent low level with the vast majority of request retries succeeding. While the vast majority of requests are being processed within the normal latency levels, request tail latencies are exceeding 1 second in some cases. In some applications, increasing client timeouts may also help to mitigate the issue.
4:52 PM PST We are starting to see some improvement in the 5xx error rates and latencies for S3 API requests, in particular S3 PUT API calls. The issue affected a subsystem that stores routing metadata used by Amazon S3 to map API requests to the storage nodes. A recent update caused increased load within this subsystem, which led to increased error rates and latencies for the S3 APIs. We have now successfully mitigated this increased load within this subsystem and are seeing early signs of recovery. As the sub-system processes the backlog of requests, S3 API error rates and latencies will continue to improve.
5:53 PM PST We continue to see a gradual improvement in error rates as we process the backlog of mappings between request metadata and data storage in the sub-system affected by the increased load. We are currently working on mitigations to speed up the processing of the backlog during this event. Once the backlog is resolved, we expect that the error rate will fully recover. The vast majority of requests to S3 APIs continue to operate normally.
6:58 PM PST We continue to process the backlog of mappings between request metadata and data storage in the sub-system affected by the increased load. We have implemented two parallel mitigations to improve the speed of processing. Both mitigations are in process of deployment. Once the backlog is resolved, we expect that the error rate will fully recover. The vast majority of requests to S3 APIs continue to operate normally.
7:51 PM PST We have completed the mitigation to accelerate processing of the mappings between S3 API request metadata and storage. The backlog has been fully processed and S3 API errors and latencies have returned to normal levels. The issue has been resolved and the service is operating normally.