Outage in Datadog US1

Backfilling historical data for March 8, 2023 incident

Resolved Minor
March 08, 2023 - Started almost 2 years ago - Lasted 2 days
Official incident page

Need to monitor Datadog US1 outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Datadog US1, and never miss an outage again.
Start Free Trial

Outage Details

We are investigating loading issues on our web application. As a result, some users might be getting errors or increased latency when loading the web application.
Latest Updates ( sorted recent to last )
RESOLVED almost 2 years ago - at 03/10/2023 05:25AM

We have finished backfilling data across all products: all data received during the incident that had been successfully buffered but unprocessed, is now fully accessible on the platform. Due to the nature of this outage, you may see some residual gaps in the data we received within the first few hours after the start of the incident.

We truly appreciate your patience and understanding during this incident.

MONITORING almost 2 years ago - at 03/10/2023 02:10AM

We have completed backfill of data for the following products

* Real User Monitoring
* Database Monitoring
* Network Performance Monitoring
* Network Device Monitoring

We are now in the process of validating and verifying data across all customers in those products.

For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 11:04PM

We have also completed backfilling data for the following products:

Log Management

We are now in the process of validating and verifying data across all customers in those products.

For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 08:15PM

We have completed backfill of data for APM traces and services, and CI Visibility and are now in the process of validating and verifying data across all customers in those products.
For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 05:11PM

We've finished monitoring the recovery. All Datadog systems continue to receive, query, and evaluate monitors on live data as normal, while backfill operations continue for historical data.

MONITORING almost 2 years ago - at 03/09/2023 03:55PM

During the recovery operations, users might have experienced temporary elevated latency and error rates on the web application between 15:30 and 15:46 UTC , specifically for metric queries and APM. We are monitoring recovery now and we are continuing with the backfilling operations.

MONITORING almost 2 years ago - at 03/09/2023 01:00PM

All Datadog services are now available and able to receive, query, and report on live data. Monitors continue to be evaluated correctly since live data has been restored. Some customers may still observe gaps in historical data for parts of the last 24 hours.

We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 12:26PM

Network Performance Management is operational again.
Logs Management are generally working, but some users may see transient errors when querying recent data.
Monitors continue to be evaluated correctly since live data has been restored.

Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 12:08PM

Logs Management and Network Performance Management are generally working, but some users may see transient errors when querying recent data.
Monitors continue to be evaluated correctly since live data has been restored.

Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 10:09AM

We are continuing to monitor for any further issues.

MONITORING almost 2 years ago - at 03/09/2023 09:48AM

Logs Management and Network Performance Management are generally working, but some users may see transient errors when querying recent data.
Monitors continue to be evaluated correctly since live data has been restored.

Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

MONITORING almost 2 years ago - at 03/09/2023 08:58AM

Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.

IDENTIFIED almost 2 years ago - at 03/09/2023 08:27AM

A subset of customers might experience transient errors while loading Network Performance Monitoring and Logs Management. The underlying data is still being processed and will be available query once queries are fully operational again. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 07:04AM

SLOs are operational. CI Visibility is operational. Profiling recent data is available for queries. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 06:31AM

Logs Management is operational, live data and alerting are back to normal. External Archives and Log Forwarding are still delayed. Serverless monitoring is operational. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 06:26AM

We are continuing to work on a fix for this issue.

IDENTIFIED almost 2 years ago - at 03/09/2023 06:03AM

APM Traces is fully operational. RUM is fully operational. Security Monitoring is fully operational. Database Monitoring is fully operational. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 05:19AM

Serverless monitoring is fully operational. Synthetic Monitoring is fully operational. Network Device Monitoring is fully operational. Database Monitoring is fully operational. APM Services is fully operational. Metrics from our cloud provider integrations are available, and Metrics generated from Logs are available. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 04:28AM

Live data for Metrics is now available for all customers. We're in the process of enabling metric alerts for some customers for time windows less than 1 hour.

We're seeing partial recovery for Network Performance Monitoring. Error Tracking is seeing partial availability, and we're investigating. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 03:17AM

We're seeing partial recovery for the Profiling product, as well as metrics from our cloud provider integrations. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 02:15AM

Live data for Metrics is now available for most customers. Historical search for APM Traces is operational. Monitors for Logs and Service Checks are operational. We will continue to monitor progress towards recovering the remaining services.

IDENTIFIED almost 2 years ago - at 03/09/2023 12:54AM

Live data is now available for Logs. We're seeing partial recovery for Security Monitoring. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 11:49PM

Live Search on last 15 mins for APM Traces, and Live Processes is recovered. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 11:21PM

Incident Management is fully operational. We're seeing partial recovery across several products including Serverless and Network Performance Monitoring. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 11:04PM

Logs have live data available on US1 for about 33% of customers. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 10:49PM

We're seeing partial recovery across several products including SLOs, Profiling, WatchDog, Logs. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 10:06PM

Database Monitoring is operational in US1. There may be gaps in historical data. We continue progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 09:21PM

Processes Monitors are operational in US1. There may be gaps in historical metric data. We continue progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 08:39PM

At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products and regions . The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed.

We will share a more detailed analysis post-recovery, but at a very high level:
A system update on a number of hosts controlling our compute clusters caused a subset of these hosts to lose network connectivity
As a result a number of the corresponding clusters entered unhealthy states and caused failures in a number of the internal services, datastores and applications hosted on these clusters.

Our current status is:
We identified and mitigated the initial issue, and rebuilt our clusters
We also have recovered a number of our applications and services, including our web portals
We are now working on recovering and catching-up the rest of our data systems for metrics, traces and logs across the regions that are still affected (see region-specific status pages). The recovery work is currently constrained by the number and large scale of the systems involved.

What to expect next:
We are focusing on bringing back live data for all customers and all products before catching-up on any historical data we may have stored during the outage
We expect live data recovery in a matter of hours (not minutes, and not days)
We will continue to issue regular updates as the situation unfolds

We understand how critical Datadog is to your business, we sincerely apologize for the inconvenience and we are working hard to resolve this issue.

IDENTIFIED almost 2 years ago - at 03/08/2023 08:12PM

We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 07:26PM

We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 06:42PM

We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 06:14PM

We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 05:29PM

We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 04:46PM

We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 04:04PM

We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 03:34PM

We are still progressing towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 02:49PM

We are still progressing towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 02:07PM

Some products are recovering and we are still progressing towards a complete recovery. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 01:30PM

Some products are recovering and we are still progressing towards a complete recovery. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 12:45PM

We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 12:08PM

We have identified the issue, and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.

IDENTIFIED almost 2 years ago - at 03/08/2023 11:20AM

We are seeing reduced error rates for the web application. We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app.

INVESTIGATING almost 2 years ago - at 03/08/2023 10:31AM

We are seeing reduced error rates for the web application. We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app.

INVESTIGATING almost 2 years ago - at 03/08/2023 09:39AM

We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.

INVESTIGATING almost 2 years ago - at 03/08/2023 08:52AM

We are continuing to investigate this issue.

INVESTIGATING almost 2 years ago - at 03/08/2023 08:40AM

We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.

INVESTIGATING almost 2 years ago - at 03/08/2023 08:05AM

We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.

INVESTIGATING almost 2 years ago - at 03/08/2023 07:23AM

We are investigating issues causing delayed data ingestion across all data types. As a result monitor notifications may be delayed, and you may observe delayed data throughout the web app.

INVESTIGATING almost 2 years ago - at 03/08/2023 06:31AM

We are investigating loading issues on our web application. As a result, some users might be getting errors or increased latency when loading the web application.

Be the first to know when Datadog US1 and other third-party services go down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime