Outage in Grafana

Metrics Disruption - cortex-prod-10 cluster

Resolved Minor
March 10, 2023 - Started about 2 years ago - Lasted about 2 hours
Official incident page

Need to monitor Grafana outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Grafana, and never miss an outage again.
Start Free Trial

Outage Details

As of 18:45 UTC, we observed metric disruption in the cortex-prod-10 cluster, only. Some customers in this cluster may have experienced failed metric queries or failed remote write actions, likely manifesting as 500 errors. Engineering is aware and actively engaged in investigation. We will provide updates as information is shared.
Components affected
Grafana US-CENTRAL: Querying
Latest Updates ( sorted recent to last )
RESOLVED about 2 years ago - at 03/10/2023 08:59PM

Engineering has identified and fully remediated the issue. We have observed complete recovery in the cortex-prod-10 cluster and backfilling of metrics is complete. At this time, we are considering this incident resolved.

No further updates.

MONITORING about 2 years ago - at 03/10/2023 08:16PM

We continue to observe a healthy cluster state for the cortex-prod-10 cluster and metrics backfill is proceeding as expected. Users should no longer experience failed metric queries or failed remote write actions.

Engineering is continuing action and investigation. We will continue to monitor the state of the cluster closely and continue to provide updates.

MONITORING about 2 years ago - at 03/10/2023 07:36PM

As of 19:29 UTC, Engineering has applied mitigation efforts to the cortex-prod-10 cluster and we are seeing improvements with metric queries and remote write actions. As the cluster health improves, metrics from the affected time period will begin backfilling.

Engineering is continuing action and investigation. We will monitor the state of the cluster closely and continue to provide updates.

INVESTIGATING about 2 years ago - at 03/10/2023 06:55PM

As of 18:45 UTC, we observed metric disruption in the cortex-prod-10 cluster, only. Some customers in this cluster may have experienced failed metric queries or failed remote write actions, likely manifesting as 500 errors.

Engineering is aware and actively engaged in investigation. We will provide updates as information is shared.

Stop Juggling Dozens of Status Pages – Monitor Them All in One Place

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 4000 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook