Outage in Grafana

Metrics Disruption - cortex-prod-10 cluster

Resolved Minor
March 10, 2023 - Started about 1 year ago - Lasted about 2 hours
Official incident page

Need to monitor Grafana outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Grafana, and never miss an outage again.
Start Free Trial

Outage Details

As of 18:45 UTC, we observed metric disruption in the cortex-prod-10 cluster, only. Some customers in this cluster may have experienced failed metric queries or failed remote write actions, likely manifesting as 500 errors. Engineering is aware and actively engaged in investigation. We will provide updates as information is shared.
Components affected
Grafana US-CENTRAL: Querying
Latest Updates ( sorted recent to last )
RESOLVED about 1 year ago - at 03/10/2023 08:59PM

Engineering has identified and fully remediated the issue. We have observed complete recovery in the cortex-prod-10 cluster and backfilling of metrics is complete. At this time, we are considering this incident resolved.

No further updates.

MONITORING about 1 year ago - at 03/10/2023 08:16PM

We continue to observe a healthy cluster state for the cortex-prod-10 cluster and metrics backfill is proceeding as expected. Users should no longer experience failed metric queries or failed remote write actions.

Engineering is continuing action and investigation. We will continue to monitor the state of the cluster closely and continue to provide updates.

MONITORING about 1 year ago - at 03/10/2023 07:36PM

As of 19:29 UTC, Engineering has applied mitigation efforts to the cortex-prod-10 cluster and we are seeing improvements with metric queries and remote write actions. As the cluster health improves, metrics from the affected time period will begin backfilling.

Engineering is continuing action and investigation. We will monitor the state of the cluster closely and continue to provide updates.

INVESTIGATING about 1 year ago - at 03/10/2023 06:55PM

As of 18:45 UTC, we observed metric disruption in the cortex-prod-10 cluster, only. Some customers in this cluster may have experienced failed metric queries or failed remote write actions, likely manifesting as 500 errors.

Engineering is aware and actively engaged in investigation. We will provide updates as information is shared.

The easiest way to monitor Grafana and all cloud vendors

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3153 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime