Outage in Cognite

Elevated error rates and reduced capacity in Timeseries and sequences API in EUR-W3

Resolved Major
March 15, 2023 - Started over 2 years ago - Lasted 5 days
Official incident page

Need to monitor Cognite outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Cognite, and never miss an outage again.
Start Free Trial

Outage Details

Cognite Engineering is working on an incident where the backend datastores for timeseries and sequences have performance problems that results in a need to throttle incoming load and from time to time return 5xx responses due to system overload. The engineering team is working on improving the storage system's performance. A new update will be posted when end users experience is believed to change.
Components affected
Cognite Data Fusion API
Latest Updates ( sorted recent to last )
RESOLVED over 2 years ago - at 03/20/2023 09:44AM

This incident is resolved and the time series and sequences services are now operating at normal performance levels and with the resiliency that the services were designed for.

IDENTIFIED over 2 years ago - at 03/18/2023 07:37AM

The engineering team is still not able to lift the rate limits that has been configured to protect the backend storage during the recovery and optimization work that is ongoing. Users will see 429 response codes from the API when rate limits kick in. Users will see higher rates of 429s between 4.00 am and 10:00 am UTC due to work ongoing to improve redundancy.

IDENTIFIED over 2 years ago - at 03/17/2023 12:51PM

The engineering team is still working on stabilizing the performance of the timeseries and sequences service. There has been adjustments to the rate limiting in place to reduce the load on the system. End users will get a 429 response code to their requests if their request rate exceeds the rate limits. We are considering further relaxing rate limits, and a new update will be made here if and when this happens.

IDENTIFIED over 2 years ago - at 03/17/2023 08:14AM

The engineering team is still working on stabilizing the performance of the timeseries and sequences service. There is rate limiting in place to reduce the load on the system. End users will get a 429 response code to their requests if their request rate exceeds the rate limits. We are adding more resources to the backend systems, but we are not able to lift the rate limits before the processing of the backlog is complete. It will still be a few hours.

IDENTIFIED over 2 years ago - at 03/16/2023 08:51AM

The storage backend has now recovered completely and is running with the desired number of replicas. Risk for dataloss is no longer a concern in this incident. There is a processing backlog that now needs to be addressed. Cognite engineering is working on an ussue with query performance degradation related to high load.

IDENTIFIED over 2 years ago - at 03/16/2023 06:03AM

The engineering team has fixed the problems related to the replication in the backend database. We are currently running with a normal level of resiliency. But we have still not lifted the rate limiting as we want to observe the system for a while longer before opening up for full load on the system.

IDENTIFIED over 2 years ago - at 03/15/2023 05:16PM

The engineering team is still working on resolving this incident. We have had two low-level storage failures in the storage backend. There is redundancy in the system, but not all replicas are fully operational Cognite is now bringing up a restore cluster to mitigate the chances of data loss. Incoming traffic is still being rate limited to protect the service and the storage backend from too high a load during the work on containing and eradicating the incident.

IDENTIFIED over 2 years ago - at 03/15/2023 02:10PM

The engineering team is continuing to investigate how to improve the performance of the backend for time series and sequences. To prevent data loss, the team has configured rate limiting for the services. Users will see 429 https responses if these new rate limits are exceeded.

IDENTIFIED over 2 years ago - at 03/15/2023 12:56PM

Cognite Engineering is working on an incident where the backend datastores for timeseries and sequences have performance problems that results in a need to throttle incoming load and from time to time return 5xx responses due to system overload. The engineering team is working on improving the storage system's performance. A new update will be posted when end users experience is believed to change.

Be the First to Know When Vendors Go Down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 4484 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook