PostHog experienced a 15.1-hour event ingestion processing delay caused by a degraded shard during routine maintenance and elevated part counts on ClickHouse, which led to insert rejections and Kafka consumer lag. Events took longer than usual to appear in PostHog apps and queries, though no data was lost during the incident. The root cause was identified and resolved, with ingestion resumed and the backlog processed over approximately 2 hours.
We are still processing the ingestion queue, we should be fully caught up in about 2 hours.
We have identified the root cause of the ingestion lag and cluster overload, and have resolved the issue.
We have now resumed ingestion and are processing the event ingestion lag.
During routine maintenance a shard has entered a degraded state in terms of performance and is causing us to fall behind on ingesting data. We are working to remedy the issue and will report back as soon as we have a remedy in place.
EU event ingestion experienced delays due to elevated part counts on ClickHouse. The high part count caused some insert rejections, leading to Kafka consumer lag on event processing. Replication queues have been restarted and merge backlogs are draining. Part counts are returning to normal.
We’ve identified processing delays in the event ingestion pipeline. Events may take longer than usual to appear in the product. Data is not lost but may not show in PostHog apps and queries until the processing delay is resolved.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6020 services available
Integrations with