Outage in Braze

Network issues in US clusters

Resolved Major
September 30, 2024 - Started 3 months ago - Lasted about 7 hours
Official incident page

Need to monitor Braze outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Braze, and never miss an outage again.
Start Free Trial

Outage Details

We have become aware of a networking issue at our database layer that is impacting services across our US clusters. Our engineers and DBAs are currently engaged and investigating.
Latest Updates ( sorted recent to last )
RESOLVED 3 months ago - at 09/30/2024 09:24PM

We have been monitoring since our last update and are declaring this incident as resolved.

IDENTIFIED 3 months ago - at 09/30/2024 08:50PM

Both Data Processing and Outbound Messaging backlogs in US01 have been fully processed.
Both Data Processing and Outbound Messaging backlogs in US06 have been fully processed.

For services still identified as Degraded Performance, we are still working through the job backlogs for a small subset of customers.

At this point, less than 1% of customers have messaging latency or data processing queues greater than 10 minutes.

We will update in 1 hour or sooner when we have a material update.

IDENTIFIED 3 months ago - at 09/30/2024 08:06PM

All job processing backlogs in US04 have been fully processed.
The data processing backlogs in US06 and US01 are fully processed. We are still working through the messaging backlogs for a small subset of customers.

We will continue to update as clusters fully clear their backlogs.

At this point, less than 5% of customers have messaging latency or data processing queues greater than 10 minutes.

We will update in 1 hour, or when we have a material update, whichever is sooner.

IDENTIFIED 3 months ago - at 09/30/2024 07:26PM

At this point, approximately 90% of customers have messaging latency under 10 minutes. As of this update, backlog processing on other US clusters continues, and we are adding more resources to maximize throughput.

US04 remains on track to shortly have the backlog fully processed.

In the last 45 minutes, we have processed through more than a third of the remaining backlog across customers still impacted.

IDENTIFIED 3 months ago - at 09/30/2024 06:48PM

We have completed processing through the backlog on US02, US08.

US04 is on track to be resolved in the next 30 minutes. As of this update, backlog processing on other US clusters continues and we are continuing to add more resources to maximize throughput. At this point, approximately 85% of customers have messaging latency under 10 minutes.

REST API errors have recovered in all clusters.

SDK API errors have recovered in all clusters.

Dashboard in US01, US03, and US05 is now performant for all customers.

IDENTIFIED 3 months ago - at 09/30/2024 06:07PM

We have completed processing through the backlog on US07, confirming our resolution steps are effective.

As a minor correction to our last post, Dashboard logins had a less than 1% error rate, but our US01, US03, and US05 clusters continue to have some sporadic timeouts for some customers.

We are still scaling up processing to ensure we rapidly process through the backlogs in other clusters, our next update should provide insight into latency and backlog processing completion targets.

IDENTIFIED 3 months ago - at 09/30/2024 05:36PM

We are continuing to work on a fix for this issue.

IDENTIFIED 3 months ago - at 09/30/2024 05:34PM

Dashboard login is now fully operational in all clusters.

The error rates for the REST and SDK APIs are now in normal ranges. Customers should retry their REST API calls at this point.

We are actively processing the backlog and are actively making progress against that.

Message sending and data processing remain latent.

From now on, we will update the Statuspage every 30 minutes. Our next update will be at 2 p.m. ET.

IDENTIFIED 3 months ago - at 09/30/2024 04:56PM

While network connectivity to the database layer has been restored, we continue to experience errors in services while cycling through restarts at the database layer. We expect no SDK data loss and all messages to enqueue and send.

We continue to process the backlog, however, recovery is taking longer than originally anticipated.

From now on, we will update the Statuspage every 30 minutes. Our next update will be at 1:30 p.m. ET.

IDENTIFIED 3 months ago - at 09/30/2024 03:18PM

We have recovered most network connections to our databases and are in the process of scaling up backlog processing - prioritizing message sending. We have begun to process the message-sending backlog on impacted clusters.

Our REST and SDK API error rates have dropped below 10% and are close to fully recovered.

IDENTIFIED 3 months ago - at 09/30/2024 02:13PM

A fix to the network issue has been implemented, and we are now restoring connections to the database layer.

IDENTIFIED 3 months ago - at 09/30/2024 02:04PM

We have become aware of a networking issue at our database layer that is impacting services across our US clusters. Our engineers and DBAs are currently engaged and investigating.

Be the first to know when Braze and other third-party services go down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime