Outage in Braze

Issue impacting US clusters

Resolved Major
April 29, 2024 - Started 8 months ago - Lasted about 18 hours
Official incident page

Need to monitor Braze outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Braze, and never miss an outage again.
Start Free Trial

Outage Details

Engineers are investigating an issue impacting multiple services on all US clusters.
Latest Updates ( sorted recent to last )
RESOLVED 8 months ago - at 04/30/2024 03:56AM

The overwhelming majority of customers across US 01 and US 03 have had their backlogs processed and are back to real-time data processing & message sending. All services are functioning as expected. We are considering this incident resolved.

We apologize for this incident and will provide a detailed Root Cause Analysis (RCA) report soon.

IDENTIFIED 8 months ago - at 04/30/2024 03:29AM

US01 Data Processing, Outbound Messages, and SDK Data Collection are fully operational.

US03 Data Processing and SDK Data Collection is fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US03.

IDENTIFIED 8 months ago - at 04/30/2024 01:50AM

US01 Data Processing and SDK Data Collection are fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US01.

US03 SDK Data Collection is fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US03.
We are still actively processing a backlog of Data Processing jobs in US03.

IDENTIFIED 8 months ago - at 04/29/2024 11:57PM

US08 has been marked as operational. The messaging and data processing backlogs on that cluster have been fully processed, and all other services are operational. We can consider that cluster in a "monitoring" status.

IDENTIFIED 8 months ago - at 04/29/2024 11:25PM

Providing a number of meaningful updates to US01, and US03:

Dashboards and REST API processing are fully operational in both US01 and US03.
SDK Data collection is fully operational in 03, and we are scaling up in US01.

Data Processing and Message Sending are still experiencing sporadic latency as we work through the backlogs, but all health measures are improving rapidly.

IDENTIFIED 8 months ago - at 04/29/2024 10:32PM

US06 has been marked as operational. The messaging and data processing backlogs on that cluster have been fully processed, and all other services are operational. We can consider that cluster in a "monitoring" status.

IDENTIFIED 8 months ago - at 04/29/2024 10:29PM

We are continuing to work on a fix for this issue.

IDENTIFIED 8 months ago - at 04/29/2024 10:13PM

US04 and US05 have been marked as operational. The messaging and data processing backlogs on those clusters have been fully processed, and all other services are operational. We can consider those clusters in a "monitoring" status.

IDENTIFIED 8 months ago - at 04/29/2024 09:16PM

We are actively processing backlogs of both messaging and data across all clusters. Our Database, SRE, and Networking teams are continuing to increase overall throughput as the recovery continues and individual clusters catch back up to real-time.

Currents is operational across all clusters, and has been processing all events as they are cleared from the backlogs.

At this point we have completed both backlogs in US02 and US07. We have also completed the full message sending backlog in US04, and are more than 75% through backlogs in US05 and US06. US01 and US03 are continuing to ramp their pace of recovery. The next update will provide continued status updates on backlog processing and recovery.

IDENTIFIED 8 months ago - at 04/29/2024 08:00PM

At this point, Dashboard access is available for all clusters.

We are processing through the backlog of messages to send and data to process across all clusters.

We'll continue to provide hourly updates.

IDENTIFIED 8 months ago - at 04/29/2024 06:20PM

US02 and US07 have been marked as operational. The messaging and data processing backlogs on those clusters have been fully processed.

On our larger clusters, this will take longer, and we don't yet have a cluster-by-cluster ETA, but we are tracking toward resolution.

IDENTIFIED 8 months ago - at 04/29/2024 05:59PM

We continue to see service restoration across several clusters:

Data Processing and Messaging have resumed in US05, and US07.

IDENTIFIED 8 months ago - at 04/29/2024 05:37PM

We continue to see service restoration across several clusters:

Dashboard services are resumed on US04, US05, US06, US07.
Data Processing and Messaging have resumed in US04.

IDENTIFIED 8 months ago - at 04/29/2024 05:16PM

We are seeing Dashboard access, Data Processing, and Messaging resuming in US02. There is a backlog of work to process, and once it is fully caught up, we will update the status to operational.

We are working through the rest of the US clusters and will provide updates in real-time as we have them.

IDENTIFIED 8 months ago - at 04/29/2024 05:01PM

We continue working to resolve a network issue in our US data centers.

We continue to work through checkout, and our remediation steps are showing success across various services.

Our next update will be in 30 minutes or once we have more detailed information about the resolution.

IDENTIFIED 8 months ago - at 04/29/2024 04:28PM

We continue working to resolve a network issue in our US data centers.

Senior leaders in our Engineering organization have implemented code designed to ensure that Quiet Hours are respected where required, to the extent this feature was properly configured by customers in Campaigns and Canvases, before this incident.

We have completed the restoration of services to a pilot customer successfully, and are now working through restoration across all US Clusters.

Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 03:55PM

We continue working to resolve a network issue in our US data centers.

We have no material update since our last post. We continue to work through restoring connectivity to those databases.

Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 03:27PM

We are continuing to work to resolve a network issue in our US data centers. As mentioned, the rolling restart of our database containers with Rackspace, our database hosting provider, was completed. We are now working through restoring connectivity to those databases. Senior leaders in our engineering organization are working to ensure that Quiet Hours will be respected in the countries where they are required and as configured in campaigns.

We will provide a full RCA and postmortem once this is resolved.

Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 02:55PM

We are continuing to work to resolve a network issue in our US data centers. The rolling restart of our database containers with Rackspace, our database hosting provider, is complete. Services are gradually returning online, and we are currently processing the backlog of data and messages accumulated during the incident.

We will provide a full RCA and postmortem once this is resolved.

Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 02:25PM

We are continuing to resolve a network issue in our US data centers. The rolling restart of database containers with Rackspace, our database hosting provider, is progressing and we are approximately 75% complete. Once these restarts are complete, we will begin returning services and processing data and messaging backlogs. Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 01:53PM

We have identified the root cause and are working to resolve a network issue in our US data centers. We are actively performing a rolling restart of database containers with Rackspace, our database hosting provider. We do not expect data loss, and further expect that all messages will be sent once the services are up and running. Our next update will be in 30 minutes or less.

IDENTIFIED 8 months ago - at 04/29/2024 12:59PM

We are continuing to work on a fix for this issue.

IDENTIFIED 8 months ago - at 04/29/2024 12:05PM

Work is ongoing by Engineers and our database provider to restore service.

IDENTIFIED 8 months ago - at 04/29/2024 10:53AM

Engineers are continuing to work alongside our Database provider to restore service.

IDENTIFIED 8 months ago - at 04/29/2024 10:18AM

Engineers are actively working with our Database provider to restore service.

IDENTIFIED 8 months ago - at 04/29/2024 09:48AM

We have identified a third-party networking issue.

INVESTIGATING 8 months ago - at 04/29/2024 09:41AM

Engineers are investigating an issue impacting multiple services on all US clusters.

Be the first to know when Braze and other third-party services go down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime