Outage in VTEX

Elevated Errors in the Platform

Resolved Major
July 12, 2023 - Started about 2 years ago - Lasted about 4 hours
Official incident page

Need to monitor VTEX outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including VTEX, and never miss an outage again.
Start Free Trial

Outage Details

We are currently investigating this issue.
Latest Updates ( sorted recent to last )
RESOLVED about 2 years ago - at 07/13/2023 12:36AM

On July 12, 2023, at 17:29 GMT-03:00, one of our replicated and highly available database systems experienced severe service degradation and didn't automatically recover. Our incident response team was alerted. This issue caused intermittent errors and high latency in the overall user experience, leading to a partial outage across the VTEX Platform. Sales flow, Product Indexing, and the Administrative Environment were severely impacted.

At 18:15 GMT-03:00, the initial issue was manually remediated. Unfortunately, several of our services circuit breakers opened up and failed to automatically close once the initial issue had been resolved. This extended the incident to some customers for longer than we'd like. The circuit breakers are in place to avoid complete unavailability, and as a side effect, a subset of our platform continued to operate.

By 19:00 GMT-03:00, our incident response team was still investigating the failure's cascading effects and manually implementing remediations to address the issue. At 19:32 GMT-03:00, we confirmed that remediation efforts were underway to recover the platform. However, some intermittent errors and higher latency persisted for several customers.

At 19:56 GMT-03:00, the additional team's remediation actions were proving successful, with sessions and orders gradually increasing towards expected levels.

At 20:03 GMT-03:00, we validated that the remediation actions had been effective, resulting in a steady recovery.

Moving forward, the following actions will be taken to prevent similar incidents:
- Re-evaluating the size of the failure domain associated with the replicated and highly available database system.
- Investigating and fixing the cause of the severe service degradation in the database system with one of VTEX's cloud providers.
- Investigating and fixing the bug related to the circuit breakers not properly closing once the database issue had been resolved.

Efforts are underway to enhance the infrastructure and implement additional safeguards to bolster the platform's resiliency. Improvements will also be made to the timing, frequency, and quality of communications via status.vtex.com.

VTEX expressed appreciation for understanding and patience during the incident. They remain committed to providing the highest level of service and will continue working diligently to ensure uninterrupted service. Apologies were extended for any inconvenience caused, with gratitude for ongoing support.

MONITORING about 2 years ago - at 07/12/2023 11:26PM

Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.

IDENTIFIED about 2 years ago - at 07/12/2023 11:25PM

Our remediation efforts have been effective, leading to a steady recovery of sessions and orders towards nominal levels. As we continue to monitor the situation closely, we remain committed to maintaining the stability of our platform. Our team actively observes and will address any remaining issues to guarantee a seamless user experience.

IDENTIFIED about 2 years ago - at 07/12/2023 10:57PM

Our remediation actions are successful, and we observe a gradual increase in sessions and orders, bringing us closer to nominal levels. However, we remain vigilant and are prepared to take additional actions to ensure the long-term stability of our platform.

IDENTIFIED about 2 years ago - at 07/12/2023 10:32PM

We continue to have intermittent errors and high latency in the IO Platform. The overall impact can be experienced in Sales flow, Product Indexing, and Administrative Environment.

Currently, we are applying remediation to recover the IO Platform gradually.

IDENTIFIED about 2 years ago - at 07/12/2023 10:00PM

At 17:28 BRT, one internal infrastructure service got overloaded, generating intermittent errors and high latency in the IO Platform, which caused a partial outage in the entire VTEX Platform. At 18:12 BRT, We completed the first remediation, recovering the internal service, which slightly recovered the platform.

Sales flow, Product Indexing, and Administrative Environment are being severely impacted.

We are still investigating this failure's cascading effects and applying the following remediations.

IDENTIFIED about 2 years ago - at 07/12/2023 09:31PM

We are continuing to work on a fix for this issue. The first fix did not completely solve the problem.

IDENTIFIED about 2 years ago - at 07/12/2023 09:18PM

The issue has been identified and a fix is being implemented.

INVESTIGATING about 2 years ago - at 07/12/2023 08:58PM

We are currently investigating this issue.

Be the First to Know When Vendors Go Down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 4400 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook