Pricing

Outage in freistilbox

Database failure

Resolved Major

December 04, 2023 - Started over 1 year ago - Lasted about 21 hours
Official incident page

Need to monitor freistilbox outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including freistilbox, and never miss an outage again.
Start Free Trial

Outage Details

One of our database clusters suffered a server failure. We're switching operation to the standby node and will be back with an update.

Components affected

freistilbox Database clusters

Latest Updates ( sorted recent to last )

RESOLVED over 1 year ago - at 12/05/2023 01:38PM

This incident has been resolved.

MONITORING over 1 year ago - at 12/05/2023 03:10AM

We are happy to report that we have successfully restored redundancy and thus standard operation for the database cluster db16.

We will keep monitoring the situation to catch any regression as quickly as possible.

There was no loss of data except for a very short period during which traffic was not cleanly routed to only one database node; however, this seems to have mostly affected ephemeral data like cache contents.

After getting some highly necessary rest, we will initiate a series of follow-up tasks, the most important one being conducting a thorough incident review in which we analyze root causes and mitigation timeline, and determine necessary improvements to our hosting infrastructure and operations processes. We will publish this review by the end of next week.

We sincerely apologize for the downtime this incident caused for the affected customers, and will take any possible measure to prevent an outage like this in the future.

IDENTIFIED over 1 year ago - at 12/05/2023 01:31AM

We were able to successfully restore a consistent data set on the previously broken, but reliably working cluster node.

Unfortunately, when we executed the final recovery step, the switch of network traffic back to this node, the routing change on the data centre network level didn't go through cleanly and created a "split-brain" situation that destroyed the newly restored data synchronization between both nodes within seconds. The active node is still fully operational, and website operation is not impacted, but the standby node has been rendered unusable.

This forces us to immediately follow up with a task that we would rather have tackled at a later time, which is to set up data replication to a completely new cluster node. It's a silver lining that we will be able to use our standard operating procedure for this process, but it will require a few more hours of work.

IDENTIFIED over 1 year ago - at 12/05/2023 12:14AM

We have successfully transferred the majority of the active data set and are about to launch the final transfer phase which requires a database lock to ensure data consistency. We're expecting a database downtime of 10 to 15 minutes.

IDENTIFIED over 1 year ago - at 12/04/2023 11:22PM

Since all our many attempts at cloning the active database server using our regular backup software ended up unsuccessful, we will now take a new approach. This alternative process will not have to rely on database stability because it operates on the filesystem level. But it has the downside that its final phase will require a downtime of the database server during which website operation will not be possible. We are shifting as much of the necessary data transfer into the initial phase that allows the database server to operate normally, in order to keep the duration of the final offline phase as short as technically possible.

IDENTIFIED over 1 year ago - at 12/04/2023 09:38PM

We were able to get the database server back online and are relieved to see it serving data to websites again. We are resuming our attempts to restore the full active data set on the broken node. In parallel, we are preparing last night's database backup to restore it as a last resort after we've tried all other avenues.

IDENTIFIED over 1 year ago - at 12/04/2023 08:51PM

After multiple failed attempts at doing the necessary backup, the server has become unresponsive. We are "all hands on deck" and are working with datacenter staff to get it back online.

IDENTIFIED over 1 year ago - at 12/04/2023 06:47PM

An operator error left the active node on our database cluster db16 in a broken state. As per our standard operating procedures, we performed a failover to the cluster's standby node, which successfully took over serving data to its associated websites.

Unfortunately, an instability in this newly active node keeps causing the full backup to fail that we need to restore the broken node, and thus redundancy in the cluster. We are investigating the cause of this instability as well as possible approaches to finish a successful backup of the whole data set.

This restoration work might require us to restart the database server which will cause a short service downtime. We apologize for the service interruption this will inevitably cause; we are doing our best to keep them at an absolute minimum.

INVESTIGATING over 1 year ago - at 12/04/2023 04:38PM

One of our database clusters suffered a server failure. We're switching operation to the standby node and will be back with an update.

Latest freistilbox outages

Database failure - 7 months ago

Limited availability - 8 months ago

Limited availability - 9 months ago

Partial network outage - about 1 year ago

Degraded network performance - over 1 year ago

Be the First to Know When Vendors Go Down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 4400 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook