Outage in TrekkSoft

Down time post mortem report Feb 26th 2020

Resolved Minor
February 27, 2020 - Started over 4 years ago
Official incident page

Need to monitor TrekkSoft outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including TrekkSoft, and never miss an outage again.
Start Free Trial

Outage Details

Summary: On the morning of February the 26th we migrated the TrekkSoft servers from our Cloudscale hosting provider in Zurich to the new Amazon Web Services in Ireland. After the migration was complete, usage of the system increased as the merchants began to take bookings and use the system. We spotted a major drop in performance. The root cause was one database host that was throttling under the amount of requests per second. This database slowdown caused a significant drop in performance to our applications (Merchants landing pages - CMS, Backoffice, public and private API and mobile apps), in some cases rendering them inoperable. What Happened 6:45am - 8:29am CET - We completed the AWS migration. We tested all the main cases and monitored all hosts and the preliminary results were satisfactory. 9:00am CET - Our applications began handling an increased amount of requests as the system came back online and usage of the system scaled up. One of the main database hosts (MySQL) began struggling with the amount of requests. This affected the performance of our application, preventing normal functionality. Contributing Factors Uncertainty regarding the performance of the new AWS infrastructure vs CloudScale. We compared all hosts in CloudScale vs AWS to ensure the same hardware requirements. The infrastructures are different. Steps Taken Phase 1: Increase the size of the database in AWS to increase performance (no downtime was required at this point). Contact AWS support to provide for more information about the resizing time. The database resize was to take AWS too long to deploy, so we decided to apply another workaround, described below. Phase 2: We put all the webapps in maintenance mode (down time). We created a new, larger database (downtime was required to avoid data loss). We extracted all data from one database to another, now using a migration system in AWS. The new database created failed. This required a new approach, described below. Phase 3 We created a new empty database (again, downtime was required to avoid data loss). We proceeded with a manual dump of the data from the old database to the new one. The process took 4 hours and was successful. 3:20PM CET The new infrastructure was ready to be released at aprox. We have been monitoring and tweaking the system over the last 24 hours to improve performance. Impact Low number of bookings from 7:00am to 4:30pm CET (about 9 hours). Some merchants were unable to process any bookings, while others still managed to take some. The impact here is financial loss to all parties. Benefits The objective behind the migration that caused the issue. Overall long term increase in performance. Up to date industry infrastructure. More direct control over our infrastructure. Infrastructure ready to apply autoscaling in case of a peak of request per second. Lessons learned We will strategically time operation of this scale so that we have more time to react and avoid peak booking hours. Triple-check hardware and settings specifications. Bulletproof checklist. Replicate the system and run stress tests. Build our infrastructure with extra capacity and resources/have a larger infrastructure as a backup. We apologize deeply for this incident.
Latest Updates ( sorted recent to last )
RESOLVED over 4 years ago - at 02/27/2020 03:49PM

Summary:
On the morning of February the 26th we migrated the TrekkSoft servers from our Cloudscale hosting provider in Zurich to the new Amazon Web Services in Ireland.
After the migration was complete, usage of the system increased as the merchants began to take bookings and use the system.
We spotted a major drop in performance. The root cause was one database host that was throttling under the amount of requests per second. This database slowdown caused a significant drop in performance to our applications (Merchants landing pages - CMS, Backoffice, public and private API and mobile apps), in some cases rendering them inoperable.


What Happened
6:45am - 8:29am CET - We completed the AWS migration.
We tested all the main cases and monitored all hosts and the preliminary results were satisfactory.
9:00am CET - Our applications began handling an increased amount of requests as the system came back online and usage of the system scaled up.
One of the main database hosts (MySQL) began struggling with the amount of requests. This affected the performance of our application, preventing normal functionality.
Contributing Factors
Uncertainty regarding the performance of the new AWS infrastructure vs CloudScale.
We compared all hosts in CloudScale vs AWS to ensure the same hardware requirements.
The infrastructures are different.
Steps Taken
Phase 1:
Increase the size of the database in AWS to increase performance (no downtime was required at this point).
Contact AWS support to provide for more information about the resizing time.
The database resize was to take AWS too long to deploy, so we decided to apply another workaround, described below.


Phase 2:
We put all the webapps in maintenance mode (down time).
We created a new, larger database (downtime was required to avoid data loss).
We extracted all data from one database to another, now using a migration system in AWS.
The new database created failed.
This required a new approach, described below.

Phase 3
We created a new empty database (again, downtime was required to avoid data loss).
We proceeded with a manual dump of the data from the old database to the new one. The process took 4 hours and was successful.
3:20PM CET The new infrastructure was ready to be released at aprox.
We have been monitoring and tweaking the system over the last 24 hours to improve performance.

Impact
Low number of bookings from 7:00am to 4:30pm CET (about 9 hours). Some merchants were unable to process any bookings, while others still managed to take some. The impact here is financial loss to all parties.

Benefits
The objective behind the migration that caused the issue.
Overall long term increase in performance.
Up to date industry infrastructure.
More direct control over our infrastructure.
Infrastructure ready to apply autoscaling in case of a peak of request per second.

Lessons learned
We will strategically time operation of this scale so that we have more time to react and avoid peak booking hours.
Triple-check hardware and settings specifications.
Bulletproof checklist.
Replicate the system and run stress tests.
Build our infrastructure with extra capacity and resources/have a larger infrastructure as a backup.

We apologize deeply for this incident.

The easiest way to monitor TrekkSoft and all cloud vendors

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3202 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime