We had a great day yesterday! During our peak time (10am-2pm EDT), we averaged about 350 transactions a second and 2,800+ reads per second. Average CPU usage remained steady at 20-40%. In addition to the increase of RAM on Friday, we rebuilt indexes over the weekend which has really seemed to help. We will continue to monitor numbers this week and if we find anything else that needs to be changed to keep up with the load, we will do so. Thank you for your continued patience - we really appreciate you!
We have finally identified root cause of the outage.
In January we made a change to turn on Deliver alerts for everyone by default. Because of this change, this is now our third outage so far this year. This is more than we've experienced in past years. The two previous times we put some new indexes in place and thought the issue was resolved. Unfortunately that was not the case. Today the DB server pegged at 100% CPU and when that happens the application server starts dropping connections and then struggles to reconnect to get a connection to the database. Luckily the connections still went through, however slow so theoretically the portal should still show the stops as delivered, skipped, etc. And like I said previously all signatures, photos, and comments were successfully updated to Eclipse.
To fix this today, we increased the RAM on our DB server based on the recommendations from Mongo. On Sunday we will also be creating additional indexes to mitigate any further slowness. In addition, we will be working with Mongo next week to implement further performance recommendations.
Again, I am really sorry this happened today and we do really appreciate your patience. We will continue to keep you updated as we learn more.
Unfortunately we are still experiencing issues and connections are intermittent. We will keep you updated.
We have identified the issue to be with MongoDB (our database provider) and are working with their engineers on implementing auto scaling of our cluster to mitigate the issue. This seems to be working for the moment. We will continue to monitor throughout the day.
Thank you all for your patience.
All of your signatures, photos, comments, etc. were being transmitted to Eclipse during the outage just as normal. The outage was only on the portal side.
The DNS change worked for a bit and then we went down again. We have been on the phone with our database provider and AWS all morning to try to get to the bottom of things. We will keep you posted as we learn more.
We've pointed our DNS to our backup server so Route and Deliver Web should be coming back online. We are still investigating what caused the outage and will keep you posted.
We are continuing to investigate this issue.
It appears we are currently experiencing an outage. We will investigate and let you know as soon as we know more.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 5850 services available
Integrations with