What happened on April 5th, 2023?
TL;DR: Scroll to the end of this post for a quick FAQ.
In the early morning of April 5th at 2 am, we were alerted that one of our database servers had an unusual occurrence. After an initial diagnosis, we ruled to take the server offline for a thorough inspection. As we have other database servers deployed (each with a copy of all the data), it was a routine audit done multiple times before.
Only after the audit did we see that some of our other servers started to slow down as they had information that wasn't fully synchronized. Hence, we added the server back online to sync up with its colleagues. Again, standard procedure.
However, this is where things quickly went sideways, i.e., the server started to pull complete copies of each of the other servers, causing reindexing and buffer issues on the server and the network. As we store millions of messages, we are talking TB of data here.
At this point, the whole Helpmonks team was already trying to devise a solution. We agreed on one solution: to take the server offline again so it won't impact you, as it was around 6 am.
Little did we know this made it worse, as the other servers were also looking for the data. Doing so caused a slowdown in traffic on all our servers.
Until now, we still thought that everything we did internally did not impact you, our customers. However, around 8 am, one of our support people mentioned needing help with our web interface. By then, we had started seeing massive issues with our web and application servers, which caused time-outs for our visitors.
At this point, it was already 10 am, and we were hoping to have the issue resolved without further impact for you. Unfortunately, we didn't consider all the amassed emails, which counted toward the ten thousand.
So, while our databases were "one man down," they also had to digest an immense amount of new emails and traffic from all our other servers, which were eager to deliver the messages to you as fast as possible. While we invested a lot to handle this traffic, it simply caused further slowdowns on your end, i.e., our web interface.
While we were able to re-establish the platform on April 5th, April 6th, and April 7th, our databases still needed to be fully optimized. During this time, you could sign in to Helpmonks, but some subsequent actions, timed out.
I want to apologize for this downtime to you personally. My whole team and I know how vital Helpmonks is for your business. We understand not being able to get to emails is frustrating.
However, I want to apologize even further for not being able to inform you sooner of the issues. We were all surprised when we saw that it impacted you directly. We were unaware of the impact as we only received your messages as delayed as you.
We understand that it was upsetting not to be able to reach us. To remedy this situation, we have now established a public channel on Telegram. Additionally, we've created a public group for discussions (something we have had planned anyhow).
We encourage you to join both of these channels. The channel is available at https://t.me/helpmonks and the group at https://t.me/helpmonks_group.
Please know that no emails or other data were lost during the outage.
One of our core database servers synchronized a faulty state which caused all database servers to adapt to that defective state. After investigation, we added it back, which inadvertently caused backlog and buffer issues.
Why was the platform unstable for a prolonged time?
We've had to turn off some services to keep all the data synced. Among them was email parsing. We also had to reduce our web to one server only. Otherwise, we would have unsynched data or, worse, missing data—something we wanted to avoid by all means.
Why did I not get a response from your team?
Simply put, we use Helpmonks in the same way as you do. We get and reply to all emails in Helpmonks. As we did not parse emails, we did not get your emails. The same applies to emails that you sent to team members directly.
What will you do to prevent this from happening again?
We were already running a large set of database servers and have extended this now. Furthermore, we set up real-time database copies to replace one which became corrupted.
Great, but what about communication?
We established a public channel on Telegram for announcements. Please join it at https://t.me/helpmonks.
In addition, we also created a public group for discussions of all kinds, e.g., questions, asking others, etc., at https://t.me/helpmonks_group.
Telegram is an independent platform with over 700 million users. You can use Telegram on every platform.