← Back

Incident Report: Analysis of the October 7 server downtime

· Lukas Hermann Lukas Hermann

What happened?

  • On October 6, 2022, at 23:00 UTC the server started to fail to respond to some isolated requests resulting in occasional timeouts
  • On October 7, 2022, at 00:46 UTC the server started to time out almost all requests
  • On October 7, 2022, at 06:22 UTC normal operation of the server was resumed
  • During the incident, no commands could be issued or received by the timer software. Existing data was not affected.

The timeline of events can be followed on the public status page: https://stats.uptimerobot.com/jJ140cjvEB

What was the problem?

Stagetimer is running inside a docker container on a shared server with other processes. One process (most likely unrelated to Stagetimer) became faulty and at 00:46 UTC filled up the entire available memory of 32 GB. Even though the server kept running, it was not able to handle requests on time due to the maxed-out memory, resulting in timeouts.

The server monitoring picked up on the outage but wasn’t able to notify the team because the SMS alert credits ran out. Unfortunately, this resulted in longer downtime than usual.

At 06:22 UTC the server was finally rebooted and services were carefully restarted. Since then the server is running well and is carefully monitored.

What are we doing to avoid the problem in the future?

  1. We’ll move Stagetimer to its own dedicated cloud server.
  2. We’ll set up an exact replica of the database and server environment on a backup server with automatic failover.
  3. We’ll revisit our existing monitoring solution and make sure the team is properly alerted in case of server downtime.

Closing thoughts

We are very sorry for any inconvenience caused by the server outage. We are aware that stagetimer.io is used in time-critical settings and that downtime can result in essential disturbances of live events.

Please feel free to get in contact: https://stagetimer.io/contact/