Overview
Beginning on the evening of 9/18/19, the stream of leaderboard stat value update events which feeds into the leaderboard database became delayed due to an overload in one of the processing components. This led to delayed database updates, which caused the leaderboard APIs to return stale results. At the incident’s peak, we experienced delays of 4 hours 35 minutes.
The cause of the issue was an increase in the volume of incoming events, which were being generated faster than we were batch processing them. While this mismatch had been present for some time, it went undetected because the volumes had not previously been large enough to trigger a problem.
Because our alarm was misconfigured, the issues was not detected until it was reported by a customer many hours after the incident was active. Once the issue was resolved, the service returned gradually returned to normal operation, with no data loss or longer-term impact.
Timeline
Time (PDT) | Event |
---|---|
2019-9-18 23:30 | Statistic updates to leaderboards begin to fall behind, but staying less then the threshold to alert. |
2019-9-19 03:00 | Statistic updates become delayed by more than 5 minutes at maximum and continues to degrade. However this did not page the on call due to a misconfiguration of the alarm. |
2019-9-19 14:52 | Customer report of leaderboards showing stale data was escalated to on call engineer. |
2019-9-19 15:20 | Maximum delay of 4 hours, 35 minutes is reached. |
2019-9-19 15:30 | Unable to isolate the delay to a single cause, the machines are scaled temporarily to allow the system to begin recovery. |
2019-9-19 16:50 | Half of all processing is recovered. |
2019-9-19 19:50 | The last delay is fully recovered. |
2019-9-20 07:24 | Statistic updates begin to be delayed and on call engineer is engaged automatically thanks to the corrected alarm from the prior day. |
2019-9-20 10:00 | Having attributed the cause of the delay to event volume multiplied by the latency in processing each event, we doubled capacity allowing for twice the time for processing each event. The doubled capacity does not immediate begin reducing delay because it first must process the backlog. |
2019-9-20 10:48 | Maximum delay is reached of 30 minutes, average delay was under 5 minutes. |
2019-9-20 12:22 | Capacity doubling begins to take effect. |
2019-9-20 12:36 | Capacity doubling in full effect. |
Mitigation steps: