Microsoft Azure PlayFab

System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage
Leaderboards Delayed
Incident Report for PlayFab
Postmortem

Overview

Beginning on the evening of 9/18/19, the stream of leaderboard stat value update events which feeds into the leaderboard database became delayed due to an overload in one of the processing components. This led to delayed database updates, which caused the leaderboard APIs to return stale results. At the incident’s peak, we experienced delays of 4 hours 35 minutes.

The cause of the issue was an increase in the volume of incoming events, which were being generated faster than we were batch processing them. While this mismatch had been present for some time, it went undetected because the volumes had not previously been large enough to trigger a problem.

Because our alarm was misconfigured, the issues was not detected until it was reported by a customer many hours after the incident was active. Once the issue was resolved, the service returned gradually returned to normal operation, with no data loss or longer-term impact.

Timeline

Time (PDT) Event
2019-9-18 23:30 Statistic updates to leaderboards begin to fall behind, but staying less then the threshold to alert.
2019-9-19 03:00 Statistic updates become delayed by more than 5 minutes at maximum and continues to degrade. However this did not page the on call due to a misconfiguration of the alarm.
2019-9-19 14:52 Customer report of leaderboards showing stale data was escalated to on call engineer.
2019-9-19 15:20 Maximum delay of 4 hours, 35 minutes is reached.
2019-9-19 15:30 Unable to isolate the delay to a single cause, the machines are scaled temporarily to allow the system to begin recovery.
2019-9-19 16:50 Half of all processing is recovered.
2019-9-19 19:50 The last delay is fully recovered.
2019-9-20 07:24 Statistic updates begin to be delayed and on call engineer is engaged automatically thanks to the corrected alarm from the prior day.
2019-9-20 10:00 Having attributed the cause of the delay to event volume multiplied by the latency in processing each event, we doubled capacity allowing for twice the time for processing each event. The doubled capacity does not immediate begin reducing delay because it first must process the backlog.
2019-9-20 10:48 Maximum delay is reached of 30 minutes, average delay was under 5 minutes.
2019-9-20 12:22 Capacity doubling begins to take effect.
2019-9-20 12:36 Capacity doubling in full effect.

Mitigation steps:

  1. Complete: We scaled the overloaded leaderboard stat update stream processing component, and database update delays have returned to normal and stayed within the normal range since.
  2. In progress: We are implementing and testing an improvement in the stream processing component which we expect to better handle load variations.
  3. Complete: We have fixed the delay monitor and tested it to verify that it will properly alert us if a similar delay occurs in the future.
  4. Complete: We added additional metrics to Statistic services to better monitor latency of dependent systems
Posted Sep 24, 2019 - 12:10 PDT

Resolved
We've fully recovered.
Posted Sep 20, 2019 - 12:58 PDT
Update
We're activating the change now and will continue monitoring the impact.
Posted Sep 20, 2019 - 11:51 PDT
Update
The delay has stopped growing. We have a change prepped that we expect will improve the underlying issue.
Posted Sep 20, 2019 - 11:12 PDT
Update
We are continuing to investigate the issue.
Posted Sep 20, 2019 - 10:43 PDT
Investigating
It seems our fix last night did not resolve the underlying issue. Leaderboards are again delayed this morning, currently by 25 minutes. We're investigating the root cause and will keep this updated.
Posted Sep 20, 2019 - 10:06 PDT
This incident affected: API (Statistics and Leaderboards).