On the afternoon of March 27, 2019 beginning 19:33 UTC, we suffered a three-hour period of service unavailability in our main cloud. During this period, most PlayFab APIs, including all forms of player login, returned a high rate of "DatabaseThroughputExceeded" errors.
The cause of the service unavailability was an unexpected down-scaling of provisioned read throughput on a critical database table in AWS DynamoDB. This table is accessed as part of the execution of a large portion of APIs in order to check the player's ban status, with failures to access the table resulting in API errors. The down-scaling of provisioned throughput was initiated by the DynamoDB auto scaling service, which erroneously detected a sudden drop in table read throughput due to an earlier incident involving the AWS CloudWatch service. The issue persisted long after it was detected and diagnosed by our on-call engineers, because a mitigation measure within the DynamoDB service prevented our repeated requests to increase the provisioned throughput of the table from completing.
Once the underlying issue with the DynamoDB service was resolved, PlayFab APIs quickly returned to normal operation, with no data loss or other longer-term impact.
Time (UTC) | Event |
---|---|
2019-03-27 19:32 | DynamoDB "Ban" table provisioned read throughput dropped to 1.7% of its previous level. |
2019-03-27 19:33 | 95% of API requests began returning "DatabaseThroughputExceeded" results. |
2019-03-27 19:38 | Engineers began investigating a high volume of logged service errors. |
2019-03-27 19:45 | Automated alerting system opened a high severity incident, paging on-call engineers. |
2019-03-27 19:45 | The source of errors was diagnosed as an unexpected down-scaling of the DynamoDB "Ban" table provisioned read capacity. |
2019-03-27 19:46 | Engineers made several attempts to manually increase read throughput provisioning on the "Ban" table, but these requests failed after several minutes. |
2019-03-27 20:02 | High severity support ticket was opened with AWS support. |
2019-03-27 21:36 | AWS support representative acknowledged the DynamoDB issue and escalated the case to the DynamoDB service team. |
2019-03-27 22:28 | DynamoDB "Ban" table provisioned read throughput returned to pre-incident levels. |
2019-03-27 22:35 | API error rates began dropping to normal levels. |
2019-03-28 00:01 | All systems confirmed to be operating normally. Incident updated as resolved. |
Following a thorough review of this incident, we are taking the following actions to reduce the likelihood, mitigate the impact, and decrease the repair time of similar incidents in the future.
We have set minimum provisioned read/write throughput levels on all high load DynamoDB tables to levels matching their trailing seven-day average usage. This should prevent the auto scaling service from setting provisioned throughput far below the actual usage level if it malfunctions again, which should limit the impact on API availability. We have added a weekly procedure to update these minimum levels.
We have established more direct communication channels with the AWS DynamoDB team in order to fully review this incident and to reduce the time-to-repair for future incidents. The DynamoDB team is conducting an internal review of this incident and will share their findings and recommendations with us once completed.
We are revising our agreement with AWS support to ensure a shorter response time on high priority issues.
We have scheduled work to update the player ban system to better handle temporary loss of availability of the database table.
We apologize for the impact that this incident caused for our customers who depend on PlayFab as a critical component of their games. We take pride in the availability and performance of our APIs, and we will learn from this incident and take the opportunity to further improve our service and operations.