Microsoft Azure PlayFab

System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage
API Server Outage
Incident Report for PlayFab
Postmortem

Summary

On the afternoon of March 27, 2019 beginning 19:33 UTC, we suffered a three-hour period of service unavailability in our main cloud. During this period, most PlayFab APIs, including all forms of player login, returned a high rate of "DatabaseThroughputExceeded" errors.

The cause of the service unavailability was an unexpected down-scaling of provisioned read throughput on a critical database table in AWS DynamoDB. This table is accessed as part of the execution of a large portion of APIs in order to check the player's ban status, with failures to access the table resulting in API errors. The down-scaling of provisioned throughput was initiated by the DynamoDB auto scaling service, which erroneously detected a sudden drop in table read throughput due to an earlier incident involving the AWS CloudWatch service. The issue persisted long after it was detected and diagnosed by our on-call engineers, because a mitigation measure within the DynamoDB service prevented our repeated requests to increase the provisioned throughput of the table from completing.

Once the underlying issue with the DynamoDB service was resolved, PlayFab APIs quickly returned to normal operation, with no data loss or other longer-term impact.

Timeline

Time (UTC) Event
2019-03-27 19:32 DynamoDB "Ban" table provisioned read throughput dropped to 1.7% of its previous level.
2019-03-27 19:33 95% of API requests began returning "DatabaseThroughputExceeded" results.
2019-03-27 19:38 Engineers began investigating a high volume of logged service errors.
2019-03-27 19:45 Automated alerting system opened a high severity incident, paging on-call engineers.
2019-03-27 19:45 The source of errors was diagnosed as an unexpected down-scaling of the DynamoDB "Ban" table provisioned read capacity.
2019-03-27 19:46 Engineers made several attempts to manually increase read throughput provisioning on the "Ban" table, but these requests failed after several minutes.
2019-03-27 20:02 High severity support ticket was opened with AWS support.
2019-03-27 21:36 AWS support representative acknowledged the DynamoDB issue and escalated the case to the DynamoDB service team.
2019-03-27 22:28 DynamoDB "Ban" table provisioned read throughput returned to pre-incident levels.
2019-03-27 22:35 API error rates began dropping to normal levels.
2019-03-28 00:01 All systems confirmed to be operating normally. Incident updated as resolved.

Response

Following a thorough review of this incident, we are taking the following actions to reduce the likelihood, mitigate the impact, and decrease the repair time of similar incidents in the future.

We have set minimum provisioned read/write throughput levels on all high load DynamoDB tables to levels matching their trailing seven-day average usage. This should prevent the auto scaling service from setting provisioned throughput far below the actual usage level if it malfunctions again, which should limit the impact on API availability. We have added a weekly procedure to update these minimum levels.

We have established more direct communication channels with the AWS DynamoDB team in order to fully review this incident and to reduce the time-to-repair for future incidents. The DynamoDB team is conducting an internal review of this incident and will share their findings and recommendations with us once completed.

We are revising our agreement with AWS support to ensure a shorter response time on high priority issues.

We have scheduled work to update the player ban system to better handle temporary loss of availability of the database table.

We apologize for the impact that this incident caused for our customers who depend on PlayFab as a critical component of their games. We take pride in the availability and performance of our APIs, and we will learn from this incident and take the opportunity to further improve our service and operations.

Posted 7 months ago. Apr 01, 2019 - 11:07 PDT

Resolved
All services appear to be fully operational, we will continue monitoring.
Posted 7 months ago. Mar 27, 2019 - 17:01 PDT
Update
All services appear to be fully operational, we will continue monitoring.
Posted 7 months ago. Mar 27, 2019 - 16:42 PDT
Monitoring
We are seeing API performance returning to normal and monitoring closely for delays and/or errors.
Posted 7 months ago. Mar 27, 2019 - 15:58 PDT
Update
We have more clarity on what we believe happened. The AWS Cloudwatch outage caused Cloudwatch monitors to drop to near zero. Dynamo DB, which powers most of our key tables in PlayFab, has an auto scaling feature which depends on Cloudwatch to set the service capacity. When Cloudwatch dropped to zero, dynamo DB scaled down to match improperly. We believe many customers are all trying to scale back up now at the same time, and this is causing the outage for us and we believe many others. We are working with Amazon to get these back up but still have no ETA at this point.
Posted 7 months ago. Mar 27, 2019 - 15:49 PDT
Update
More details: as we work with AWS on regaining availability, we want to assure customers that we have seen zero data loss.
Posted 7 months ago. Mar 27, 2019 - 15:20 PDT
Update
More details: a previous AWS outage in US-West-2 resulted in Dynamo capacity being dialed down to a minimal level; we believe the issue is Dynamo is now trying to scale back up and failing to scale resulting in many of our read/write operations to fail. We are working with them now to resolve.
Posted 7 months ago. Mar 27, 2019 - 14:40 PDT
Update
The issue appears to be with AWS US-West-2 (Oregon) -- many of our key services depend on Dynamo DB, and Dynamo DB appears to be experiencing scaling issues. Amazon appears to be aware of the issue and working to solve it but at this point we have no ETA on the fix because we have no visibility into when AWS will be back up. We are looking into what other mitigations we can do to try and work around it.
Posted 7 months ago. Mar 27, 2019 - 14:29 PDT
Identified
The issue has been identified and we're working on a resolution.
Posted 7 months ago. Mar 27, 2019 - 13:43 PDT
Update
We are continuing to investigate this issue.
Posted 7 months ago. Mar 27, 2019 - 13:26 PDT
Update
We are continuing to investigate this issue.
Posted 7 months ago. Mar 27, 2019 - 13:24 PDT
Update
We are continuing to investigate this issue.
Posted 7 months ago. Mar 27, 2019 - 13:22 PDT
Update
We are continuing to investigate this issue.
Posted 7 months ago. Mar 27, 2019 - 13:11 PDT
Investigating
We're experiencing incidents causing our service to have problems scaling. We are investigating
Posted 7 months ago. Mar 27, 2019 - 13:10 PDT
This incident affected: Multiplayer Game Servers 2.0 (Request Multiplayer Server API, Build Management - Game Manager and APIs, Legacy Multiplayer Servers API (Thunderhead)), API (Authentication, Data, Inventory, Statistics and Leaderboards, Matchmaking, Content, Events, Cloud Script), Add-ons (Appuri, Segment, Kochava, GitHub, Innervate, PayPal, Photon, New Relic, Kongregate, Apple, Google Play, Facebook, Steam, PlayStation Network, Xbox Live), PlayStream (Event Processing, Webhook Deliveries), Analytics (Reports, Event History & Search, PlayStream Debugger, Event Archiving), Game Manager (Console, Player Search, API Graphs), and Support and Documentation, Player Password Reset Portal.