System Status

Status of PlayFab services and history of incidents

Operational

Partial Outage

Major Outage

API Server Outage

Incident Report for PlayFab

Postmortem

Summary

On the afternoon of March 27, 2019 beginning 19:33 UTC, we suffered a three-hour period of service unavailability in our main cloud. During this period, most PlayFab APIs, including all forms of player login, returned a high rate of "DatabaseThroughputExceeded" errors.

The cause of the service unavailability was an unexpected down-scaling of provisioned read throughput on a critical database table in AWS DynamoDB. This table is accessed as part of the execution of a large portion of APIs in order to check the player's ban status, with failures to access the table resulting in API errors. The down-scaling of provisioned throughput was initiated by the DynamoDB auto scaling service, which erroneously detected a sudden drop in table read throughput due to an earlier incident involving the AWS CloudWatch service. The issue persisted long after it was detected and diagnosed by our on-call engineers, because a mitigation measure within the DynamoDB service prevented our repeated requests to increase the provisioned throughput of the table from completing.

Once the underlying issue with the DynamoDB service was resolved, PlayFab APIs quickly returned to normal operation, with no data loss or other longer-term impact.

Timeline

Time (UTC)	Event
2019-03-27 19:32	DynamoDB "Ban" table provisioned read throughput dropped to 1.7% of its previous level.
2019-03-27 19:33	95% of API requests began returning "DatabaseThroughputExceeded" results.
2019-03-27 19:38	Engineers began investigating a high volume of logged service errors.
2019-03-27 19:45	Automated alerting system opened a high severity incident, paging on-call engineers.
2019-03-27 19:45	The source of errors was diagnosed as an unexpected down-scaling of the DynamoDB "Ban" table provisioned read capacity.
2019-03-27 19:46	Engineers made several attempts to manually increase read throughput provisioning on the "Ban" table, but these requests failed after several minutes.
2019-03-27 20:02	High severity support ticket was opened with AWS support.
2019-03-27 21:36	AWS support representative acknowledged the DynamoDB issue and escalated the case to the DynamoDB service team.
2019-03-27 22:28	DynamoDB "Ban" table provisioned read throughput returned to pre-incident levels.
2019-03-27 22:35	API error rates began dropping to normal levels.
2019-03-28 00:01	All systems confirmed to be operating normally. Incident updated as resolved.

Response

Following a thorough review of this incident, we are taking the following actions to reduce the likelihood, mitigate the impact, and decrease the repair time of similar incidents in the future.

We have set minimum provisioned read/write throughput levels on all high load DynamoDB tables to levels matching their trailing seven-day average usage. This should prevent the auto scaling service from setting provisioned throughput far below the actual usage level if it malfunctions again, which should limit the impact on API availability. We have added a weekly procedure to update these minimum levels.

We have established more direct communication channels with the AWS DynamoDB team in order to fully review this incident and to reduce the time-to-repair for future incidents. The DynamoDB team is conducting an internal review of this incident and will share their findings and recommendations with us once completed.

We are revising our agreement with AWS support to ensure a shorter response time on high priority issues.

We have scheduled work to update the player ban system to better handle temporary loss of availability of the database table.

We apologize for the impact that this incident caused for our customers who depend on PlayFab as a critical component of their games. We take pride in the availability and performance of our APIs, and we will learn from this incident and take the opportunity to further improve our service and operations.

Posted Apr 01, 2019 - 11:07 PDT

Resolved

All services appear to be fully operational, we will continue monitoring.

Posted Mar 27, 2019 - 17:01 PDT

Update

All services appear to be fully operational, we will continue monitoring.

Posted Mar 27, 2019 - 16:42 PDT

Monitoring

We are seeing API performance returning to normal and monitoring closely for delays and/or errors.

Posted Mar 27, 2019 - 15:58 PDT

Update

We have more clarity on what we believe happened. The AWS Cloudwatch outage caused Cloudwatch monitors to drop to near zero. Dynamo DB, which powers most of our key tables in PlayFab, has an auto scaling feature which depends on Cloudwatch to set the service capacity. When Cloudwatch dropped to zero, dynamo DB scaled down to match improperly. We believe many customers are all trying to scale back up now at the same time, and this is causing the outage for us and we believe many others. We are working with Amazon to get these back up but still have no ETA at this point.

Posted Mar 27, 2019 - 15:49 PDT

Update

More details: as we work with AWS on regaining availability, we want to assure customers that we have seen zero data loss.

Posted Mar 27, 2019 - 15:20 PDT

Update

More details: a previous AWS outage in US-West-2 resulted in Dynamo capacity being dialed down to a minimal level; we believe the issue is Dynamo is now trying to scale back up and failing to scale resulting in many of our read/write operations to fail. We are working with them now to resolve.

Posted Mar 27, 2019 - 14:40 PDT

Update

The issue appears to be with AWS US-West-2 (Oregon) -- many of our key services depend on Dynamo DB, and Dynamo DB appears to be experiencing scaling issues. Amazon appears to be aware of the issue and working to solve it but at this point we have no ETA on the fix because we have no visibility into when AWS will be back up. We are looking into what other mitigations we can do to try and work around it.

Posted Mar 27, 2019 - 14:29 PDT

Identified

The issue has been identified and we're working on a resolution.

Posted Mar 27, 2019 - 13:43 PDT

Update

We are continuing to investigate this issue.

Posted Mar 27, 2019 - 13:26 PDT

Update

We are continuing to investigate this issue.

Posted Mar 27, 2019 - 13:24 PDT

Update

We are continuing to investigate this issue.

Posted Mar 27, 2019 - 13:22 PDT

Update

We are continuing to investigate this issue.

Posted Mar 27, 2019 - 13:11 PDT

Investigating

We're experiencing incidents causing our service to have problems scaling. We are investigating

Posted Mar 27, 2019 - 13:10 PDT

This incident affected: Add-ons (Appuri, Segment, Kochava, GitHub, Innervate, PayPal, Photon, New Relic, Kongregate, Facebook, Apple, Google Play, PlayStation Network, Steam, Xbox Live), API (Authentication, Cloud Script, Content, Data, Events, Inventory, Matchmaking, Statistics and Leaderboards), PlayStream (Event Processing, Webhook Deliveries), Multiplayer Game Servers 2.0 (Request Multiplayer Server API, Build Management - Game Manager and APIs, Legacy Multiplayer Servers API (Thunderhead)), Analytics (Reports, Event History & Search, PlayStream Debugger, Event Archiving), Game Manager (Console, Player Search, API Graphs), and Support and Documentation, Player Password Reset Portal.