System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage
Partial Outage of PlayFab Services
Incident Report for PlayFab
Postmortem

Playfab Incident Summary 10/7/2021

 

Incident Impact

Titles Affected: 65 titles [SK(1] [MA2] affected that compromise 66 % of our production traffic that were running on our Azure API servers.

All APIs for these titles were returning 404 errors for 100% of the requests [MA3] for the duration of the incident.

Impact Start @ 10:50 PDT (17:50 UTC[MA4] ) [SK(5] [MA6] on 10/7/2021

Mitigation Start @ 12:10 PDT (19:10 UTC[MA7] ) New cluster online and accepting requests

Resolved @ 12:35 PDT (19:35 UTC[MA8] ) Old cluster decommissioned to force remaining clients to reconnect to new cluster.

 

Timeline

PlayFab was in the process of migrating API traffic to a single large Azure Kubernetes cluster. At 17:48 AM UTC an infrastructure change to a helm chart that had been tested successfully in a development environment was rolled out to our single production cluster. The infrastructure update caused our ingress pods to begin returning 404s for all API calls. We attempted to revert the change, but it had no effect. We deployed a second cluster to replace the primary cluster, which was in a failed state. By 19:10 UTC Traffic was successfully migrated to the second cluster except for a small portion of requests from clients that held connections to the old cluster. The old cluster was destroyed to force the remaining clients to reconnect to the second cluster. At 19:35 UTC the incident was completely mitigated.

 

Root Cause

The root cause of this incident was that a single Kubernetes cluster serving all traffic was updated in place. This made the cluster a single point of failure where a failed configuration update could cause an outage for all titles.

 

Mitigations

  • Active x3 - We have now switched from running a single cluster to running three independent clusters behind a load balancer. All three of the clusters are receiving traffic, and if one fails for any reason the traffic will automatically switch to the other two clusters which are over-provisioned to handle the extra load if needed.
  • Immutable Clusters - In-place infrastructure updates to an active cluster will no longer be done. Instead, we will create a new cluster with the new infrastructure or configuration and then add the cluster to the load balancer rotation.

 

Both mitigations were in place by the day after the incident and we believe will prevent this class of problem in the future.

Posted Oct 26, 2021 - 11:36 PDT

Resolved
This issue is resolved and all Azure PlayFab functionality is fully operational.
Posted Oct 07, 2021 - 12:38 PDT
Monitoring
The changes to address this partial outage have been rolled out and we are currently monitoring the services. Functionality is gradually being restored.
Posted Oct 07, 2021 - 12:32 PDT
Identified
The issue has been identified and we are currently validating changes needed to rectify the underlying issue.
Posted Oct 07, 2021 - 12:18 PDT
Update
We are continuing to investigate this issue.
Posted Oct 07, 2021 - 12:14 PDT
Update
We are continuing to investigate this issue.
Posted Oct 07, 2021 - 11:57 PDT
Investigating
We are currently experiencing a partial outage of a number of our services. The issue is under active investigation.
Posted Oct 07, 2021 - 11:35 PDT
This incident affected: Add-ons (Appuri, Segment, Kochava, GitHub, Innervate, PayPal, Photon, New Relic, Kongregate, Facebook, Apple, Google Play, PlayStation Network, Steam, Xbox Live), API (Authentication, Cloud Script, Content, Data, Events, Inventory, Matchmaking, Statistics and Leaderboards), Custom Game Servers (US, EU, Singapore, Japan, Australia, Brazil), PlayStream (Event Processing, Webhook Deliveries), Analytics (Trends, Reports, Dashboard, Event History & Search, PlayStream Debugger, Event Archiving), Services (Job Service), Game Manager (Console, Player Search, API Graphs), and Party, Player Password Reset Portal.