System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage
Increased latency and decreased availability of GetUserReadOnlyData APIs
Incident Report for PlayFab
Postmortem

On July 25th, 2024, between 12:10 PM and 10:20 PM PDT, some customers started experiencing intermittent errors and delays when accessing PlayFab's API. The incident was caused by network routing issues between Azure Kubernetes Service (AKS) and CosmosDB.

We resolved the issue by moving all pods to different availability zones (AZ) while the Azure network routing issue was resolved.

Impact

Customers experienced timeouts when calling multiple APIs (77 different APIs) for some players. This affected the overall quality of service and service level agreements (SLA).

Root Cause Analysis

The root cause of the incident was a misconfiguration of private endpoints in AZ2, which occurred because the update to their configuration was delayed. This led to connectivity issues between AKS and CosmosDB, causing queries to time out.

Action Items

To prevent similar incidents from happening again, we will take the following actions:

  • Enhanced monitoring and alerting systems to detect and report any anomalies in the networking behavior and performance in a particular AZ.

  • Azure networking team to make rollouts of configuration updates, with improved monitoring to avoid a repeat of such incidents in the future.

Posted Aug 28, 2024 - 17:11 PDT

Resolved
This incident has been resolved.
Posted Aug 16, 2024 - 09:51 PDT
Update
Fix has been confirmed since 8/3 and API latency is in the green again. (Sorry we forgot to resolve this at that time).
Posted Aug 16, 2024 - 09:49 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 02, 2024 - 21:38 PDT
Identified
The issue has been identified and a mitigation has been implemented. Customers should be seeing an immediate improvement. Latency and error rate are expected to be back to normal within 2 hours. Multiple engineering teams across the company are collaborating to identify the root cause.
Posted Aug 02, 2024 - 21:36 PDT
Update
We are still investigating the issue. The list of affected APIs:
Admin/GetUserReadOnlyData, Admin/UpdateUserReadOnlyData, Client/GetPlayerCombinedInfo, Client/GetUserCombinedInfo, Client/GetUserReadOnlyData, Client/LoginWithAndroidDeviceID, Client/LoginWithApple, Client/LoginWithCustomID, Client/LoginWithEmailAddress, Client/LoginWithFacebook, Client/LoginWithGameCenter, Client/LoginWithGoogleAccount, Client/LoginWithGooglePlayGamesServices, Client/LoginWithIOSDeviceID, Client/LoginWithKongregate, Client/LoginWithNintendoServiceAccount, Client/LoginWithNintendoSwitchDeviceId, Client/LoginWithOpenIdConnect, Client/LoginWithPlayFab, Client/LoginWithPSN, Client/LoginWithSteam, Client/LoginWithTwitch, Client/LoginWithXbox, Server/GetPlayerCombinedInfo, Server/GetUserReadOnlyData, Server/LoginWithServerCustomId, Server/LoginWithXboxId, Server/UpdateUserReadOnlyData
Posted Jul 30, 2024 - 22:26 PDT
Investigating
Our engineering teams are investigating the degraded performance of GetUserReadOnlyData APIs since 00:00 UTC.
Posted Jul 30, 2024 - 20:43 PDT
This incident affected: API (Data).