System Status

Status of PlayFab services and history of incidents

Operational

Partial Outage

Major Outage

MPS Build and Allocation Failures

Incident Report for PlayFab

Postmortem

On March 24th, 2025, between 9:00 PM and 12:00 AM PST, customers intermittently encountered failures with PlayFab Multiplayer Server (MPS) APIs, such as build or allocation calls. The incident was caused by the unhealthy state of a cluster, which was triggered by an experimental feature enabled by a high load customer, resulting in pod restarts due to high CPU usage. We resolved the issue by deploying a hotfix to address the bug.

Impact

During the incident, customers experienced intermittent failures when using MPS APIs. The issue was isolated to titles leased to a specific cluster. Titles on other clusters were not affected.

Root Cause Analysis

The root cause of the incident was pod restarts triggered by unnecessary recurring calls from a new experimental feature enabled by a high load customer. The feature caused grains to initialize with leases in stamps not associated with the title, leading to delays in processing heartbeat requests. This filled the message queue on the grain, resulting in excessive CPU usage and pod restart events.

Action Items

To prevent similar incidents from occurring in the future, we have implemented the following actions:

Verified the functionality of the experimental feature.
Checked flags in Cosmos DB and API usage for predictive standby.
Initiated a re-evaluation of the design and test coverage for the feature.

Posted Apr 02, 2025 - 13:03 PDT

Resolved

One of our clusters experienced issues processing calls between 4 AM to 7 AM UTC on March 25th, causing some calls to fail. The issue was resolved after deploying a fix, but customers may have intermittently encountered failures with MPS APIs, such as build or allocation calls, during that time.

Posted Mar 24, 2025 - 20:58 PDT