On March 24th, 2025, between 9:00 PM and 12:00 AM PST, customers intermittently encountered failures with PlayFab Multiplayer Server (MPS) APIs, such as build or allocation calls. The incident was caused by the unhealthy state of a cluster, which was triggered by an experimental feature enabled by a high load customer, resulting in pod restarts due to high CPU usage. We resolved the issue by deploying a hotfix to address the bug.
During the incident, customers experienced intermittent failures when using MPS APIs. The issue was isolated to titles leased to a specific cluster. Titles on other clusters were not affected.
The root cause of the incident was pod restarts triggered by unnecessary recurring calls from a new experimental feature enabled by a high load customer. The feature caused grains to initialize with leases in stamps not associated with the title, leading to delays in processing heartbeat requests. This filled the message queue on the grain, resulting in excessive CPU usage and pod restart events.
To prevent similar incidents from occurring in the future, we have implemented the following actions:
Verified the functionality of the experimental feature.
Checked flags in Cosmos DB and API usage for predictive standby.
Initiated a re-evaluation of the design and test coverage for the feature.