System Status

Status of PlayFab services and history of incidents

Operational

Partial Outage

Major Outage

High API error rate across multiple titles

Incident Report for PlayFab

Postmortem

On March 21, 2024, between 11:45 AM and 1:55 PM UTC, some customers experienced intermittent errors and delays when accessing PlayFab's API. The incident was caused by a network configuration change that triggered an overload of NAT gateways that severely throttled outgoing traffic. We resolved the issue by rolling back the configuration change and applying a patch to the load balancer.

Impact

The incident affected almost all titles that use PlayFab's API, especially those that rely on multiplayer services. Some customers received ServiceUnavailable errors when trying to matchmake or join game sessions. The incident lasted for about two hours and ten minutes.

Root Cause Analysis

The root cause of the incident was a network configuration change that enabled a feature flag to warm up CosmosDB connections. This feature flag was intended to improve the performance and reliability of our database operations, but it had an unintended side effect of creating thousands of outgoing HTTP connections to CosmosDB endpoints. These connections went through our NAT gateways, which have a maximum capacity of SNAT connections. Due to a gap in our monitoring, we were not aware the NAT gateways were already operating near capacity. The gateways were overwhelmed by the surge of traffic and entered a failure state, dropping SYN packets and causing connection timeouts. This caused our API servers to fail to process incoming requests, resulting in errors and delays for our customers.

Action Items

We improved our testing and validation procedures for network configuration changes to catch such bugs before they reach production.
We enhanced our monitoring and alerting systems to detect and report any anomalies in the load balancer's behavior and performance.
We followed up with Azure NAT Gateway team on the cause of the abrupt failure and requested more visibility and control over the NAT gateway capacity and health.

Posted Apr 23, 2024 - 17:12 PDT

Resolved

This incident has been resolved.

Posted Mar 21, 2024 - 14:04 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 21, 2024 - 13:52 PDT

Update

We are continuing to investigate this issue.

Posted Mar 21, 2024 - 12:13 PDT

Investigating

We are currently investigating this issue.

Posted Mar 21, 2024 - 12:12 PDT

This incident affected: API (Authentication, Cloud Script, Content, Data, Economy (V2), Events, Inventory, Lobby, Matchmaking, Statistics and Leaderboards).