On March 21, 2024, between 11:45 AM and 1:55 PM UTC, some customers experienced intermittent errors and delays when accessing PlayFab's API. The incident was caused by a network configuration change that triggered an overload of NAT gateways that severely throttled outgoing traffic. We resolved the issue by rolling back the configuration change and applying a patch to the load balancer.
The incident affected almost all titles that use PlayFab's API, especially those that rely on multiplayer services. Some customers received ServiceUnavailable errors when trying to matchmake or join game sessions. The incident lasted for about two hours and ten minutes.
The root cause of the incident was a network configuration change that enabled a feature flag to warm up CosmosDB connections. This feature flag was intended to improve the performance and reliability of our database operations, but it had an unintended side effect of creating thousands of outgoing HTTP connections to CosmosDB endpoints. These connections went through our NAT gateways, which have a maximum capacity of SNAT connections. Due to a gap in our monitoring, we were not aware the NAT gateways were already operating near capacity. The gateways were overwhelmed by the surge of traffic and entered a failure state, dropping SYN packets and causing connection timeouts. This caused our API servers to fail to process incoming requests, resulting in errors and delays for our customers.