On July 25th, 2024, between 12:10 PM and 10:20 PM PDT, some customers started experiencing intermittent errors and delays when accessing PlayFab's API. The incident was caused by network routing issues between Azure Kubernetes Service (AKS) and CosmosDB.
We resolved the issue by moving all pods to different availability zones (AZ) while the Azure network routing issue was resolved.
Customers experienced timeouts when calling multiple APIs (77 different APIs) for some players. This affected the overall quality of service and service level agreements (SLA).
The root cause of the incident was a misconfiguration of private endpoints in AZ2, which occurred because the update to their configuration was delayed. This led to connectivity issues between AKS and CosmosDB, causing queries to time out.
To prevent similar incidents from happening again, we will take the following actions:
Enhanced monitoring and alerting systems to detect and report any anomalies in the networking behavior and performance in a particular AZ.
Azure networking team to make rollouts of configuration updates, with improved monitoring to avoid a repeat of such incidents in the future.