On June 7th, 2024, between 11:26 AM and 12:17 PM UTC, some customers experienced intermittent errors and delays when accessing PlayFab's API. The incident was caused by a rapid scaling down of the cloud script infrastructure after a network configuration change, resulting in resource starvation and overload of the available compute instances. We resolved the issue by increasing the minimum number of replicas and decreasing the maximum number of script engines per title for some heavy cloud script users.
Customers were seeing a 7% error rate in all cloud script calls returning InternalServerError.
To prevent similar incidents from happening again, we have taken the following actions:
We enhanced our monitoring and alerting systems to detect and report any anomalies in the cloud script server's behavior and performance.
We updated the scaling policies for the cloud script server deployment to ensure a sufficient number of replicas and a balanced distribution of traffic.
We decreased the maximum number of script engines per title for some heavy cloud script users to reduce the resource contention and improve the service quality.