On April 4th, 2024, between 9:45 AM and 11:06 AM UTC, some customers experienced intermittent errors and delays when accessing PlayFab's CloudScript API. The incident was caused by a misconfigured TLS certificate that affected the communication between PlayFab compute instances, causing some CloudScript calls to general PlayFab APIs to fail. We resolved the issue by rolling back the traffic to undo changes caused by our production deployment and updated the certificate policy.
To prevent similar incidents from happening again, we have taken the following actions:
We fixed the bug in the certificate policy creation module that prevented the certificate policy from being updated when the DNS names changed.
We improved our testing and validation procedures for certificate policies to ensure that they match the expected DNS names for each cluster version.
We enhanced our monitoring and alerting systems to detect and report any anomalies in the CloudScript API performance and availability.