On February 12th, 2025, between 11:09 AM and 12:14 PM PST, some customers experienced delays in the updating of Leadership dashboards due to an issue with the PlayStream processor. The incident was caused by a failed authentication error from a network configuration change which was not correctly assigned to the managed identity. We resolved the issue by deleting the stats processor pods in the partially created cluster and ensuring the monitor reported healthy status.
The delay in updating Leadership dashboards lasted 1 hour and 4 minutes, affecting the PlayStream processor's ability to update its processing status.
The root cause of this incident was a human error in configuration. A new rollout was initiated earlier in the day, but the cluster was not fully created and the deployment should have been for an earlier version. This incomplete status led to missing role assignments and managed identities, resulting in authentication errors in the stats processor.
To prevent similar incidents from happening again, we have taken the following actions:
· We have improved our testing and validation procedures for network configuration changes to catch such bugs before they reach production.
· We have enhanced our monitoring and alerting systems to detect and report any anomalies in the load balancer's behavior and performance.
· We investigated and fixed the health probe for the PlayStream processors to ensure proper assignment of managed identities.