System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage

PlayStream Processing Delay

Incident Report for PlayFab

Postmortem

On February 12th, 2025, between 11:09 AM and 12:14 PM PST, some customers experienced delays in the updating of Leadership dashboards due to an issue with the PlayStream processor. The incident was caused by a failed authentication error from a network configuration change which was not correctly assigned to the managed identity. We resolved the issue by deleting the stats processor pods in the partially created cluster and ensuring the monitor reported healthy status.

Impact

The delay in updating Leadership dashboards lasted 1 hour and 4 minutes, affecting the PlayStream processor's ability to update its processing status.

Root Cause Analysis

The root cause of this incident was a human error in configuration. A new rollout was initiated earlier in the day, but the cluster was not fully created and the deployment should have been for an earlier version. This incomplete status led to missing role assignments and managed identities, resulting in authentication errors in the stats processor.

Action Items

To prevent similar incidents from happening again, we have taken the following actions:

·       We have improved our testing and validation procedures for network configuration changes to catch such bugs before they reach production.

·       We have enhanced our monitoring and alerting systems to detect and report any anomalies in the load balancer's behavior and performance.

·       We investigated and fixed the health probe for the PlayStream processors to ensure proper assignment of managed identities.

Posted Feb 18, 2025 - 14:20 PST

Resolved

There was a playstream incident between 10:50 AM and noon, which caused a delay in updating the leaderboard dashboards. The issue was resolved at noon, and processing has returned to normal.
Posted Feb 12, 2025 - 11:00 PST