System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage

Action Processing Delay

Incident Report for PlayFab

Postmortem

On July 31, 2025, between 10:30 AM and 7:29 PM PDT, some customers experienced significant delays when using PlayStream actions for rules and segments. Action executions were delayed, with a maximum delay of over 500 minutes at peak. The incident was caused by a combination of load and configuration values for the maximum number of records each processor could read at once, which, combined with pod health logic, led to excessive memory usage and processor failures. We resolved the issue by reducing the configuration value, which restored healthy processing across all partitions.

Impact

All PlayFab titles using PlayStream actions for rules and segments were impacted. Action executions were delayed but not dropped; however, the prolonged delay meant that some actions may not have been useful by the time they were processed.

Root Cause Analysis

The incident was caused by a misconfiguration in the number of records each processor attempted to read, combined with a change in the logic for partition allocation per processor. As processor pods failed due to memory exhaustion, the remaining healthy pods became overloaded, leading to a cascading failure and increasing delays in action processing.

Action Items

To prevent similar incidents from happening again, we have taken the following actions:

  • We reduced the maximum number of records each processor can read at once, improving processor reliability and preventing memory exhaustion.
  • We improved our monitoring and alerting to detect abnormal processor delays and memory usage earlier.
Posted Aug 12, 2025 - 16:47 PDT

Resolved

This incident has been resolved.
Posted Jul 31, 2025 - 21:55 PDT

Monitoring

A fix has been deployed and we are continuing to monitor as processing catches up.
Posted Jul 31, 2025 - 19:01 PDT

Update

We are continuing to investigate and testing a potential mitigation to improve action processing throughput.
Posted Jul 31, 2025 - 17:31 PDT

Investigating

We are currently experiencing delayed processing of actions for rule and segment automation. Engineers are working to resolve this as soon as possible.
Posted Jul 31, 2025 - 15:17 PDT
This incident affected: PlayStream (Event Processing).