System Status

Status of PlayFab services and history of incidents

Operational
Partial Outage
Major Outage
API Connection Failures and Timeouts
Incident Report for PlayFab
Postmortem

At 2018-07-11T00:54 UTC, an engineer ran an automated job to re-route a title’s traffic from one cluster to another by updating DNS records.  Moving a title from one cluster to another is a routine job that includes specifying the title ID they are migrating into a form.  The engineer pasted an incorrect value into the job form, which resulted in the job updating the DNS records for the primary public API cluster to an incorrect destination.

This resulted in a majority of API calls failing because they were routed to the wrong cluster.  Within about two minutes the engineer noticed the drop-in request volume and started to investigate.  Shortly after that, a service health check detected the failure and paged the on-call engineer, who started to investigate.  The engineer identified the issue and deployed a fix in 7 minutes.

With the service working normally again we are focusing on making sure this issue won’t happen again and look for other similar issues that could potentially happen to prevent them in the first place.  We have already put the following fixes in place:

  • Add validation to the specific form such that only a valid title ID can be entered and ensure the main API service can’t be migrated using this automated job.
  • Reviewed our other automated deployment jobs to ensure that we have proper validation in place before the job runs.

Regards,

The PlayFab Team

Posted Sep 07, 2018 - 11:02 PDT

Resolved
This incident has been resolved.
Posted Jul 10, 2018 - 19:12 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 10, 2018 - 18:55 PDT
Monitoring
From approximately 17:54 to 18:05 there was a high rate of failures and timeouts connecting to the public API endpoint, due to a DNS routing error. The issue has been fixed, and connection error rates are returning to normal levels.
Posted Jul 10, 2018 - 18:54 PDT
This incident affected: API (Authentication, Content, Data, Events, Inventory, Matchmaking, Statistics and Leaderboards).