Microsoft Azure PlayFab

System Status

Status of PlayFab services and history of incidents

Partial Outage
Major Outage
Allocation Failures - Multiplayer Servers 2.0 Beta (Thunderhead)
Incident Report for PlayFab

*At *2018-07-13T00:58 UTC, the certificate used to during the SSL handshake to verify the certificate chain between the Multiplayer Server agent running in each VM and the Multiplayer Server control plane failed to be found.  The underlying issue was that the servers hosting these certificates were decommissioned several days before and the cached version on Akamai expired at this time.

This resulted in the Multiplayer server 2.0 beta service was unable to allocate new servers to titles using the service. Initially, the on-call engineer suspected storage account throttling since this had been an issue in the past. Once we verified that wasn’t the case, we checked our replication system and found a what we thought was the issue but turned out to be a symptom not the cause of the issue.  Once these two issues were ruled out the first two potential issues the missing certificate was identified as the issue at approximately 2018-07-13T06:05 UTC.

As an immediate mitigation, the on-call engineer manually copied and installed the CA cert onto each of the other servers where the problem was occurring. This mitigated the issue. As a second mitigation, the on-call engineer hosted the AIA cert at a temporary storage account and set up Akamai routing to retain the AIA URL stamped on the Azure VM certificate.

With the service working normally again we are focusing on making sure this issue won’t happen again and look for other similar issues that could potentially happen to prevent them in the first place.  We have already put the following fixes in place:

  • Investigation and full internal report around why the servers hosting the certificates were decommissioned
  • Long term ownership and accountability for the internal certificate authority (there was a recent team transition).
  • Moving the CA certificate from temporary storage account back to the main accounts backing the URLs.


The PlayFab Team

Posted 12 months ago. Sep 07, 2018 - 11:02 PDT

This incident has been resolved.
Posted about 1 year ago. Jul 12, 2018 - 23:39 PDT
Problem is understood and mitigation applied. Standing by levels are recovering but still below target and causing partial allocation failures.
Posted about 1 year ago. Jul 12, 2018 - 23:37 PDT
Allocation failures in multiple regions.
Posted about 1 year ago. Jul 12, 2018 - 22:17 PDT
This incident affected: Multiplayer Game Servers 2.0 (Request Multiplayer Server API).