Allocation Failures - Multiplayer Servers 2.0 Beta (Thunderhead)
Incident Report for PlayFab
Postmortem

*At *2018-07-13T00:58 UTC, the certificate used to during the SSL handshake to verify the certificate chain between the Multiplayer Server agent running in each VM and the Multiplayer Server control plane failed to be found.  The underlying issue was that the servers hosting these certificates were decommissioned several days before and the cached version on Akamai expired at this time.

This resulted in the Multiplayer server 2.0 beta service was unable to allocate new servers to titles using the service. Initially, the on-call engineer suspected storage account throttling since this had been an issue in the past. Once we verified that wasn’t the case, we checked our replication system and found a what we thought was the issue but turned out to be a symptom not the cause of the issue.  Once these two issues were ruled out the first two potential issues the missing certificate was identified as the issue at approximately 2018-07-13T06:05 UTC.

As an immediate mitigation, the on-call engineer manually copied and installed the CA cert onto each of the other servers where the problem was occurring. This mitigated the issue. As a second mitigation, the on-call engineer hosted the AIA cert at a temporary storage account and set up Akamai routing to retain the AIA URL stamped on the Azure VM certificate.

With the service working normally again we are focusing on making sure this issue won’t happen again and look for other similar issues that could potentially happen to prevent them in the first place.  We have already put the following fixes in place:

  • Investigation and full internal report around why the servers hosting the certificates were decommissioned
  • Long term ownership and accountability for the internal certificate authority (there was a recent team transition).
  • Moving the CA certificate from temporary storage account back to the main accounts backing the URLs.

Regards,

The PlayFab Team

Posted 3 months ago. Sep 07, 2018 - 11:02 PDT

Resolved
This incident has been resolved.
Posted 5 months ago. Jul 12, 2018 - 23:39 PDT
Monitoring
Problem is understood and mitigation applied. Standing by levels are recovering but still below target and causing partial allocation failures.
Posted 5 months ago. Jul 12, 2018 - 23:37 PDT
Identified
Allocation failures in multiple regions.
Posted 5 months ago. Jul 12, 2018 - 22:17 PDT
This incident affected: Multiplayer Game Servers (private preview - beta ).