System Status

Status of PlayFab services and history of incidents

Operational

Partial Outage

Major Outage

Multiplayer servers are having allocation issues in multiple regions

Incident Report for PlayFab

Postmortem

Reference

CosmosDB indexing policy - https://docs.microsoft.com/en-us/azure/cosmos-db/index-policy
Manage indexing policy in Azure CosmosDB - https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-manage-indexing-policy

Background

The PlayFab Multiplayer Server orchestrator (aka Control Plane) uses Azure CosmosDB as our backing persistence store and perform a variety of queries on various fields within each document stored therein. We take advantage of the fact that there is inbuilt secondary indices on each document, guaranteed high availability and geo-replication SLA (https://azure.microsoft.com/en-in/support/legal/sla/cosmos-db/v1_3/). Changing the indexing policy for documents on an existing database is more nuanced. When an indexing policy is changed, the existing indices are transformed to the new ones online and in-place. Also, it is possible that queries return partial results while the transformation is in progress.

Incident Details (all times are PST)

Impact start: 10:00 AM 10/21/2019

Partial mitigation: 4PM 10/21/2019

Full mitigation: 8PM 10/21/2019

As part of our on-call follow up, we do log analysis and performance tuning to ensure that our service is running optimally. During one of these exercises, we found that because of our non-default indexing policy, some of the queries were very expensive (the right kind of index was not built). To mitigate this, we changed the indexing policy manually on 10/17/2019. Note that this was done when our service was in steady state, during which, all our writes go through a write-through cache and the reads are served from the cache (therefore, we don't depend much on the database). The indexing policy update went through smoothly. In our code, as part of our service startup, we also set the indexing policy of the database (this was so that if we deployed to a new environment, the database would be configured correctly). We updated this code as well, so that the next deployment of the service wouldn't reset the indexing policy that was updated manually.

9:30 AM: We were updating our cluster certificates and that triggered a restart of each node in the cluster. The nodes were still running the old code, and so on restart, the startup code reset the indexing policy. Since the nodes were being restarted, the service wasn't in steady state and relied heavily on the database to give the right results. At this point, the database indexing transformation had started, and database was returning partial results. The effect wasn't drastic yet.

10:15 AM: Later, we did our regular production deployment with the new code. This again reset the indexing policy to the new policy, triggering another transformation. The certificate configuration on the some of the nodes hadn’t taken affect and the deployment was rolled back (queuing a change in policy). We corrected the certificate configuration and deployed again (changing the policy again). Note that, at this point, we had queued multiple transformation of the indices, each of which would cause the index to be rebuilt and cause the database to return partial results. The deployment completed around 11:20 AM.

11:30 AM: Our automated alerts and customers begin signaling that the service was reporting errors and that some multiplayer server builds were missing in API responses. We also noticed that some of the internal system documents were getting duplicated (logs showed exceptions saying multiple documents were found where only one was expected) and we realized something must be wrong with the database results. However, it was not immediately obvious why the indices were being rebuilt (as we had made the necessary code change to avoid this situation).

12 PM: To reduce server allocation failures, we made a code change that allowed us make progress in VM provisioning even in the wake of partial database results. This resulted in another production deployment that queued up another index transformation. The deployment completed around 12:45PM. We then started gathering more data around the impact and data corruption.

1:30 PM: We started engaging the CosmosDB on-call team since it wasn't clear why the transformation was happening and why it hadn't completed yet. After analyzing multiple logs, we found that there were queued transformations and each one was taking around 90 minutes to complete. With the help of their team, we cleared the queue and manually reset the policy again.

3:30 PM: Now that we had identified the underlying problem, we removed the code that was setting the indexing policy at startup completely. This production deployment resulted in partial mitigation. Server allocations started to succeed at around 4-4:15PM.

4:45 PM: We started seeing that VMs were not being provisioned and all the telemetry was still stale. Our VM provisioning is performed using periodic triggers/ reminders. By design, when a Multiplayer Server Build is deleted by a customer, we unregister all triggers for it (as part of cleanup). Unfortunately, during the index transformation, the database did not return documents for Builds that existed. Therefore, the service incorrectly deemed that Builds had been deleted by the customer and unregistered triggers for many Builds. So, no new VMs were being provisioned or assigned to those Builds. Existing VMs were working fine and all allocations were based on existing game servers and their recycle.

6 PM: We started to "kick" some of these triggers manually. We also wrote a script to do this for all titles. We started seeing the VM provisioning logic kicking in and presumed the issue was resolved.

6:30 PM: Another part of our system uses event streaming to hydrate data about all the game servers running for each Build. Again, by design, when a Multiplayer Server Build is deleted by the customer, we unregister from the event streaming (as part of cleanup). Thus, this part of the system was not getting any of the data from the VMs (heartbeats) and wasn't able to recover. As a conservative measure, if we haven't been able to recover the state fully, we don't try to provision more VMs (since we cannot make assertions about the current state of the system), to ensure that the system doesn't spiral out of control. However, this also meant that VM provisioning was still stuck.

7:30 PM: We deployed a code change to force subscribe to the event streaming. This finally did the trick and the service was fully restored. VM provisioning resumed immediately and we were able to quickly scale up to meet the demand.

Learnings and improvements

Dealing with corrupt data / partial results from database is incredibly hard and our service doesn’t handle this very well. We are evaluating approaches to increase isolation, to ensure that even if there is an issue, it doesn’t take down our entire service and affect all customers.
In hindsight, attempting the indexing policy change for query optimization was very risky. Much of the delay in recovery was also because of unfamiliarity/misunderstanding of how multiple index transformation are handled (we initially thought the latter ones would be rejected while one transformation is in progress). We are working with CosmosDB team to understand why new indices are not built separately while keeping existing ones (or at least queries shouldn't use any index, they will be expensive, but return correct results). Until we have identified and tested a safe way to perform index policy updates without these side effects, we will not be performing any such policy updates (manual or otherwise). Instead, if necessary, we will migrate the data to new collections (with the appropriate indexing policy).
We have removed all service code that touches or manipulates database parameters (including indexing policy). If needed, all of these will be implemented in a separate workflow that is independent of code deployment so that a rollback of code doesn’t modify the database parameters in unintended ways.
We are improving our alerting to catch the cases where the VM provisioning is stuck because of event streaming not working properly. We could have done better here if we diagnosed the issue faster. We are doing a pass across our alerting logic to see if there are additional cases that we should capture.

Posted Oct 24, 2019 - 10:27 PDT

Resolved

All required cleanup has been completed.

We received reports from a few customers, that they are having troubles for some Builds for which deletion was initiated hasn't completed and they are still in a Deleting state. We are aware and actively working on it, will notify when cleanup is complete.

Posted Nov 04, 2019 - 08:30 PST

Monitoring

VM rebuild is now occurring at a healthy pace. We are continuing to monitor the recovery.

Posted Oct 21, 2019 - 20:14 PDT

Identified

We identified an issue with rebuilding VM pools after the service recovered, which leads to insufficient number of available VMs. We are preparing a fix to deploy.

Posted Oct 21, 2019 - 17:43 PDT

Monitoring

A fix has been deployed. The service is recovering. We expect it might take another 30 minutes until full recovery.

Posted Oct 21, 2019 - 16:38 PDT

Update

We are continuing to work on a fix for this issue.

Posted Oct 21, 2019 - 16:36 PDT

Update

We continue to work on the mitigation and have identified the additional problem that blocked implementation of the prior hotfix.

Posted Oct 21, 2019 - 14:19 PDT

Update

We have identified a dependent service that is still causing issues despite the hotfix, investigation is on-going.

Posted Oct 21, 2019 - 13:17 PDT

Update

We are continuing to work on a fix for this issue.

Posted Oct 21, 2019 - 12:47 PDT

Update

We believe we identified the root cause. Deploying a code update.

Posted Oct 21, 2019 - 12:28 PDT

Identified

We have identified the issue and are working on a hotfix now.

Posted Oct 21, 2019 - 11:52 PDT

Investigating

The orchestrator service is experiencing an issue and servers are not being allocated, with several builds also no longer building new virtual machines. We are investigating.

Posted Oct 21, 2019 - 11:40 PDT

This incident affected: Multiplayer Game Servers 2.0 (Request Multiplayer Server API, Build Management - Game Manager and APIs) and Party.