Reference
Background
The PlayFab Multiplayer Server orchestrator (aka Control Plane) uses Azure CosmosDB as our backing persistence store and perform a variety of queries on various fields within each document stored therein. We take advantage of the fact that there is inbuilt secondary indices on each document, guaranteed high availability and geo-replication SLA (https://azure.microsoft.com/en-in/support/legal/sla/cosmos-db/v1_3/). Changing the indexing policy for documents on an existing database is more nuanced. When an indexing policy is changed, the existing indices are transformed to the new ones online and in-place. Also, it is possible that queries return partial results while the transformation is in progress.
Incident Details (all times are PST)
Impact start: 10:00 AM 10/21/2019
Partial mitigation: 4PM 10/21/2019
Full mitigation: 8PM 10/21/2019
As part of our on-call follow up, we do log analysis and performance tuning to ensure that our service is running optimally. During one of these exercises, we found that because of our non-default indexing policy, some of the queries were very expensive (the right kind of index was not built). To mitigate this, we changed the indexing policy manually on 10/17/2019. Note that this was done when our service was in steady state, during which, all our writes go through a write-through cache and the reads are served from the cache (therefore, we don't depend much on the database). The indexing policy update went through smoothly. In our code, as part of our service startup, we also set the indexing policy of the database (this was so that if we deployed to a new environment, the database would be configured correctly). We updated this code as well, so that the next deployment of the service wouldn't reset the indexing policy that was updated manually.
9:30 AM: We were updating our cluster certificates and that triggered a restart of each node in the cluster. The nodes were still running the old code, and so on restart, the startup code reset the indexing policy. Since the nodes were being restarted, the service wasn't in steady state and relied heavily on the database to give the right results. At this point, the database indexing transformation had started, and database was returning partial results. The effect wasn't drastic yet.
10:15 AM: Later, we did our regular production deployment with the new code. This again reset the indexing policy to the new policy, triggering another transformation. The certificate configuration on the some of the nodes hadn’t taken affect and the deployment was rolled back (queuing a change in policy). We corrected the certificate configuration and deployed again (changing the policy again). Note that, at this point, we had queued multiple transformation of the indices, each of which would cause the index to be rebuilt and cause the database to return partial results. The deployment completed around 11:20 AM.
11:30 AM: Our automated alerts and customers begin signaling that the service was reporting errors and that some multiplayer server builds were missing in API responses. We also noticed that some of the internal system documents were getting duplicated (logs showed exceptions saying multiple documents were found where only one was expected) and we realized something must be wrong with the database results. However, it was not immediately obvious why the indices were being rebuilt (as we had made the necessary code change to avoid this situation).
12 PM: To reduce server allocation failures, we made a code change that allowed us make progress in VM provisioning even in the wake of partial database results. This resulted in another production deployment that queued up another index transformation. The deployment completed around 12:45PM. We then started gathering more data around the impact and data corruption.
1:30 PM: We started engaging the CosmosDB on-call team since it wasn't clear why the transformation was happening and why it hadn't completed yet. After analyzing multiple logs, we found that there were queued transformations and each one was taking around 90 minutes to complete. With the help of their team, we cleared the queue and manually reset the policy again.
3:30 PM: Now that we had identified the underlying problem, we removed the code that was setting the indexing policy at startup completely. This production deployment resulted in partial mitigation. Server allocations started to succeed at around 4-4:15PM.
4:45 PM: We started seeing that VMs were not being provisioned and all the telemetry was still stale. Our VM provisioning is performed using periodic triggers/ reminders. By design, when a Multiplayer Server Build is deleted by a customer, we unregister all triggers for it (as part of cleanup). Unfortunately, during the index transformation, the database did not return documents for Builds that existed. Therefore, the service incorrectly deemed that Builds had been deleted by the customer and unregistered triggers for many Builds. So, no new VMs were being provisioned or assigned to those Builds. Existing VMs were working fine and all allocations were based on existing game servers and their recycle.
6 PM: We started to "kick" some of these triggers manually. We also wrote a script to do this for all titles. We started seeing the VM provisioning logic kicking in and presumed the issue was resolved.
6:30 PM: Another part of our system uses event streaming to hydrate data about all the game servers running for each Build. Again, by design, when a Multiplayer Server Build is deleted by the customer, we unregister from the event streaming (as part of cleanup). Thus, this part of the system was not getting any of the data from the VMs (heartbeats) and wasn't able to recover. As a conservative measure, if we haven't been able to recover the state fully, we don't try to provision more VMs (since we cannot make assertions about the current state of the system), to ensure that the system doesn't spiral out of control. However, this also meant that VM provisioning was still stuck.
7:30 PM: We deployed a code change to force subscribe to the event streaming. This finally did the trick and the service was fully restored. VM provisioning resumed immediately and we were able to quickly scale up to meet the demand.
Learnings and improvements