This postmortem covers three MPS incidents that occurred on April 10th.
Multiplayer Services spins up Batch VMs to host game servers. The VM agent deployed in each VM sends heartbeats to Multiplayer Services to report the health of the VM and its sessions. Lack of heartbeats for an extended period of time from a given VM deems it unhealthy and deletes it.
Impact start time: 04/10/2020 10:37 PDT
Mitigation time: 04/10/2020 23:00 PDT
10:35- Ev2 rollout for Multiplayer Services (MPS) starts to execute to update an expiring internal certificate.
10:37- MPS nodes restart because of the rollout.
10:43-11:22- VMs encounter exceptions when communicating with MPS. Consequently, heartbeats received by MPS primary cluster decreases to 0, while secondary cluster decreased by 47%.
10:52- VMs that have not successfully sent heartbeats to MPS in the last 10 mins are marked as unhealthy and start to be deleted.
11:25- Networking exceptions in VMs subside.
11:40 – 13:15- MPS nodes show high CPU usage. The procedure followed by the team when this is the case is to restart the affected nodes to allow traffic to redistribute among the rest. Several nodes in primary cluster are restarted during this time with no complications.
14:40- An additional node in primary cluster is manually restarted due to its high CPU usage.
14:40-15:08- VMs network exceptions spike up again. Heartbeats received by MPS primary cluster decreased by 83%, and secondary cluster by 35%.
15:19- System recovered. However, the team determined that the system was still at risk. Scaling out MPS seemed to be a way to prevent the incident to happen again.
18:25- Secondary MPS cluster is scaled out.
18:28-18:54- VM network exceptions spike up again. Causing heartbeats received by MPS primary cluster to decrease by 44% and secondary cluster by 93%.
18:38- Some VMs are marked as unhealthy due to lack of heartbeats.
19:00- Monitoring tools are deployed to some MPS primary cluster nodes to track network metrics.
19:29- Primary cluster is scaled out.
19:34- VM network exceptions spike up again. Heartbeats in primary cluster drop by 99% and secondary by 38%. TCP resets spike during this time as well as number of client ports in the nodes.
19:45- VMs are marked as unhealthy and deleted due to extended time without heartbeats.
20:00-23:00- A memory dump taken from one of the MPS nodes reveals that a blocking call executed on the establishment of new connections has caused the creation of thousands of threads in the nodes. Further investigation reveals that this call experienced some slow down during this time. For immediate mitigation, the blocking call is removed from the call path.
23:49- The new code is deployed to primary and secondary cluster.
23:59- Incident is mitigated.
Learnings and improvements
1. Monitoring of the network health of MPS will be improved to better diagnose problems in the future.
2. Synchronous calls in the hot path exacerbated the problem by overloading MPS. These will be made asynchronous by migrating to .NET Core.