Starting 05/30 01:35 PDT, new virtual machines to fulfill Multiplayer Server demand were not being generated in Japan East. This eventually caused allocation issues as standing-by pools did not refill.
The Multiplayer Server control plane uses an array of Orleans grains to orchestrate server builds. A typically benign restart of a grain silo uncovered an Orleans bug where other silos did not pick up the restarted node. The issue was mitigated at 11:00 PDT by restarting all nodes serving Japan East.
Short-term (June) repair actions include:
1. Repairing the Orleans issue to be more resilient to node restarts
2. Increasing the sensitivity of our alerts so that single region issues are escalated more rapidly by engineering
Long-term repair actions (this quarter) include:
3 Providing monitoring tools for Multiplayer Server regions in Game Manager