On-demand Game Server Architecture On AWS

A cheaper alternative to running your high-scale multiplayer games

Bálint Biró
Towards AWS

--

Recently I’ve been working on a room-based multiplayer game built for WebGL and I kept bumping into issues with running my server architecture on top of AWS GameLift. I’ve used GameLift before for a different game, but I’ve always found the documentation grueling to sift through and the service just, all in all, lacking; deployments were slow and brittle, Terraform support is not great and for the problems it solves — as in indie — I found it to be a bit too pricy.

That, plus the fact that the game will be running through a https website (not much to see here just yet, it’s just set up as a test page) and that I’m using Mirror as my networking framework, I needed a bit more control over my TLS certs and domain mapping as well.

Lastly, as a bonus on top of this, the AWS .NET SDK is not designed to run on IL2CPP builds — which, believe me, I’ve tried to fix, but I just ended up digging further down the rabbit hole — , meant that I needed to go with an alternative approach and set up an API that would communicate with the rest of the services. My game client would then just make HTTPS calls to this API.

About The Game

Lunar League is a fast-paced space-based multiplayer battle game. It combines elements from the well-known battle royale and .io genres. A game room can host up to 8 players. Each players has a base planet and controls a ship. By collecting scraps across the map, players gain levels which enhance both their ships and their bases. They can also find weapons and utilities, which they can equip to further enhance their battling prowess. Players will respawn up until their bases stand. Once the base is destroyed, it’s their last chance to fight.

The last ship standing wins the game.

Game Design and Architecture

The way the game is designed dictates what the game server architecture should look like.

In a typical game server architecture setup, you have a bunch of machines running (sometimes multiple) process(es) that act as the game servers. Between the clients and the game servers there’s usually a middleman, who takes requests from the clients and acts on them, by say, instructing and reserving a game server to launch a new game. This middleman can also be used to poll the status of the game server or even update it if needed. It also acts as a relay between the client and the server, feeding the server instance’s address back to the client once its ready. This would also be the place to introduce matchmaking if you want to have any.

Generally, in a case like this, you should understand the seasonality of your game and autoscale the capacity based on that (amongst other things). The downside is that you’ll always have idle instances running, who are just waiting for new game requests, incurring costs. On the upside, serving a new game request is practically instantaneous, as the server process is already running by the time a new game is requested. Deployments can also get slightly tricky, as to safely roll out a new server version, you’d likely have to have two game server pools and slowly switch traffic over from one to the other. I guess you could also do canary deployments, but depending on the number of server processes you’re running, it can get tricky to get it right.

Depending on your game design, this type of architecture might be necessary. For example, in games like Diablo 3 or Path Of Exile, it would be awkward if you had to wait 30–60 seconds for a rift/map to open up.

On the other hand, if you look at games like Brawlhalla or Dauntless, even when you request a new game — i.e. start matchmaking in Brawlhalla, or start a hunt in Dauntless — there’s quite a bit of time between you pressing start and the game actually starting. In Brawlhalla, you get into an offline bot game, while in Dauntless you can just walk around town until your game is ready. These kind of setups can benefit heavily from an on-demand game server architecture, where instead of having multiple instances idly running, you launch a new game server instance when the player actually requests it. It can cut hosting costs insanely, but at the cost of the player having to wait 30–60 seconds for their game to start.

Of course there are many techniques in both cases that can highly improve the cons, I’m just stating them here as baselines, because we need to work off of something.

The beauty of going with an on-demand game server architecture is that you can essentially have your whole game infra be pay-as-you-go. On AWS, you can utilize API Gateway + Lambda + DynamoDB + Cognito for your middleman layer so you only pay for the backend API when players are playing the game, plus you can launch spot instances at a fraction of the price of normal EC2 instances, on-demand, in less than a minute. Of course, whether you can really use spot instances, depends on your failure tolerance, as these instances can go away with a few minutes notice at any time. In my personal experience of working with spot instances across multiple high-volume backends, I’ve rarely had any issues with them yet. That’s not to say issues can’t happen, so do your own research before committing to using it.

Fun fact, in Lunar League, I’m considering adding a feature where if an instance is notified about getting terminated, I’d start shrinking the map sooner (think battle royales) than usual, forcing the game to end before the instance goes away.

The flow

The diagram above should give you a good idea of the overarching system architecture, but it doesn’t really show the exact flow how a user gets from creating a game to joining one.

  1. Player visits the website (it’s a WebGL game) and launches the game.
  2. Inside the game client the player signs up/logs in (this can be pulled out if necessary and the session cookie can be injected in — this way you could avoid having to work with forms in your game client).
  3. The login/signup process is handled through HTTPS calls to the Middleman. For Lunar League the login API is a thin wrapper around AWS Cognito.
  4. Once logged in, the player creates a new game room. For now, the game has game rooms for rapid testing, later on this might change and move completely to matchmaking. Because I’m using Mirror and its built-in Room system, creating the room actually means launching the EC2 instance. This means creating the room takes up the majority of the time, which is less than ideal. I have plans on detaching it, though I’m not treating it as high priority right now.
  5. To launch the room, the game client sends a HTTPS request to the Middleman, which requests a new spot instance. The spot instance is launched with a small user data script through which I pass in the game room id and some other arbitrary info to bootstrap the game server. Middleman saves some metadata about the newly created game room to DynamoDB for state keeping. The game room also comes with a TTL field, which is going to be important for the cleanup process.
  6. The EC2 instance launches, it uses a custom AMI with some very basic setup. The user data script injected in the previous step runs, it pulls the latest available production build, extracts it and launches it.
  7. The server process comes online and notifies the Middleman that it’s ready and awaiting connections. It tells the Middleman its IP address and the port of the game server process. Middleman saves this info to DynamoDB.
  8. In the meantime the game client is polling the Middleman to know when it can initiate the connection. Once it receives a READY status, it takes the IP and port from the response.
  9. Because I’m running WebGL on a https website, I can only instantiate the WebSocket connection through Secure WebSocket (wss). For this, I have a registered domain with a TLS cert (servers.lunarleague.io) that points to the name servers of sslip.io. Later I might run my own name server, but for now it’s good enough. This lets me look up arbitrary addresses, such as 172.168.1.59.servers.lunarleague.io which basically returns the IP address 172.168.1.59. This way I don’t need to rely on Route53 to register subdomains/routes for my game servers. Route53 is pricy and slow.
  10. Connection is established to wss://172.168.1.59.servers.lunarleague.io, player joins the game room.
  11. Player who created the room launches the game. Players play and finish the game.
  12. When the game ends naturally, the game server updates Middleman that it’s now finished. Middleman removes the game state from DynamoDb, which triggers a DynamoDb stream event. A lambda function is listening to this event and cleans up the game server. If the game ends without a successful finish (e.g. all the players leave the game instead of finishing it), DynamoDb ends up cleaning up the game state itself once the TTL of the document is hit. This then triggers the same cleanup process.

That’s it pretty much. The whole lifecycle is self-contained to ensure there are no dangling instances in the pipeline.

There are still many ways to improve this. For example, I’m certain I can shave off some of the bootstrapping time by optimizing the AMI and the launch script. The lobby will need to be detached from the game server process. I’m exploring using WebSockets through API Gateway for this. I’ll post an update on the progress if I get there.

The infra that’s not managed by the Middleman is managed through Terraform. This way I have multiple identical environments set up. All of this allows me to test against very real scenarios for practically no cost. Running a game server literally costs a few cents.

Hope you found some value in this. I’m looking to write more about the different components in-depths, though I’m uncertain if I’ll have the time.

Do get in touch if you think it would be interesting for you. I’m also curious to hear your thoughts on this matter.

Credits

Thanks to my friend and colleague, Conor Maher, for pointing me in the right direction with xip.io, which lead me to finding sslip.io

--

--