How We Process Replays
We’re finally releasing HSReplay.net into public beta! With that, I wanted to give a short overview of how replays are ingested and processed behind the scenes.
The flow, in a nutshell:
- A game is played with logging enabled.
- The
Power.log
file for the game is uploaded, with a bunch of metadata about the game. - An “upload request” is created, its URL is returned to the original client.
- The upload request is processed into an HSReplay file and a Game Replay DB object.
- The replay’s URL is created and the upload request now redirects to it.
Considerations
Our two main goals when architecturing this were speed and reliability.
The simplest version of the flow would look like Upload game with metadata
-> Return processed game URL
.
However, replay processing is an operation which can take some time - parsing the log, creating the database
object, etc. So to prevent the server from going down under any kind of load, we decided to use
Amazon Lambda.
Using Lambda, we offload replay processing to separate machines which are spun up on demand - each machine
gets a single replay to process.
Now, we had to separate the concept of an “upload request” (the input), from the “game replay” (the output).
In the first version of that, when the game is uploaded, we store the file and the metadata in the UploadEvent
object. This returns immediately with a unique upload URL. A new Lambda is then spun up to process that replay;
once that completes, the original upload is then linked up to the resulting game, so that the URL now redirects
to the correct game.
If processing fails for whatever reason, it is instead updated to show the relevant error.
This design allowed us to easily re-process replays directly from their original upload. This is useful if,
for example, there is a bug in the parser which generated corrupt files.
On top of that, it allowed us to return URLs without having to wait for the whole game to be generated. We did
not want to share URLs between UploadEvent
and GameReplay
, seeing as uploads can fail or, in the case of
de-duplicated uploads, link to existing games.
Database-less design
When we did our initial load tests against the site, we found that at peak database capacity, uploads would fail because they could not be saved to the database - they would therefore be rejected. Upgrading the database is easy, but we wanted to do better. DB migrations, maintenance and such can interfere with the upload and that means losing games while the service is down. We started working on a new upload flow.
We came up with a way to offload the initial upload request entirely to S3, which we were already using to store our uploads and replays:
-
The client initially sends an upload request to Amazon’s API Gateway. This upload request contains only the authentication credentials and the game’s metadata.
-
The Gateway triggers a minimal Lambda which losslessly saves the upload request to a descriptor. It then generates an S3 PUT URL, which the client can use to upload the game’s log directly on the S3 bucket. It also generates an initial
shortuuid
, which is used as the upload’s unique ID from this point on. -
The S3 upload triggers a notification, which spins up a second Lambda. This Lambda validates the metadata against our API and creates the
UploadEvent
in the database accordingly. If this succeeds, the initial S3 metadata is deleted. If it fails, it is instead moved to a “failed” prefix, allowing us to inspect the issue. -
Another notification is triggered to spin up a third Lambda, which takes the UploadEvent’s log file and parses it, creating a final HSReplay file the
GameReplay
object. The initial UploadEvent is updated with that game’s URL, as before.
Suddenly, the site was a lot more reliable. Every step of this process gracefully fails without any data loss incurring. Every failed step can be reprocessed. Processing can even be paused should we need to go into a maintenance mode, replays will merely accumulate into the S3 bucket and can be processed later on.
Lastly, creating the Upload’s ID early on means we decide on a unique URL extremely early in the process. We return that URL to the client before the log file has even been uploaded, in the form of a “promise”. This allows uploads to feel extremely fast from the client’s perspective, as uploads can start invisibly early on (eg. during the end of the game) and only display to the end user once all of the game’s animations have completed. This is what we do in HDT; by the time we show the notification, games have generally already been processed, making uploads completely instantaneous from the player’s perspective.
You can read about how we improved our Power.log parsing speed to achieve that: Fast Hearthstone Log Parsing
And if you want to see the code, check us out on Github!
Jerome