At LMAX Exchange, we generate a huge amount of market data. At the moment, we’re spending some time putting together some new applications that want to be able to access this data. This post will briefly explain one facet of this work, and, in particular, a couple of interesting bugs we had to fix along the way.
Two important requirements conspired to make this trickier than usual:
these applications must not be able to communicate back pressure to the trading exchange.
these applications should, in normal operation, have access to data that is within a few seconds of real time
The existing market data store
We already have a reliable on-disk journal of all of our market data that is written to in a timely fashion. We’ve also put in some equipment to allow us to join the stream of events the journal experiences in real time.
To cope with the "no back pressure" requirement, we may be disconnected from that stream at any point (if the stream sender detects any sign of not keeping up it will terminate our session).
Unfortunately, this market data stream is stateful, so if we get disconnected, we need to rejoin at the point we last saw a message. The source application can only keep a small buffer of the stream in memory, so we’re going to have to replay from disk until we’re back in range of the live buffer, and then rejoin it.
Journal following peril
The journal writer serializes market data events using the same equipment we use to stream them over the network – it just pops them in a file instead. Once a file reaches a certain size, we move to a new file and start a job in the background to compress the newly-completed-file.
Our replay solution looks a bit like this:
10 receive replay request for timestamp 20 locate files after timestamp 30 replay those files 40 attempt to rejoin live 50 if unsuccessful, goto 20, but with timestamp = last replayed timestamp
We put this together. All our tests pass. We try it in a real environment and quickly encounter an
We had, in our tests, conveniently made the gzipping of journal files synchronous. Unfortunately this was not true in reality, so it was perfectly possible to see a bundle of files like the following:
journal-1.raw.gz journal-2.raw.gz journal-3.raw.gz <-- gzip in progress, partial file journal-3.raw <-- left in place until gzip complete journal-4.raw <-- currently live file
We considered trying to be clever and working out what gzips were in progress for a while. We played with some code that did that and it got big; we’ll try something else. How about if we gzip to a file extension that replay would not consider, and then perform a more atomic looking rename only when the file was complete? Simple, and, we thought, effective.
We triumphantly deployed this fix, and then immediately ran into a slightly knottier issue – one of our pieces of safety equipment detected a gap in sequence numbers during replay.
We stuck in some extra debugging information to help us out, and it only confused us further. The sequence gaps were within a file, not between files, and when we examined that file afterwards we found no sequence gaps in it. Usefully though, we did discover that we only had this issue when we were looking at the currently active file.
We had once again run into an issue that our single threaded tests could tell us nothing about. We looked deeper into the workings of the journaller and found our answer.
The inner workings of the journaller
Like most code on the hot path at work, the journaller has been put together thoughtfully. It deals with market data events in small batches (usually between one and twenty messages), and at the end of each batch, those events are guaranteed to be on disk.
Internally, it manages this by maintaining a 4KB buffer internally, and writing the entirety of that buffer to a memory mapped file at the end of every batch. It also calls ‘sync’ at the end of each batch to ensure the bytes have reached the disk. When the block is full (or an event appears that will not fit in the remaining space), the buffer is emptied and a new block is begun.
Now, the journal reader was originally built to cope with completed journal files, and so as soon as it sees any part of a block that looks empty (messages are written length prefixed, so any 0 length message is treated as end of block), it skips to the next block.
So, what’s our bug then? It’s the following sequence of events:
Journal writer thread | Journal reader thread writes message to block 1 | writes message to block 1 | writes message to block 1 | syncs block 1 to filesystem | writes message to block 1 (*) | starts reading block 1 writes message to block 1 (*) | reads all of block 1 into its own buffer syncs block 1 to filesystem | processing block 1 block 1 ends | ... writes message to block 2 | ... syncs block 2 to filesystem | ... ... | reads all of block 2 into its own buffer
In that sequence, the reader misses the events annotated with (*) – it reads the block before those messages have been sync’d to the disk, sees what looks like an empty block remainder there, and by the time it is ready to read the next block, some messages are there for it to read, so it continues, and immediately discovers it has a sequence gap.
A few quick and unsafe fixes suggest themselves – can we detect if we’re reading the last block of a file? Can we read the block again until some condition is met? We thought about these and eventually discarded them. Instead, we introduced an extra piece of information to the journal reader – that of the maximal sequence that is currently available for read, defined as the last sequence written to the last complete block.
As soon as the journal reader reaches this sequence it assumes it is ‘near-live’ and attempts to rejoin the in-memory stream. Given the single threaded nature of the journal writer, informing the reader what is ‘safe’ is just a matter of updating a volatile long – nice and cheap.
We shipped this rather less triumphantly and discovered, to our immense relief, that we were now free of obvious bugs. Now we can get back to actually working with this data!
All of these bugs fell into the gaps where we had explicitly decided "that case is too tricky to put tests around, because it involves interactions across multiple racing threads, and tests like that invariably end up being intermittent and of low value".
Now, that’s not a bad plan, assuming you already trust the equipment that’s managing the inter thread communication. We didn’t think too much about this in our case because we picked a hidden mechanism for communicating – the on disk filesystem. While it can be sensible to let a database handle matters of concurrency for you, allowing a custom binary reader/writer pairing to do so via a bundle of files is perhaps less of a good idea.
With hindsight we’d probably have refactored a tiny working example of the reading/writing code out of the application and run it with same threading model. We could then have tested this exploratively in situ, and those tests might, if they were sound enough, have made it into the build somehow (they’d become ‘checks’ if you believe James Marcus Bach’s creed).
At some point, we must choose to stop testing. Once the decision is made, any ‘untested’ behaviour has to be protected and documented in some other way, or it is very easily broken, or forgotten; not least by the original developers, who may be unnecessarily surprised when that untested behaviour is a bunch of gnarly bugs.