Author’s note: This will be the first of a series of posts about my adventures in building a “large”, in-memory hash table. This first post will focus on a few philosophical notes that inspired this adventure. Research summaries, benchmarks, engineering notes, and so on will follow in future posts.
A few years ago, I recall being flabbergasted when I was told that Google had deployed a Perforce server with 256GB of RAM. Production machines at my then job had 16GB of RAM and I had certainly heard of 32- and 64-GB boxes, but 256GB struck me as an unthinkable amount. Our whole production database in RAM twice over! Wham! Pow! Smack!
Fast-forward to a month ago when I was told that we had two “leftover” boxes with a dozen cores and 256GB of RAM each. Impressive, yes, but a pleasant surprise at best. How the times have changed!
The availability of the hardware got Rob and I thinking about novel things we could do in RAM. After some brainstorming, we came up with some basic tenets that should guide our exploration of the space.
We’re not in the business of saving lives.
We track ads online. Lots of them. Not all components of our system require perfect uptime and not all of our data has to be perfectly accurate. I think perhaps this scenario is more common than many are willing to admit or embrace, especially in the analytics community. My main beef with MapReduce is the a priori necessity of examining every last piece of data. Throw out what doesn’t matter! Live a little!
That said, “in-memory” does not mean unstable or lossy.
If your data fits in memory and you can easily reconstruct your data store by replaying the input stream, there’s really no reason to dismiss a volatile design. Hell, the extra speed you’re likely to pick up with an in-memory design can actually make your recoveries quicker than with a persistent solution. By the same token, writing a persistence layer for a data store is arguably the most complicated part. (See these two posts, for instance.) Mitigate the volatility of an in-memory solution by going back to the simplicity, transparency, and composability espoused by the Unix philosophy.
One thread does all the reading and writing, all in-memory, with only one I/O format: protobuf messages over 0MQ. See how fast that baby can go and how big she can grow before you get any fancier. Sure I could wave my hands about all kinds of fancy things like context switching, but that’s not the justification here. We’re really trying to test the limits of a relatively simple computation model, without working around it: how much can you do with a fast processor and gobs of RAM?
Benchmark with the future in mind.
Test at capacities that significantly exceed current needs. Push the envelope well past what roughly similar projects are doing. Stress it until it genuinely breaks.
Since Rob has already started tackling our aggregation bottlenecks with the Summarizer (which he will surely write more about soon, nudge,nudge), I decided to try my hand at our custom attribution problem. We need a way to store user interaction streams and run attribution models over them quicker than in Hadoop, but not quite “in real-time”.
Practically, the problem amounts to storing a billion or so randomly- and uniformly- distributed 64-bit integer keys with a Zipfian distribution of values ranging between 16 bytes and 16 kilobytes of structured data. The combination of an extremely heavy write workload, zero key locality, and no durability requirements points to an in-memory hash table.
Next post, I’ll cover the research I did and am doing to familiarize myself with the problem of large, in-memory hash tables.