Big Memory, Part 2

Author’s Note: This is part 2 of a series of posts about my adventures in building a “large”, in-memory hash table. Part 1 introduced our goals and our approach to the task at hand. This post is a summary of some of the research I’ve done to familiarize myself with the problem.


Our current approach to custom attribution models is very simple: pay Amazon thousands of dollars a month and “do it in the cloud” with Elastic MapReduce. In Hadoop, we partition the data by user, sort by time, identify their conversion events, and run an attribution model on these conversion-terminated “chains” of events. This is both costly and more cumbersome than we’d like.

A faster, cheaper, and arguably more transparent approach might be to pipe the live events to a service that could buffer and assemble these chains in memory and output “completed” chains (when a conversion event arrived) to a separate service to do the model computation.

We’ve come to the conclusion that a large in-memory hash table could be suitable to the task. Our specifications for said hash table are:

  • 1.5 billion 64-bit keys, uniformly and randomly distributed
  • values between 16 bytes and 16 kilobytes
  • deployed to one machine, all in-memory
  • sustained 200,000 updates per second over the course of a 14-hour “internet day”

Before jumping into building one of these, I thought I’d learn a bit about hash tables themselves.

The Research

Naturally, my research began with Wikipedia. The article on hash tables is a fairly comprehensive overview. From there, I read handful of papers and articles to dig a little deeper. Below are a selection that helped me immensely.

Dynamic Hashing

  • Fagin’s Extendible Hashing – A Fast Access Method for Dynamic Files

    One of those seminal IBM papers that everyone seems to reference. It provides some interesting historical context for the introduction of dynamic hashing. The central thesis is an insight as impressive now as it was then: by separating the hash space from the storage addressing space, a hash table can be made incrementally extendible.

  • Seltzer and Yigit’s A New Hashing Package for UNIX

    An older overview of hashing algorithms for use in and out of main memory. Includes exquisite insight into the implementation concerns the authors took into account while building a general hashing library for UNIX.

  • Rathi, Lu, and Hedrick’s Performance Comparison of Extendible Hashing And Linear Hashing Techniques

    An old but very useful comparison of linear and extendible hashing that demonstrates certain periodic performance characteristics that may make one or the other unsuitable for your application.

  • Baeza-Yates and Soza-Pollman’s Analysis of Linear Hashing Revisited

    A theoretical analysis of different overflow control functions in linear hashing. Lots of math, but very clearly demonstrates the differences between global versus local overflow resolution functions and the impact of page sizes.

Hash Functions

  • Jenkins Hash

    A solid general-purpose hash whose source and documentation are a masterwork of explication and thoroughness.

  • (Minimal) Perfect Hashing: some theory, some practice

    For when you have all of your keys ahead of time and want 100% occupancy.

Collision Resolution

  • Pagh and Rodler‘s Cuckoo Hashing

    The original cuckoo hashing paper that compares cuckoo hashing against chaining methods and linear probing. Includes a nice section at the end recapping earlier hashing schemes and their historical context.

  • Erlingsson, Manasse, and McSherry’s A Cool and Practical Alternative to Hash Tables

    They present an empirical analysis of parametrized cuckoo hashing (in number of hash functions and bucket size). There’s an interesting bit at the end discussing dynamic expansion by adding bins and/or new hash functions.

  • Lehman and Panigrahy’s 3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit

    Describes a parametrized cuckoo hashing scheme with overlapping bins. Improves likely utilization by several percent.

  • Herhily, Shavit, and Tzafrir’s Hopscotch Hashing

    Incorporates ideas from linear probing, cuckoo, and chaining techniques to avoid any of their individual pitfalls.

  • Panigrahy’s Efficient Hashing with Lookups in two Memory Accesses

    Provides a lucid graph-theoretic description of the problem of collision resolution. The solution proposed is all-theory, so don’t bother looking for a practical result therein.

  • Dietzfelbinger and Schellbach’s On Risks of Using Cuckoo Hashing with Simple Universal Hash Classes

    Discusses the unsuitability of certain linear and multiplicative hash functions for use with cuckoo hashing, using a graph theoretic argument.


After my academic explorations, I started to look for candidate data stores. In doing this, I began digging into the history of key-value stores and hash table implementations. A few things jumped out immediately:

  • Engineering effort seems to have been diverted from hash table development to distributed hash table development, in the past 5 years.
  • Dynamic hashing innovation seems to have stopped at linear and extendible hashing.
  • No benchmarks I ran into exceeded 100M insertions. In fact, this benchmark is the only one that I found that exceeded 10M insertions.

The first seems obvious given the meteoric rise in data captured from the web and the relatively fixed decrease in RAM price and increase in density. With dozens or hundreds of terabytes of “online” data, one can hardly be blamed for steering toward mid-range commodity servers en masse. However, this approach comes at a cost: coordinating and maintaining a cluster of servers is no mean feat. In fact, I consider the consensus and commitment protocols that make said coordination possible significantly more challenging to understand, let alone implement, than any of the hashing subjects mentioned above. (Just look at the Wikipedia entry for Paxos!) Similarly, hot-node issues and debugging distributed systems strike me as being an order of magnitude harder to solve than the problem of building a “better” hash table. To be clear, I’m not arguing that these two things solve the same problems. Rather, given the choice of implementing a “huge”, performant hash table in memory or the algorithms to support a clustered solution, I would choose the former.

Despite the fact that the progress of Dynamo, Cassandra, Riak, and Voldemort took most of the headlines from 2005 to 2010, work still progressed on in-memory and disk-based non-distributed hash tables like Tokyo Cabinet and Kyoto Cabinet, Redis, and even the venerable Berekely DB. (If you’re at all interested in the history of “NoSQL” data stores, you should check out this handy timeline.) That said, little in terms of novel hash table technology came from these efforts. As far as I know, BDB still uses a variant of linear hashing, Redis uses standard chaining, and Kyoto Cabinet falls back on std::unordered_map for its in-memory hash table.

This brings us to the other two points: indeed, how could traditional hash table development cease (practically) in light of the advances of DHTs? With “web-scale” data sets even a single node’s data storage needs should easily exceed anything seen 5 years prior, right? In fairness, some work has been done in the last few years to add concurrency to linear hashing as well as some work on optimizing hash table algorithms for modern cache hierarchies, but this doesn’t feel like the same kind of fundamental, basic result as, say, the introduction of extendible hashing.

I suppose the fact that there has been little visible engineering progress on this front is a testament to the quality of the existing algorithms and code. Either that or existing workloads have not yet exceeded that high watermark of 100M entries and we’re just waiting for the next jump to inspire new work in the field.

Next post: a roundup of existing candidates, benchmarks, and observations about their ease-of-use.

Big Memory, Part 1

Author’s note: This will be the first of a series of posts about my adventures in building a “large”, in-memory hash table. This first post will focus on a few philosophical notes that inspired this adventure. Research summaries, benchmarks, engineering notes, and so on will follow in future posts.


A few years ago, I recall being flabbergasted when I was told that Google had deployed a Perforce server with 256GB of RAM. Production machines at my then job had 16GB of RAM and I had certainly heard of 32- and 64-GB boxes, but 256GB struck me as an unthinkable amount. Our whole production database in RAM twice over! Wham! Pow! Smack!

Fast-forward to a month ago when I was told that we had two “leftover” boxes with a dozen cores and 256GB of RAM each. Impressive, yes, but a pleasant surprise at best. How the times have changed!


The availability of the hardware got Rob and I thinking about novel things we could do in RAM. After some brainstorming, we came up with some basic tenets that should guide our exploration of the space.

We’re not in the business of saving lives.

We track ads online. Lots of them. Not all components of our system require perfect uptime and not all of our data has to be perfectly accurate. I think perhaps this scenario is more common than many are willing to admit or embrace, especially in the analytics community. My main beef with MapReduce is the a priori necessity of examining every last piece of data. Throw out what doesn’t matter! Live a little!

That said, “in-memory” does not mean unstable or lossy.

If your data fits in memory and you can easily reconstruct your data store by replaying the input stream, there’s really no reason to dismiss a volatile design. Hell, the extra speed you’re likely to pick up with an in-memory design can actually make your recoveries quicker than with a persistent solution. By the same token, writing a persistence layer for a data store is arguably the most complicated part. (See these two posts, for instance.) Mitigate the volatility of an in-memory solution by going back to the simplicity, transparency, and composability espoused by the Unix philosophy.


One thread does all the reading and writing, all in-memory, with only one I/O format: protobuf messages over 0MQ. See how fast that baby can go and how big she can grow before you get any fancier. Sure I could wave my hands about all kinds of fancy things like context switching, but that’s not the justification here. We’re really trying to test the limits of a relatively simple computation model, without working around it: how much can you do with a fast processor and gobs of RAM?

Benchmark with the future in mind.

Test at capacities that significantly exceed current needs. Push the envelope well past what roughly similar projects are doing. Stress it until it genuinely breaks.


Since Rob has already started tackling our aggregation bottlenecks with the Summarizer (which he will surely write more about soon, nudge,nudge), I decided to try my hand at our custom attribution problem. We need a way to store user interaction streams and run attribution models over them quicker than in Hadoop, but not quite “in real-time”.

Practically, the problem amounts to storing a billion or so randomly- and uniformly- distributed 64-bit integer keys with a Zipfian distribution of values ranging between 16 bytes and 16 kilobytes of structured data. The combination of an extremely heavy write workload, zero key locality, and no durability requirements points to an in-memory hash table.

Next post, I’ll cover the research I did and am doing to familiarize myself with the problem of large, in-memory hash tables.