*Author’s Note: This is part 2 of a series of posts about my adventures in building a “large”, in-memory hash table. Part 1 introduced our goals and our approach to the task at hand. This post is a summary of some of the research I’ve done to familiarize myself with the problem.*

### Background

Our current approach to custom attribution models is very simple: pay Amazon thousands of dollars a month and “do it in the cloud” with Elastic MapReduce. In Hadoop, we partition the data by user, sort by time, identify their conversion events, and run an attribution model on these conversion-terminated “chains” of events. This is both costly and more cumbersome than we’d like.

A faster, cheaper, and arguably more transparent approach might be to pipe the live events to a service that could buffer and assemble these chains in memory and output “completed” chains (when a conversion event arrived) to a separate service to do the model computation.

We’ve come to the conclusion that a large in-memory hash table could be suitable to the task. Our specifications for said hash table are:

- 1.5 billion 64-bit keys, uniformly and randomly distributed
- values between 16 bytes and 16 kilobytes
- deployed to one machine, all in-memory
- sustained 200,000 updates per second over the course of a 14-hour “internet day”

Before jumping into building one of these, I thought I’d learn a bit about hash tables themselves.

### The Research

Naturally, my research began with Wikipedia. The article on hash tables is a fairly comprehensive overview. From there, I read handful of papers and articles to dig a little deeper. Below are a selection that helped me immensely.

#### Dynamic Hashing

*Fagin’*s Extendible Hashing – A Fast Access Method for Dynamic FilesOne of those seminal IBM papers that everyone seems to reference. It provides some interesting historical context for the introduction of dynamic hashing. The central thesis is an insight as impressive now as it was then: by separating the hash space from the storage addressing space, a hash table can be made incrementally extendible.

*Seltzer*and*Yigit’*s A New Hashing Package for UNIXAn older overview of hashing algorithms for use in and out of main memory. Includes exquisite insight into the implementation concerns the authors took into account while building a general hashing library for UNIX.

*Rathi*,*Lu*, and*Hedrick’*s Performance Comparison of Extendible Hashing And Linear Hashing TechniquesAn old but very useful comparison of linear and extendible hashing that demonstrates certain periodic performance characteristics that may make one or the other unsuitable for your application.

*Baeza-Yates*and*Soza-Pollman’*s Analysis of Linear Hashing RevisitedA theoretical analysis of different overflow control functions in linear hashing. Lots of math, but very clearly demonstrates the differences between global versus local overflow resolution functions and the impact of page sizes.

#### Hash Functions

- Jenkins Hash
A solid general-purpose hash whose source and documentation are a masterwork of explication and thoroughness.

- (Minimal) Perfect Hashing: some theory, some practice
For when you have all of your keys ahead of time and want 100% occupancy.

#### Collision Resolution

*Pagh*and*Rodler*‘s Cuckoo HashingThe original cuckoo hashing paper that compares cuckoo hashing against chaining methods and linear probing. Includes a nice section at the end recapping earlier hashing schemes and their historical context.

*Erlingsson*,*Manasse*, and*McSherry’*s A Cool and Practical Alternative to Hash TablesThey present an empirical analysis of parametrized cuckoo hashing (in number of hash functions and bucket size). There’s an interesting bit at the end discussing dynamic expansion by adding bins and/or new hash functions.

*Lehman*and*Panigrahy’*s 3.5-Way Cuckoo Hashing for the Price of 2-and-a-BitDescribes a parametrized cuckoo hashing scheme with overlapping bins. Improves likely utilization by several percent.

*Herhily*,*Shavit*, and*Tzafrir’*s Hopscotch HashingIncorporates ideas from linear probing, cuckoo, and chaining techniques to avoid any of their individual pitfalls.

*Panigrahy’*s Efficient Hashing with Lookups in two Memory AccessesProvides a lucid graph-theoretic description of the problem of collision resolution. The solution proposed is all-theory, so don’t bother looking for a practical result therein.

*Dietzfelbinger*and*Schellbach’*s On Risks of Using Cuckoo Hashing with Simple Universal Hash ClassesDiscusses the unsuitability of certain linear and multiplicative hash functions for use with cuckoo hashing, using a graph theoretic argument.

### Observations

After my academic explorations, I started to look for candidate data stores. In doing this, I began digging into the history of key-value stores and hash table implementations. A few things jumped out immediately:

- Engineering effort seems to have been diverted from hash table development to
*distributed*hash table development, in the past 5 years. - Dynamic hashing innovation seems to have stopped at linear and extendible hashing.
- No benchmarks I ran into exceeded 100M insertions. In fact, this benchmark is the only one that I found that exceeded 10M insertions.

The first seems obvious given the meteoric rise in data captured from the web and the relatively fixed decrease in RAM price and increase in density. With dozens or hundreds of terabytes of “online” data, one can hardly be blamed for steering toward mid-range commodity servers *en masse*. However, this approach comes at a cost: coordinating and maintaining a cluster of servers is no mean feat. In fact, I consider the consensus and commitment protocols that make said coordination possible significantly more challenging to understand, let alone implement, than any of the hashing subjects mentioned above. (Just look at the Wikipedia entry for Paxos!) Similarly, hot-node issues and debugging distributed systems strike me as being an order of magnitude harder to solve than the problem of building a “better” hash table. To be clear, I’m not arguing that these two things solve the same problems. Rather, given the choice of implementing a “huge”, performant hash table in memory or the algorithms to support a clustered solution, I would choose the former.

Despite the fact that the progress of Dynamo, Cassandra, Riak, and Voldemort took most of the headlines from 2005 to 2010, work still progressed on in-memory and disk-based non-distributed hash tables like Tokyo Cabinet and Kyoto Cabinet, Redis, and even the venerable Berekely DB. (If you’re at all interested in the history of “NoSQL” data stores, you should check out this handy timeline.) That said, little in terms of novel hash table technology came from these efforts. As far as I know, BDB still uses a variant of linear hashing, Redis uses standard chaining, and Kyoto Cabinet falls back on `std::unordered_map`

for its in-memory hash table.

This brings us to the other two points: indeed, how could traditional hash table development cease (practically) in light of the advances of DHTs? With “web-scale” data sets even a single node’s data storage needs should easily exceed anything seen 5 years prior, right? In fairness, some work has been done in the last few years to add concurrency to linear hashing as well as some work on optimizing hash table algorithms for modern cache hierarchies, but this doesn’t feel like the same kind of fundamental, basic result as, say, the introduction of extendible hashing.

I suppose the fact that there has been little visible engineering progress on this front is a testament to the quality of the existing algorithms and code. Either that or existing workloads have not yet exceeded that high watermark of 100M entries and we’re just waiting for the next jump to inspire new work in the field.

Next post: a roundup of existing candidates, benchmarks, and observations about their ease-of-use.