Author’s Note: This is part 2 of a series of posts about my adventures in building a “large”, in-memory hash table. Part 1 introduced our goals and our approach to the task at hand. This post is a summary of some of the research I’ve done to familiarize myself with the problem.
Our current approach to custom attribution models is very simple: pay Amazon thousands of dollars a month and “do it in the cloud” with Elastic MapReduce. In Hadoop, we partition the data by user, sort by time, identify their conversion events, and run an attribution model on these conversion-terminated “chains” of events. This is both costly and more cumbersome than we’d like.
A faster, cheaper, and arguably more transparent approach might be to pipe the live events to a service that could buffer and assemble these chains in memory and output “completed” chains (when a conversion event arrived) to a separate service to do the model computation.
We’ve come to the conclusion that a large in-memory hash table could be suitable to the task. Our specifications for said hash table are:
- 1.5 billion 64-bit keys, uniformly and randomly distributed
- values between 16 bytes and 16 kilobytes
- deployed to one machine, all in-memory
- sustained 200,000 updates per second over the course of a 14-hour “internet day”
Before jumping into building one of these, I thought I’d learn a bit about hash tables themselves.
Naturally, my research began with Wikipedia. The article on hash tables is a fairly comprehensive overview. From there, I read handful of papers and articles to dig a little deeper. Below are a selection that helped me immensely.
- Fagin’s Extendible Hashing – A Fast Access Method for Dynamic Files
One of those seminal IBM papers that everyone seems to reference. It provides some interesting historical context for the introduction of dynamic hashing. The central thesis is an insight as impressive now as it was then: by separating the hash space from the storage addressing space, a hash table can be made incrementally extendible.
- Seltzer and Yigit’s A New Hashing Package for UNIX
An older overview of hashing algorithms for use in and out of main memory. Includes exquisite insight into the implementation concerns the authors took into account while building a general hashing library for UNIX.
- Rathi, Lu, and Hedrick’s Performance Comparison of Extendible Hashing And Linear Hashing Techniques
An old but very useful comparison of linear and extendible hashing that demonstrates certain periodic performance characteristics that may make one or the other unsuitable for your application.
- Baeza-Yates and Soza-Pollman’s Analysis of Linear Hashing Revisited
A theoretical analysis of different overflow control functions in linear hashing. Lots of math, but very clearly demonstrates the differences between global versus local overflow resolution functions and the impact of page sizes.
- Jenkins Hash
A solid general-purpose hash whose source and documentation are a masterwork of explication and thoroughness.
- (Minimal) Perfect Hashing: some theory, some practice
For when you have all of your keys ahead of time and want 100% occupancy.
- Pagh and Rodler‘s Cuckoo Hashing
The original cuckoo hashing paper that compares cuckoo hashing against chaining methods and linear probing. Includes a nice section at the end recapping earlier hashing schemes and their historical context.
- Erlingsson, Manasse, and McSherry’s A Cool and Practical Alternative to Hash Tables
They present an empirical analysis of parametrized cuckoo hashing (in number of hash functions and bucket size). There’s an interesting bit at the end discussing dynamic expansion by adding bins and/or new hash functions.
- Lehman and Panigrahy’s 3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit
Describes a parametrized cuckoo hashing scheme with overlapping bins. Improves likely utilization by several percent.
- Herhily, Shavit, and Tzafrir’s Hopscotch Hashing
Incorporates ideas from linear probing, cuckoo, and chaining techniques to avoid any of their individual pitfalls.
- Panigrahy’s Efficient Hashing with Lookups in two Memory Accesses
Provides a lucid graph-theoretic description of the problem of collision resolution. The solution proposed is all-theory, so don’t bother looking for a practical result therein.
- Dietzfelbinger and Schellbach’s On Risks of Using Cuckoo Hashing with Simple Universal Hash Classes
Discusses the unsuitability of certain linear and multiplicative hash functions for use with cuckoo hashing, using a graph theoretic argument.
After my academic explorations, I started to look for candidate data stores. In doing this, I began digging into the history of key-value stores and hash table implementations. A few things jumped out immediately:
- Engineering effort seems to have been diverted from hash table development to distributed hash table development, in the past 5 years.
- Dynamic hashing innovation seems to have stopped at linear and extendible hashing.
- No benchmarks I ran into exceeded 100M insertions. In fact, this benchmark is the only one that I found that exceeded 10M insertions.
The first seems obvious given the meteoric rise in data captured from the web and the relatively fixed decrease in RAM price and increase in density. With dozens or hundreds of terabytes of “online” data, one can hardly be blamed for steering toward mid-range commodity servers en masse. However, this approach comes at a cost: coordinating and maintaining a cluster of servers is no mean feat. In fact, I consider the consensus and commitment protocols that make said coordination possible significantly more challenging to understand, let alone implement, than any of the hashing subjects mentioned above. (Just look at the Wikipedia entry for Paxos!) Similarly, hot-node issues and debugging distributed systems strike me as being an order of magnitude harder to solve than the problem of building a “better” hash table. To be clear, I’m not arguing that these two things solve the same problems. Rather, given the choice of implementing a “huge”, performant hash table in memory or the algorithms to support a clustered solution, I would choose the former.
Despite the fact that the progress of Dynamo, Cassandra, Riak, and Voldemort took most of the headlines from 2005 to 2010, work still progressed on in-memory and disk-based non-distributed hash tables like Tokyo Cabinet and Kyoto Cabinet, Redis, and even the venerable Berekely DB. (If you’re at all interested in the history of “NoSQL” data stores, you should check out this handy timeline.) That said, little in terms of novel hash table technology came from these efforts. As far as I know, BDB still uses a variant of linear hashing, Redis uses standard chaining, and Kyoto Cabinet falls back on
std::unordered_map for its in-memory hash table.
This brings us to the other two points: indeed, how could traditional hash table development cease (practically) in light of the advances of DHTs? With “web-scale” data sets even a single node’s data storage needs should easily exceed anything seen 5 years prior, right? In fairness, some work has been done in the last few years to add concurrency to linear hashing as well as some work on optimizing hash table algorithms for modern cache hierarchies, but this doesn’t feel like the same kind of fundamental, basic result as, say, the introduction of extendible hashing.
I suppose the fact that there has been little visible engineering progress on this front is a testament to the quality of the existing algorithms and code. Either that or existing workloads have not yet exceeded that high watermark of 100M entries and we’re just waiting for the next jump to inspire new work in the field.
Next post: a roundup of existing candidates, benchmarks, and observations about their ease-of-use.