Big Memory, Part 3

Author’s Note: This is part 3 of a series of posts about my adventures in building a “large”, in-memory hash table. Part 1 introduced our goals and our approach to the task at hand. This post is a summary of some candidate hash table “services”.

Goals

To recap, I need a hash table that can support the following:

  • 1.5 billion 64-bit keys, uniformly and randomly distributed
  • values between 16 bytes and 16 kilobytes, with sizes in a Zipfian distribution
  • deployed to one machine, all in-memory
  • sustained 200,000 writes per second over the course of many hours

The API should support a non-bulk, mutable, key-value interface with an append command.

The final requirement is that the source be obtainable. After all, this is just as much about finding a viable candidate as understanding how the results are achieved.

My approach to testing the initial viability of candidates was to replicate a subset of the required production load using some of our production logs. The test amounted to writing 212 million records to a bit over 78 million unique keys. Each record’s key is 8 bytes and its value 16 bytes. The value bytes are simply appended to the existing value corresponding to the key. This closely mimics our real write workload for the project.

Note that throughput and latency are the primary concerns here: we seek a consistently high write rate. Memory overhead, at this stage, is not under scrutiny. (This may strike some as odd, given the hard bounds on a single machine’s memory, but honestly the raw data set we’re seeking to store is easily within the bounds of the servers I described in my previous post. As long as nothing absurd is going on, we can afford to trade some memory for speed.)

Candidates

Given the API requirements, the candidates that immediately came to mind were:

  • Berkeley DB
  • Kyoto Cabinet
  • Redis
  • Memcached

Note that the scope here is restricted to hash table “services”, not hash table libraries. Specifically, I don’t want to manage memory, rehashing, growing, or shrinking. I’ll be covering libraries in the next post.

Under the hood these all use slightly different hashing and collision resolution schemes.

Berkeley DB uses an implementation of Litwin’s Extended Linear Hashing. In particular, it implements linear hashing using a hybrid split control: bucket overflow and load factor independently trigger splits. (Look for ffactor and do_expand in hash_page.c’s __ham_add_el().) Notably, BDB chains memory pages, not object pointers. This is a sensible optimization in a world where main memory is small and disk seeks are costly. However, the cost in code complexity is immense. For an idea of just how much attention to detail is required, download BDB’s source and check out hash_page.c’s __ham_replpair() and __ham_add_el(). It is fascinating to see how much work goes into managing the differences between small and large values. [1]

Kyoto Cabinet “boringly” uses the C++ stdlib’s std::unordered_map. I had trouble finding implementations other than GCC’s, so I can’t really speak to anything but that. The tr1/hashtable implementation uses chaining, with a prime bucket count and a max-load-factor-based rehashing policy. When the ratio of elements to buckets passes a certain threshold (1, I believe), a full stop-the-world rehash is performed. (_M_rehash() on line 1146) The resizing policy finds the smallest prime greater than twice the current number of buckets, and the table is resized. (_M_need_rehash() on line 455) The prime policy default can be seen here.

Redis implements its own hash table that uses chaining as well with a target load factor of 1. Interestingly, it rehashes the keys incrementally in the background, pushing updates to a new table while checking both the old and new tables for reads. The incremental work is spread over all subsequent reads and writes issued to the table. This is perhaps the easiest of the four implementations to fully understand on the first read.

Similar to BDB, memcached implements linear hashing, but it chains object pointers, not memory pages. It uses what the paper calls “load control[led]” splits, meaning that incremental rehashing occurs when the load factor exceeds a certain value. (In this case, 3/2.) Unlike Redis, it does the rehashing in an another thread in the background, not as a part of the read or write operations. assoc.c very nicely illustrates the gist of linear hashing with controlled splits; check out assoc_find() and assoc_insert(). Beware, assoc_expand() just sets up some state to signal incremental rehashing. The real guts of the rehashing is in assoc_maintenance_thread(). It is notable how much simpler the code for object-pointer chaining is than the page chaining used in BDB.

Ease of Use

Note: I’m talking specifically about ease of use from a developer perspective. I’m not qualified or interested in commenting on their operational merits here.

Without a doubt, the easiest candidate to set up, use, and analyze was Redis. Between the trivial build from source, the simplicity of the Jedis API, and the visibility provided by the INFO command, using Redis was a walk in the park. The redis.conf file has a lot of knobs but most of them can safely be ignored and the inline documentation is ample.

Kyoto Cabinet came in a close second. I forgot to set $JAVA_HOME before installing the Java bindings, which caused me some grief, but once I figured that bit out everything was right as rain. Instantiating and using it were painless if somewhat sparsely documented.

Memcached was actually a pain to use, not because of the daemon itself, but because of the client libraries available in Java. The fact that the append command required a CAS value in some clients and not in others was the main culprit. One qualm with the daemon itself is that an append command only succeeds if used on an existing key.

Finally, BDB was by far the most frustrating candidate. The configuration is arcane and poorly documented. The errors are undescriptive and often cryptic. Setting the proper combination of permissions for a client is exceedingly difficult unless you peruse the documentation with a very keen eye. The distinctions made between what configuration should be done on the EnvironmentConfig versus the DatabaseConfig is unclear and poorly documented. Despite specifying an in-memory hash database, a home directory for BDB is still required, even though it never touches it. One has to manually initialize the memory subsystem. Blah! Maybe I’m just uninitiated, but I don’t think I’ve ever been more frustrated with a piece of software. To boot, only the Heap, Queue, and Recno access methods support append puts, leaving me to manually do a get/append/put in the client. Even if BDB is fast enough, there’s absolutely no chance I’ll use it in production due to these limitations.

Results

I’ll briefly note that memcached was so slow that it didn’t complete the test suite in the two days I left it running. As such, I’ve removed it entirely from this comparison. I was probably doing something wrong vis-a-vis configuration of the client and server. Similarly, a simple un-pipelined Redis connection proved to be incredibly slow, at least an order of magnitude slower than BDB. As such, I reran the original Redis test with a pipelined connection, flushing every 10,000 records. Both versions of the test are included in the source for posterity.

These plots come from 30 runs over the data set, preceded by 10 warmup runs. The hash marks are the average value of the number of records processed per second at the particular record count, and the points are the actual observations with 10% alpha.

The first plot includes a baseline processing rate (‘xfer’ in the legend) which indicates how quickly the records can be read and prepared. The second simply excludes the baseline, for a clearer view. You can click through for larger versions of the plots.





You can find the source code used to run these comparisons on GitHub.

Notable bits

  • Despite the drastically different algorithms used by BDB and KC, their results were roughly equivalent. KC’s performance proved to be slightly smoother, and seems to have reached a stable point at around 170 million records while BDB continued to degrade. A concern is that they were the only two packages that were used through JNI. This may have limited performance, but I am disinclined to investigate further as we use the JVM in production which necessitates this cost when interacting with these services. That said, tr1/hashtable’s underlying algorithm is still quite attractive. It performed smoother despite not having a hint about the number of unique keys while BDB did.
  • Though Redis’ throughput proved to be about 50% greater than KC and BDB, the precipitous drops during (what I assume is) resizing are extremely worrisome. (I’m guessing it’s resizing since the distance between drop-offs roughly doubles each time.) The performance drop off just doesn’t jive with the goal of continuously high throughput. Equally worrying is the cost of at least doubling memory use during rehashing. Even though I mentioned this is a secondary concern in this comparison, it is an important operational problem.
  • The performance difference between tr1/hashtable (KC) and Redis is marked, given they both use chaining. I suspect this is either a result of pipelining or JNI overhead. The purpose of adding the pipelined version of the Redis test was to emulate a scenario where issuing commands did not carry network and serialization overhead. Perhaps it optimized Redis access unfairly by instead emulating a bulk command API as opposed to just mitigating protocol overhead.

Conclusions

It seems that the overhead of a general-purpose hash table “service” makes these options unsuitable for my needs. I’m honestly not sure whether it’s the broad feature sets, the JNI bridge, or the serialization/deserialization overhead of having a network server on top of the hash table, and as a practical matter I won’t pursue any further. All three problems can be circumvented by using a Java-resident hash table, so that is where we’ll go in the next post. -Xmx128g here we come!

Footnotes

[1] Per the Berkeley DB license I am including a link to Oracle’s site where one can find the full source of the database.

Big Memory, Part 2

Author’s Note: This is part 2 of a series of posts about my adventures in building a “large”, in-memory hash table. Part 1 introduced our goals and our approach to the task at hand. This post is a summary of some of the research I’ve done to familiarize myself with the problem.

Background

Our current approach to custom attribution models is very simple: pay Amazon thousands of dollars a month and “do it in the cloud” with Elastic MapReduce. In Hadoop, we partition the data by user, sort by time, identify their conversion events, and run an attribution model on these conversion-terminated “chains” of events. This is both costly and more cumbersome than we’d like.

A faster, cheaper, and arguably more transparent approach might be to pipe the live events to a service that could buffer and assemble these chains in memory and output “completed” chains (when a conversion event arrived) to a separate service to do the model computation.

We’ve come to the conclusion that a large in-memory hash table could be suitable to the task. Our specifications for said hash table are:

  • 1.5 billion 64-bit keys, uniformly and randomly distributed
  • values between 16 bytes and 16 kilobytes
  • deployed to one machine, all in-memory
  • sustained 200,000 updates per second over the course of a 14-hour “internet day”

Before jumping into building one of these, I thought I’d learn a bit about hash tables themselves.

The Research

Naturally, my research began with Wikipedia. The article on hash tables is a fairly comprehensive overview. From there, I read handful of papers and articles to dig a little deeper. Below are a selection that helped me immensely.

Dynamic Hashing

  • Fagin’s Extendible Hashing – A Fast Access Method for Dynamic Files

    One of those seminal IBM papers that everyone seems to reference. It provides some interesting historical context for the introduction of dynamic hashing. The central thesis is an insight as impressive now as it was then: by separating the hash space from the storage addressing space, a hash table can be made incrementally extendible.

  • Seltzer and Yigit’s A New Hashing Package for UNIX

    An older overview of hashing algorithms for use in and out of main memory. Includes exquisite insight into the implementation concerns the authors took into account while building a general hashing library for UNIX.

  • Rathi, Lu, and Hedrick’s Performance Comparison of Extendible Hashing And Linear Hashing Techniques

    An old but very useful comparison of linear and extendible hashing that demonstrates certain periodic performance characteristics that may make one or the other unsuitable for your application.

  • Baeza-Yates and Soza-Pollman’s Analysis of Linear Hashing Revisited

    A theoretical analysis of different overflow control functions in linear hashing. Lots of math, but very clearly demonstrates the differences between global versus local overflow resolution functions and the impact of page sizes.

Hash Functions

  • Jenkins Hash

    A solid general-purpose hash whose source and documentation are a masterwork of explication and thoroughness.

  • (Minimal) Perfect Hashing: some theory, some practice

    For when you have all of your keys ahead of time and want 100% occupancy.

Collision Resolution

  • Pagh and Rodler‘s Cuckoo Hashing

    The original cuckoo hashing paper that compares cuckoo hashing against chaining methods and linear probing. Includes a nice section at the end recapping earlier hashing schemes and their historical context.

  • Erlingsson, Manasse, and McSherry’s A Cool and Practical Alternative to Hash Tables

    They present an empirical analysis of parametrized cuckoo hashing (in number of hash functions and bucket size). There’s an interesting bit at the end discussing dynamic expansion by adding bins and/or new hash functions.

  • Lehman and Panigrahy’s 3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit

    Describes a parametrized cuckoo hashing scheme with overlapping bins. Improves likely utilization by several percent.

  • Herhily, Shavit, and Tzafrir’s Hopscotch Hashing

    Incorporates ideas from linear probing, cuckoo, and chaining techniques to avoid any of their individual pitfalls.

  • Panigrahy’s Efficient Hashing with Lookups in two Memory Accesses

    Provides a lucid graph-theoretic description of the problem of collision resolution. The solution proposed is all-theory, so don’t bother looking for a practical result therein.

  • Dietzfelbinger and Schellbach’s On Risks of Using Cuckoo Hashing with Simple Universal Hash Classes

    Discusses the unsuitability of certain linear and multiplicative hash functions for use with cuckoo hashing, using a graph theoretic argument.

Observations

After my academic explorations, I started to look for candidate data stores. In doing this, I began digging into the history of key-value stores and hash table implementations. A few things jumped out immediately:

  • Engineering effort seems to have been diverted from hash table development to distributed hash table development, in the past 5 years.
  • Dynamic hashing innovation seems to have stopped at linear and extendible hashing.
  • No benchmarks I ran into exceeded 100M insertions. In fact, this benchmark is the only one that I found that exceeded 10M insertions.

The first seems obvious given the meteoric rise in data captured from the web and the relatively fixed decrease in RAM price and increase in density. With dozens or hundreds of terabytes of “online” data, one can hardly be blamed for steering toward mid-range commodity servers en masse. However, this approach comes at a cost: coordinating and maintaining a cluster of servers is no mean feat. In fact, I consider the consensus and commitment protocols that make said coordination possible significantly more challenging to understand, let alone implement, than any of the hashing subjects mentioned above. (Just look at the Wikipedia entry for Paxos!) Similarly, hot-node issues and debugging distributed systems strike me as being an order of magnitude harder to solve than the problem of building a “better” hash table. To be clear, I’m not arguing that these two things solve the same problems. Rather, given the choice of implementing a “huge”, performant hash table in memory or the algorithms to support a clustered solution, I would choose the former.

Despite the fact that the progress of Dynamo, Cassandra, Riak, and Voldemort took most of the headlines from 2005 to 2010, work still progressed on in-memory and disk-based non-distributed hash tables like Tokyo Cabinet and Kyoto Cabinet, Redis, and even the venerable Berekely DB. (If you’re at all interested in the history of “NoSQL” data stores, you should check out this handy timeline.) That said, little in terms of novel hash table technology came from these efforts. As far as I know, BDB still uses a variant of linear hashing, Redis uses standard chaining, and Kyoto Cabinet falls back on std::unordered_map for its in-memory hash table.

This brings us to the other two points: indeed, how could traditional hash table development cease (practically) in light of the advances of DHTs? With “web-scale” data sets even a single node’s data storage needs should easily exceed anything seen 5 years prior, right? In fairness, some work has been done in the last few years to add concurrency to linear hashing as well as some work on optimizing hash table algorithms for modern cache hierarchies, but this doesn’t feel like the same kind of fundamental, basic result as, say, the introduction of extendible hashing.

I suppose the fact that there has been little visible engineering progress on this front is a testament to the quality of the existing algorithms and code. Either that or existing workloads have not yet exceeded that high watermark of 100M entries and we’re just waiting for the next jump to inspire new work in the field.

Next post: a roundup of existing candidates, benchmarks, and observations about their ease-of-use.

Follow

Get every new post delivered to your Inbox.

Join 224 other followers