Big Memory, Part 3.5: Google sparsehash!

Author’s Note: This is part 3.5 of a series of posts about my adventures in building a “large”, in-memory hash table. This post is a handful of observations I made while running someone else’s benchmark of C/C++ hash table implementations.

I ran across Nick Welch’s Hash Table Benchmarks during my research and decided to rerun a subset of his benchmark for much larger key counts. (You can find my fork of his code on GitHub.)

The differences between the benchmarks he ran and the ones I ran are:

  • I’m using sparsehash 1.11 (vs. 1.5), Boost 1.41 (vs. 1.38), and Qt 4.6 (vs. 4.5).
  • I removed Ruby’s hash implementation because of its abysmal performance in his benchmark.
  • I increased the insertion count to 1.5 billion, and the step size to 100 million from 40 million/2 million.
  • I only provided the Random Inserts: Execution Time (integers) and the Memory Usage (integers) tests.

I did a best of three, using the recommended ionice/nice run command on a m2.4xlarge EC2 instance. (I added a dummy entry for Ruby’s hash so that the colors of my plots would match the originals.)

As Nick noted, Google’s sparsehash maps are the real deal. They sustain their performance across a broad range of key counts, offering strong tradeoffs between memory and speed. When memory usage must be low and allocation smooth, sparse_hash_map is the clear choice. Interestingly, it’s simply quadratic probing built on top of a dynamic array. Dynamic arrays only store addressed entries, so no matter how large the array address space may be, you only pay for the entries you’re using, plus some bookkeeping overhead. The dynamic array mitigates the memory cost of maintaining the low load factor required to operate a fast, large hash table with quadratic probing. At sparse_hash_map‘s default load factor of 0.8, a conventional array-backed quadratic probing scheme ‘wastes’ 20% of allocated memory. This benchmark maps to and from 64-bit integers, so at the claimed 1-2 bits of overhead per entry the dynamic array implementation only ‘wastes’ about 1% of allocated memory. It also allocates memory one entry at a time, so each insert is slower but requires constant time. Rehashes are the only allocation hiccup in this scheme.

A good review of the algorithm of the sparsetable used to back the sparse_hash_map is provided by the Google folks here, under insert() and in the source in sparsetable.cppsparsehashtable.h, and sparse_hash_map. You’ll also find an interesting claim about their particular choice of quadratic probing with triangular numbers, which I’ve been unable to find evidence for elsewhere. Regardless of the theoretical details, these benchmark numbers are enough to convince me of the efficacy of that choice.

In the opposite case where throughput is of paramount importance, dense_hash_map shines. It maintains a 0.5 load factor and stores entries in a regular C array, which means it has to pay for all the buckets it addresses (not just the ones it uses) and also leaves half of them empty. This avoids the costly bitmap population counts and the constant reallocation needed by the sparse_hash_map at the cost of infrequent, lumpy allocation and a large per-key overhead. So lumpy, in fact, that it couldn’t quite make it to the 1.5 billion key mark on the provided 68GB of memory.

The algorithm behind sparse_hash_map seemed to suit my needs perfectly: smooth allocation, compact footprint, and still plenty fast despite being the slowest of the group. I dug in a little further to get a feel for the throughput profile and ended up modifying the benchmark to report the amount of time between every million inserts up to 1.5 billion. I ran 4 warmup runs and 16 observation runs. This gave a familiar view, but with a twist. The throughput costs of the different allocation strategies were exposed:

(Note: Both plots share the bottom plot’s legend.)

The top plot shows the number of inserts per second, measured over 1-million-record intervals. Overall, the results are as expected: the sparse_hash_map and dense_hash_map sandwiching the other three, which may as well have been the same library. (They practically are: the versions of Boost, Qt, and GCC that I used are all prime-sized, chained hash tables.) The throughputs are pretty stable overall: dense_hash_map at around 4.5M inserts/sec, the unordered_maps and QHash at about 3M, and sparse_hash_map at about 1M.

The bottom plot shows the slowest million-record intervals, plotting the number of seconds needed to insert 1 million records. These correspond to the dips in throughput of the top plot. I can only assume these are the observations that spanned a resizing of the table. sparse_hash_map was the worst offender, taking more than three minutes to resize and rehash the table in the worst case, while dense_hash_map had the best worst-case at 43 seconds. The other three predictably fell somewhere in between. (See the five rightmost points on the bottom plot.)

A brief aside: it’s curious that GCC’s std::unordered_map performed roughly 20% faster after its final reallocation (near the 700M record mark) than both Boost’s unordered_map and Qt’s QHash. I’ll have to dig in later to figure out what’s actually happening there, but I suspect it has to do with the simple fact that it reallocates to a much larger memory footprint than the other two, leaving it with a much lower load factor through that section of the test.

In any case, the valuable lesson for me was that unless you properly allocate the initial table size, you will be forced to suffer catastrophically slow resizes, even in C/C++. But what recourse do you have if your table crosses the rehash point between pruning? This also raises important questions about most open-addressing schemes’ need to periodically rehash to clear tombstone entries. Is it even possible to initially allocate a table under these schemes that will be able to balance insertions and deletions without either being incredibly wasteful or forcing a rehash? I suppose it depends deeply on your workload. There are some promising proposals such as hopscotch hashing that are meant to accomodate very high load factors, but it remains to be seen if the algorithm is resilient to long periods of constant use and my particular workload. Another issue I foresee with open addressing is the concurrence of a pathological probing sequence with a hot key (my workload is append-heavy and highly variable from key to key). When combined, you could see meaningful but hard-to-diagnose dips in throughput, which is a worst-nightmare of sorts. The extra pointers needed for a chained hash table are looking more and more palatable if it provides better worst-case scenarios.

All this means that I’ll have to spend a great deal of time analyzing our traffic to solve for the correct size such that the deletions and insertions tally up correctly. I’ll also have to take into account maintaining an acceptable load factor if I use open addressing so that “standard” operational throughput doesn’t suffer either. This has wide-ranging implications for the garbage-collection/trimming algorithm I choose as well as the backing storage, but I’ll get into that in the next few posts.