Big Memory, Part 4

Author’s Note: This is part 4 of a series of posts about my adventures in building a “large”, in-memory hash table. This post is a summary of some pure Java hash table libraries.

Background

In my last post, I discussed the results of rerunning Nick Welch’s benchmark of C/C++ hash tables. However, since we use the JVM in production, those results were purely academic. I need to either verify or strike down my suspicions in the Java world.

For this test, I’m only concerned with primitive longlong mappings. For those following along, this is a departure from the methodology used in part 3. Here’s why.

Goals

I was hoping to replicate some fraction of the amazing throughput seen in the C/C++ benchmark. (A million inserts per second wouldn’t be a bad start!) More importantly, I sought to learn something about the relative performance of different schemes on the JVM. Would the bit-masking used in power-of-2-sized tables prove faster than the modulus operator used by prime-sized tables? Would the extra pointer hops of an externally chained implementation be more or less costly than the probe chains of open addressing schemes? Would Java even be able to handle a 128GiB hash table? What effect would GC have? What implementation tricks could be learned?

I’ll save you some time right now if you’re actually looking for the exact answers to all those questions: I don’t have most of them. I’ll hopefully get to them in later posts. The available, open-source primitive hash maps simply did not cover enough of the algorithm space to reliably demonstrate the isolated effect of a particular design choice such as power-of-2- vs prime-sized bucket arrays.

Instead, this post’s goals are more modest: I wish to answer the following questions:

  • What is the effect of load factor on the performance of these maps?
  • Given the out-of-the-box implementations available, which designs are the most appealing?
  • What optimizations can be applied to their implementation, given my requirements?
  • Are any of these suitable for my needs right out of the box?

Candidates

Lib. Ver. Rel. Date Class Coll. Res. No. Buckets Hash Fun. Bytes/Entry
mahout-collections 1.0.0 2010 OpenLongLongHashMap double hashing prime [1] 17
Trove 3.0.1 2011 TLongLongHashMap double hashing prime [2] 17
PCJ 1.2 2003 LongKeyLongOpenHashMap double hashing prime [2] 17
fast-util 6.4.1 2011 Long2LongOpenHashMap linear probing 2n MurmurHash3 17
hppc 0.4.1 2011 LongLongOpenHashMap linear probing 2n MurmurHash3 17
PCJ 1.2 2003 LongKeyLongChainedHashMap external chaining prime [2] 48
JDK 1.6.0_27 2011 HashMap<Long, Long> external chaining 2n [3] 100NOTE

[1] value ^ (value >> 32) [2] value ^ (value >>> 32) [3] h ^= (h >>> 20) ^ (h >>> 12); h ^ (h >>> 7) ^ (h >>> 4);

Setup

  • 2 warmup runs, 2 observation runs for each combination of (Class, Load Factor, Size Hint)
  • Centos 5.4, dual-Xeon X7542  (2.67GHz, 6 cores each), 256 GB of RAM
  • 4.72 billion 64-bit integer (976 million uniques) keys inserted with random long values, read from disk
  • baseline read/parse rate was roughly 10 million records per second
  • JVM flags: -server -Xmx128g -XX:+PrintGCTimeStamps
  • Code is on GitHub, added to the benchmark used in part 3. See the section of the README that discusses the “libraries” comparison.

The runs completed are listed below, as are some odd failures that occurred in testing.

The test pushed several hash tables to their breaking point, since they either imposed artificial limits on the backing array length (usually 230 ~ 1.074B) or they ran into Java’s built-in limit, Integer.MAX_VALUE, of 231-1 ~ 2.147B. Extremely high load factors were required to even get some tables to finish the benchmark, which I understand is suboptimal.

In general, I tried to run three different maximum load factors where possible: 0.50, 0.75, and 0.91. The first two are common default load factors for the hash tables in question, and the last one is the highest load factor such that 976,000,000/loadFactor < 230. What’s important to note here is that I actively avoided resizes during the runs by setting an initial capacity corresponding to the load factor. The number of “usable” buckets was held constant (at 976M) while the number of empty buckets varied in order to test the effect of load factor on performance.

Results

The broad strokes are that all of these libraries live between the 1M to 2M inserts per second range, with performance degrading as they approach their maximum load factors, some more than others. The drastically different bottom quartiles are what stand out to me.

Worst-cases for the chained maps are close to full stops whereas the open addressing maps’ throughputs fall by about 25%. HashMap<Long, Long> and LongKeyLongChainedHashMap clearly fall apart in a serious way on a regular basis, roughly every 50 million new keys. It’s not clear to me that the code would allow resizes/rehashes given the initial capacity provided, so my guess was that these were stop-the-world GCs.

I reran the test with -XX:+PrintGCTimeStamps -verbose:gc to verify, and 78% of the real time of the two observation runs was consumed by GC, with regular pauses up to 100 seconds. For contrast, Long2LongOpenHashMap‘s GCs amounted to less than 1% of real time with no pause greater than 0.2 seconds (with the exception of the Full GC that occurred between observation runs). I computed those percentages by dividing the sum of the reported GC durations by the ‘real’ value from time(1).

It makes sense that GC pauses would uniquely affect the externally-chained implementations, since they produce so many objects. (All the open addressing implementations only create a few handfuls of objects per map instance.) In any case, that problem alone disqualifies it for my uses: such significant dips in throughput are unacceptable. Indeed, any externally-chained map will be suspect at these sizes because of the need to create a new object per entry. Perhaps different collectors can mitigate this problem, but given the extremely sound performance of the open addressing implementations, I’m inclined to table the externally chained maps for a while. The operational advantage of simply not worrying about tuning GC is huge for me.

The two most interesting maps proved to be HPPC’s LongLongOpenHashMap and fastutil’s Long2LongOpenHashMap. Both were blazing fast and quite stable, with minor throughput degradation only after 700 million keys. LongKeyLongOpenHashMap, TLongLongHashMap, and OpenLongLongHashMap began their degradation right from the get-go, dropping by over 50% over the course of the test. fastutil’s implementation’s variance was also extremely low, which is a very admirable quality from an operational perspective.

It’s fascinating that a design as straightforward as power-of-two bucket counts and linear probing could be so amazingly effective. It’s the classic example of a ‘flawed’ open addressing scheme: it’s susceptible to primary clustering, the number of probes shoots up as you go past 90% occupancy, and it’s unforgiving of mediocre hash functions.

After thinking about it, though, it makes great sense that it performed so well here. Both favorites use a byte[] to store the state of each bucket, which means that each 64-byte cache line has on average the next 32 bucket statuses pre-cached. This means that walking the linear probe chain is effectively free for the next 32 buckets (compared to going to L1 or, heaven-forbid, main memory). At 90% occupancy, the average successful lookup requires 5.5 probes while unsuccessful lookups require 50 probes. This means that on average we’re fetching at most two cache lines for the probing. Compare this to double hashing, which by design is meant to probe far from the original hash, which requires a line fetch per probe. At 2.5/10 average probes for successful/unsuccessful lookups, you’ll be doing far more fetches with double hashing than with linear probing. If you roll your own status array on top of a long[] and some masking tricks, you could potentially fit 256 2-bit statuses on each cache line, further improving linear probing’s efficiency. ($ cat /sys/devices/system/cpu/cpu0/cache/index0/size to get your cache line size in bytes.)

If you combine linear hashing with a good hash function, like Murmur3, you can avoid primary clustering in practice. What I find curious is that so many of the double-hashing libraries use such a weak hash function. Sure it’s only 2 ops, but is the hash function the real bottleneck in their code? Can they really not afford the extra 6 ops? I doubt it. I’m going to add some instrumentation to a few of these implementations to do probe length histograms to verify what I’ve speculated. Stay tuned.

Conclusions

In order to meet the requirements I’ve enumerated in my previous posts, I’m going to need more than 976 million occupied buckets. I’m going to need 1.5 billion, and a fraction of that again for empty buckets. Unfortunately, 230 is just a bit over 1.073 billion, and well short of 1.5 billion. This means that I can’t use one of the two favorites from my testing (fastutil’s Long2LongOpenHashMap) out of the box. Luckily, HPPC’s implementation does not have an artificial cap on the maximum number of buckets, so moving forward that will be a viable out-of-the-box option. (Edit: It has the same 230 cap as fastutil.) In order to complete my testing, I’ll likely dig up another month’s worth of data and run HPPC and modified variants of the other open addressing hash tables up to about 2 billion keys.


Mapping types  [UP]

In part 3 of this series of posts, I tested out several hash table “services” to see if any of them would give me the end-to-end storage functionality I needed for this project. In that test, I used an append semantic for the performance comparison because I was hoping that one of those “services” would solve all of my storage needs in one shot. Ideally, such a service would provide a fast algorithm, efficient memory management, and solid monitoring facilities. Sadly, all of them failed on the first count and I was forced to move onto a more hands-on approach.

This means worrying about the actual storage implementation and its interaction with the JVM’s GC. If I take the naive route and simply allocate a new long[] for every user’s activity, the memory allocation time alone could cripple the application. It has become clear that storing 1.5 billion long-lived long[]s (about 80GB’s worth) on a high-occupancy heap will cause unacceptable GC pauses. As such, I’ll be storing the user activity chains in a slab allocator (of my own construction) and using the hash table to translate between user IDs and their storage reference. Thus, I’m transitioning from benchmarking appends to puts, and I’m now mapping from long to long instead of long to long[]. Since the slab allocator will return roughly random longs as storage references, I’ve simulated the values of the hash table with a call to Random#nextLong().

An Aside on HashMap<Long, Long> and External Chaining  [UP]

Unlike the other candidates, HashMap<Long, Long> stores boxed keys and values, wrapping each pair in an Entryobject.

    public final class Long extends Number implements Comparable {
        ...
        private final long value;
        ...
    }

    static class Entry implements Map.Entry {
        final K key;
        V value;
        Entry next;
        final int hash;
        ...
    }

On a 64-bit JVM with a starting heap greater than 32GB, each of our 976 million unique users (wrapped in an Entryinstance) requires:

  • 8 bytes for the pointer to the Entry from the backing Entry[],
  • 16 bytes for the object header of the Entry,
  • 8 bytes for the pointer to final K key,
  • 8 bytes for the pointer to V value,
  • 8 byte pointer to Entry next,
  • 4 bytes for int hash,
  • 16 bytes for the object header of final K key,
  • 8 bytes for the underlying long of final K key,
  • 16 bytes for the object header of V value, and
  • 8 bytes for the underlying long of V value

for a grand total of 100 bytes per unique key. This means that at the default 0.75 load factor, you’re looking at 97,600,000,000 bytes for the used buckets and 2,600,000,000 bytes for the empty buckets. 93 GiB later and I haven’t even stored the events themselves. Since boxing has its own CPU costs, I’m not expecting much in terms of performance or memory footprint from HashMap. It’s included here mostly as a baseline, but also to demonstrate it’s unsuitability for this scale of work. The same goes for LongKeyLongChainedHashMap, minus the boxing. Even at a more modest 48 bytes per entry, it’s still using 55 GiB plus the cost of maintaing a billion items on the heap. External chaining is simply unviable at these sizes.

Runs  [UP]

For each load factor, the initial capacity of the table was hinted to be at least 976,000,000/loadFactor in order to avoid resizes, where possible. However, Trove and fastutil automatically do this computation by requesting the number of unique elements expected, instead of a lower bound on the initial bucket count. They were given 976,000,000 as their starting size hint regardless of load factor. As a last resort, 976,000,000 was used as the size hint when a map could or would not finish a test run.

The “Constructor” column indicates whether the constructor expects the initial capacity (“IC”) or the expected elements count (“EE”).

Lib. Class Constructor LF Size Hint
mahout-collections OpenLongLongHashMap IC 0.91 1,073M
0.75 1,302M
0.50 1,952M
Trove TLongLongHashMap EE 0.91 976M
0.75 976M
0.50 976M
PCJ LongKeyLongOpenHashMap IC 0.91 1,073M
0.75 1,302M
0.50 1,952M
fast-util Long2LongOpenHashMap EE 0.91 976M
hppc LongLongOpenHashMap IC 0.91 1,073M
PCJ LongKeyLongChainedHashMap IC 0.91 976M
JDK HashMap<Long, Long> IC 0.91 976M
0.75 1,302M
0.50 1,952M

Oddities  [UP]

When running initial tests, the configuration (TLongLongHashMap, 0.50, 976M) forced a catastrophic six-hour-long rehash at roughly 1.78 billion records in. At first I suspected a pathological probe chain followed by a rehash, however, I could not reproduce this particular case despite using identical data and identical runtime parameters. This is utterly horrifying from an operational perspective, but I’m willing to write this off as some artifact of the machine or perhaps the screen session. In another initial test, the configuration (LongKeyLongChainedHashMap, 0.50, 976M) crashed the Parallel Mark-Sweep garbage collector thread. Again, I couldn’t reproduce the problem despite my best efforts, so I’m writing it off as an artifact. It could have been the 128GB heap, it could have been the hundreds of millions of objects, it could have been anything. Sadly, I don’t even have a reproducible test case to give the JVM team, so I don’t foresee much progress on the bug.

Big Memory, Part 3.5: Google sparsehash!

Author’s Note: This is part 3.5 of a series of posts about my adventures in building a “large”, in-memory hash table. This post is a handful of observations I made while running someone else’s benchmark of C/C++ hash table implementations.

I ran across Nick Welch’s Hash Table Benchmarks during my research and decided to rerun a subset of his benchmark for much larger key counts. (You can find my fork of his code on GitHub.)

The differences between the benchmarks he ran and the ones I ran are:

  • I’m using sparsehash 1.11 (vs. 1.5), Boost 1.41 (vs. 1.38), and Qt 4.6 (vs. 4.5).
  • I removed Ruby’s hash implementation because of its abysmal performance in his benchmark.
  • I increased the insertion count to 1.5 billion, and the step size to 100 million from 40 million/2 million.
  • I only provided the Random Inserts: Execution Time (integers) and the Memory Usage (integers) tests.

I did a best of three, using the recommended ionice/nice run command on a m2.4xlarge EC2 instance. (I added a dummy entry for Ruby’s hash so that the colors of my plots would match the originals.)

As Nick noted, Google’s sparsehash maps are the real deal. They sustain their performance across a broad range of key counts, offering strong tradeoffs between memory and speed. When memory usage must be low and allocation smooth, sparse_hash_map is the clear choice. Interestingly, it’s simply quadratic probing built on top of a dynamic array. Dynamic arrays only store addressed entries, so no matter how large the array address space may be, you only pay for the entries you’re using, plus some bookkeeping overhead. The dynamic array mitigates the memory cost of maintaining the low load factor required to operate a fast, large hash table with quadratic probing. At sparse_hash_map‘s default load factor of 0.8, a conventional array-backed quadratic probing scheme ‘wastes’ 20% of allocated memory. This benchmark maps to and from 64-bit integers, so at the claimed 1-2 bits of overhead per entry the dynamic array implementation only ‘wastes’ about 1% of allocated memory. It also allocates memory one entry at a time, so each insert is slower but requires constant time. Rehashes are the only allocation hiccup in this scheme.

A good review of the algorithm of the sparsetable used to back the sparse_hash_map is provided by the Google folks here, under insert() and in the source in sparsetable.cppsparsehashtable.h, and sparse_hash_map. You’ll also find an interesting claim about their particular choice of quadratic probing with triangular numbers, which I’ve been unable to find evidence for elsewhere. Regardless of the theoretical details, these benchmark numbers are enough to convince me of the efficacy of that choice.

In the opposite case where throughput is of paramount importance, dense_hash_map shines. It maintains a 0.5 load factor and stores entries in a regular C array, which means it has to pay for all the buckets it addresses (not just the ones it uses) and also leaves half of them empty. This avoids the costly bitmap population counts and the constant reallocation needed by the sparse_hash_map at the cost of infrequent, lumpy allocation and a large per-key overhead. So lumpy, in fact, that it couldn’t quite make it to the 1.5 billion key mark on the provided 68GB of memory.

The algorithm behind sparse_hash_map seemed to suit my needs perfectly: smooth allocation, compact footprint, and still plenty fast despite being the slowest of the group. I dug in a little further to get a feel for the throughput profile and ended up modifying the benchmark to report the amount of time between every million inserts up to 1.5 billion. I ran 4 warmup runs and 16 observation runs. This gave a familiar view, but with a twist. The throughput costs of the different allocation strategies were exposed:

(Note: Both plots share the bottom plot’s legend.)

The top plot shows the number of inserts per second, measured over 1-million-record intervals. Overall, the results are as expected: the sparse_hash_map and dense_hash_map sandwiching the other three, which may as well have been the same library. (They practically are: the versions of Boost, Qt, and GCC that I used are all prime-sized, chained hash tables.) The throughputs are pretty stable overall: dense_hash_map at around 4.5M inserts/sec, the unordered_maps and QHash at about 3M, and sparse_hash_map at about 1M.

The bottom plot shows the slowest million-record intervals, plotting the number of seconds needed to insert 1 million records. These correspond to the dips in throughput of the top plot. I can only assume these are the observations that spanned a resizing of the table. sparse_hash_map was the worst offender, taking more than three minutes to resize and rehash the table in the worst case, while dense_hash_map had the best worst-case at 43 seconds. The other three predictably fell somewhere in between. (See the five rightmost points on the bottom plot.)

A brief aside: it’s curious that GCC’s std::unordered_map performed roughly 20% faster after its final reallocation (near the 700M record mark) than both Boost’s unordered_map and Qt’s QHash. I’ll have to dig in later to figure out what’s actually happening there, but I suspect it has to do with the simple fact that it reallocates to a much larger memory footprint than the other two, leaving it with a much lower load factor through that section of the test.

In any case, the valuable lesson for me was that unless you properly allocate the initial table size, you will be forced to suffer catastrophically slow resizes, even in C/C++. But what recourse do you have if your table crosses the rehash point between pruning? This also raises important questions about most open-addressing schemes’ need to periodically rehash to clear tombstone entries. Is it even possible to initially allocate a table under these schemes that will be able to balance insertions and deletions without either being incredibly wasteful or forcing a rehash? I suppose it depends deeply on your workload. There are some promising proposals such as hopscotch hashing that are meant to accomodate very high load factors, but it remains to be seen if the algorithm is resilient to long periods of constant use and my particular workload. Another issue I foresee with open addressing is the concurrence of a pathological probing sequence with a hot key (my workload is append-heavy and highly variable from key to key). When combined, you could see meaningful but hard-to-diagnose dips in throughput, which is a worst-nightmare of sorts. The extra pointers needed for a chained hash table are looking more and more palatable if it provides better worst-case scenarios.

All this means that I’ll have to spend a great deal of time analyzing our traffic to solve for the correct size such that the deletions and insertions tally up correctly. I’ll also have to take into account maintaining an acceptable load factor if I use open addressing so that “standard” operational throughput doesn’t suffer either. This has wide-ranging implications for the garbage-collection/trimming algorithm I choose as well as the backing storage, but I’ll get into that in the next few posts.