Choosing a Good Hash Function, Part 3

Author’s note: Part three of a series studying hash functions. My last post identified a few candidate algorithms that are subjected to further scrutiny here today.

The Story So Far

The simplest attribute on which one could imagine differentiating candidate hash functions is the number of collision produced when hashing a fixed pool of keys. By that standard, my last post identified Murmur3, Jenkins, City, Spooky, FNV1/1a, SDBM, AP, and RS as possible contenders. Today we’re going to see how they compare  to each other on some more rigorous tests.

Random Uniformity

A hash function ought to distribute its keys uniformly across its output range. To see how these functions stack up, we’ll put our 42 million unique keys through each hash function, bin the output, and compare the bin counts with expectation:

For bins of equal size, E[bini] = Number of items hashed/Number of bins

Now, uniformity is different from random uniformity. In general the latter is not always necessary for building a good hash table, but the analysis of some schemes assume it. For our purposes, we’re going to want our hashes to look like they are drawn from a random uniform distribution — simple uniformity won’t cut it for our applications. This means that when we look at our bin counts, we want them to be neither too smooth nor too lumpy. To quantify this concept, we’ll use a chi-squared test.

In volume II of TAOCP Donald Knuth provides a somewhat ad-hoc, but easy to understand method for interpreting the p-values calculated by a chi-squared test of randomness. If your p-value is less than 0.01 or greater than 0.99 the process that generated those results is almost certainly non-random. Something less than 0.05 or greater than 0.95 should be considered suspect. Finally, he designates a p-value of less than 0.1 or greater than 0.90 as “almost suspect”.

Here I’ve cut the whole 64 bit output space into 100 bins, and again in 1,000,000 bins. For a final test I modded out the bottom 20 bits, to check their distributions in isolation.

Hash Function 1 Million bins* Bottom 20 bits* 100 bins
AP  0.70  0.50  <0.01
City 0.07  0.29  0.46
FNV64-1  <0.01  >0.99  0.97
FNV64-1a  >0.99  >0.99  0.87
Jenkins  0.17  0.46  0.72
Murmur3  0.14  0.31  0.08
RS  >0.99  >0.99  0.23
SDBM  >0.99  >0.99  >0.99
Spooky  0.84  0.27  0.98

*p-values estimated from a standard normal distribution

Jenkins passes all three of these nicely. City and Murmur each come up “almost suspect” once, and Spooky shows some suspicious behavior in the 100 bin test. I put the heaviest weight on the bottom 20 bit test, and can pretty comfortably give these four functions a pass here. AP does dramatically better at higher bin counts, which is interesting. We can pretty solidly eliminate RS, SDBM, AP, and both FNV variants based on this analysis alone.

As a final note, hash functions are not meant to be RNGs! This test holds them to a very rigid standard that is not generally necessary to build a good hash table. It’s just that in our specific application, we’re going to want our hash values to be somewhat random looking.

Using Keyspace Structure

Before I continue, let me explain a little bit more of the structure of the data I am working with. I have 251 namespaces, each of which has a variable number of 192 and 256 bit keys associated with it. All told I have in the neighborhood of 66 million datapoints of the form (namespace, key). Only the key portion of these tuples actually gets hashed, however. Up until this point, we have been ignoring the namespace attribute of these data points, and thus have been restricted to looking at the 42 million unique (key, hash(key)) pairs. Let’s see if we can exploit larger set of data by including the namespaces!

In the chi-squared analysis above, we did our binning over the union of all namespaces. Now let’s individually bin the hash values of each namespace. All said and done, we have 251 namespaces ranging in size from a tiny handful to several million elements. This gives us 251 vectors of size 100, with

V{n,i} = Number of items of namespace n hashed to the i-th bin

For each namespace, we can compute the mean and variance of its count vector. I’ll leave it as an exercise to the reader, but it’s a pretty simple calculation to show that if you sample from a random uniform distribution, the variance of such a bin-count vector should equal its mean. If the variance is lower than the mean, it implies that the distribution is flatter than expected. On the contrary, if the variance is higher, it implies the existence of hot-spots on the range that are getting more than their fair share of data points hashed there.

Enough with the words, let’s look at the graphs! To generate these, I took the subset of namespaces that had at least 100,000 elements, of which there are 83. Each point is a namespace, and the green line shows the theoretical variance = mean relationship we’d expect from binning a random uniform distribution. Finally, I ran a Bonferroni corrected chi-squared test within each namespace. Those that come out “almost suspect” or worse are highlighted in red.

You can think of these namespaces as small experiments. Together, they help give us a picture of what the chi-squared test done on the whole dataset tells us.

A few observations:

  • Under the 100 bin chi-squared test, SDBM was flagged as being way too uniformly distributed. We can see that quite clearly here. Generally, the variance of the bin counts is quite a bit lower than the mean bin count.
  • On the other hand, AP has a comparatively high variance. This translates, again, to some bins being overly “favored” by the hash function.
  • These pictures also give us some idea of how noisy the functions are on a namespace by namespace basis. Compare Spooky and Murmur3. The residuals for all of the namespaces are quite low, and basically equal for Spooky, whereas Murmur3’s residuals show a lot more variability.

So far we’ve been taking our input sets as a given, and examining the statistical properties of the outputs. While powerful, we need not limit ourselves to these techniques. Onward to avalanche!

Avalanche Analysis

A common test of hash function performance is whether or not it achieves “avalanche.” This refers to the desireable characteristic that

P(Output bit i changes | Input bit j changes) = 0.5 for all i, j

Basically, if we keep all of the input bits the same, save for exactly 1 which we flip, we’d hope that each of our hash function’s output bits changes with probability 1/2.

I generated the following avalanche diagrams by using a random sample of 4000 keys (2000 of each type). The x-axis is the input key bit, the y axis is the output hash bit, and the color of the (x,y) tile is a measure of the bias that I/O pair has. Black indicates the desired 50% flip-probability, bright green indicates that the output bit is “stuck” and, certeris paribus, it doesn’t ever vary as a result of flipping just that input bit.

Avalanche Diagram

This test absolutely wrecks AP, SDBM, both FNV twins, and RS. Jenkins has some poor mixing in its upper bits, but that is mentioned in the implementation. It’s very small, but a slight bias can be observed in City’s lowest bits on the Creative keys. Murmur3 and Spooky are the only two functions left unscathed by this test. Given some of our algorithmic needs, this is a very slight knock against both Jenkins and City.

Conclusion

After all of this, Murmur3, Jenkins, City, and Spooky are the only functions that I’m really pleased with for our work. I’ll give a slight edge to Murmur3 and City over Jenkins due to the avalanche results, and City’s incredible speed. Spooky’s performance here is notable, but I’m a little uneasy putting it forward as a candidate for use in production, as it is still in beta. I’ll be keeping my eye on it. Based on these results it shows a lot of promise!

The next logical step is to plug some of these in to Timon’s work, and see how they serve as the keystone of our hash table!

Comments

  1. Salvor Hardin says:

    Spooky hash was beta when you posted, but it looks like the latest is marked production ready:

    http://burtleburtle.net/bob/c/spooky.h

  2. You guys have one of the best tech blogs I’ve ever seen. Thanks–it’s invaluable. Here’s a hash function question you didn’t cover directly, but the answer seems to be implied. By definition, if a function has good avalanche behavior, a minute change almost always gives a grossly different result. If one needs K different hash functions, and starts with a good algorithm, say, Spooky, is it sufficient, in your view, to implement this by simply concatenating a different byte onto each of K instances of the input. Or maybe some similar expedient would work better, such as flipping a different set of input bits for each of the K functions. I’m not sure whether or not this is implied. Any thoughts on this? Needing K functions must be pretty common…

    • Thanks for the kind words! You’re correct that needing K functions is fairly common, with a Bloom filter being a classic example.

      Many higher quality hash functions will allow you to pass in a seed value when calling the function. For instance, here is the function signature from one of the Murmur3 variants:

      void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out )

      By changing the value of “seed” in K successive calls, you can generate K different hashes, using the same underlying algorithm. This seems like the most natural way to address the problem.

      Kirsch and Mitzenmacher develop a very nice idea in their paper “Less Hashing, Same Performance; Building a Better Bloom Filter” which, in essence, allows to you generate an arbitrary number of “hashes” by combining the results of two hash functions according to the rule g_i(x) = h1(x) + i * h2(x) http://www.eecs.harvard.edu/~michaelm/CS223/lesshash.pdf

      Finally, a word about the limits of avalanche diagrams. The ones typically generated (and the ones shown in this article) only look at what happens when a single bit is flipped. They do not tell you anything about the behaviors that can occur when pairs of bits, or triplets, or whatever number you like, are manipulated together. In a word, avalanche does nothing to tell you about bit independence, which is a much stronger criterion.

  3. For a bloom filter, you usually don’t need many bits for each hash, so just slicing a single 64bit or 128bit hash is probably sufficient and faster. The top three hashes in this article have 128bit versions, and seems to be safe to slice. City also have a 256bit version.

    And very good review of hash functions. I was searching for some analysis on the distribution of the hash outputs, not only if they collide or not, and this was nice.

  4. SpookyHash V2 has been released: http://burtleburtle.net/bob/hash/spooky.html

  5. It’d be great if you included SipHash in a possible future installment of your series, SipHash was designed to counterfeit hash flooding DDOS attacks in combination with a strong random seed. So FYI: https://131002.net/siphash/

  6. Reblogged this on rg443blog and commented:
    Good Hash Function

  7. Arnaud Pilon says:

    what about xxHash ?

Trackbacks

  1. [...] appear to suffer dramatically when they are encoded as strings vs. bytes. As it so happens we’ve seen that Murmur is pretty darn [...]

  2. [...] out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, [...]

  3. [...] hash collision probability becomes negligibly small. This is fairly straightforward since most good hash functions give you at least 64-bits of entropy these days and it’s also the size of a machine word. [...]

  4. [...] quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like [...]

  5. [...] you’re using a good hash function like MurmurHash3 that gives you 128 bits of entropy, you could simply compute the register index [...]

  6. […] extension to Colin’s blog post about a good hash function that adds CityHash and SipHash to the […]

  7. […] you can imagine from of all of our blog posts about hashing that we hash a lot of things. While the various hashing algorithms may be […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 260 other followers

%d bloggers like this: