### Introduction

*“All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions.”
–Flajolet, et al*

Recalling the KMV algorithm Matt presented in his last post, one will note that every stream element is passed to a hash function as part of the processing step. This is meant to transform the data being operated on from its native distribution into something uniformly distributed. Unfortunately, we don’t live in a perfect world, and since all of the algorithm’s analysis assumes that this hash function does its job well, we wanted to get some sense of how it behaves under less friendly conditions. The first half of this post will investigate the algorithm’s performance when we artificially introduce bias, and the second half will look at its behavior with a handful of real hash functions.

### A Simple Error Model

The first hash function error model that came to mind is admittedly unrealistic and ham-fisted, but hopefully illustrative. Suppose you have a stream of fixed sized, an ideal hash function, and from these you produce a distinct value estimate using the KMV algorithm. Now suppose that for some unlucky reason, one bit from your hash function is stuck; it’s always a zero or a one, but the other 31 bits are free of this curse.

There’s nothing to stop you from computing a distinct value estimate using this janky hash with KMV, but your intuition suggests that it shouldn’t be very good. We went through this exact process with various choices of *k*, using a random number generator to simulate a perfect hash function.

Before we look at the data, let’s think about what we should expect. From the perspective of KMV, it shouldn’t make a whole lot of difference if your *k*th smallest element is odd or even (for instance, in a case where the lowest order bit always/never set, respectively). It does, however, make a difference if you’re actually incapable of seeing values smaller than 2^{31}, which is what happens when the highest order bit is always set. Thus, in both the 0-biasing and 1-biasing cases, we should expect that higher order bits have a much more dramatic effect on error than lower order bits.

Notice how the performance degradation follows two different patterns. When we are fixing bits as ones, the observed error increases fairly smoothly, and tends to result in under estimates. In contrast, setting bits to zeros results in no change until the error increases producing catastrophic over estimates. Additionally, larger values of *k* have protective effects against these biases.

### A Somewhat Less Simple Error Model

Now that we have some intuition for the problem, let’s get a little more subtle. Instead of always setting the *n*th bit as a 0 or 1, let’s add a probabilistic element. We’ll do the same experiment as before, except we will now fix the *n*th bit with probability *p*. Thus, when *p *= 0 we have a perfectly well behaved KMV, and when *p* = 1, we have the experiment we just finished discussing. In the following diagram, each tile represents the average error across several experiments in which a stream of 1,000,000 unique elements was fed to a KMV sketch (*k *= 1024) which was rigged to modify the *n*th bit with probability *p*.

Many of the same lessons can be seen here — high order bits matter more, downward biasing degrades performance sharply, upward biasing degrades more smoothly. Additionally, as we’d expect, within a given bit, more bias means more error.

### Send in the Hash Functions

All of the experiments to this point have involved using a random number generator instead of hashing real data. I think it’s time that we took a look at what happens when we drop in a few real hash functions with real data. For the following experiments, I’m using four 32-bit hashes — Murmur3, SDBM, Arash Partow’s hash, and one of the old Donald Knuth hashes. You may recall these from our series on choosing a good hash function (although 64-bit versions were used there). I chose four text corpuses:

*Romeo and Juliet*, stripped of all punctuation and converted to lower case (3794 words)`/usr/share/dict/words`(99171 words)- 1,000,000 random 12 character long strings, each sharing the same suffix: “123456″
- 1,000,000 random 12 character long strings, each sharing the same prefix: “123456″

Using formulas from this paper, we can compute the relative error that 99% of KMV estimates should theoretically fall within. This turns out to depend on *k* and the stream size.

To make these pictures, I chose random values of *k* within each hash/document pair at which I evaluate the cardinality estimate and compute the relative error. The lines are linear interpolations between sampled points and are shown solely for clarity. The y-axis scale is adjusted on a per-picture basis to best display the theoretical envelope within which we expect our errors to lie.

Now that we’ve gotten through all the necessary preamble, let’s take a look at the results!

One picture in and we’ve already learned a lesson: choice of hash function seriously matters! SDBM and DEK cause the algorithm to perform well below its capabilities. DEK’s error is actually off the charts for most of this graph, which is why it does not appear until *k* > 3,000.

On a bigger corpus with tighter theoretical error bounds, Murmur3 and AP are still doing quite well. Do note, however, that AP dips outside the envelope for a while at *k* = 70,000 or so.

With the random strings, SDBM performs much better than it did on English words. DEK, however, is still hopeless. It’s a little tough to see on these pictures, but at high *k*, AP starts to fall off the wagon, and even Murmur3 dips outside the envelope, though not beyond what we’d expect, statistically speaking. Honestly, I was hoping for some fireworks here, but they didn’t materialize. I was wondering if we might see some hashes break on one version of these strings, and do fine on the other due to the location of the varying key bits (high order/low order). Sadly, that didn’t happen, but a negative result is a result none the less.

To summarize these, I made the following table, which shows us the percentage of time that an one of the samples falls outside the theoretical envelope. In this view, Murmur3′s superiority is clear.

AP | DEK | Murmur3 | SDBM | |
---|---|---|---|---|

Romeo and Juliet |
0.00% | 100.00% | 0.00% | 61.54% |

/usr/share/dict/words |
10.76% | 100.00% | 0.00% | 68.46% |

Common Suffix | 7.22% | 99.11% | 1.10% | 0.27% |

Common Prefix | 3.33% | 100.00% | 0.22% | 0.0001% |

### Fin

KMV is a very nice little algorithm that is incredibly simple to understand, implement, and use. That said, if you’re going to make use of it, you really do need to practice some due diligence when choosing your hash function. With packages like smhasher available, trying out multiple hash functions is a cinch, and a little legwork at the start of a project can save you from confusion and despair later on!

[...] quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like [...]