## Set Operations On HLLs of Different Sizes

### Introduction

Here at AK, we’re in the business of storing huge amounts of information in the form of 64 bit keys. As shown in other blog posts and in the HLL post by Matt, one efficient way of getting an estimate of the size of the set of these keys is by using the HyperLogLog (HLL) algorithm.  There are two important decisions one has to make when implementing this algorithm.  The first is how many bins one will use and the second is the maximum value one allows in each bin.  As a result, the amount of space this will take up is going to be the number of bins times the log of the maximum value you allow in each bin.  For this post we’ll ignore this second consideration and focus instead on the number of bins one uses.  The accuracy for an estimate is given approximately by 1.04/√b, where b is the number of bins.  Hence there is a tradeoff between the accuracy of the estimate and the amount of space you wish to dedicate to this estimate. Certainly, projects will have various requirements that call for different choices of number of bins.

The HLL algorithm natively supports the union operation.  However, one requirement for this operation is that the HLLs involved are of the same size, i.e. have the same number of bins.  In practice, there’s no guarantee that HLLs will satisfy this requirement.  In this post, I’ll outline the method by which we transform an HLL with a certain number of bins to one with a fewer number of bins, allowing us to perform set operations on any two HLLs, regardless of size.

### Key Processing

As discussed in the HyperLogLog paper, to get a cardinality estimate with an HLL with 2n bins on a data set we pass over each key, using the placement of the rightmost “1” to determine the value of the key and the next n digits to the left to determine in which bin to place that value.  In each bin, we only store the maximum value that that bin has “seen.”

Below I’ve shown how two HLLs (one of size 23 and one of size 24) process two different keys.  Here, the keys have the same value, because the purpose of this example is to illustrate how the location in which we place the key changes when the HLL has twice the number of bins.

Above, the keys which are attributed to the fifth and thirteenth bins in the larger HLL would both have been attributed to the fifth bin in the smaller HLL.  Hence, unraveling the algorithm a bit, we see that the values which are seen by the fifth and thirteenth bins in the larger HLL would have been seen by the fifth bin in the smaller HLL had they run on the same dataset.  Because of this, in the case where the two algorithms estimate the same dataset, the value stored in the fifth bin in the smaller HLL is the maximum of the values stored in the fifth and thirteenth bins in the larger HLL.

### Folding HLLs

What happened above is not an isolated phenomenon.  In general, if one uses the HLL algorithm twice on a dataset, once with 2n+1 bins and once with 2n bins, the value in the kth bin in the smaller HLL will be the maximum of the values in the kth and (k + 2n)th bins of the larger HLL.  As a result, if given an HLL of size 2n+1 that one wishes to transform to an HLL of size 2n, one can simply fold the HLL by letting the value of the kth bin in the folded HLL be given by the maximum of the values in the kth and (k + 2n)th bins of the original HLL.

In fact, we can fold any HLL an arbitrary number of times.  Repeating this process, we can take an HLL of size 2n to an HLL of size 2m for any m which is less than or equal to n.  Hence if we wish to perform a set operation on two HLLs of different sizes, we can simply fold the larger HLL repeatedly until it is the same size as the smaller HLL.  After this, we can take unions and intersections as we wish.

### Folding – An Example

Below, we show a simple example of how folding works.  Here we have an HLL with 23 bins which we fold to be an HLL with 22 bins.  In the diagram, I’ve torn an HLL of size 23 in half and placed the strips side by side to emphasize how we line up bins and take maximums in the folding process.  Notice that the values in the folded the bins of the folded HLL are the maximum of the relevant bins in the larger HLL.

This technique gives us the flexibility to be able to perform set operations on any two HLLs regardless of the number of bins used in the algorithms.  It’s usefulness in this regard is a bit offset by the fact that the accuracy of the estimate on these is limited by the accuracy of the least accurate HLL.  For example, an HLL of size 210 will have accuracy roughly 23 times better than an HLL of size 2 (to see where I’m getting these numbers from, you’ll have to read the paper!).  Unfortunately, if we combine these with a set operation, our resulting estimate will have the same accuracy as the smaller HLL with short term loans being taken from the RAM of the machine.

### Summary

The HyperLogLog algorithm supports set operations in a nice way only if the number of bins used is fixed.  Using folding, one can correct for this by reducing the size of the larger HLL to the size of the smaller.  The cost of this convenience is in the accuracy of the estimates after the folding process.  In my next post, I’ll explore some methods of performing the set operations without this loss of accuracy.

## Sketch of the Day: K-Minimum Values

### Intro

We’ve been talking about probabilistic distinct value counting with sketches (DV sketches) for a while now and have had some fun experiences implementing them into our production environment. In this post I want to talk about a DV sketch that is very intuitive and easy to implement, the K-minimum Values sketch (KMV). While KMV sketches are relatively lightweight and accurate, they are not the best of breed when it comes to DV counting. They are useful in two ways to me though, for exposition and multi-set operations.

### History

KMV seems to have been first introduced in 2002 by Ziv Bar-Yossef et. al. in the great paper Counting distinct elements in a data stream. In this paper they talk about improving on the basic intuition by the seminal DV sketch papers of Flajolet and Martin and Alon, Matias, and Szegedy (AMS) (AMS put some formality around the frequency moment problems, bounds of algorithms etc.) Flajolet and Martin’s paper is in turn based upon work from Morris 1978 (looking for streaks of right-most zeroes i.e. the predecessor to LogLog and HyperLogLog). These are fun to read (although they admittedly get pretty mathy) and it’s cool to see the progression of knowledge, accuracy, and efficiency as these guys do their work. You can almost imagine the fist fights that happen during their meet-ups! The final detailed work on KMV is by Beyer et. al. in On Synopses for Distinct-Value Estimation Under Multiset Operations.

### How it works

The intuition behind KMV is straightforward. Supposing you have a good hash function, i.e. hash values are evenly distributed over the hash space (I will normalize the hash space output to [0-1] for the rest of this), then you could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. If I see 10 distinct values, I would expect them on average to be spaced about 1/10th apart from each other. We could do this cheaply by keeping track of, say, the smallest value you have ever seen. If the values are indeed uniformly distributed and provided you’ve thrown a decent amount of data through it, you could guess that the smallest value you have seen is a decent estimate of the average spacing of hash values in your space.

Of course, this doesn’t have a lot of “nice” properties. Taking only one value opens you up to a ton of variance and you are fairly dependent on the “goodness” of your hash. In order to improve upon this Bar-Yossef suggests keeping the k smallest values you have ever seen around. The algorithm becomes:

```Initialize KMV with first k values
for all h(n):
if h(n) &lt; max(KMV):
insert h(n) into KMV set
remove largest value from KMV

Cardinality(KMV):
return: (k-1)/max(KMV)
```

For a KMV sketch of size k=3, graphically you have:

A very straightforward approach. Note that the “-1” in the numerator comes from a bias correction in the estimate. You’re going to have to read the paper for that. So, the size of the sketch is basically k 64bit values large. Click below to run a KMV simulation:

Click above to run the KMV simulation

### Set Operations

Performing set operations with KMV’s is also incredibly straightforward. The intuition around unions is that there is no difference between combining 2 KMV sketches and keeping the k minimum values in both versus just keeping one to start with, so unions are “lossless”. To perform union, you merely take 2 sketches and combine their values and keep the k smallest ones (if the 2 sketches are of different sizes, k and k’, then you keep the min(k,k’) values in order to keep the lowest resolution).

```Union(A,B):
k = min( |A|, |B|)
return: min_k( A U B )
```

For intersections you use the KMV to estimate the Jaccard coefficient for the 2 (or n) sets. Basically, you treat the 2 KMV sketches for each set as a random uniform sample and intersect these to estimate Jaccard. So, you assemble the k minimum values of the two sets (as you did in union above), and intersect this result with the original sketches to obtain an estimate of the overlap of the 2 sets. The steps are:

```IntersectionCard(A,B):
L = UnionSet(A,B)  # the set this time, not just the cardinality
k = min( |A|, |B|)
K = | L ∩ A ∩ B |
return: K/k * Cardinality(L)
```

One of the nice features of KMV which is different than say HyperLogLog, is that you can take n-way intersections by extending the algorithm above. To do this with HyperLogLog you actually need to compute the n-way algebra for set intersection i.e.

```|A ∩ B| = |A| + |B| - |A U B|
```

However, in our experience of using KMV for set operations on Zipfian data, KMV’s still don’t perform as well HyperLogLog sketches for computing n-way intersections using the same amount of memory.

### Expansion to Multisets

One of the nice features of KMV sketches is their expansion to supporting multiset operations, dubbed the AKMV sketch. This is great if you are using them for document representations and want to support document similarity operations like tf-idf (or any other multiset operation). In order to expand the basic KMV structure to support multisets (described here) you just add a counter on top of the values you are storing. In this way you get a decent sample of the counts of things in the stream/document to use for multiset operations. Most other DV sketches, HyperLogLog in particular, don’t support these types of queries.

To see how well this might work in practice, I took a look at some simple tf-idf similarity against the 20 news groups data set. This data set contains about 1000 news group emails on various topics such as atheism and motorcycles (woo!). For each article I constructed an AKMV sketch of the words in it and used this representation as the basis for tf-idf.  I cleaned up the data marginally by limiting my analysis to the 5000 most common words in the corpus (as seems to be the norm) and only considered alpahnumeric “words”.   Additionally, I cherry picked only a few newsgroups from the set that showed “nice” separation in the SVD.  You can think of the documents looking a bit like this where the red dots are the entries in the AKMV and the green dots are not (as above):

Once I created the tf-idf matrix, I SVD-ed it and plotted each newsgroup against the second and third singular vectors (the first vector in this case contained mostly information about the mean of the document vectors and contained little real information for classification).  The intermediate singular vectors for differing k were projected onto the actual singular vectors from the complete matrix (k = Inf).  Running through increasing k, the newsgroups look like this (click on the graphic to restart the animation):

Click image to restart animation

You can see the structure start to appear relatively quickly for small k and you can also see how some of the articles “stick” to their final spots due to them having less than k words.  Clearly you would have to do more work and testing if you wanted to implement something like this in a real classifier or search engine but it seems to be a promising approach.

Here is the same thing for a corpus composed of 23 articles about the Tom Cruise/Katie Holmes divorce and 20 articles about the Higgs boson.

Click image to restart animation

Using document sketches as a basis for a recommender system/search engine or any other application that requires similarity metrics seems like a promising avenue.  It would be very interesting indeed to run some real tests of precision/recall and memory footprint for sketch based recommenders/classifiers against other more standard approaches.

###### Disclaimer:

I make no claims about having built a classifier of any sort here. A lot of work and decisions would be necessary to move from these ideas to a useful classification scheme in a real environment. I was interested in how much of the flavor of a document would be retained in an AKMV sketch. Based on the above results, I think that the answer is “quite a bit,” even for modest values of k. I don’t think it would be out of the question to try to build a system that allowed you to compute similarities or apply classification tools after the sampling process inherent in the construction of these sketches.

### Compression

An interesting thing to notice is that as your DV count gets larger, your max value of the k items is getting smaller. What this means is a simple compression algorithm that works is to just throw away the higher order unused bits of all the k values. Oddly, as the DV count gets larger your KMV will get smaller without losing accuracy.

### Summary

There are many DV sketches in the world and KMV is one of the most interesting due to how easy it is to comprehend and implement. I particularly enjoy using KMV as a pedagogical tool and a solid jumping off point for DV sketching. The fact that KMV is so straightforward makes it stand out in a world of more confusing math and complicated sketching algorithms. In the right context it very well could be the right solution for your sketching needs, especially given the multiset support.

## Sketching the last year

Sketching is an area of big-data science that has been getting a lot of attention lately. I personally am very excited about this.  Sketching analytics has been a primary focus of our platform and one of my personal interests for quite a while now. Sketching as an area of big-data science has been slow to unfold, (thanks Strata for declining our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in the wild for almost a year now (and, obviously we were working on it for many months before that).

#### Fast, But Fickle

The R&D of the summarizer was fun but, as with most technical implementations, it’s never as easy as reading the papers and writing some code. The majority of the work we have done to make our DV sketches perform in production has nothing to do with the actual implementation.  We spend a lot of time focused on how we tune them, how we feed them, and make them play well with the rest of our stack.

Likewise, setting proper bounds on our sketches is an ongoing area of work for us and has led down some very interesting paths.  We have gained insights that are not just high level business problems, but very low level watchmaker type stuff.  Hash function behaviors and stream entropy alongside the skewness of data-sets themselves are areas we are constantly looking into to improve our implementations. This work has helped us refine and find optimizations around storage that aren’t limited to sketches themselves, but the architecture of the system as a whole.

#### Human Time Analytics

Leveraging DV sketches as more than just counters has proven unbelievably useful for us. The DV sketches we use provide arbitrary set operations. This comes in amazingly handy when our customers ask “How many users did we see on Facebook and on AOL this month that purchased something?” You can imagine how far these types of questions go in a real analytics platform. We have found that DV counts alongside set operation queries satisfy a large portion of our analytics platforms needs.

Using sketches for internal analytics has been a blast as well. Writing implementations and libraries in scripting languages enables our data-science team to perform very cool ad-hoc analyses faster and in “human-time”. Integrating DV sketches as custom data-types into existing databases has proven to be a boon for analysts and engineers alike.

#### Reap The Rewards

Over the course of the year that we’ve been using DV sketches to power analytics, the key takeaways we’ve found are: be VERY careful when choosing and implementing sketches; and leverage as many of their properties as possible.  When you get the formula right, these are powerful little structures. Enabling in-memory DV counting and set operations is pretty amazing when you think of the amount of data and analysis we support. Sketching as an area of big-data science seems to have (finally!) arrived and I, for one, welcome our new sketching overlords.

## Statistical Toolbox: The Kolmogorov-Smirnov Test

Author’s Note: The Kolmogorov-Smirnov test is a handy tool that is conceptually clean, and can be useful in a variety of data analysis situations. I’ll introduce it in the context of a problem that I came across, and give a feel for what it does, and how it might be useful.

### A Question and A Tool

I’ve been doing a lot of work with hash functions, and as part of that work I was posed with a question. If I take the same data, encode it two different ways, and feed the two encodings to the same hash function, is there any difference in the statistical properties of the hashed output data sets?

The model I used to explore this question was to take a great number of SHA1 checksums, and MurmurHash3 these numbers, first encoded as 16 byte integers, and then again as Java Strings. There are a lot of things that one could do at this stage, but the first thing I thought to apply was the Kolmogorov-Smirnov (KS) test.

### The Whatnow?

First, some background. The cumulative distribution function (CDF) is a common and natural way of characterizing a probability distribution. The KS test gives us a tool for taking two CDFs and speaking intelligently about how “different” they are. A typical use case is as follows:

• You collect data that you suspect follows some theoretical distribution (uniform, Poisson, whatever)
• From the raw data you construct an empirical cumulative distribution function (ECDF)
• You use the KS test to answer the question, “Assuming my data were sampled from this theoretical distribution, what is the probability of seeing an ECDF that is at least this different from what one would predict?”

A more interesting use case is to compare two empirical distributions for equality. The test is conceptually exactly the same, except instead of comparing a CDF generated from data to one generated by theory, the comparison is between two empirical CDFs. A minor consequence of comparing two empirical data sets is that there is some additional uncertainty that must be dealt with, but this can be addressed by simply using larger samples (see the scaling factors discussed below).

### What Does It Look Like?

The figure on the right is very helpful in understanding what is going on in this test.

Given two CDFs, the first thing the KS test does is find their maximum positive and negative differences, D+ and D-, respectively. These differences are scaled to produce so-called “K statistics.” In the case where one is comparing an empirical to a theoretical CDF (shown in the figure), all one needs to do is scale the differences by sqrt(n) where there are n observations. For the comparison of two empirical distributions of size n and m, D+ and D- are scaled by sqrt(nm/(n+m))

This scaling takes care of the idea the same magnitude of difference is more troubling if you have more data. A chance large jump or long lag in your ECDF curve is increasingly unlikely as your samples grow.

For a vanilla KS test, the larger of K+ and K- is compared against the Kolmogorov distribution. This allows you to compute a p-value telling you the probability of seeing a K statistic as large as you did under the assumptions of the null hypothesis that the sample is drawn from the theoretical distribution you are testing it against.

The KS test doesn’t need a lot of data to start detecting fairly small differences. If you have a lot of data, and you want to get fancy, you can break your data set up into many disjoint subsets and run KS test on each of the subsets, keeping the K+ and K- statistics for each subset. You can then pool all of K+ statistics into one collection, all of the K- statistics into another and individually compare them to their theoretical distribution, which is well approximated by 1-e-2x2. In this way you can make good use of all of your data, and better balancing the competing goals of detecting both global and local divergence from the ideal CDF. See TAOCP Vol. II for a more thorough discussion of this technique.

### So What Happened?

A simple call to scipy.stats.ks_2samp and some waiting returned a p-value of 0.9977065. The size difference between the two data samples’ ECDFs was well within what one would expect, were they drawn from the same underlying distribution. This result is nice. A good hash function should be as insensitive to the statistical nuances of the input data as possible, always producing a nice, uniform, output. Note that this statistic says nothing about the quality of MurmurHash3‘s output distribution, only that its ability to grind up the name numbers doesn’t appear to suffer dramatically when they are encoded as strings vs. bytes. As it so happens we’ve seen that Murmur is pretty darn good!

### Closing Thoughts

As with all test statistics, you shouldn’t blindly accept or reject a result on the basis of some arbitrary cutoff. The KS test can’t tell you whether or not any “statistically significant” difference is practically significant. It is a very sensitive test, and given a large enough sample size can detect differences that are meaningless to your application. It’s certainly worth looking at plots of your ECDFs, repeating your analysis on different subsets of your data, and even judging the results of the test in light of other statistical measures or related data. This test wasn’t end of my analysis of this problem, but it was certainly a useful tool along the way. I hope that it may one day be similarly useful for you!

#### Implementations

• R’s ks.test and ks.boot functions implement the standard and bootstrapped KS test for single and two-sample cases
• SciPy implements a lot of KS tools in the scipy.stats module
• Matlab’s versions live in the statistics toolbox
• Octave has these tests as builtins

#### Books

• TAOCP Vol II. Seminumerical Algorithms by Knuth has a very nice writeup, but is focused on 1 sample tests.
• The KS test is discussed in John Cook’s chapter on testing a random number generator in Beautiful Testing. It is freely readable here.

## Choosing a Good Hash Function, Part 3

Author’s note: Part three of a series studying hash functions. My last post identified a few candidate algorithms that are subjected to further scrutiny here today.

### The Story So Far

The simplest attribute on which one could imagine differentiating candidate hash functions is the number of collision produced when hashing a fixed pool of keys. By that standard, my last post identified Murmur3, Jenkins, City, Spooky, FNV1/1a, SDBM, AP, and RS as possible contenders. Today we’re going to see how they compare  to each other on some more rigorous tests.

### Random Uniformity

A hash function ought to distribute its keys uniformly across its output range. To see how these functions stack up, we’ll put our 42 million unique keys through each hash function, bin the output, and compare the bin counts with expectation:

For bins of equal size, E[bini] = Number of items hashed/Number of bins

Now, uniformity is different from random uniformity. In general the latter is not always necessary for building a good hash table, but the analysis of some schemes assume it. For our purposes, we’re going to want our hashes to look like they are drawn from a random uniform distribution — simple uniformity won’t cut it for our applications. This means that when we look at our bin counts, we want them to be neither too smooth nor too lumpy. To quantify this concept, we’ll use a chi-squared test.

In volume II of TAOCP Donald Knuth provides a somewhat ad-hoc, but easy to understand method for interpreting the p-values calculated by a chi-squared test of randomness. If your p-value is less than 0.01 or greater than 0.99 the process that generated those results is almost certainly non-random. Something less than 0.05 or greater than 0.95 should be considered suspect. Finally, he designates a p-value of less than 0.1 or greater than 0.90 as “almost suspect”.

Here I’ve cut the whole 64 bit output space into 100 bins, and again in 1,000,000 bins. For a final test I modded out the bottom 20 bits, to check their distributions in isolation.

Hash Function 1 Million bins* Bottom 20 bits* 100 bins
AP  0.70  0.50  <0.01
City 0.07  0.29  0.46
FNV64-1  <0.01  >0.99  0.97
FNV64-1a  >0.99  >0.99  0.87
Jenkins  0.17  0.46  0.72
Murmur3  0.14  0.31  0.08
RS  >0.99  >0.99  0.23
SDBM  >0.99  >0.99  >0.99
Spooky  0.84  0.27  0.98

*p-values estimated from a standard normal distribution

Jenkins passes all three of these nicely. City and Murmur each come up “almost suspect” once, and Spooky shows some suspicious behavior in the 100 bin test. I put the heaviest weight on the bottom 20 bit test, and can pretty comfortably give these four functions a pass here. AP does dramatically better at higher bin counts, which is interesting. We can pretty solidly eliminate RS, SDBM, AP, and both FNV variants based on this analysis alone.

As a final note, hash functions are not meant to be RNGs! This test holds them to a very rigid standard that is not generally necessary to build a good hash table. It’s just that in our specific application, we’re going to want our hash values to be somewhat random looking.

### Using Keyspace Structure

Before I continue, let me explain a little bit more of the structure of the data I am working with. I have 251 namespaces, each of which has a variable number of 192 and 256 bit keys associated with it. All told I have in the neighborhood of 66 million datapoints of the form (namespace, key). Only the key portion of these tuples actually gets hashed, however. Up until this point, we have been ignoring the namespace attribute of these data points, and thus have been restricted to looking at the 42 million unique (key, hash(key)) pairs. Let’s see if we can exploit larger set of data by including the namespaces!

In the chi-squared analysis above, we did our binning over the union of all namespaces. Now let’s individually bin the hash values of each namespace. All said and done, we have 251 namespaces ranging in size from a tiny handful to several million elements. This gives us 251 vectors of size 100, with

V{n,i} = Number of items of namespace n hashed to the i-th bin

For each namespace, we can compute the mean and variance of its count vector. I’ll leave it as an exercise to the reader, but it’s a pretty simple calculation to show that if you sample from a random uniform distribution, the variance of such a bin-count vector should equal its mean. If the variance is lower than the mean, it implies that the distribution is flatter than expected. On the contrary, if the variance is higher, it implies the existence of hot-spots on the range that are getting more than their fair share of data points hashed there.

Enough with the words, let’s look at the graphs! To generate these, I took the subset of namespaces that had at least 100,000 elements, of which there are 83. Each point is a namespace, and the green line shows the theoretical variance = mean relationship we’d expect from binning a random uniform distribution. Finally, I ran a Bonferroni corrected chi-squared test within each namespace. Those that come out “almost suspect” or worse are highlighted in red.

You can think of these namespaces as small experiments. Together, they help give us a picture of what the chi-squared test done on the whole dataset tells us.

A few observations:

• Under the 100 bin chi-squared test, SDBM was flagged as being way too uniformly distributed. We can see that quite clearly here. Generally, the variance of the bin counts is quite a bit lower than the mean bin count.
• On the other hand, AP has a comparatively high variance. This translates, again, to some bins being overly “favored” by the hash function.
• These pictures also give us some idea of how noisy the functions are on a namespace by namespace basis. Compare Spooky and Murmur3. The residuals for all of the namespaces are quite low, and basically equal for Spooky, whereas Murmur3’s residuals show a lot more variability.

So far we’ve been taking our input sets as a given, and examining the statistical properties of the outputs. While powerful, we need not limit ourselves to these techniques. Onward to avalanche!

### Avalanche Analysis

A common test of hash function performance is whether or not it achieves “avalanche.” This refers to the desireable characteristic that

P(Output bit i changes | Input bit j changes) = 0.5 for all i, j

Basically, if we keep all of the input bits the same, save for exactly 1 which we flip, we’d hope that each of our hash function’s output bits changes with probability 1/2.

I generated the following avalanche diagrams by using a random sample of 4000 keys (2000 of each type). The x-axis is the input key bit, the y axis is the output hash bit, and the color of the (x,y) tile is a measure of the bias that I/O pair has. Black indicates the desired 50% flip-probability, bright green indicates that the output bit is “stuck” and, certeris paribus, it doesn’t ever vary as a result of flipping just that input bit.

This test absolutely wrecks AP, SDBM, both FNV twins, and RS. Jenkins has some poor mixing in its upper bits, but that is mentioned in the implementation. It’s very small, but a slight bias can be observed in City’s lowest bits on the Creative keys. Murmur3 and Spooky are the only two functions left unscathed by this test. Given some of our algorithmic needs, this is a very slight knock against both Jenkins and City.

### Conclusion

After all of this, Murmur3, Jenkins, City, and Spooky are the only functions that I’m really pleased with for our work. I’ll give a slight edge to Murmur3 and City over Jenkins due to the avalanche results, and City’s incredible speed. Spooky’s performance here is notable, but I’m a little uneasy putting it forward as a candidate for use in production, as it is still in beta. I’ll be keeping my eye on it. Based on these results it shows a lot of promise!

The next logical step is to plug some of these in to Timon’s work, and see how they serve as the keystone of our hash table!

## Big Data Ain’t Fat Data: A Case Study

We’ve always had a hunch that our users stick to the same geographic region. Sure, there’s the occasional jet-setter that takes their laptop from New York to Los Angeles (or like Rob, goes Chicago to San Francisco) on a daily or weekly basis, but they’re the exception and not the rule. Knowing how true this is can simplify the way we work with user-centric data across multiple data centers.

When Rob asked me to find this out for sure, my first instinct was to groan and fire up Hive on an Elastic MapReduce cluster, but after a second, I heard Matt’s voice in my head saying, “Big Data isn’t Fat Data”. Why bother with Hadoop?

#### The Setup

If I was solving this problem on a small data-set, it’d be pretty straight-forward. I could write a Python script in about 10 minutes that would take care of the problem. It would probably look something like:

```users = {}

for line in sys.stdin:
user, data_center = parse(line)
try:
users[user].append(data_center)
except KeyError:
users[user] = [data_center]

total_users = len(users)
multiple_dc_users = len([u for u in users if len(users[u]) > 1])
```

Easy peasy. However, explicitly storing such a large hash-table gets a little problematic once you start approaching medium-sized data (1GB+). Your memory needs grow pretty rapidly – with M users and N data centers, storage is O(MN) – , and things start to get a little slow in Python. At this point there are two options. You can brute force the problem by throwing hardware at it, either with a bigger machine or with something like Hadoop. Or, we can put on our Computer Science and Statistics hats and get a little bit clever.

What if we turn the problem sideways? Above, we’re keeping a hash table that holds a set of data-center for each user. Instead, let’s keep a set of users per data-center, splitting the problem up into multiple hash tables. This lets us keep a small, fixed number of tables – since I’d hope any company knows exactly how many data centers they have – and spread the load across them, hopefully making the load on each table more tolerable. We can then check how many sets each user falls into, and call it a day.

```data_centers = dict([(dc, set()) for dc in AK_DATA_CENTERS])

for line in sys.stdin:
user, data_center = parse(line)

# Get the total users by intersecting all of the data center sets
...

# Get all users who are in exactly one set by taking symmetric differences (XOR) of data-center sets
# and count the size of that set.
...
```

While this approach theoretically has better performance with the same O(MN) space requirements, with big enough data the space requirements of the problem totally dominate whatever improvement this approach would provide. In other words, it doesn’t matter how small each hash table is, you can’t fit 80GB of user IDs into the 8GB of RAM on your laptop.

It’s looking pretty bleak for the Clever Way of doing things, since what we really want is a magic hash table that can store our 80GB of user IDs in the memory on our laptops.

#### Bloom Filters

Enter Bloom Filters. A bloom filter is a fixed-size set data structure with two minor features/drawbacks:

1. You can never ask a Bloom Filter for the set of elements it contains.
2. Membership queries have a small, controllable, false-positive probability. Bloom filters will never return false negatives.

With a little bit of work, it’s pretty easy to substitute Bloom Filters for plain old hash tables in our sideways approach above. There’s a slight tweak we have to make to our algorithm to accommodate the fact that we can’t ever query a bloom filter for the elements it contains, but the idea remains the same.

#### The Payoff

Suppose now we’re keeping a bloom-filter of users per data center. The only thing we have to work around is the fact that we’ll never be able to recover the list of users we’ve added to each set. So, we’ll just deal with users each time we see them instead of deferring our counting to the end.

With that idea in the bag, there are really only a few things to worry about when a request comes in for a given data center.

• Check the bloom filter for that data center to see if the user has been to that one before
• Check the other bloom filters to see how many other data-centers that user has been to before
• Count the number of total data-centers that user has seen before. If the user is new to this data center, and the user has seen exactly one other data center before, increment the multiple data center user counter
• If the user has never seen any of your data centers before, that user is a completely new user. Increment the total number of users seen.
• If the user has already seen this data-center, this user is a repeat. Do nothing!

We ran our version of this overnight. It took us one core, 8GB of RAM, and just under than 4 hours to count the number of users who hit multiple data centers in a full week worth of logs.

## Choosing a Good Hash Function, Part 2

Author’s note: Part two of a series in which I investigate the performance of a menagerie of hash functions on our data. In today’s episode the analysis begins in earnest with an investigation of collision rates.

Hash function designers have many tools at their disposal, but at their heart, most algorithms follow the same pattern: bytewise iteration over a key during which some internal state is mixed up with the key bits via some combination of ANDs, ORs, XORs, ADDs, shifts, magic numbers, modular arithmetic, and similar tools. As an example, consider the famous FNV hash function, which is astoundingly simple in its construction:

```uint64_t fnv1_hash (void *key, int n_bytes)
{
unsigned char *p = key;
uint64_t h = 14695981039346656037;
int i;
for (i = 0; i < n_bytes; i++) {
h = (h * 1099511628211) ^ p[i];
}
return h;
}
```

With all hash functions, the hope is that one may sufficiently mix up the input bits such that, on average, the output is uniformly distributed across its available range. If you think that designing such an algorithm sounds tricky, you’re right!

Over the years many hash functions have been developed that vary widely in quality and complexity. There are many that, despite some demonstrable theoretical flaws, have worked well in enough practical applications to have gained popularity. Other algorithms have been designed from the ground up to achieve a variety of theoretical benchmarks. To get started with this project, I spent some time looking around and came up with a list of 16 reasonably well-known functions that run a pretty wide breadth of quality from negative control to veteran. I started with the simplest test imaginable: I have ~42 million keys available, each of which are either 192 or 256 bits long. Given my entire available set of keys, what fraction can be hashed without collision?

• All hashes are 64 bits.
• Hashing is hard. Many of these functions do quite poorly compared to sampling from a random uniform distribution. The theoretical expectation here is that 0 keys should collide.
• It looks like there is a significant hurdle at ~85% of the keys.
• Although hard to see on this chart, OAT (Bob Jenkins’ less popular one-at-a-time hash) came in just under 100%. While this is a standout performance in comparison to most of the functions tested, it is still below what is expected by theory.
• Unsurprisingly, Murmur3 and Jenkins eat this data set for lunch. They are carefully designed to work well on a broad variety of inputs, thoroughly tested, and I would have been shocked to see them fail here. They are matched by Google’s City Hash, Spooky Hash (Jenkins’ most recent project, which is still under development), FNV-1/1a, SDBM hash (also known as x65599), RS (Arash Partow‘s version of a hash function designed by Robert Sedgewick), another function of Partow’s own creation.

We’re by no means done here — we’ve simply thinned our list to a few algorithms that merit deeper exploration. The challenge now becomes distinguishing our high performers, and for that we’ll need tools a little bit more sophisticated than simple collision counts. Bring your statistics thinking cap to part 3!

• Unsurprisingly, Donald Knuth’s chapter from The Art of Computer Programming, Volume III: Sorting and Searching is an excellent piece.
• Bob Jenkins wrote a great article in Dr. Dobb’s back in 1997 that is also a great starting place.
• More generally, Jenkins’ own website is a treasure trove of material on the subject of hashing
• There’s a lot of material about FNV to be had here.
• And let’s not leave out Murmur Hash and City Hash.

## Choosing a Good Hash Function, Part 1

Author’s note: Hello, reader! I’m Colin, a new data scientist on the team. This is the first in a series of posts in which I will be describing my efforts to characterize various hash functions for use here at AK. Future posts will discuss the statistical and computational properties exhibited by these algorithms on our data. Additionally, I will be tackling the problem of  trying to use the data that we have available to uncover potentially pathological input sets.

At AK, every event that we track is encoded as an n-tuple of 64-bit integers:

key component #1, key component #2, … , key component #n

This is a convenient form for summary and analysis, but obviously not optimal from a storage perspective. Internet advertising is no stranger to large numbers, but 264n is enormous. The set of keys that we will draw from this theoretical universe of keys is comparatively quite small. We find ourselves posed with a problem that looks very much like a natural fit for hashing!

A well chosen hash function, operating at the heart of solidly designed hash table could allow us a big win on both the internal storage/representation front, as well as in wild, freeing up space in client cookies, etc.

Paraphrasing Knuth, one should not choose a random hash function to generate a good hash table. As with any hashing task, there are the three classical issues to consider:

• The size of the hash in terms of the number of bits of output needed to hit your collision (two distinct keys hashing to the same value) goals and remain within your storage constraints
• The distributions of hashes on your input data, and the related problem of collisions
• Computation time

Over the next several posts, I will be putting a number of hash functions through the wringer in an effort to identify a handful that perform well on our data.

## On Accuracy and Precision

A joint post from Matt and Ben

Believe it or not, we’ve been getting inspired by MP3’s lately, and not by turning on music in the office. Instead, we drew a little bit of inspiration from the way MP3 encoding works. From wikipedia:

“The compression works by reducing accuracy of certain parts of sound that are considered to be beyond the auditory resolution ability of most people. This method is commonly referred to as perceptual coding. It uses psychoacoustic models to discard or reduce precision of components less audible to human hearing, and then records the remaining information in an efficient manner.”

Very similarly, in online advertising there are signals that go “beyond the resolution of advertisers to action”. Rather than tackling the problem of clickstream analysis in the standard way, we’ve employed an MP3-like philosophy to storage. Instead of storing absolutely everything and counting it, we’ve employed a probabilistic, streaming approach to measurement. This lets us give clients real-time measurements of how many users and impressions a campaign has seen at excruciating levels of detail. The downside is that our reports tends to include numbers like “301M unique users last month” as opposed to “301,123,098 unique users last month”, but we believe that the benefits of this approach far outweigh the cost of limiting precision.

### Give a little, get a lot

The precision of our approach does not depend on the size of the thing we’re counting. When we set our precision to +/-1%, we can tell the difference between 1000 and 990 as easily as we can tell the difference between 30 billion and 29.7 billion users. For example when we count the numbers of users a campaign reached in Wernersville, PA (Matt’s hometown) we can guarantee that we saw 1000 +/- 10 unique cookies, as well as saying the campaign reached 1 Billion +/- 10M unique cookies overall.

Our storage size is fixed once we choose our level of precision. This means that we can accurately predict the amount of storage needed and our system has no problem coping with increases in data volume and scales preposterously well. Just to reiterate, it takes exactly as much space to count the number of users you reach in Wernersville as it does to count the total number of users you reach in North America. Contrast this with sampling, where to maintain a fixed precision when capturing long-tail features (things that don’t show up a lot relative to the rest of the data-set, like Wernersville) you need to drastically increase the size of your storage.

The benefits of not having unexpected storage spikes, and scaling well are pretty obvious – fewer technical limits, fewer surprises, and lower costs for us, which directly translates to better value for our users and a more reliable product. A little bit of precision seems like a fair trade here.

The technique we chose supports set-operations. This lets us ask questions like, “how many unique users did I see from small towns in Pennsylvania today” and get an answer instantaneously by composing multiple data structures. Traditionally, the answers to questions like this have to be pre-computed, leaving you waiting for a long job to run every time you ask a question you haven’t prepared for. Fortunately, we can do these computations nearly instantaneously, so you can focus on digging into your data. You can try that small-town PA query again, but this time including Newton, MA (Ben’s hometown), and not worry that no one has prepared an answer.

Unfortunately, not all of these operations are subject to the same “nice” error bounds. However, we’ve put the time in to detect these errors, and make sure that the functionality our clients see degrades gracefully. And since our precision is tunable, we can always dial the precision up as necessary.

### Getting insight from data

Combined with our awesome streaming architecture this allows us to stop thinking about storage infrastructure as the limiting factor in analytics, similar to the way MP3 compression allows you to fit more and more music on your phone or MP3-player. When you throw the ability to have ad-hoc queries execute nearly instantly into the mix, we have no regrets about getting a little bit lossy. We’ve already had our fair share of internal revelations, and enabled clients to have quite a few of their own, just because it’s now just so easy to work with our data.

## Streaming Algorithms and Sketches

Here at Aggregate Knowledge we spend a lot of time thinking about how to do analytics on a massive amount of data. Rob recently posted about building our streaming datastore and the architecture that helps us deal with “big data”. Given a streaming architecture, the obvious question for the data scientist is “How do we fit in?”. Clearly we need to look towards streaming algorithms to match the speed and performance of our datastore.

A streaming algorithm is defined generally as having finite memory – significantly smaller than the data presented to it – and must process the input in one pass. Streaming algorithms start pretty simple, for instance counting the number of elements in the stream:

```counter = 0
for event in stream:
counter += 1
```

While eventually `counter` will overflow (and you can be somewhat clever about avoiding that) this is way better than the non-streaming alternative.

```elements = list(stream)
counter = len(elements)
```

Pretty simple stuff. Even a novice programmer can tell you why the second method is way worse than the first. You can get more complicated and keep the same basic approach – computing the mean of a floating point number stream is almost as simple: keep around `counter` as above, and add a new variable, `total_sum += value_new`. Now that we’re feeling smart, what about the quantiles of the stream? Ah! Now that is harder.

While it may not be immediately obvious, you can prove (as Munro and Paterson did in 1980) that computing exact quantiles of a stream requires memory that is at least linear with respect to the size of the stream. So, we’re left approximating a solution to the quantiles problem. A first stab might be sampling where you keep every 1000th element. While this isn’t horrible, it has it’s downsides – if your stream is infinite, you’ll still run out of space. It’s a good thing there are much better solutions. One of the first and most elegant was proposed by Cormode and Muthukrishnan in 2003 where they introduce the Count-Min sketch data structure. (A nice reference for sketching data structures can be found here.)

Count-Min sketch works much like a bloom filter. You compose `k` empty tables and `k` hash functions. For each incoming element we simply hash it through each function and increment the appropriate element in the corresponding table. To find out how many times we have historically seen a particular element we simply hash our query and take the MINIMUM value that we find in the tables. In this way we limit the effects of hash collision, and clearly we balance the size of the Count-Min sketch with the accuracy we require for the final answer. Heres how it works:

The Count-Min sketch is an approximation to the histogram of the incoming data, in fact it’s really only probabilistic when hashes collide. In order to compute quantiles we want to find the “mass” of the histogram above/below a certain point. Luckily Count-Min sketches support range queries of the type “`select count(*) where val between 1 and x;`“. Now it is just a matter of finding the quantile of choice.

To actually find the quantiles is slightly tricky, but not that hard. You basically have to perform a binary search with the range queries. So to find the first decile value, and supposing you kept around the the number of elements you have seen in the stream, you would binary search through values of x until the return count of the range query is 1/10 of the total count.

Pretty neat, huh?