The intersection of two streams (of user ids) is a particularly important business need in the advertising industry. For instance, if you want to reach suburban moms but the cost of targeting those women on a particular inventory provider is too high, you might want to know about a cheaper inventory provider whose audience overlaps with the first provider. You may also want to know how heavily your car-purchaser audience overlaps with a certain metro area or a particular income range. These types of operations are critical for understanding where and how well you’re spending your advertising budget.
As we’ve seen before, HyperLogLog provides a time- and memory-efficient algorithm for estimating the number of distinct values in a stream. For the past two years, we’ve been using HLL at AK to do just that: count the number of unique users in a stream of ad impressions. Conveniently, HLL also supports the union operator ( ) allowing us to trivially estimate the distinct value count of any composition of streams without sacrificing precision or accuracy. This piqued our interest because if we can “losslessly” compute the union of two streams and produce low-error cardinality estimates, then there’s a chance we can use that estimate along with the inclusion-exclusion principle to produce “directionally correct” cardinality estimates of the intersection of two streams. (To be clear, when I say “directionally correct” my criteria is “can an advertiser make a decision off of this number?”, or “can it point them in the right direction for further research?”. This often means that we can tolerate relative errors of up to 50%.)
The goals were:
- Get a grasp on the theoretical error bounds of intersections done with HLLs, and
- Come up with heuristic bounds around , , and the set cardinalities that could inform our usage of HLL intersections in the AK product.
Quick terminology review:
- If I have set of integers , I’m going to call the HLL representing it .
- If I have HLLs and their union , then I’m going to call the intersection cardinality estimate produced .
- Define the between two sets as .
- Define the cardinality ratio as a shorthand for the relative cardinality of the two sets.
- We’ll represent the absolute error of an observation as .
That should be enough for those following our recent posts, but for those just jumping in, check out Appendices A and B at the bottom of the post for a more thorough review of the set terminology and error terminology.
We fixed 16 values (0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) and 12 set cardinalities (100M, 50M, 10M, 5M, 1M, 500K, 100K, 50K, 10K, 5K, 1K, 300) and did 100 runs of each permutation of . A random stream of 64-bit integers hashed with Murmur3 was used to create the two sets such that they shared exactly elements. We then built the corresponding HLLs and for those sets and calculated the intersection cardinality estimate and computed its relative error.
Given that we could only generate and insert about 2M elements/second per core, doing runs with set cardinalities greater than 100M was quickly ruled out for this blog post. However, I can assure you the results hold for much larger sets (up to the multiple billion-element range) as long as you heed the advice below.
This first group of plots has a lot going on, so I’ll preface it by saying that it’s just here to give you a general feeling for what’s going on. First note that within each group of boxplots increases from left to right (orange to green to pink), and within each plot cardinality ratio increases from left to right. Also note that the y-axis (the relative error of the observation) is log-base-10 scale. You can clearly see that as the set sizes diverge, the error skyrockets for all but the most similar (in both cardinality and composition) sets. I’ve drawn in a horizontal line at 50% relative error to make it easier to see what falls under the “directionally correct” criteria. You can (and should) click for a full-sized image.
Note: the error bars narrow as we progress further to the right because there are fewer observations with very large cardinality ratios. This is an artifact of the experimental design.
A few things jump out immediately:
- For cardinality ratio > 500, the majority of observations have many thousands of percent error.
- When cardinality ratio is smaller than that and , register count has little effect on error, which stays very low.
- When , register count has little effect on error, which stays very high.
Just eyeballing this, the lesson I get is that computing intersection cardinality with small error (relative to the true value) is difficult and only works within certain constraints. Specifically,
- , and
The intuition behind this is very simple: if the error of any one of the terms in your calculation is roughly as large as the true value of the result then you’re not going to estimate that result well. Let’s look back at the intersection cardinality formula. The left-hand side (that we are trying to estimate) is a “small” value, usually. The three terms on the right tend to be “large” (or at least “larger”) values. If any of the “large” terms has error as large as the left-hand side we’re out of luck.
So, let’s say you can compute the cardinality of an HLL with relative error of a few percent. If is two orders of magnitude smaller than , then the error alone of is roughly as large as .
by definition, so
In the best scenario, where , the errors of and are both roughly the same size as what you’re trying to measure. Furthermore, even if but the overlap is very small, then will be roughly as large as the error of all three right-hand terms.
On the bubble
Let’s throw out the permutations whose error bounds clearly don’t support “directionally correct” answers ( and ) and those that trivially do () so we can focus more closely on the observations that are “on the bubble”. Sadly, these plots exhibit a good deal of variance inherent in their smaller sample size. Ideally we’d have tens of thousands of runs of each combination, but for now this rough sketch will hopefully be useful. (Apologies for the inconsistent colors between the two plots. It’s a real bear to coordinate these things in R.) Again, please click through for a larger, clearer image.
By doubling the number of registers, the variance of the relative error falls by about a quarter and moves the median relative error down (closer to zero) by 10-20 points.
In general, we’ve seen that the following cutoffs perform pretty well, practically. Note that some of these aren’t reflected too clearly in the plots because of the smaller sample sizes.
|Register Count||Data Structure Size||Overlap Cutoff||Cardinality Ratio Cutoff|
To get a theoretical formulation of the error envelope for intersection, in terms of the two set sizes and their overlap, I tried the first and simplest error propagation technique I learned. For variables , and a linear combination of those (independent) variables, we have
Applied to the inclusion-exclusion formula:
as in section 4 (“Discussion”) of the HLL paper.
Aside: Clearly is not independent of , though is likely independent of . However, I do not know how to a priori calculate the covariance in order to use an error propagation model for dependent variables. If you do, please pipe up in the comments!
I’ve plotted this error envelope against the relative error of the observations from HLLs with 8192 registers (approximately 5kB data structure).
Despite the crudeness of the method, it provided a 95% error envelope for the data without significant differences across cardinality ratio or . Specifically, at least 95% of observations satisfied
However, it’s really only useful in the ranges shown in the table above. Past those cutoffs the bound becomes too loose and isn’t very useful.
This is particularly convenient because you can tune the number of registers you need to allocate based on the most common intersection sizes/overlaps you see in your application. Obviously, I’d recommend everyone run these tests and do this analysis on their business data, and not on some contrived setup for a blog post. We’ve definitely seen that we can get away with far less memory usage than expected to successfully power our features, simply because we tuned and experimented with our most common use cases.
We hope to improve the intersection cardinality result by finding alternatives to the inclusion-exclusion formula. We’ve tried a few different approaches, mostly centered around the idea of treating the register collections themselves as sets, and in my next post we’ll dive into those and other failed attempts!
Appendix A: A Review Of Sets
Let’s say we have two streams of user ids, and . Take a snapshot of the unique elements in those streams as sets and call them and . In the standard notation, we’ll represent the cardinality, or number of elements, of each set as and .
Example: If then .
If I wanted to represent the unique elements in both of those sets combined as another set I would be performing the union, which is represented by .
Example: If then .
If I wanted to represent the unique elements that appear in both and I would be performing the intersection, which is represented by .
Example: With as above, .
The relationship between the union’s cardinality and the intersection’s cardinality is given by the inclusion-exclusion principle. (We’ll only be looking at the two-set version in this post.) For reference, the two-way inclusion-exclusion formula is .
Example: With as above, we see that and .
For convenience we’ll define the between two sets as .
Example: With as above, .
Similarly, for convenience, we’ll define the cardinality ratio as a shorthand for the relative cardinality of the two sets.
The examples and operators shown above are all relevant for exact, true values. However, HLLs do not provide exact answers to the set cardinality question. They offer estimates of the cardinality along with certain error guarantees about those estimates. In order to differentiate between the two, we introduce HLL-specific operators.
Consider a set . Call the HLL constructed from this set’s elements . The cardinality estimate given by the HLL algorithm for is .
Define the union of two HLLs , which is also the same as the HLL created by taking the pairwise max of ‘s and ‘s registers.
Finally, define the intersection cardinality of two HLLs in the obvious way: . (This is simply the inclusion-exclusion formula for two sets with the cardinality estimates instead of the true values.)
Appendix B: A (Very Brief) Review of Error
The simplest way of understanding the error of an estimate is simply “how far is it from the truth?”. That is, what is the difference in value between the estimate and the exact value, also known as the absolute error.
However, that’s only useful if you’re only measuring a single thing over and over again. The primary criteria for judging the utility of HLL intersections is relative error because we are trying to measure intersections of many different sizes. In order to get an apples-to-apples comparison of the efficacy of our method, we normalize the absolute error by the true size of the intersection. So, for some observation whose exact value is non-zero , we say that the relative error of the observation is . That is, “by what percentage off the true value is the observation off?”
Example: If then the relative error is .