## HLL talk at SFPUG

I had the pleasure of speaking at the SF PostgreSQL User Group’s meetup tonight about sketching, the history of HLL, and our implementation of HLL as a PG extension. My slides are embedded below and you can get a PDF copy here. Be sure to click the gear below to show speaker’s notes for context!

If video is made available, I’ll post an update with a link!

## Hitting the Books: EADS Summer School on Hashing

Rob, Matt, and I just wrapped up our trip to Copenhagen for the EADS Summer School on Hashing at the University of Copenhagen and it was a blast! The lineup of speakers was, simply put, unbeatable: Rasmus Pagh, Graham Cormode, Michael Mitzenmacher, Mikkel Thorup, Alex Andoni, Haim Kaplan, John Langford, and Suresh Venkatasubramanian. There’s a good chance that any paper you’ve read on hashing, sketching, or streaming has one of them as a co-author or is heavily influenced by their work. The format was three hour-and-a-half lectures for four days, with exercises presented at the end of each lecture. (Slides can be found here. They might also post videos. UPDATE: They’ve posted videos!)

Despite the depth of the material, almost all of it was approachable with some experience in undergraduate math and computer science. We would strongly recommend both of Michael Mitzenmacher’s talks (1, 2) for an excellent overview of Bloom Filters and Cuckoo hashing that are, in my opinion, significantly better and more in depth than any other out there. Specifically, the Bloom Filter talk presents very elegantly the continuum of Bloom Filter to Counting Bloom Filter to Count-Min Sketch (with “conservative update”) to the Stragglers Problem and Invertible Bloom Filters to, finally, extremely recent work called Odd Sketches.

Similarly, Mikkel Thorup’s two talks on hashing (1, 2) do a very thorough job of examining the hows and whys of integer hashing, all the way from the simplest multiply-mod-prime schemes all the way to modern work on tabulation hashing. And if you haven’t heard of tabulation hashing, and specifically twisted tabulation hashing, get on that because (1) it’s amazing that it doesn’t produce trash given how simple it is, (2) it’s unbelievably fast, and (3) it has been proven to provide the guarantees required for almost all of the interesting topics we’ve discussed on the blog in the past: Bloom Filters, Count-Min sketch, HyperLogLog, chaining/linear-probing/cuckoo hash tables, and so on. We really appreciated how much attention Mikkel devoted to practicality of implementation and to empirical performance when discussing hashing algorithms. It warms our heart to hear a leading academic in this field tout the number of nanoseconds it takes to hash an item as vocally as the elegance of the proof behind it!

We love this “Summer School” format because it delivers the accumulated didactic insight of the field’s top researchers and educators to both old techniques and brand new ones. (And we hope by now that everyone reading our blog appreciates how obsessed we are with teaching and clarifying interesting algorithms and data structures!) Usually most of this insight (into origins, creative process, stumbling blocks, intermediate results, inspiration, etc.) only comes out in conversation or lectures, and even worse is withheld or elided at publishing time for the sake of “clarity” or “elegance”, which is a baffling rationale given how useful these “notes in the margin” have been to us. The longer format of the lectures really allowed for useful “digressions” into the history or inspiration for topics or proofs, which is a huge contrast to the 10-minute presentations given at a conference like SODA. (Yes, obviously the objective of SODA is to show a much greater breadth of work, but it really makes it hard to explain or digest the context of new work.)

In much the same way, the length of the program really gave us the opportunity to have great conversations with the speakers and attendees between sessions and over dinner. We can’t emphasize this enough: if your ambition to is implement and understand cutting edge algorithms and data structures then the best bang for your buck is to get out there and meet the researchers in person. We’re incredibly lucky to call most of the speakers our friends and to regularly trade notes and get pointers to new research. They have helped us time and again when we’ve been baffled by inconsistent terminology or had a hunch that two pieces of research were “basically saying the same thing”. Unsurprisingly, they are also the best group of people to talk to when it comes to understanding how to foster a culture of successful research. For instance, Mikkel has a great article on how to systematically encourage and reward research article that appears in the March 2013 issue of CACM (pay-wall’d). Also worthwhile is his guest post on Bertrand Meyer’s blog.

If Mikkel decides to host another one of these, we cannot recommend attending enough. (Did we mention it was free?!) Thanks again Mikkel, Rasmus, Graham, Alex, Michael, Haim, and John for organizing such a great program and lecturing so eloquently!

## Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

### History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
if count == 0:
majority = val
count = 1
elif val == majority:
count += 1
else:
count -= 1

print majority if count > 0 else "no majority!"

If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to $N/2+1$ and then start decrementing until all N values have been seen, which would leave the counter at $2 = (N/2+1) - (N/2-1)$*. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

### One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
if val > median_est:
median_est += 1
elif val < median_est:
median_est -= 1

Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:

quantile_75 = 0
for val in stream:
r = random()
if val > quantile_75 and r > 1 - 0.75:
quantile_75 += 1
elif val < quantile_75 and r > 0.75:
quantile_75 -= 1

As the paper states:

The trick to generalize median estimation to any $\frac{h}{k}$ -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability $\frac{h}{k}$. The rationale behind it is that if we are estimating $\frac{h}{k}$ -quantile, and if the current estimate is at stream’s true $\frac{h}{k}$ -quantile, we will expect to see stream items larger than the current estimate with probability $1-\frac{h}{k}$ .

### Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1

for item in stream:
if item > m and random() > 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign > 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step > 0 else 1
# Mark that the estimate increased this step
sign = 1
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m > item:
step += (item - m)
m = item
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item < m and random() > q:
step += f(step) if sign < 0 else -1 * f(step)
m -= step if step > 0 else 1
sign = -1
if m < item:
step += (m - item)
m = item
# Damp down the step size to avoid oscillation.
if (m - item) * sign < 0 and step > 1:
step = 1

You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as $-10^{102}$ for some simulations.

### Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

### Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

### Appendix

• Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
• Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
• Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
• Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

## Doubling the Size of an HLL Dynamically – Extra Bits…

Author’s Note: This post is related to a few previous posts on the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the final post in a series on doubling the number of bins in HLLs. The first post dealt with the recovery time after doubling, and the second dealt with doubling’s accuracy when taking unions of two HLLs.

### Introduction

The main draw to the HyperLogLog algorithm is its ability to make accurate cardinality estimates using small, fixed memory.  In practice, there are two choices a user makes which determine how much memory the algorithm will use: the number of registers (bins) and the size of each register (how high they can count).  As Timon discussed previously, increasing the size of each register will only increase the accuracy if the true cardinality of the stream is HUGE.

Recall that HyperLogLog (and most other streaming algorithms) is designed to work with a fixed number of registers, $m$, which is chosen as a function of the expected cardinality to approximate. We track a great number of different cardinality streams and in this context it is useful for us to not have one fixed value of $m$, but to have this evolve with the needs of a given estimation.

We are thus confronted with many engineering problems, some of which we have already discussed. In particular, one problem is that the neat feature of sketches, namely that they allow for an estimate of the cardinality of the union of multiple streams at no cost, depends on having sketches of the same size.

We’ve discussed how to get around this by folding HLLs, though with some increase in error. We’ve also explored a few options on how to effectively perform a doubling procedure. However, we started to wonder if any improvements could be made by using just a small amount of extra memory, say an extra bit for each register. In this post we will discuss one such idea and its use in doubling. Note: we don’t talk about quadrupling or more. We limit ourselves to the situation where HLL sketches only differ in $m$‘s by 1.

### The Setup

One of the downfalls in doubling is that it there is no way to know, after doubling, whether a value belongs in its bin or its partner bin. Recall that a “partner bin” is the register that could have been used had our “prefix” (the portion of the hashed value which is used to decide which register to update) been one bit longer. If the binary representation of the bin index used only two bits of the hashed value, e.g. $01$, then in an HLL that used a three-bit index, the same hashed value could have been placed in the bin whose index is either $101$ or $001$. Since $001$ and $01$ are the same number, we call $101$ the “partner bin”. (See the “Key Processing” section in Set Operations On HLLs of Different Sizes).

Consider an example where we have an HLL with $2^{10}$ bins.The $k^{th}$ bin has the value 7 in it, and after doubling we guess that its partner bin, at index $(2^{10} + k)^{th}$, should have a 5 in it. It is equally likely that the $k^{th}$ bin should have the 5 in it and the $(2^{10}+k)^{th}$ bin should have the 7 in it (since the “missing” prefix bit could have been a 1 or a 0)! Certainly the arrangement doesn’t change the basic cardinality estimate, but once we start getting involved with unions, the arrangement can make a very large difference.

To see how drastic the consequences can be, let’s look at a simple example. Suppose we start with an HLL with 2 bins and get the value 6 in each of its bins. Then we run the doubling procedure and decide that the partner bins should both have 1’s in them. With this information, it is equally likely that both of the arrangements below, “A” and “B”, could be the “true” larger HLL.

Further suppose we have some other data with which we wish to estimate the union. Below, I’ve diagrammed what happens when we take the union.

Arrangement A leads to a cardinality estimate (of the union) of about 12 and Arrangement B leads to a cardinality estimate (of the union) of about 122. This is an order of magnitude different! Obviously not all cases are this bad, but this example is instructive. It tells us that knowing the true location of each value is very important. We’ve attempted to improve our doubling estimate by keeping an extra bit of information as we will describe below.

### The Algorithm

Suppose we have an HLL with $m$ bins. Let’s keep another array of data which holds $m$ total bits, one for each bin — we will call these the “Cached Values.” For each bin, we keep a 0 if the value truly belongs in the bin in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the first $m$ bins in the HLL), and we keep a 1 if the value truly belongs in the partner bin of the one in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the last $m$ bins). See the image below for an example. Here we see two HLLs which have processed the same data. The one on the left is half the size and collects the cached values as it runs on the data. The one on the right is simply the usual HLL algorithm run on the same data.

Looking at the first row of the small HLL (with $m$ bins), the $0$ cache value means that the 2 “belongs” in the top half of the large HLL, i.e. if we had processed the stream using a larger HLL the 2 would be in the same register. Essentially this cached bit allows you to know exactly where the largest value in a bin was located in the larger HLL (if the $i^{th}$ bin has value $V$ and cached value $S$, we place the value $V$ in the $S * 2^{\log{m}} + i$ = $(S\cdot m + i)^{th}$ bin).

In practice, when we double, we populate the doubled HLL first with the (now correct location) bin values from the original HLL then we fill the remaining bins by using our “Proportion Doubling” algorithm.

Before we begin looking at the algorithm’s performance, let’s think about how much extra space this requires. In our new algorithm, notice that for each bin, we keep around either a zero or a one as its cached value. Hence, we require only one extra bit per bin to accommodate the cached values. Our implementation of HLL requires 5 bits per bin, since we want to be able to include values up to $2^5 -1= 31$ in our bins. Thus, a standard HLL with $m$ bins, requires $5m$ bits. Hence, this algorithm requires $5m + m = 6m$ bits (with the extra $m$ bins representing the cached values). This implies that this sketch requires 20% more space.

### The Data

Recall in the last post in this series, we explored doubling with two main strategies: Random Estimate (RE) and Proportion Doubling (PD). We did the same here, though using the additional information from this cached bit. We want to know a few things:

• Does doubling using a cache bit work? i.e. is it better to fold the bigger one or double the smaller one when comparing HLL’s of different sizes?
• Does adding in a cache bit change which doubling strategy is preferred (RE or PD)?
• Does the error in union estimate depend on intersection size as we have seen in the past?

Is it better to double or fold?

For each experiment we took 2 sets of data (each generated from 200k random keys) and estimated the intersection size between them using varying methods.

• “Folded”: estimate by filling up an HLL with $log_{2}(m) = 10$ and  comparing it to a folded HLL starting from $log_{2}(m) = 11$ and folded down $log_{2}(m) = 10$
• “Large”: estimate by using two HLL’s of a larger HLL of $log_{2}(m) = 11$.  This is effectively a lower bound for our doubling approaches.
• “Doubled – PD”: estimate by taking an HLL of $log_{2}(m) = 10$ and double it up to $log_{2}(m) = 11$ using the Proportion Doubling strategy.  Once this larger HLL is approximated we estimate the intersection with another HLL of native size $log_{2}(m) = 11$
• “Doubled – RE”: estimate by taking an HLL of $log_{2}(m) = 10$ and doubling up to $log_{2}(m) = 11$ using Random Estimate strategy.

We performed an experiment 300 times at varying intersection sizes from 0 up to 200k (100%) overlapping elements between sets (in steps of 10k). The plots below show our results (and extrapolate between points).

The graph of the mean error looks pretty bad for Random Estimate doubling. Again we see that the error depends heavily on the intersection size and becomes more biased as the set’s overlap more. On the other hand, Proportion Doubling was much more successful  (recall that this strategy forces the proportion of bins in the to-be-doubled HLL and the HLL with which we will union it to be equal before and after doubling.)  It’s possible there is some error bias with small intersections but we would need to run more trials to know for sure. As expected, the “Folded” and the “Large” are centered around zero. But what about the spread of the error?

The Proportion Doubling strategy looks great! In my last post on this subject, we found that this doubling strategy (without the cached part) really only worked well in the large intersection size regime, but here, with the extra cache bits, we seem to avoid that. Certainly the large intersection regime is where the standard deviation is lowest, but for every intersection size, it is significantly lower than that of the smaller HLL. This suggests that one of our largest sources of error when we use doubling in conjunction with unions is related to our lack of knowledge of the arrangement of the bins (i.e. when doubling, we do not know which of the two partner bins gets the larger, observed value). So it appears that the strategy of keeping cache bits around does indeed work, provided you use a decent doubling scheme.

Interestingly, it is always much better to double a smaller cache HLL than to fold a larger HLL when comparing sketches of different sizes. This is represented above by the lower error of the doubled HLL than the small HLL. The error bounds do seem to depend on the size of the intersection between the two sets but this will require more work to really understand how, especially in the case of Proportion Doubling.

Notes:  In this work we focus solely on doubling a HLL sketch and then immediately using this new structure to compute set operations. It would be interesting to see if set operation accuracy changes as a doubled HLL goes through its “recovery” period under varying doubling methods. It is our assumption that nothing out of the ordinary would come of this, but we definitely could be wrong. We will leave this as an exercise for the reader.

### Summary

We’ve found an interesting way of trading space for accuracy with this cached bit method, but there are certainly other ways of using an extra bit or two (per bucket). For instance, we could keep more information about the distribution of each bin by keeping a bit indicating whether or not the bin’s value minus one has been seen. (If the value is $k$, keep track of whether $k-1$ has shown up.)

We should be able to use any extra piece of information about the distribution or position of the data to help us obtain a more accurate estimate. Certainly, there are a myriad of other ideas ways of storing a bit or two of extra information per bin in order to gain a little leverage — it’s just a matter of figuring out what works best. We’ll be messing around more with this in the coming weeks, so if you have any ideas of what would work best, let us know in the comments!

(P.S. A lot of our recent work has been inspired by Flajolet et al.’s paper on PCSA – check out our post on this here!)

Thanks to Jeremie Lumbroso for his kind input on this post. We are much indebted to him and hopefully you will see more from our collaboration.

## Sketch of the Day: Probabilistic Counting with Stochastic Averaging (PCSA)

Before there was LogLog, SuperLogLog or HyperLogLog there was Probabilistic Counting with Stochastic Averaging (PCSA) from the seminal work “Probabilistic Counting Algorithms for Data Base Applications” (also known as the “FM Sketches” due to its two authors, Flajolet and Martin). The basis of PCSA matches that of the other Flajolet distinct value (DV) counters: hash values from a collection into binary strings, use patterns in those strings as indicators for the number of distinct values in that collection (bit-pattern observables), then use stochastic averaging to combine m trials into a better estimate. Our HyperLogLog post has more details on these estimators as well as stochastic averaging.

### Observables

The choice of observable pattern in PCSA comes from the knowledge that in a collection of randomly generated binary strings, the following probabilities occur:

\begin{aligned} &P( ..... 1) &= 2^{-1} \\ &P( .... 10) &= 2^{-2} \\ &P( ... 100) &= 2^{-3} \end{aligned}

$\vdots$

$P(...10^{k-1}) = 2^{-k}$

For each value added to the DV counter, a suitable hash is created and the position of the least-significant (right-most) 1 is determined. The corresponding position in a bitmap is updated and stored. I’ve created the simulation below so that you can get a feel for how this plays out.

Click above to run the bit-pattern simulation

(All bit representations in this post are numbered from 0 (the least-significant bit) on the right. This is the opposite of the direction in which they’re represented in the paper.)

Run the simulation a few times and notice how the bitmap is filled in. In particular, notice that it doesn’t necessarily fill in from the right side to the left — there are gaps that exist for a time that eventually get filled in. As the cardinality increases there will be a block of 1s on the right (the high probability slots), a block of 0s on the left (the low probability slots) and a “fringe” (as Flajolet et al. called it) of 1s and 0s in the middle.

I added a small pointer below the bitmap in the simulation to show how the cardinality corresponds to the expected bit position (based on the above probabilities). Notice what Flajolet et al. saw when they ran this same experiment: the least-significant (right-most) 0 is a pretty good estimator for the cardinality! In fact when you run multiple trials you see that this least-significant 0 for a given cardinality has a narrow distribution. When you combine the results with stochastic averaging it leads to a small relative error of $0.78 / \sqrt{m}$ and estimates the cardinality of the set quite well. You might have also observed that the most-significant (left-most) 1 can also be used for an estimator for the cardinality but it isn’t as clear-cut. This value is exactly the observable used in LogLog, SuperLogLog and HyperLogLog and does in fact lead to the larger relative error of $1.04 / \sqrt{m}$ (in the case of HLL).

### Algorithm

The PCSA algorithm is elegant in its simplicity:

m = 2^b # with b in [4...16]

bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s

##############################################################################################
# Construct the PCSA bitmaps
for h in hashed(data):
bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of the rightmost b bits
run_length = run_of_zeros( h,b ) # length of the run of zeroes starting at bit b+1
bitmaps[ bitmap_index ][ run_length ] = 1 # set the bitmap bit based on the run length observed

##############################################################################################
# Determine the cardinality
phi = 0.77351
DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV estimate

Stochastic averaging is accomplished via the arithmetic mean.

You can see PCSA in action by clicking on the image below.

Click above to run the PCSA simulation

There is one point to note about PCSA and small cardinalities: Flajolet et al. mention that there are “initial nonlinearities” in the algorithm which result in poor estimation at small cardinalities ($n/m \approx 10 \, \text{to} \, 20$) which can be dealt with by introducing corrections but they leave it as an exercise for the reader to determine what those corrections are. Scheuermann et al. did the leg work in “Near-Optimal Compression of Probabilistic Counting Sketches for Networking Applications” and came up with a small correction term (see equation 6). Another approach is to simply use the linear (ball-bin) counting introduced in the HLL paper.

### Set Operations

Just like HLL and KMV, unions are trivial to compute and lossless. The PCSA sketch is essentially a “marker” for runs of zeroes, so to perform a union you merely bit-wise OR the two sets of bitmaps together. Folding a PCSA down to a smaller m works the same way as HLL but instead of HLL’s max you bit-wise OR the bitmaps together. Unfortunately for intersections you have the same issue as HLL, you must perform them using the inclusion/exclusion principle. We haven’t done the plots on intersection errors for PCSA but you can imagine they are similar to HLL (and have the benefit of the better relative error $0.78 / \sqrt{m}$).

### PCSA vs. HLL

That fact that PCSA has a better relative error than HyperLogLog with the same number of registers ($1.04 / 0.78 \approx 1.33$) is slightly deceiving in that $m$ (the number of stored observations) are different sizes. A better way to look at it is to fix the accuracy of the sketches and see how they compare. If we would like to have the same relative error from both sketches we can see that the relationship between registers is:

$\text{PCSA}_{RE} = \text{HLL}_{RE}$

$\dfrac{0.78}{\sqrt{m_{\scriptscriptstyle PCSA}}} = \dfrac{1.04}{\sqrt{m_{\scriptscriptstyle HLL}}}$

$m_{\scriptscriptstyle PCSA} = \left( \dfrac{0.78}{1.04} \right)^2 m_{\scriptscriptstyle HLL} \approx 0.563 \ m_{\scriptscriptstyle HLL}$

Interestingly, PCSA only needs a little more than half the registers of an HLL to reach the same relative error. But this is also deceiving. What we should be asking is what is the size of each sketch if they provide the same relative error? HLL commonly uses a register width of 5 bits to count to billions whereas PCSA requires 32 bits. That means a PCSA sketch with the same accuracy as an HLL would be:

\begin{aligned} \text{Size of PCSA} &= 32 \text{bits} \ m_{\scriptscriptstyle PCSA} = 32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} ) \\ \\ \text{Size of HLL} &= 5 \text{bits} \ m_{\scriptscriptstyle HLL} \end{aligned}

Therefore,

$\dfrac{\text{Size of PCSA}} {\text{Size of HLL}} = \dfrac{32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )}{5 \text{bits} \ m_{\scriptscriptstyle HLL}} \approx 3.6$

A PCSA sketch with the same accuracy is 3.6 times larger than HLL!

### Optimizations

But what if you could make PCSA smaller by reducing the size of the bitmaps? Near the end of the paper in the Scrolling section, Flajolet et al. bring up the point that you can make the bitmaps take up less space. From the simulation you can observe that with a high probability there is a block of consecutive 1s on the right side of the bitmap and a block of consecutive 0s on the left side of the bitmap with a fringe in between. If one found the “global fringe” — that is the region defined by the left-most 1 and right-most 0 across all bitmaps — then only those bits need to be stored (along with an offset value). The authors theorized that a fringe width of 8 bits would be sufficient (though they fail to mention if there are any dependencies on the number of distinct values counted). We put this to the test.

In our simulations it appears that a fringe width of 12 bits is necessary to provide an unbiased estimator comparable to full-fringe PCSA (32-bit) for the range of distinct values we analyzed. (Notice the consistent bias of smaller fringe sizes.) There are many interesting reasons that this “fringe” concept can fail. Look at the notes to this post for more. If we take the above math and update 32 to 12 bits per register (and include the 32 bit offset value) we get:

\begin{aligned} \dfrac{\text{Size of PCSA}} {\text{Size of HLL}} &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} ) + 32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} \\ \\ &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} + \dfrac{32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} \\ \\ &\approx 1.35 \text{ (for }m\gg64 \text{)} \end{aligned}

This is getting much closer to HLL! The combination of tighter bounds on the estimate and the fact that the fringe isn’t really that wide in practice result in PCSA being very close to the size of the much lauded HLL. This got us thinking about further compression techniques for PCSA. After all, we only need to get the sketch about 1/3 smaller to be comparable in size to HLL. In a future post we will talk about what happens if you Huffman code the PCSA bitmaps and the tradeoffs you make when you do this.

### Summary

PCSA provides for all of the goodness of HLL: very fast updates making it suitable for real-time use, small footprint compared to the information that it provides, tunable accuracy and unions. The fact that it has a much better relative error per register than HLL indicates that it should get more credit than it does. Unfortunately, each bitmap in PCSA requires more space than HLL and you still get less accuracy per bit. Look for a future post on how it is possible to use compression (e.g. Huffman encoding) to reduce the number of bits per bitmap, thus reducing the error per bit to match that of HLL, resulting in an approach that matches HLL in size but exceeds its precision!

### Notes on the Fringe

While we were putting this post together we discovered many interesting things to look at with respect to fringe optimization. One of the questions we wanted to answer was “How often does the limited size of the fringe muck up a bitmap?” Below is a plot that shows how often any given sketch had a truncation event (that affected the DV estimate) in the fringe of any one of its bitmaps for a given fringe width (i.e. some value could not be stored in the space available).

Note that this is an upper bound on the error that could be generated by truncation. If you compare the number of runs that had a truncation event (almost all of the runs) with the error plot in the post it is quite shocking that the errors are as small as they are.

Since we might not get around to all of the interesting research here, we are calling out to the community to help! Some ideas:

1. There are likely a few ways to improve the fringe truncation. Since PCSA is so sensitive to the least-significant 1 in each bitmap, it would be very interesting to see how different approaches affect the algorithm. For example, in our algorithm we “left” truncated meaning that all bitmaps had to have a one in the least-significant position of the bitmap in order to move up the offset. It would be interesting to look at “right” truncation. If one bitmap is causing many of the others to not record incoming values perhaps it should be bumped up. Is there some math to back up this intuition?
2. It is interesting to us that the fringe width truncation events are DV dependent. We struggled with the math on this for a bit before we just stopped. Essentially we want to know what is the width of the theoretical fringe? It obviously appears to be DV dependent and some sort of coupon collector problem with unequal probabilities. Someone with better math skills than us needs to help here.

### Closing thoughts

We uncovered PCSA again as a way to go back to first principles and see if there are lessons to be learned that could be applied to HLL to make it even better. For instance, can all of this work on the fringe be applied to HLL to reduce the number of bits per register while still maintaining the same precision? HLL effectively records the “strandline” (what we call the left-most 1s). More research into how this strandline behaves and if it is possible to improve the storage of it through truncation could reduce the standard HLL register width from 5 bits to 4, a huge savings! Obviously, we uncovered a lot of open questions with this research and we feel there are algorithmic improvements to HLL right around the corner. We have done some preliminary tests and the results so far are intriguing. Stay tuned!

## Doubling the Size of an HLL Dynamically – Unions

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the second in a series of three posts on doubling the number of bins of HLLs. The first post dealt with the recovery time after doubling and the next post will deal with ways to utilize an extra bit or two per bin.

### Overview

Let’s say we have two streams of data which we’re monitoring with the HLL algorithm, and we’d like to get an estimate on the cardinality of these two streams combined, i.e. thought of as one large stream.  In this case, we have to take advantage of the algorithm’s built-in “unionfeature.  Done naively, the accuracy of the estimate will depend entirely on the the number of bins, $m$, of the smaller of the two HLLs.  In this case, to make our estimate more accurate, we would need to increase this $m$ of one (or both) of our HLLs.  This post will investigate the feasibility of doing this; we will apply our idea of “doubling” to see if we can gain any accuracy.  We will not focus on intersections, since the only support the HyperLogLog algorithm has for intersections is via the inclusion/exclusion principle. Hence the error can be kind of funky for this – for a better overview of this, check out Timon’s post here. For this reason, we only focus on how the union works with doubling.

### The Strategy: A Quick Reminder

In my last post we discussed the benefits and drawbacks of many different doubling strategies in the context of recovery time of the HLL after doubling. Eventually we saw that two of our doubling strategies worked significantly better than the others. In this post, instead of testing many different strategies, we’ll focus instead on one strategy, “proportion doubling” (PD), and how to manipulate it to work best in the context of unions. The idea behind PD is to guess the approximate intersection cardinality of the two datasets and to force that estimate to remain after doubling. To be more specific, suppose we have an HLL $A$ and an HLL $B$ with$n$ bins and $2n$ bins, respectively. Then we check what proportion of bins in $A$, call it $p$, agree with the bins in $B$. When we doubled $A$, we fill in the bins by randomly selecting $p\cdot n$ bins, and filling them in with the value in the corresponding bins in $B$. To fill in the rest of the bins, we fill them in randomly according to the distribution.

### The Naive Approach

To get some idea of how well this would work, I put the most naive strategy to the test. The idea was to run 100 trials where I took two HLLs (one of size $2^5 = 32$ and one of size $2^6 = 64$), ran 200K keys through them, doubled the smaller one (according to Random Estimate), and took a union. I had a hunch that the accuracy of our estimate after doubling would depend on how large the true intersection cardinality of the two datasets would be, so I ran this experiment for overlaps of size 0, 10K, 20K, etc. The graphs below are organized by the true intersection cardinality, and each graph shows the boxplot of the error for the trials.

This graph is a little overwhelming and a bit of a strange way to display the data, but is useful for getting a feel for how the three estimates work in the different regimes.  The graph below is from the same data and just compares the “Small” and “Doubled” HLLs.  The shaded region represents the middle 50% of the data, and the blue dots represent the data points.

The first thing to notice about these graphs is the accuracy of the estimate in the small intersection regime. However, outside of this, the estimates are not very accurate – it is clearly a better choice to just use the estimate from the smaller HLL.

Let’s try a second approach. Above we noticed that the algorithm’s accuracy depended on the cardinality of the intersection. Let’s try to take that into consideration. Let’s use the “Proportion Doubling” (PD) strategy we discussed in our first post. That post goes more in depth into the algorithm, but the take away is that this doubling strategy preserves the proportion of bins in the two HLLs which agree. I ran some trials like I did above to get some data on this. The graphs below represent this.

Here we again, show the data in a second graph comparing just the “Doubled” and “Small” HLL estimates.  Notice how much tighter the middle 50% region is on the top graph (for the “Doubled” HLL).  Hence in the large intersection regime, we get very accurate estimates.

One thing to notice about the second set of graphs is how narrow the error bars are.  Even when the estimate is biased, it still has much smaller error.  Also, notice that this works well in the large intersection regime but horribly in the small intersection regime.  This suggests that we may be able to interpolate our strategies. The next set of graphs is for an attempt at this. The algorithm gets an estimate of the intersection cardinality, then decides to either double using PD, double using RE, or not double depending on whether the intersection is large, small, or medium.

Here, the algorithm works well in the large intersection regime and doesn’t totally crap out outside of this regime (like the second algorithm), but doesn’t sustain the accuracy of the first algorithm in the small intersection regime. This is most likely because the algorithm cannot “know” which regime it is in and thus, must make a guess.  Eventually, it will guess wrong will severely underestimate the union cardinality. This will introduce a lot of error, and hence, our boxplot looks silly in this regime. The graph below shows the inefficacy of this new strategy. Notice that there are virtually no gains in accuracy in the top graph.

### Conclusion

With some trickery, it is indeed possible to gain some some accuracy when estimating the cardinality of the union of two HLLs by doubling one.  However, in order for this to be feasible, we need to apply the correct algorithm in the correct regime. This isn’t a major disappointment since for many practical cases, it would be easy to guess which regime the HLLs should fall under and we could build in the necessary safeguards if we guess incorrectly.  In any case, our gains were modest but certainly encouraging!

## HLL Intersections

### Why?

The intersection of two streams (of user ids) is a particularly important business need in the advertising industry. For instance, if you want to reach suburban moms but the cost of targeting those women on a particular inventory provider is too high, you might want to know about a cheaper inventory provider whose audience overlaps with the first provider. You may also want to know how heavily your car-purchaser audience overlaps with a certain metro area or a particular income range. These types of operations are critical for understanding where and how well you’re spending your advertising budget.

As we’ve seen before, HyperLogLog provides a time- and memory-efficient algorithm for estimating the number of distinct values in a stream. For the past two years, we’ve been using HLL at AK to do just that: count the number of unique users in a stream of ad impressions. Conveniently, HLL also supports the union operator ( $\cup$ ) allowing us to trivially estimate the distinct value count of any composition of streams without sacrificing precision or accuracy. This piqued our interest because if we can “losslessly” compute the union of two streams and produce low-error cardinality estimates, then there’s a chance we can use that estimate along with the inclusion-exclusion principle to produce “directionally correct” cardinality estimates of the intersection of two streams. (To be clear, when I say “directionally correct” my criteria is “can an advertiser make a decision off of this number?”, or “can it point them in the right direction for further research?”. This often means that we can tolerate relative errors of up to 50%.)

The goals were:

1. Get a grasp on the theoretical error bounds of intersections done with HLLs, and
2. Come up with heuristic bounds around $m$, $overlap$, and the set cardinalities that could inform our usage of HLL intersections in the AK product.

Quick terminology review:

• If I have set of integers $A$, I’m going to call the HLL representing it $H_{A}$.
• If I have HLLs $H_{A}, H_{B}$ and their union $H_{A \cup B}$, then I’m going to call the intersection cardinality estimate produced $|H_{A \cap B}|$.
• Define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.
• Define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.
• We’ll represent the absolute error of an observation $|H_{A}|$ as $\Delta |H_{A}|$.

That should be enough for those following our recent posts, but for those just jumping in, check out Appendices A and B at the bottom of the post for a more thorough review of the set terminology and error terminology.

### Experiment

We fixed 16 $overlap$ values (0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) and 12 set cardinalities (100M, 50M, 10M, 5M, 1M, 500K, 100K, 50K, 10K, 5K, 1K, 300) and did 100 runs of each permutation of $(overlap, |A|, |B|)$. A random stream of 64-bit integers hashed with Murmur3 was used to create the two sets such that they shared exactly $min(|A|,|B|) \cdot overlap = |A \cap B|$  elements. We then built the corresponding HLLs $H_{A}$ and $H_{B}$ for those sets and calculated the intersection cardinality estimate $|H_{A} \cap H_{B}|$ and computed its relative error.

Given that we could only generate and insert about 2M elements/second per core, doing runs with set cardinalities greater than 100M was quickly ruled out for this blog post. However, I can assure you the results hold for much larger sets (up to the multiple billion-element range) as long as you heed the advice below.

### Results

This first group of plots has a lot going on, so I’ll preface it by saying that it’s just here to give you a general feeling for what’s going on. First note that within each group of boxplots $overlap$ increases from left to right (orange to green to pink), and within each plot cardinality ratio increases from left to right. Also note that the y-axis (the relative error of the observation) is log-base-10 scale. You can clearly see that as the set sizes diverge, the error skyrockets for all but the most similar (in both cardinality and composition) sets. I’ve drawn in a horizontal line at 50% relative error to make it easier to see what falls under the “directionally correct” criteria. You can (and should) click for a full-sized image.

Note: the error bars narrow as we progress further to the right because there are fewer observations with very large cardinality ratios. This is an artifact of the experimental design.

A few things jump out immediately:

• For cardinality ratio > 500, the majority of observations have many thousands of percent error.
• When cardinality ratio is smaller than that and $overlap > 0.4$, register count has little effect on error, which stays very low.
• When $overlap \le 0.01$, register count has little effect on error, which stays very high.

Just eyeballing this, the lesson I get is that computing intersection cardinality with small error (relative to the true value) is difficult and only works within certain constraints. Specifically,

1. $\frac{|A|}{|B|} < 100$, and
2. $overlap(A, B) = \frac{|A \cap B|}{min(|A|, |B|)} \ge 0.05$.

The intuition behind this is very simple: if the error of any one of the terms in your calculation is roughly as large as the true value of the result then you’re not going to estimate that result well. Let’s look back at the intersection cardinality formula. The left-hand side (that we are trying to estimate) is a “small” value, usually. The three terms on the right tend to be “large” (or at least “larger”) values. If any of the “large” terms has error as large as the left-hand side we’re out of luck.

So, let’s say you can compute the cardinality of an HLL with relative error of a few percent. If $|H_{A}|$ is two orders of magnitude smaller than $|H_{B}|$, then the error alone of $|H_{B}|$ is roughly as large as $|A|$.

$|A \cap B| \le |A|$ by definition, so

$|A \cap B| \le |A| \approx |H_{A}| \approx \Delta |H_{B}|$.

In the best scenario, where $A \cap B = A$, the errors of $|H_{B}|$ and $|H_{A \cup B}| \approx |H_{B}|$ are both roughly the same size as what you’re trying to measure. Furthermore, even if $|A| \approx |B|$ but the overlap is very small, then $|A \cap B|$  will be roughly as large as the error of all three right-hand terms.

### On the bubble

Let’s throw out the permutations whose error bounds clearly don’t support “directionally correct” answers ($overlap < 0.01$ and $\frac{|A|}{|B|} > 500$) and those that trivially do ($overlap > 0.4$) so we can focus more closely on the observations that are “on the bubble”. Sadly, these plots exhibit a good deal of variance inherent in their smaller sample size. Ideally we’d have tens of thousands of runs of each combination, but for now this rough sketch will hopefully be useful. (Apologies for the inconsistent colors between the two plots. It’s a real bear to coordinate these things in R.) Again, please click through for a larger, clearer image.

By doubling the number of registers, the variance of the relative error falls by about a quarter and moves the median relative error down (closer to zero) by 10-20 points.

In general, we’ve seen that the following cutoffs perform pretty well, practically. Note that some of these aren’t reflected too clearly in the plots because of the smaller sample sizes.

Register Count Data Structure Size Overlap Cutoff Cardinality Ratio Cutoff
8192 5kB 0.05 10
16384 10kB 0.05 20
32768 20kB 0.05 30
65536 41kB 0.05 100

### Error Estimation

To get a theoretical formulation of the error envelope for intersection, in terms of the two set sizes and their overlap, I tried the first and simplest error propagation technique I learned. For variables $Y, Z, ...$, and $X$ a linear combination of those (independent) variables, we have

$\Delta X = \sqrt{ (\Delta Y)^2 + (\Delta Z)^2 + ...}$

Applied to the inclusion-exclusion formula:

$\begin{array}{ll} \displaystyle \Delta |H_{A \cap B}| &= \sqrt{ (\Delta |H_{A}|)^2 + (\Delta |H_{B}|)^2 + (\Delta |H_{A \cup B}|)^2} \\ &= \sqrt{ (\sigma\cdot |A|)^2 + (\sigma\cdot |B|)^2 + (\sigma\cdot |A \cup B|)^2} \end{array}$

where

$\sigma = \frac{1.04}{\sqrt{m}}$ as in section 4 (“Discussion”) of the HLL paper.

Aside: Clearly $|H_{A \cup B}|$ is not independent of $|H_{A}| + |H_{B}|$, though $|H_{A}|$ is likely independent of $|H_{B}|$. However, I do not know how to a priori calculate the covariance in order to use an error propagation model for dependent variables. If you do, please pipe up in the comments!

I’ve plotted this error envelope against the relative error of the observations from HLLs with 8192 registers (approximately 5kB data structure).

Despite the crudeness of the method, it provided a 95% error envelope for the data without significant differences across cardinality ratio or $overlap$. Specifically, at least 95% of observations satisfied

$(|H_{A \cap B}| - |A \cap B|) < \Delta |H_{A \cap B}|$

However, it’s really only useful in the ranges shown in the table above. Past those cutoffs the bound becomes too loose and isn’t very useful.

This is particularly convenient because you can tune the number of registers you need to allocate based on the most common intersection sizes/overlaps you see in your application. Obviously, I’d recommend everyone run these tests and do this analysis on their business data, and not on some contrived setup for a blog post. We’ve definitely seen that we can get away with far less memory usage than expected to successfully power our features, simply because we tuned and experimented with our most common use cases.

### Next Steps

We hope to improve the intersection cardinality result by finding alternatives to the inclusion-exclusion formula. We’ve tried a few different approaches, mostly centered around the idea of treating the register collections themselves as sets, and in my next post we’ll dive into those and other failed attempts!

### Appendix A: A Review Of Sets

Let’s say we have two streams of user ids, $S_{A}$ and $S_{B}$. Take a snapshot of the unique elements in those streams as sets and call them $A$ and $B$. In the standard notation, we’ll represent the cardinality, or number of elements, of each set as $|A|$ and $|B|$.

Example: If $A = \{1,2,10\}$ then $|A| = 3$.

If I wanted to represent the unique elements in both of those sets combined as another set I would be performing the union, which is represented by $A \cup B$.

Example: If $A = \{1,2,3\}, B=\{2,3,4\}$ then $A \cup B = \{1,2,3,4\}$.

If I wanted to represent the unique elements that appear in both $A$ and $B$ I would be performing the intersection, which is represented by $A \cap B$.

Example: With $A, B$ as above, $A \cap B = \{2,3\}$.

The relationship between the union’s cardinality and the intersection’s cardinality is given by the inclusion-exclusion principle. (We’ll only be looking at the two-set version in this post.) For reference, the two-way inclusion-exclusion formula is $|A \cap B| = |A| + |B| - |A \cup B|$.

Example: With $A, B$ as above, we see that $|A \cap B| = 2$ and $|A| + |B| - |A \cup B| = 3 + 3 - 4 = 2$.

For convenience we’ll define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.

Example: With $A, B$ as above, $overlap(A,B) = \frac{|A \cap B|}{min(|A|, |B|)} = \frac{2}{min(3,3)} = \frac{2}{3}$.

Similarly, for convenience, we’ll define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.

The examples and operators shown above are all relevant for exact, true values. However, HLLs do not provide exact answers to the set cardinality question. They offer estimates of the cardinality along with certain error guarantees about those estimates. In order to differentiate between the two, we introduce HLL-specific operators.

Consider a set $A$. Call the HLL constructed from this set’s elements $H_{A}$. The cardinality estimate given by the HLL algorithm for $H_{A}$ is $|H_{A}|$.

Define the union of two HLLs $H_{A} \cup H_{B} := H_{A \cup B}$, which is also the same as the HLL created by taking the pairwise max of $H_{A}$‘s and $H_{B}$‘s registers.

Finally, define the intersection cardinality of two HLLs in the obvious way: $|H_{A} \cap H_{B}| := |H_{A}| + |H_{B}| - |H_{A \cup B}|$. (This is simply the inclusion-exclusion formula for two sets with the cardinality estimates instead of the true values.)

### Appendix B: A (Very Brief) Review of Error

The simplest way of understanding the error of an estimate is simply “how far is it from the truth?”. That is, what is the difference in value between the estimate and the exact value, also known as the absolute error.

However, that’s only useful if you’re only measuring a single thing over and over again. The primary criteria for judging the utility of HLL intersections is relative error because we are trying to measure intersections of many different sizes. In order to get an apples-to-apples comparison of the efficacy of our method, we normalize the absolute error by the true size of the intersection. So, for some observation $\hat{x}$ whose exact value is non-zero $x$, we say that the relative error of the observation is $\frac{x-\hat{x}}{x}$. That is, “by what percentage off the true value is the observation off?”

Example: If $|A| = 100, |H_{A}| = 90$ then the relative error is $\frac{100 - 90}{100} = \frac{10}{100} = 10\%$.

## HLLs and Polluted Registers

### Introduction

It’s worth thinking about how things can go wrong, and what the implications of such occurrences might be. In this post, I’ll be taking a look at the HyperLogLog (HLL) algorithm for cardinality estimation, which we’ve discussed before.

### The Setup

HLLs have the property that their register values increase monotonically as they run. The basic update rule is:

for item in stream:
index, proposed_value = process_hashed_item(hash(item))
hll.registers[index] = max(hll.registers[index], proposed_value)

There’s an obvious vulnerability here: what happens to your counts if you get pathological data that blows up a register value to some really large number? These values are never allowed to decrease according to the vanilla algorithm. How much of a beating can these sketches take from such pathological data before their estimates are wholly unreliable?

### Experiment The First

To get some sense of this, I took a 1024 bucket HLL, ran a stream through it, and then computed the error in the estimate. I then proceeded to randomly choose a register, max it out, and compute the error again. I repeated this process until I had maxed out 10% of the registers. In pseudo-python:

print("n_registers_touched,relative_error")
print(0, relative_error(hll.cardinality(), stream_size), sep = ",")
for index, reg in random.sample(range(1024), num_to_edit):
hll.registers[reg] = 32
print(index + 1, relative_error(hll.cardinality(), stream_size), sep = ",")

In practice, HLL registers are fixed to be a certain bit width. In our case, registers are 5 bits wide, as this allows us to count runs of 0s up to length 32. This allows us to count astronomically high in a 1024 register HLL.

Repeating this for many trials, and stream sizes of 100k, 1M, and 10M, we have the following picture. The green line is the best fit line.

What we see is actually pretty reassuring. Roughly speaking, totally poisoning x% of registers results in about an x% error in your cardinality estimate. For example, here are the error means and variances across all the trials for the 1M element stream:

Number of Registers Modified Percentage of Registers Modified Error Mean Error Variance
0 0 -0.0005806094 0.001271119
10 0.97% 0.0094607324 0.001300780
20 1.9% 0.0194860481 0.001356282
30 2.9% 0.0297495396 0.001381753
40 3.9% 0.0395013208 0.001436418
50 4.9% 0.0494727527 0.001460182
60 5.9% 0.0600436774 0.001525749
70 6.8% 0.0706375356 0.001522320
80 7.8% 0.0826034639 0.001599104
90 8.8% 0.0937465662 0.001587156
100 9.8% 0.1060810958 0.001600348

### Initial Reactions

I was actually not too surprised to see that the induced error was modest when only a small fraction of the registers were poisoned. Along with some other machinery, the HLL algorithm uses the harmonic mean of the individual register estimates when computing its guess for the number of distinct values in the data stream. The harmonic mean does a very nice job of downweighting values that are significantly larger than others in the set under consideration:

In [1]: from scipy.stats import hmean

In [2]: from numpy import mean

In [3]: f = [1] * 100000 + [1000000000]

In [4]: mean(f)
Out[4]: 10000.899991000089

In [5]: hmean(f)
Out[5]: 1.0000099999999899

It is this property that provides protection against totally wrecking the sketch’s estimate when we blow up a fairly small fraction of the registers.

### Experiment The Second

Of course, the algorithm can only hold out so long. While I was not surprised by the modesty of the error, I was very surprised by how linear the error growth was in the first figure. I ran the same experiment, but instead of stopping at 10% of the registers, I went all the way to the end. This time, I have plotted the results with a log-scaled y-axis:

Note that some experiments appear to start after others. This is due to missing data from taking the logarithm of negative errors.

Without getting overly formal in our analysis, there are roughly three phases in error growth here. At first, it’s sublinear on the log-scale, then linear, then superlinear. This roughly corresponds to “slow”, “exponential”, and “really, really, fast”. As our mathemagician-in-residence points out, the error will grow roughly as p/(1-p) where p is the fraction of polluted registers. The derivation of this isn’t too hard to work out, if you want to give it a shot! The implication of this little formula matches exactly what we see above. When p is small, the denominator does not change much, and the error grows roughly linearly. As p approaches 1, the error begins to grow super-exponentially. Isn’t it nice when experiment matches theory?

### Final Thoughts

It’s certainly nice to see that the estimates produced by HLLs are not overly vulnerable to a few errant register hits. As is often the case with this sort of analysis, the academic point must be put in balance with the practical. The chance of maxing out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, my first thought would be “What is going wrong with my system!?” and not “Oh, well, at least I know my estimate to within 10%!” I would be disinclined to trust the whole data set until I got a better sense of what caused the blowups, and why I should give any credence at all to the supposedly unpolluted registers.

## Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure

### Intro

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

Just like all the other DV sketches, HyperLogLog looks for interesting things in the hashed values of your incoming data.  However, unlike other DV sketches HLL is based on bit pattern observables as opposed to KMV (and others) which are based on order statistics of a stream.  As Flajolet himself states:

Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit- pattern $O^{\rho-1}1$ is more or less a likely indication that the cardinality n of S is at least $2^\rho$.

Order statistics observables: these are based on order statistics, like the smallest (real) values, that appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order of 1/X…

In my mind HyperLogLog is really composed of two insights: Lots of crappy things are sometimes better than one really good thing; and bit pattern observables tell you a lot about a stream. We’re going to look at each component in turn.

Even though the literature refers to the HyperLogLog sketch as a different family of estimator than KMV I think they are very similar. It’s useful to understand the approach of HLL by reviewing the KMV sketch. Recall that KMV stores the smallest $k$ values that you have seen in a stream. From these $k$ values you get an estimate of the number of distinct elements you have seen so far. HLL also stores something similar to the smallest values ever seen. To see how this works it’s useful to ask “How could we make the KMV sketch smaller?” KMV stores the actual value of the incoming numbers. So you have to store $k$ 64 bit values which is tiny, but not that tiny. What if we just stored the “rank” of the numbers?  Let’s say the number 94103 comes through (I’ll use base 10 here to make things easier). That number is basically $9*10^4$ plus some stuff. So, let’s just store the exponent, i.e. 4. In this way I get an approximation of the size of numbers I have seen so far. That turns the original KMV algorithm into only having to store the numbers 1-19 (since $2^{64} \approx 10^{19}$) which is a whole lot less than $2^{64}$ numbers. Of course, this estimate will be much worse than storing the actual values.

### Bit Pattern Observables

In actuality HLL, just like all the other DV sketches, uses hashes of the incoming data in base 2. And instead of storing the “rank” of the incoming numbers HLL uses a nice trick of looking for runs of zeroes in the hash values. These runs of zeroes are an example of “bit pattern observables”. This concept is similar to recording the longest run of heads in a series of coin flips and using that to guess the number of times the coin was flipped. For instance, if you told me that you spent some time this afternoon flipping a coin and the longest run of heads you saw was 2 I could guess you didn’t flip the coin very many times. However, if you told me you saw a run of 100 heads in a row I would gather you were flipping the coin for quite a while. This “bit pattern observable”, the run of heads, gives me information about the stream of data it was pulled from. An interesting thing to note is just how probable long runs of heads are. As Mark Shilling points out, you can almost always tell the difference between a human generated set of coin flips and an actual one, due to humans not generating long runs. (The world of coin flipping seems to be a deep and crazy pit.) Disclaimer: The only thing I am trying to motivate here is that by keeping a very small piece of information (the longest run of heads) I can get some understanding of what has happened in a stream. Of course, you could probably guess that even though we have now reduced the storage of our sketch the DV estimate is pretty crummy. But what if we kept more than one of them?

### Stochastic Averaging

In order to improve the estimate, the HLL algorithm stores many estimators instead of one and averages the results. However, in order to do this you would have to hash the incoming data through a bunch of independent hash functions. This approach isn’t a very good idea since hashing each value a bunch of times is expensive and finding good independent hash families is quite difficult in practice. The work around for this is to just use one hash function and “split up” the input into $m$ buckets while maintaining the observable (longest run of zeroes) for each bucket. This procedure is called stochastic averaging. You could do this split in KMV as well and it’s easier to visualize. For an $m$ of 3 it would look like:

To break the input into the $m$ buckets, Durand suggests using the first few ($k$) bits of the hash value as an index into a bucket and compute the longest run of zeroes ($R$) on what is left over. For example, if your incoming datum looks like 010100000110 and k = 3 you could use the 3 rightmost bits, 110, to tell you which register to update ($110_2 = 6$) and from the remaining bits, 010100000, you could take the longest run of zeroes (up to some max), which in this case is 5. In order to compute the number of distinct values in the stream you would just take the average of all of the $m$ buckets:

$\displaystyle DV_{LL} = \displaystyle\text{constant} * m*2^{\overline{R}}$

Here $\overline{R}$ is the average of the values $R$ in all the buckets. The formula above is actually the estimator for the LogLog algorithm, not HyperLogLog. To get HLL, you need one more piece…

### Harmonic Mean

A fundamental insight that Flajolet had to improve LogLog into HyperLogLog was that he noticed the distribution of the values in the $m$ registers is skewed to the right, and there can be some dramatic outliers that really mess up the average used in LogLog (see Fig. 1 below). He and Durand knew this when they wrote LogLog and did a bunch of hand-wavey stuff (like cut off the top 30% of the register values) to create what he called the “SuperLogLog”, but in retrospect this seems kind of dumb. He fixed this in HLL by tossing out the odd rules in SuperLogLog and deciding to take the harmonic mean of the DV estimates. The harmonic mean tends to throw out extreme values and behave well in this type of environment. This seems like an obvious thing to do. I’m a bit surprised they didn’t try this in the LogLog paper, but perhaps the math is harder to deal with when using the harmonic mean vs the geometric mean.

Fig. 1:  The theoretical distribution of register values after $v$ distinct values have been run through an HLL.

Throw all these pieces together and you get the HyperLogLog DV estimator:

$\displaystyle DV_{HLL} = \displaystyle\text{constant} * m^2 *\left (\sum_{j=1}^m 2^{-R_j} \right )^{-1}$

Here $R_j$ is the longest run of zeroes in the $i^{th}$ bucket.

### Putting it All Together

Even with the harmonic mean Flajolet still has to introduce a few “corrections” to the algorithm. When the HLL begins counting, most of the registers are empty and it takes a while to fill them up. In this range he introduces a “small range correction”. The other correction is when the HLL gets full. If a lot of distinct values have been run through an HLL the odds of collisions in your hash space increases. To correct for hash collisions Flajolet introduces the “large range collection”. The final algorithm looks like (it might be easier for some of you to just look at the source in the JavaScript HLL simulation):

m = 2^b #with b in [4...16]

if m == 16:
alpha = 0.673
elif m == 32:
alpha = 0.697
elif m == 64:
alpha = 0.709
else:
alpha = 0.7213/(1 + 1.079/m)

registers = [0]*m # initialize m registers to 0

##############################################################################################
# Construct the HLL structure
for h in hashed(data):
register_index = 1 + get_register_index( h,b ) # binary address of the rightmost b bits
run_length = run_of_zeros( h,b ) # length of the run of zeroes starting at bit b+1
registers[ register_index ] = max( registers[ register_index ], run_length )

##############################################################################################
# Determine the cardinality
DV_est = alpha * m^2 * 1/sum( 2^ -register )  # the DV estimate

if DV_est < 5/2 * m: # small range correction
V = count_of_zero_registers( registers ) # the number of registers equal to zero
if V == 0:  # if none of the registers are empty, use the HLL estimate
DV = DV_est
else:
DV = m * log(m/V)  # i.e. balls and bins correction

if DV_est <= ( 1/30 * 2^32 ):  # intermediate range, no correction
DV = DV_est
if DV_est > ( 1/30 * 2^32 ):  # large range correction
DV = -2^32 * log( 1 - DV_est/2^32)

Rob wrote up an awesome HLL simulation for this post. You can get a real sense of how this thing works by playing around with different values and just watching how it grows over time. Click below to see how this all fits together.

Click above to run the HyperLogLog simulation

### Unions

Unions are very straightforward to compute in HLL and, like KMV, are lossless. All you need to do to combine the register values of the 2 (or $n$) HLL sketches is take the max of the 2 (or $n$) register values and assign that to the union HLL. With a little thought you should realize that this is the same thing as if you had fed in the union stream to begin with. A nice side effect about lossless unions is that HLL sketches are trivially parallelizable. This is great if, like us, you are trying to digest a firehose of data and need multiple boxes to do summarization. So, you have:

for i in range(0, len(R_1)):
R_new[i] = max( R_1[i], R_2[i] )

To combine HLL sketches that have differing sizes read Chris’s blog post about it.

### Wrapping Up

In our research, and as the literature says, the HyperLogLog algorithm gives you the biggest bang for the buck for DV counting. It has the best accuracy per storage of all the DV counters to date. The biggest drawbacks we have seen are around intersections. Unlike KMV, there is no explicit intersection logic, you have to use the inclusion/exclusion principle and this gets really annoying for anything more than 3 sets. Aside from that, we’ve been tickled pink using HLL for our production reporting. We have even written a PostgreSQL HLL data type that supports cardinality, union, and intersection. This has enabled all kinds of efficiencies for our analytics teams as the round trips to Hadoop are less and most of the analysis can be done in SQL. We have seen a massive increase in the types of analytics that go on at AK since we have adopted a sketching infrastructure and I don’t think I’m crazy saying that many big data platforms will be built this way in the future.

P.S.  Sadly, Philippe Flajolet passed away in March 2011. It was actually a very sad day for us at Aggregate Knowledge because we were so deep in our HLL research at the time and would have loved to reach out to him, he seems like he would have been happy to see his theory put to practice. Based on all I’ve read about him I’m very sorry to have not met him in person. I’m sure his work will live on but we have definitely lost a great mind both in industry and academia. Keep counting Philippe!

Photo courtesy of http://www.ae-info.org/

## Doubling the Size of an HLL Dynamically

### Introduction

In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs.  Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of $\sqrt{2}$.

In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had?  Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone.  We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!).  Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations.  By this, I mean the following:  Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small?  I’ll also discuss some ways of modifying the algorithm to carry slightly more information.

### The Candidates

Before we begin, a quick piece of terminology.  Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1.  We consider two bins to be partners if their bin numbers differ by 2n.  To see why this is important – check the post on HLL folding.

Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.

• Zeroes – Fill in with zeroes.
• Concatenate – Fill in each bin with the value of its partner.
• MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
• RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
• ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.

#### Nitty Gritty of RE

The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin.  In the original paper, Flajolet et al. calculate the probability that a given bin takes the value $k$ to be given by $(1 - 0.5^k)^v - (1 - 0.5^{k-1})^v$ where $v$ is the number of keys that the bin has seen so far. Of course, we don’t know this value ($v$) exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.

Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a $1/10$ chance that I chose the number one, a $1/10$ chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a $1/2$ chance that my guess is a one, a $1/2$ chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.

RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.

#### Nitty Gritty of PD

Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction

So how do we do this? Let’s say we have two HLLs: $H$ and $G$. We wish to double $H$ before taking its union with $G$. To estimate the proportion of their intersection, make a copy of $G$ and fold it to be the same size as $H$. Then count the number of bins where $G$ and $H$ agree, call this number $a$. Then if $m$ is the number of bins in $H$, we can estimate that $H$ and $G$ should overlap in about $a/m$ bins. Then for each bin, with probability $a/m$ we fill in the bin with the the minimum of the relevant bin from $G$ and that bin’s partner in $G$. With probability $1 - a/m$ we fill in the bin according to the rules of RE.