## HLL talk at SFPUG

I had the pleasure of speaking at the SF PostgreSQL User Group’s meetup tonight about sketching, the history of HLL, and our implementation of HLL as a PG extension. My slides are embedded below and you can get a PDF copy here. Be sure to click the gear below to show speaker’s notes for context!

If video is made available, I’ll post an update with a link!

## Hitting the Books: EADS Summer School on Hashing

Rob, Matt, and I just wrapped up our trip to Copenhagen for the EADS Summer School on Hashing at the University of Copenhagen and it was a blast! The lineup of speakers was, simply put, unbeatable: Rasmus Pagh, Graham Cormode, Michael Mitzenmacher, Mikkel Thorup, Alex Andoni, Haim Kaplan, John Langford, and Suresh Venkatasubramanian. There’s a good chance that any paper you’ve read on hashing, sketching, or streaming has one of them as a co-author or is heavily influenced by their work. The format was three hour-and-a-half lectures for four days, with exercises presented at the end of each lecture. (Slides can be found here. They might also post videos. UPDATE: They’ve posted videos!)

Despite the depth of the material, almost all of it was approachable with some experience in undergraduate math and computer science. We would strongly recommend both of Michael Mitzenmacher’s talks (1, 2) for an excellent overview of Bloom Filters and Cuckoo hashing that are, in my opinion, significantly better and more in depth than any other out there. Specifically, the Bloom Filter talk presents very elegantly the continuum of Bloom Filter to Counting Bloom Filter to Count-Min Sketch (with “conservative update”) to the Stragglers Problem and Invertible Bloom Filters to, finally, extremely recent work called Odd Sketches.

Similarly, Mikkel Thorup’s two talks on hashing (1, 2) do a very thorough job of examining the hows and whys of integer hashing, all the way from the simplest multiply-mod-prime schemes all the way to modern work on tabulation hashing. And if you haven’t heard of tabulation hashing, and specifically twisted tabulation hashing, get on that because (1) it’s amazing that it doesn’t produce trash given how simple it is, (2) it’s unbelievably fast, and (3) it has been proven to provide the guarantees required for almost all of the interesting topics we’ve discussed on the blog in the past: Bloom Filters, Count-Min sketch, HyperLogLog, chaining/linear-probing/cuckoo hash tables, and so on. We really appreciated how much attention Mikkel devoted to practicality of implementation and to empirical performance when discussing hashing algorithms. It warms our heart to hear a leading academic in this field tout the number of nanoseconds it takes to hash an item as vocally as the elegance of the proof behind it!

We love this “Summer School” format because it delivers the accumulated didactic insight of the field’s top researchers and educators to both old techniques and brand new ones. (And we hope by now that everyone reading our blog appreciates how obsessed we are with teaching and clarifying interesting algorithms and data structures!) Usually most of this insight (into origins, creative process, stumbling blocks, intermediate results, inspiration, etc.) only comes out in conversation or lectures, and even worse is withheld or elided at publishing time for the sake of “clarity” or “elegance”, which is a baffling rationale given how useful these “notes in the margin” have been to us. The longer format of the lectures really allowed for useful “digressions” into the history or inspiration for topics or proofs, which is a huge contrast to the 10-minute presentations given at a conference like SODA. (Yes, obviously the objective of SODA is to show a much greater breadth of work, but it really makes it hard to explain or digest the context of new work.)

In much the same way, the length of the program really gave us the opportunity to have great conversations with the speakers and attendees between sessions and over dinner. We can’t emphasize this enough: if your ambition to is implement and understand cutting edge algorithms and data structures then the best bang for your buck is to get out there and meet the researchers in person. We’re incredibly lucky to call most of the speakers our friends and to regularly trade notes and get pointers to new research. They have helped us time and again when we’ve been baffled by inconsistent terminology or had a hunch that two pieces of research were “basically saying the same thing”. Unsurprisingly, they are also the best group of people to talk to when it comes to understanding how to foster a culture of successful research. For instance, Mikkel has a great article on how to systematically encourage and reward research article that appears in the March 2013 issue of CACM (pay-wall’d). Also worthwhile is his guest post on Bertrand Meyer’s blog.

If Mikkel decides to host another one of these, we cannot recommend attending enough. (Did we mention it was free?!) Thanks again Mikkel, Rasmus, Graham, Alex, Michael, Haim, and John for organizing such a great program and lecturing so eloquently!

## Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

### History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
if count == 0:
majority = val
count = 1
elif val == majority:
count += 1
else:
count -= 1

print majority if count > 0 else "no majority!"


If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to $N/2+1$ and then start decrementing until all N values have been seen, which would leave the counter at $2 = (N/2+1) - (N/2-1)$*. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

### One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
if val > median_est:
median_est += 1
elif val < median_est:
median_est -= 1


Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:


quantile_75 = 0
for val in stream:
r = random()
if val > quantile_75 and r > 1 - 0.75:
quantile_75 += 1
elif val < quantile_75 and r > 0.75:
quantile_75 -= 1


As the paper states:

The trick to generalize median estimation to any $\frac{h}{k}$ -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability $\frac{h}{k}$. The rationale behind it is that if we are estimating $\frac{h}{k}$ -quantile, and if the current estimate is at stream’s true $\frac{h}{k}$ -quantile, we will expect to see stream items larger than the current estimate with probability $1-\frac{h}{k}$ .

### Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1

for item in stream:
if item > m and random() > 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign > 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step > 0 else 1
# Mark that the estimate increased this step
sign = 1
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m > item:
step += (item - m)
m = item
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item < m and random() > q:
step += f(step) if sign < 0 else -1 * f(step)
m -= step if step > 0 else 1
sign = -1
if m < item:
step += (m - item)
m = item
# Damp down the step size to avoid oscillation.
if (m - item) * sign < 0 and step > 1:
step = 1



You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as $-10^{102}$ for some simulations.

### Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

### Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

### Appendix

• Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
• Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
• Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
• Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

## Data Science Summit – Update

I don’t think I’m going out on a limb saying that our conference last week was a success. Thanks to everyone who attended and thanks again to all the speakers. Muthu actually beat us to it and wrote up a nice summary. Very kind words from a great guy. For those of you that couldn’t make it (or those that want to relive the fun) we posted the videos and slides. Thanks again to everyone for making it such a success!

Muthu being Muthu during David Woodruff’s fantastic talk on streaming linear algebra

## Foundation Capital and Aggregate Knowledge Sponsor Streaming/Sketching Conference

We, along with our friends at Foundation Capital, are pleased to announce a 1 day mini-conference on streaming and sketching algorithms in Big Data.  We have gathered an amazing group of speakers from academia and industry to give talks.  If you are a reader of this blog we would love to have you come!  The conference will be on 6/20 (Thursday) from 10 AM to 5:30 PM at the 111 Minna Gallery in San Francisco and attendance is limited. Breakfast and lunch included!

There will also be a happy hour afterwards if you cannot make the conference or just want a beer.

The speaker list includes:

##### Muthu Muthukrishnan

The Count-Min Sketch, 10 Years Later

The Count-Min Sketch is a data structure for indexing data streams in very small space. In a decade since its introduction, it has found many uses in theory and practice, with data streaming systems and beyond. This talk will survey the developments.

Muthu Muthukrishnan is a Professor at Rutgers University and a Research Scientist at Microsoft, India. His research interest is in development of data stream theory and systems, as well as online advertising systems.

##### David P. Woodruff

Sketching as a Tool for Numerical Linear Algebra

The talk will focus on how sketching techniques from the data stream literature can be used to speed up well-studied algorithms for problems occurring in numerical linear algebra, such as least squares regression and approximate singular value decomposition. It will also discuss how they can be used to achieve very efficient algorithms for variants of these problems, such as robust regression.

David Woodruff joined the algorithms and complexity group at IBM Almaden in 2007 after completing his Ph.D. at MIT in theoretical computer science. His interests are in compressed sensing, communication, numerical linear algebra, sketching, and streaming.

##### Sam Ritchie

Summingbird is a platform for streaming map/reduce used at Twitter to build aggregations in real-time or on hadoop. When the programmer describes her job, that job can be run without change on Storm or Hadoop. Additionally, summingbird can manage merging realtime/online computations with offline batches so that small errors in real-time do not accumulate. Put another way, summingbird gives eventual consistency in a manner that is easy for the programmer to reason about.

Sam Ritchie works on data analysis and infrastructure problems in Twitter’s Revenue engineering team. He is co-author of a number of open-source Scala and Clojure libraries, including Bijection, Algebird, Cascalog 2 and ElephantDB. He holds a bachelor’s degree in mechanical and aerospace engineering.

##### Alexandr Andoni

Similarity Search Algorithms

Nearest Neighbor Search is an ubiquitous problem in analyzing massive datasets: its goal is to process a set of objects (such as images), so that later, one can find the object most similar to a given query object. I will survey the state-of-the-art for this problem, starting from the (Kanellakis-award winning) technique of Locality Sensitive Hashing, to its more modern relatives, and touch upon connection to the theory of sketching.

Alexandr Andoni is a researcher in the Microsoft Research at Silicon Valley since 2010, after finishing his PhD in MIT’s theory group and year-long postdoctoral position at Princeton University. His research interests revolve around algorithms for massive datasets, including similarity search and streaming/sublinear algorithms, as well as theoretical machine learning.

There will be a panel discussion on the topic of harboring research in startups. Speakers include:
Pete Skomoroch from the LinkedIn data science team.
Rob Grzywinski of Aggregate Knowledge.
Joseph Turian of Metaoptimize

##### Lightning talks
• Armon Dadgar (Kiip) – Sketching Data Structures at Kiip
• Blake Mizerany (Heroku) – An Engineer Reads a Data Sketch
• Timon Karnezos (AK) – TBD
• Jérémie Lumbroso (INRIA) – Philippe Flajolet’s Contribution to Streaming Algorithms

Update! We posted the videos and slides for those of you that couldn’t make it (or those that want to relive the fun). Enjoy!

## Call for Summer Interns

AK is looking for a summer intern in our R&D group. If any of our blog posts have interested you, then you’ll fit right in!

We’re looking for someone who has a good handle on a few programming languages (pick any two from R/Mathematica/Python/Javascript/Java) and has some math in their background — college-level calculus or algebra is plenty. Ideally, you’re interested in learning about:

• building and tuning high-performance data structures,
• streaming algorithms,
• interesting data visualizations, and

It’s OK if you’ve never seen the stuff we write about on the blog before! We didn’t either until we started researching them!

I can’t emphasize this enough: we don’t expect you to know how to do the things above yet. We simply expect you to have a passion for learning about them and the diligence to work through what (at the time) seem like impossible problems. Work experience is nice, but not necessary. As long as you can write clean code and can work hard, you’re well-qualified for this job.

If you’re interested, please send a brief note about why you’re interested, along with a CV and/or GitHub username to timon at aggregateknowledge dot com. For extra credit, please submit one (or more!) of the following:

• An implementation of HLL, Count-Min Sketch, K-Min Values, or Distinct Sampling in a language of your choice.
• An extension to Colin’s blog post about a good hash function that adds CityHash and SipHash to the shoot-out.
• An explanation of the tradeoffs between using a hash map and Count-Min Sketch for counting item frequency.

(I feel like I shouldn’t have to say this, but yes, these are all answered somewhere on the internet. Don’t plagiarize. What we want is for you to go learn from them and try your own hand at implementing/experimenting. Also, don’t freak out, these are extra credit!)

## The Sketching Press

Yesterday we had a visitor at the office, Bahman Bahmani. He was nice enough to give us a preview of his talk for Strata this week. As we are sketching cheerleaders, it was really cool of him to let us see his talk and to trade some war stories. If you are at Strata this week, definitely go and check it out. He has some really cool examples of sketching applications and a detailed description of his work at Twitter for their streaming PageRank sketch.

## HLL Intersections

### Why?

The intersection of two streams (of user ids) is a particularly important business need in the advertising industry. For instance, if you want to reach suburban moms but the cost of targeting those women on a particular inventory provider is too high, you might want to know about a cheaper inventory provider whose audience overlaps with the first provider. You may also want to know how heavily your car-purchaser audience overlaps with a certain metro area or a particular income range. These types of operations are critical for understanding where and how well you’re spending your advertising budget.

As we’ve seen before, HyperLogLog provides a time- and memory-efficient algorithm for estimating the number of distinct values in a stream. For the past two years, we’ve been using HLL at AK to do just that: count the number of unique users in a stream of ad impressions. Conveniently, HLL also supports the union operator ( $\cup$ ) allowing us to trivially estimate the distinct value count of any composition of streams without sacrificing precision or accuracy. This piqued our interest because if we can “losslessly” compute the union of two streams and produce low-error cardinality estimates, then there’s a chance we can use that estimate along with the inclusion-exclusion principle to produce “directionally correct” cardinality estimates of the intersection of two streams. (To be clear, when I say “directionally correct” my criteria is “can an advertiser make a decision off of this number?”, or “can it point them in the right direction for further research?”. This often means that we can tolerate relative errors of up to 50%.)

The goals were:

1. Get a grasp on the theoretical error bounds of intersections done with HLLs, and
2. Come up with heuristic bounds around $m$, $overlap$, and the set cardinalities that could inform our usage of HLL intersections in the AK product.

Quick terminology review:

• If I have set of integers $A$, I’m going to call the HLL representing it $H_{A}$.
• If I have HLLs $H_{A}, H_{B}$ and their union $H_{A \cup B}$, then I’m going to call the intersection cardinality estimate produced $|H_{A \cap B}|$.
• Define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.
• Define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.
• We’ll represent the absolute error of an observation $|H_{A}|$ as $\Delta |H_{A}|$.

That should be enough for those following our recent posts, but for those just jumping in, check out Appendices A and B at the bottom of the post for a more thorough review of the set terminology and error terminology.

### Experiment

We fixed 16 $overlap$ values (0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) and 12 set cardinalities (100M, 50M, 10M, 5M, 1M, 500K, 100K, 50K, 10K, 5K, 1K, 300) and did 100 runs of each permutation of $(overlap, |A|, |B|)$. A random stream of 64-bit integers hashed with Murmur3 was used to create the two sets such that they shared exactly $min(|A|,|B|) \cdot overlap = |A \cap B|$  elements. We then built the corresponding HLLs $H_{A}$ and $H_{B}$ for those sets and calculated the intersection cardinality estimate $|H_{A} \cap H_{B}|$ and computed its relative error.

Given that we could only generate and insert about 2M elements/second per core, doing runs with set cardinalities greater than 100M was quickly ruled out for this blog post. However, I can assure you the results hold for much larger sets (up to the multiple billion-element range) as long as you heed the advice below.

### Results

This first group of plots has a lot going on, so I’ll preface it by saying that it’s just here to give you a general feeling for what’s going on. First note that within each group of boxplots $overlap$ increases from left to right (orange to green to pink), and within each plot cardinality ratio increases from left to right. Also note that the y-axis (the relative error of the observation) is log-base-10 scale. You can clearly see that as the set sizes diverge, the error skyrockets for all but the most similar (in both cardinality and composition) sets. I’ve drawn in a horizontal line at 50% relative error to make it easier to see what falls under the “directionally correct” criteria. You can (and should) click for a full-sized image.

Note: the error bars narrow as we progress further to the right because there are fewer observations with very large cardinality ratios. This is an artifact of the experimental design.

A few things jump out immediately:

• For cardinality ratio > 500, the majority of observations have many thousands of percent error.
• When cardinality ratio is smaller than that and $overlap > 0.4$, register count has little effect on error, which stays very low.
• When $overlap \le 0.01$, register count has little effect on error, which stays very high.

Just eyeballing this, the lesson I get is that computing intersection cardinality with small error (relative to the true value) is difficult and only works within certain constraints. Specifically,

1. $\frac{|A|}{|B|} < 100$, and
2. $overlap(A, B) = \frac{|A \cap B|}{min(|A|, |B|)} \ge 0.05$.

The intuition behind this is very simple: if the error of any one of the terms in your calculation is roughly as large as the true value of the result then you’re not going to estimate that result well. Let’s look back at the intersection cardinality formula. The left-hand side (that we are trying to estimate) is a “small” value, usually. The three terms on the right tend to be “large” (or at least “larger”) values. If any of the “large” terms has error as large as the left-hand side we’re out of luck.

So, let’s say you can compute the cardinality of an HLL with relative error of a few percent. If $|H_{A}|$ is two orders of magnitude smaller than $|H_{B}|$, then the error alone of $|H_{B}|$ is roughly as large as $|A|$.

$|A \cap B| \le |A|$ by definition, so

$|A \cap B| \le |A| \approx |H_{A}| \approx \Delta |H_{B}|$.

In the best scenario, where $A \cap B = A$, the errors of $|H_{B}|$ and $|H_{A \cup B}| \approx |H_{B}|$ are both roughly the same size as what you’re trying to measure. Furthermore, even if $|A| \approx |B|$ but the overlap is very small, then $|A \cap B|$  will be roughly as large as the error of all three right-hand terms.

### On the bubble

Let’s throw out the permutations whose error bounds clearly don’t support “directionally correct” answers ($overlap < 0.01$ and $\frac{|A|}{|B|} > 500$) and those that trivially do ($overlap > 0.4$) so we can focus more closely on the observations that are “on the bubble”. Sadly, these plots exhibit a good deal of variance inherent in their smaller sample size. Ideally we’d have tens of thousands of runs of each combination, but for now this rough sketch will hopefully be useful. (Apologies for the inconsistent colors between the two plots. It’s a real bear to coordinate these things in R.) Again, please click through for a larger, clearer image.

By doubling the number of registers, the variance of the relative error falls by about a quarter and moves the median relative error down (closer to zero) by 10-20 points.

In general, we’ve seen that the following cutoffs perform pretty well, practically. Note that some of these aren’t reflected too clearly in the plots because of the smaller sample sizes.

Register Count Data Structure Size Overlap Cutoff Cardinality Ratio Cutoff
8192 5kB 0.05 10
16384 10kB 0.05 20
32768 20kB 0.05 30
65536 41kB 0.05 100

### Error Estimation

To get a theoretical formulation of the error envelope for intersection, in terms of the two set sizes and their overlap, I tried the first and simplest error propagation technique I learned. For variables $Y, Z, ...$, and $X$ a linear combination of those (independent) variables, we have

$\Delta X = \sqrt{ (\Delta Y)^2 + (\Delta Z)^2 + ...}$

Applied to the inclusion-exclusion formula:

$\begin{array}{ll} \displaystyle \Delta |H_{A \cap B}| &= \sqrt{ (\Delta |H_{A}|)^2 + (\Delta |H_{B}|)^2 + (\Delta |H_{A \cup B}|)^2} \\ &= \sqrt{ (\sigma\cdot |A|)^2 + (\sigma\cdot |B|)^2 + (\sigma\cdot |A \cup B|)^2} \end{array}$

where

$\sigma = \frac{1.04}{\sqrt{m}}$ as in section 4 (“Discussion”) of the HLL paper.

Aside: Clearly $|H_{A \cup B}|$ is not independent of $|H_{A}| + |H_{B}|$, though $|H_{A}|$ is likely independent of $|H_{B}|$. However, I do not know how to a priori calculate the covariance in order to use an error propagation model for dependent variables. If you do, please pipe up in the comments!

I’ve plotted this error envelope against the relative error of the observations from HLLs with 8192 registers (approximately 5kB data structure).

Despite the crudeness of the method, it provided a 95% error envelope for the data without significant differences across cardinality ratio or $overlap$. Specifically, at least 95% of observations satisfied

$(|H_{A \cap B}| - |A \cap B|) < \Delta |H_{A \cap B}|$

However, it’s really only useful in the ranges shown in the table above. Past those cutoffs the bound becomes too loose and isn’t very useful.

This is particularly convenient because you can tune the number of registers you need to allocate based on the most common intersection sizes/overlaps you see in your application. Obviously, I’d recommend everyone run these tests and do this analysis on their business data, and not on some contrived setup for a blog post. We’ve definitely seen that we can get away with far less memory usage than expected to successfully power our features, simply because we tuned and experimented with our most common use cases.

### Next Steps

We hope to improve the intersection cardinality result by finding alternatives to the inclusion-exclusion formula. We’ve tried a few different approaches, mostly centered around the idea of treating the register collections themselves as sets, and in my next post we’ll dive into those and other failed attempts!

### Appendix A: A Review Of Sets

Let’s say we have two streams of user ids, $S_{A}$ and $S_{B}$. Take a snapshot of the unique elements in those streams as sets and call them $A$ and $B$. In the standard notation, we’ll represent the cardinality, or number of elements, of each set as $|A|$ and $|B|$.

Example: If $A = \{1,2,10\}$ then $|A| = 3$.

If I wanted to represent the unique elements in both of those sets combined as another set I would be performing the union, which is represented by $A \cup B$.

Example: If $A = \{1,2,3\}, B=\{2,3,4\}$ then $A \cup B = \{1,2,3,4\}$.

If I wanted to represent the unique elements that appear in both $A$ and $B$ I would be performing the intersection, which is represented by $A \cap B$.

Example: With $A, B$ as above, $A \cap B = \{2,3\}$.

The relationship between the union’s cardinality and the intersection’s cardinality is given by the inclusion-exclusion principle. (We’ll only be looking at the two-set version in this post.) For reference, the two-way inclusion-exclusion formula is $|A \cap B| = |A| + |B| - |A \cup B|$.

Example: With $A, B$ as above, we see that $|A \cap B| = 2$ and $|A| + |B| - |A \cup B| = 3 + 3 - 4 = 2$.

For convenience we’ll define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.

Example: With $A, B$ as above, $overlap(A,B) = \frac{|A \cap B|}{min(|A|, |B|)} = \frac{2}{min(3,3)} = \frac{2}{3}$.

Similarly, for convenience, we’ll define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.

The examples and operators shown above are all relevant for exact, true values. However, HLLs do not provide exact answers to the set cardinality question. They offer estimates of the cardinality along with certain error guarantees about those estimates. In order to differentiate between the two, we introduce HLL-specific operators.

Consider a set $A$. Call the HLL constructed from this set’s elements $H_{A}$. The cardinality estimate given by the HLL algorithm for $H_{A}$ is $|H_{A}|$.

Define the union of two HLLs $H_{A} \cup H_{B} := H_{A \cup B}$, which is also the same as the HLL created by taking the pairwise max of $H_{A}$‘s and $H_{B}$‘s registers.

Finally, define the intersection cardinality of two HLLs in the obvious way: $|H_{A} \cap H_{B}| := |H_{A}| + |H_{B}| - |H_{A \cup B}|$. (This is simply the inclusion-exclusion formula for two sets with the cardinality estimates instead of the true values.)

### Appendix B: A (Very Brief) Review of Error

The simplest way of understanding the error of an estimate is simply “how far is it from the truth?”. That is, what is the difference in value between the estimate and the exact value, also known as the absolute error.

However, that’s only useful if you’re only measuring a single thing over and over again. The primary criteria for judging the utility of HLL intersections is relative error because we are trying to measure intersections of many different sizes. In order to get an apples-to-apples comparison of the efficacy of our method, we normalize the absolute error by the true size of the intersection. So, for some observation $\hat{x}$ whose exact value is non-zero $x$, we say that the relative error of the observation is $\frac{x-\hat{x}}{x}$. That is, “by what percentage off the true value is the observation off?”

Example: If $|A| = 100, |H_{A}| = 90$ then the relative error is $\frac{100 - 90}{100} = \frac{10}{100} = 10\%$.

## K-Minimum Values: Sketching Error, Hash Functions, and You

### Introduction

“All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions.”
–Flajolet, et al

Recalling the KMV algorithm Matt presented in his last post, one will note that every stream element is passed to a hash function as part of the processing step. This is meant to transform the data being operated on from its native distribution into something uniformly distributed. Unfortunately, we don’t live in a perfect world, and since all of the algorithm’s analysis assumes that this hash function does its job well, we wanted to get some sense of how it behaves under less friendly conditions. The first half of this post will investigate the algorithm’s performance when we artificially introduce bias, and the second half will look at its behavior with a handful of real hash functions.

### A Simple Error Model

The first hash function error model that came to mind is admittedly unrealistic and ham-fisted, but hopefully illustrative. Suppose you have a stream of fixed sized, an ideal hash function, and from these you produce a distinct value estimate using the KMV algorithm. Now suppose that for some unlucky reason, one bit from your hash function is stuck; it’s always a zero or a one, but the other 31 bits are free of this curse.

There’s nothing to stop you from computing a distinct value estimate using this janky hash with KMV, but your intuition suggests that it shouldn’t be very good. We went through this exact process with various choices of k, using a random number generator to simulate a perfect hash function.

Before we look at the data, let’s think about what we should expect. From the perspective of KMV, it shouldn’t make a whole lot of difference if your kth smallest element is odd or even (for instance, in a case where the lowest order bit always/never set, respectively). It does, however, make a difference if you’re actually incapable of seeing values smaller than 231, which is what happens when the highest order bit is always set. Thus, in both the 0-biasing and 1-biasing cases, we should expect that higher order bits have a much more dramatic effect on error than lower order bits.

Notice how the performance degradation follows two different patterns. When we are fixing bits as ones, the observed error increases fairly smoothly, and tends to result in under estimates. In contrast, setting bits to zeros results in no change until the error increases producing catastrophic over estimates. Additionally, larger values of k have protective effects against these biases.

### A Somewhat Less Simple Error Model

Now that we have some intuition for the problem, let’s get a little more subtle. Instead of always setting the nth bit as a 0 or 1, let’s add a probabilistic element. We’ll do the same experiment as before, except we will now fix the nth bit with probability p. Thus, when p = 0 we have a perfectly well behaved KMV, and when p = 1, we have the experiment we just finished discussing. In the following diagram, each tile represents the average error across several experiments in which a stream of 1,000,000 unique elements was fed to a KMV sketch (k = 1024) which was rigged to modify the nth bit with probability p.

Many of the same lessons can be seen here — high order bits matter more, downward biasing degrades performance sharply, upward biasing degrades more smoothly. Additionally, as we’d expect, within a given bit, more bias means more error.

### Send in the Hash Functions

All of the experiments to this point have involved using a random number generator instead of hashing real data. I think it’s time that we took a look at what happens when we drop in a few real hash functions with real data. For the following experiments, I’m using four 32-bit hashes — Murmur3, SDBM, Arash Partow’s hash, and one of the old Donald Knuth hashes. You may recall these from our series on choosing a good hash function (although 64-bit versions were used there). I chose four text corpuses:

• Romeo and Juliet, stripped of all punctuation and converted to lower case (3794 words)
• /usr/share/dict/words (99171 words)
• 1,000,000 random 12 character long strings, each sharing the same suffix: “123456”
• 1,000,000 random 12 character long strings, each sharing the same prefix: “123456”

Using formulas from this paper, we can compute the relative error that 99% of KMV estimates should theoretically fall within. This turns out to depend on k and the stream size.

To make these pictures, I chose random values of k within each hash/document pair at which I evaluate the cardinality estimate and compute the relative error. The lines are linear interpolations between sampled points and are shown solely for clarity. The y-axis scale is adjusted on a per-picture basis to best display the theoretical envelope within which we expect our errors to lie.

Now that we’ve gotten through all the necessary preamble, let’s take a look at the results!

One picture in and we’ve already learned a lesson: choice of hash function seriously matters! SDBM and DEK cause the algorithm to perform well below its capabilities. DEK’s error is actually off the charts for most of this graph, which is why it does not appear until k > 3,000.

On a bigger corpus with tighter theoretical error bounds, Murmur3 and AP are still doing quite well. Do note, however, that AP dips outside the envelope for a while at k = 70,000 or so.

With the random strings, SDBM performs much better than it did on English words. DEK, however, is still hopeless. It’s a little tough to see on these pictures, but at high k, AP starts to fall off the wagon, and even Murmur3 dips outside the envelope, though not beyond what we’d expect, statistically speaking. Honestly, I was hoping for some fireworks here, but they didn’t materialize. I was wondering if we might see some hashes break on one version of these strings, and do fine on the other due to the location of the varying key bits (high order/low order). Sadly, that didn’t happen, but a negative result is a result none the less.

To summarize these, I made the following table, which shows us the percentage of time that an one of the samples falls outside the theoretical envelope. In this view, Murmur3’s superiority is clear.

AP DEK Murmur3 SDBM
Romeo and Juliet 0.00% 100.00% 0.00% 61.54%
/usr/share/dict/words
10.76% 100.00% 0.00% 68.46%
Common Suffix 7.22% 99.11% 1.10% 0.27%
Common Prefix 3.33% 100.00% 0.22% 0.0001%

### Fin

KMV is a very nice little algorithm that is incredibly simple to understand, implement, and use. That said, if you’re going to make use of it, you really do need to practice some due diligence when choosing your hash function. With packages like smhasher available, trying out multiple hash functions is a cinch, and a little legwork at the start of a project can save you from confusion and despair later on!

## Big Data Ain’t Fat Data: A Case Study

We’ve always had a hunch that our users stick to the same geographic region. Sure, there’s the occasional jet-setter that takes their laptop from New York to Los Angeles (or like Rob, goes Chicago to San Francisco) on a daily or weekly basis, but they’re the exception and not the rule. Knowing how true this is can simplify the way we work with user-centric data across multiple data centers.

When Rob asked me to find this out for sure, my first instinct was to groan and fire up Hive on an Elastic MapReduce cluster, but after a second, I heard Matt’s voice in my head saying, “Big Data isn’t Fat Data”. Why bother with Hadoop?

#### The Setup

If I was solving this problem on a small data-set, it’d be pretty straight-forward. I could write a Python script in about 10 minutes that would take care of the problem. It would probably look something like:

users = {}

for line in sys.stdin:
user, data_center = parse(line)
try:
users[user].append(data_center)
except KeyError:
users[user] = [data_center]

total_users = len(users)
multiple_dc_users = len([u for u in users if len(users[u]) > 1])


Easy peasy. However, explicitly storing such a large hash-table gets a little problematic once you start approaching medium-sized data (1GB+). Your memory needs grow pretty rapidly – with M users and N data centers, storage is O(MN) – , and things start to get a little slow in Python. At this point there are two options. You can brute force the problem by throwing hardware at it, either with a bigger machine or with something like Hadoop. Or, we can put on our Computer Science and Statistics hats and get a little bit clever.

What if we turn the problem sideways? Above, we’re keeping a hash table that holds a set of data-center for each user. Instead, let’s keep a set of users per data-center, splitting the problem up into multiple hash tables. This lets us keep a small, fixed number of tables – since I’d hope any company knows exactly how many data centers they have – and spread the load across them, hopefully making the load on each table more tolerable. We can then check how many sets each user falls into, and call it a day.

data_centers = dict([(dc, set()) for dc in AK_DATA_CENTERS])

for line in sys.stdin:
user, data_center = parse(line)

# Get the total users by intersecting all of the data center sets
...

# Get all users who are in exactly one set by taking symmetric differences (XOR) of data-center sets
# and count the size of that set.
...


While this approach theoretically has better performance with the same O(MN) space requirements, with big enough data the space requirements of the problem totally dominate whatever improvement this approach would provide. In other words, it doesn’t matter how small each hash table is, you can’t fit 80GB of user IDs into the 8GB of RAM on your laptop.

It’s looking pretty bleak for the Clever Way of doing things, since what we really want is a magic hash table that can store our 80GB of user IDs in the memory on our laptops.

#### Bloom Filters

Enter Bloom Filters. A bloom filter is a fixed-size set data structure with two minor features/drawbacks:

1. You can never ask a Bloom Filter for the set of elements it contains.
2. Membership queries have a small, controllable, false-positive probability. Bloom filters will never return false negatives.

With a little bit of work, it’s pretty easy to substitute Bloom Filters for plain old hash tables in our sideways approach above. There’s a slight tweak we have to make to our algorithm to accommodate the fact that we can’t ever query a bloom filter for the elements it contains, but the idea remains the same.

#### The Payoff

Suppose now we’re keeping a bloom-filter of users per data center. The only thing we have to work around is the fact that we’ll never be able to recover the list of users we’ve added to each set. So, we’ll just deal with users each time we see them instead of deferring our counting to the end.

With that idea in the bag, there are really only a few things to worry about when a request comes in for a given data center.

• Check the bloom filter for that data center to see if the user has been to that one before
• Check the other bloom filters to see how many other data-centers that user has been to before
• Count the number of total data-centers that user has seen before. If the user is new to this data center, and the user has seen exactly one other data center before, increment the multiple data center user counter
• If the user has never seen any of your data centers before, that user is a completely new user. Increment the total number of users seen.
• If the user has already seen this data-center, this user is a repeat. Do nothing!

We ran our version of this overnight. It took us one core, 8GB of RAM, and just under than 4 hours to count the number of users who hit multiple data centers in a full week worth of logs.