## Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

### History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
if count == 0:
majority = val
count = 1
elif val == majority:
count += 1
else:
count -= 1

print majority if count > 0 else "no majority!"


If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to $N/2+1$ and then start decrementing until all N values have been seen, which would leave the counter at $2 = (N/2+1) - (N/2-1)$*. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

### One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
if val > median_est:
median_est += 1
elif val < median_est:
median_est -= 1


Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:


quantile_75 = 0
for val in stream:
r = random()
if val > quantile_75 and r > 1 - 0.75:
quantile_75 += 1
elif val < quantile_75 and r > 0.75:
quantile_75 -= 1


As the paper states:

The trick to generalize median estimation to any $\frac{h}{k}$ -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability $\frac{h}{k}$. The rationale behind it is that if we are estimating $\frac{h}{k}$ -quantile, and if the current estimate is at stream’s true $\frac{h}{k}$ -quantile, we will expect to see stream items larger than the current estimate with probability $1-\frac{h}{k}$ .

### Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1

for item in stream:
if item > m and random() > 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign > 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step > 0 else 1
# Mark that the estimate increased this step
sign = 1
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m > item:
step += (item - m)
m = item
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item < m and random() > q:
step += f(step) if sign < 0 else -1 * f(step)
m -= step if step > 0 else 1
sign = -1
if m < item:
step += (m - item)
m = item
# Damp down the step size to avoid oscillation.
if (m - item) * sign < 0 and step > 1:
step = 1



You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as $-10^{102}$ for some simulations.

### Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

### Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

### Appendix

• Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
• Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
• Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
• Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

## HyperLogLog++: Google’s Take On Engineering HLL

Matt Abrams recently pointed me to Google’s excellent paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” [UPDATE: changed the link to the paper version without typos] and I thought I’d share my take on it and explain a few points that I had trouble getting through the first time. The paper offers a few interesting improvements that are worth noting:

1. Move to a 64-bit hash function
2. A new small-cardinality estimation regime
3. A sparse representation

I’ll take a look at these one at a time and share our experience with similar optimizations we’ve developed for a streaming (low latency, high throughput) environment.

### 32-bit vs. 64-bit hash function

I’ll motivate the move to a 64-bit hash function in the context of the original paper a bit more since the Google paper doesn’t really cover it except to note that they wanted to count billions of distinct values.

#### Some math

In the original HLL paper, a 32-bit hash function is required with the caveat that measuring cardinalities in the hundreds of millions or billions would become problematic because of hash collisions. Specifically, Flajolet et al. propose a “large range correction” for when the estimate $E$ is greater than $2^{32}/30$.  In this regime, they replace the usual HLL estimate by the estimate

$\displaystyle E^* := -2^{32} \mbox{log}\Big(1 - \frac{E}{2^{32}}\Big)$.

This reduces to a simple probabilistic argument that can be modeled with balls being dropped into bins. Say we have an $L$-bit hash. Each distinct value is a ball and each bin is designated by a value in the hash space.  Hence, you “randomly” drop a ball into a bin if the hashed value of the ball corresponds to the hash value attached to the bin.  Then, if we get an estimate $E$ for the cardinality, that means that (approximately) $E$ of our bins have values in them, and so there are $2^L - E$ empty bins.  The number of empty bins should be about $2^L e^{ - n/2^{L} }$, where $n$ is the number of balls.  Hence $2^{L} - E = 2^{L} e^{-n/2^{L}}$.  Solving this gives us the formula he recommends using: $-2^L \log(1 - \frac{E}{2^{L}})$.

Aside:  The empty bins expected value comes from the fact that

$E(\# \text{ of empty bins}) = m(1 - \frac{1}{m})^{n}$,

where $m$ is the number of bins and $n$ the number of balls.  This is pretty quick to show by induction.  Hence,

$\displaystyle E(\#\text{ of empty bins}) \sim m e^{-n/m}$ as $n \rightarrow \infty$.

Again, the general idea is that the $E$ ends up being some number smaller than $n$ because some of the balls are getting hashed to the same value.  The correction essentially doesn’t do anything in the case when $E$ is small compared to $2^L$ as you can see here. (Plotted is $-\log(1 - x)$, where $x$ represents $E / 2^L$, against the line $y = x$. The difference between the two graphs represents the difference between $E$ and $n$.)

#### A solution and a rebuttal

The natural move to start estimating cardinalities in the billions is to simply move to a larger hash space where the hash collision probability becomes negligibly small. This is fairly straightforward since most good hash functions give you at least 64-bits of entropy these days and it’s also the size of a machine word. There’s really no downside to moving to the larger hash space, from an engineering perspective. However, the Google researchers go one step further: they increase the register width by one bit (to 6), as well, ostensibly to be able to support the higher possible register values in the 64-bit setting. I contend this is unnecessary. If we look at the distribution of register values for a given cardinality, we see that it takes about a trillion elements before a 5-bit register overflows (at the black line):

The distributions above come from the LogLog paper, on page 611, right before formula 2. Look for $\mathbb{P}_{\nu}(M = k)$.

Consider the setting in the paper where $p = \log_2(m) = 14$. Let’s says we wanted to safely count into the 100 billion range. If we have $L = p + (2^5 - 1) = 14 + 31 = 45$ then our new “large range correction” boundary is roughly one trillion, per the adapted formula above. As you can see from the graph below, even at $p = 10, L = 41$ the large range correction only kicks in at a little under 100 billion!

The black line is the cutoff for a 5-bit register, and the points are plotted when the total number of hash bits required reaches 40, 50, and 60.

The real question though is all this practically useful? I would argue no: there are no internet phenomena that I know of that are producing more than tens of billions of distinct values, and there’s not even a practical way of empirically testing the accuracy of HLL at cardinalities above 100 billion. (Assuming you could insert 50 million unique, random hashed values per second, it would take half an hour to fill an HLL to 100 billion elements, and then you’d have to repeat that 5000 times as they do in the paper for a grand total of 4 months of compute time per cardinality in the range.)

[UPDATE: After talking with Marc Nunkesser (one of the authors) it seems that Google may have a legitimate need for both the 100 billion to trillion range right now, and even more later, so I retract my statement! For the rest of us mere mortals, maybe this section will be useful for picking whether or not you need five or six bits.]

At AK we’ve run a few hundred test runs at 1, 1.5, and 2 billion distinct values under the $p = 10-14, L = 41-45$ configuration range and found the relative error to be identical to that of lower cardinalities (in the millions of DVs). There’s really no reason to inflate the storage requirements for large cardinality HLLs by 20% simply because the hash space has expanded. Furthermore, you can do all kinds of simple tricks like storing an offset as metadata (which would only require at most 5 bits) for a whole HLL and storing the register values as the difference from that base offset, in order to make use of a larger hash space.

### Small Cardinality Estimation

Simply put, the paper introduces a second small range correction between the existing one (linear counting) and the “normal” harmonic mean estimator ($E$ in the original paper) in order to eliminate the “large” jump in relative error around the boundary between the two estimators.

They empirically calculate the bias of $E$ and create a lookup table for various $p$, for 200 values less than $5 \cdot 2^p$ with a correction to the overestimate of $E$. They interpolate between the 200 reference points to determine the correction to apply for any given raw $E$ value. Their plots give compelling evidence that this bias correction makes a difference in the $m$ to $5m$ cardinality range (cuts 95th percentile relative error from about 2% to 1.2%).

I’ve been a bit terse about this improvement since sadly it doesn’t help us at AK much because most of our data is Zipfian. Few of our reporting keys live in the narrow cardinality range they are optimizing: they either wallow in the linear counting range or shoot straight up into the normal estimator range.

Nonetheless, if you find you’re doing a lot of DV counting in this range, these corrections are pretty cheap to implement (as they’ve provided numerical values for all the cutoffs and bias corrections in the appendix.)

### Sparse representation

The general theme of this optimization isn’t particularly new (our friends at MetaMarkets mentioned it in this post): for smaller cardinality HLLs there’s no need to materialize all the registers. You can get away with just materializing the set registers in a map. The paper proposes a sorted map (by register index) with a hash map off the side to allow for fast insertions. Once the hash map reaches a certain size, its entries are batch-merged into the sorted list, and once the sorted list reaches the size of the materialized HLL, the whole thing is converted to the fully-materialized representation.

Aside: Though the hash map is a clever optimization, you could skip it if you didn’t want the added complexity. We’ve found that the simple sorted list version is extremely fast (hundreds of thousands of inserts per second). Also beware the variability of the the batched sort-and-merge cost every time the hash map repeatedly outgrows its limits and has to be merged into the sorted list. This can add significant latency spikes to a streaming system, whereas a one-by-one insertion sort to a sorted list will be slower but less variable.

The next bit is very clever: they increase $p$ when the HLL is in the sparse representation because of the saved space. Since they’re already storing entries in 32-bit integers, they increase $p$ to $p^{\prime} = 31 - \mbox{regWidth} = 31 - 6 = 25$. (I’ll get to the leftover bit in a second!) This gives them increased precision which they can simply “fold” down when converting from the sparse to fully materialized representation. Even more clever is their trick of not having to always store the full register value as the value of an entry in the map. Instead, if the longer register index encodes enough bits to determine the value, they use the leftover bit I mentioned before to signal as much.

In the diagram, $p$ and $p^{\prime}$ are as in the Google paper, and $q$ and $q^{\prime}$ are the number of bits that need to be examined to determine $\rho$ for either the $p$ or $p^{\prime}$ regime. I encourage you to read section 5.3.3 as well as EncodeHash and DecodeHash in Figure 8 to see the whole thing. [UPDATE: removed the typo section as it has been fixed in the most recent version of the paper (linked at the top)]

The paper also tacks on a difference encoding (which works very well because it’s a sorted list) and a variable length encoding to the sparse representation to further shrink the storage needed, so that the HLL can use the increased register count, $p^{\prime}$, for longer before reverting to the fully materialized representation at $p$. There’s not much to say about it because it seems to work very well, based on their plots, but I’ll note that in no way is that type of encoding suitable for streaming or “real-time” applications. The encode/decode overhead simply takes an already slow (relative to the fully materialized representation) sparse format and adds more CPU overhead.

### Conclusion

The researchers at Google have done a great job with this paper, meaningfully tackling some hard engineering problems and showing some real cleverness. Most of the optimizations proposed work very well in a database context, where the HLLs are either being used as temporary aggregators or are being stored as read-only data, however some of the optimizations aren’t suitable for streaming/”real-time” work. On a more personal note, it’s very refreshing to see real algorithmic engineering work being published from industry. Rob, Matt, and I just got back from New Orleans and SODA / ALENEX / ANALCO and were hoping to see more work in this area, and Google sure did deliver!

### Appendix

Sebastiano Vigna brought up the point that 6-bit registers are necessary for counting above 4 billion in the comments. I addressed it in the original post (see “A solution and a rebuttal“) but I’ll lay out the math in a bit more detail here to show that you can easily count above 4 billion with only 5-bit registers.

If you examine the original LogLog paper (the same as mentioned above) you’ll see that the register distribution for LogLog (and HyperLogLog consequently) registers is

$\displaystyle \mathbb{P}_{\nu}(M > k) = 1 - \mathbb{P}_{\nu}(M \le k) = 1 - \Big(1 - \frac{1}{2^k}\Big)^{\nu}$

where $k$ is the register value and $\nu$ is the number of (distinct) elements a register has seen.

So, I assert that 5 bits for a register (which allows the maximum value to be 31) is enough to count to ten billion without any special tricks. Let’s fix $p=14$ and say we insert $10^{10}$ distinct elements. That means, any given register will see about $\frac{10^{10}}{2^p} = \frac{10^{10}}{2^{14}} = \approx 6.1 \times 10^5$ elements assuming we have a decent hash function. So, the probability that a given register will have a value greater than 31 is (per the LogLog formula given above)

$\displaystyle \mathbb{P}_{\nu}(M > 31) = 1 - \mathbb{P}_{\nu}(M \le 31) = 1 - \Big(1 - \frac{1}{2^{31}}\Big)^{6.1 \times 10^5} \approx 0.00028$

and hence the expected number of registers that would overflow is approximately $2^{14} \times 0.00028 \approx 4.5$. So five registers out of sixteen thousand would overflow. I am skeptical that this would meaningfully affect the cardinality computation. In fact, I ran a few tests to verify this and found that the average number of registers with values greater than 31 was 4.5 and the relative error had the same standard deviation as that predicted by the paper, $1.04/\sqrt{m}$.

For argument, let’s assume that you find those five overflowed registers to be unacceptable. I argue that you could maintain an offset in 5 bits for the whole HLL that would allow you to still use 5 bit registers but exactly store the value of every register with extremely high probability. I claim that with overwhelmingly high probability, every register the HLL used above is greater than 15 and less than or equal to 40. Again, we can use the same distribution as above and we find that the probability of a given register being outside those bounds is

$\mathbb{P}_{\nu}(M < 15) \approx 10^{-162}$ and

$\mathbb{P}_{\nu}(M > 40) \approx 10^{-7}$.

Effectively, there are no register values outside of $[15,40]$. Now I know that I can just store 15 in my offset and the true value minus the offset (which now fits in 5 bits) in the actual registers.

## HLL Intersections

### Why?

The intersection of two streams (of user ids) is a particularly important business need in the advertising industry. For instance, if you want to reach suburban moms but the cost of targeting those women on a particular inventory provider is too high, you might want to know about a cheaper inventory provider whose audience overlaps with the first provider. You may also want to know how heavily your car-purchaser audience overlaps with a certain metro area or a particular income range. These types of operations are critical for understanding where and how well you’re spending your advertising budget.

As we’ve seen before, HyperLogLog provides a time- and memory-efficient algorithm for estimating the number of distinct values in a stream. For the past two years, we’ve been using HLL at AK to do just that: count the number of unique users in a stream of ad impressions. Conveniently, HLL also supports the union operator ( $\cup$ ) allowing us to trivially estimate the distinct value count of any composition of streams without sacrificing precision or accuracy. This piqued our interest because if we can “losslessly” compute the union of two streams and produce low-error cardinality estimates, then there’s a chance we can use that estimate along with the inclusion-exclusion principle to produce “directionally correct” cardinality estimates of the intersection of two streams. (To be clear, when I say “directionally correct” my criteria is “can an advertiser make a decision off of this number?”, or “can it point them in the right direction for further research?”. This often means that we can tolerate relative errors of up to 50%.)

The goals were:

1. Get a grasp on the theoretical error bounds of intersections done with HLLs, and
2. Come up with heuristic bounds around $m$, $overlap$, and the set cardinalities that could inform our usage of HLL intersections in the AK product.

Quick terminology review:

• If I have set of integers $A$, I’m going to call the HLL representing it $H_{A}$.
• If I have HLLs $H_{A}, H_{B}$ and their union $H_{A \cup B}$, then I’m going to call the intersection cardinality estimate produced $|H_{A \cap B}|$.
• Define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.
• Define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.
• We’ll represent the absolute error of an observation $|H_{A}|$ as $\Delta |H_{A}|$.

That should be enough for those following our recent posts, but for those just jumping in, check out Appendices A and B at the bottom of the post for a more thorough review of the set terminology and error terminology.

### Experiment

We fixed 16 $overlap$ values (0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) and 12 set cardinalities (100M, 50M, 10M, 5M, 1M, 500K, 100K, 50K, 10K, 5K, 1K, 300) and did 100 runs of each permutation of $(overlap, |A|, |B|)$. A random stream of 64-bit integers hashed with Murmur3 was used to create the two sets such that they shared exactly $min(|A|,|B|) \cdot overlap = |A \cap B|$  elements. We then built the corresponding HLLs $H_{A}$ and $H_{B}$ for those sets and calculated the intersection cardinality estimate $|H_{A} \cap H_{B}|$ and computed its relative error.

Given that we could only generate and insert about 2M elements/second per core, doing runs with set cardinalities greater than 100M was quickly ruled out for this blog post. However, I can assure you the results hold for much larger sets (up to the multiple billion-element range) as long as you heed the advice below.

### Results

This first group of plots has a lot going on, so I’ll preface it by saying that it’s just here to give you a general feeling for what’s going on. First note that within each group of boxplots $overlap$ increases from left to right (orange to green to pink), and within each plot cardinality ratio increases from left to right. Also note that the y-axis (the relative error of the observation) is log-base-10 scale. You can clearly see that as the set sizes diverge, the error skyrockets for all but the most similar (in both cardinality and composition) sets. I’ve drawn in a horizontal line at 50% relative error to make it easier to see what falls under the “directionally correct” criteria. You can (and should) click for a full-sized image.

Note: the error bars narrow as we progress further to the right because there are fewer observations with very large cardinality ratios. This is an artifact of the experimental design.

A few things jump out immediately:

• For cardinality ratio > 500, the majority of observations have many thousands of percent error.
• When cardinality ratio is smaller than that and $overlap > 0.4$, register count has little effect on error, which stays very low.
• When $overlap \le 0.01$, register count has little effect on error, which stays very high.

Just eyeballing this, the lesson I get is that computing intersection cardinality with small error (relative to the true value) is difficult and only works within certain constraints. Specifically,

1. $\frac{|A|}{|B|} < 100$, and
2. $overlap(A, B) = \frac{|A \cap B|}{min(|A|, |B|)} \ge 0.05$.

The intuition behind this is very simple: if the error of any one of the terms in your calculation is roughly as large as the true value of the result then you’re not going to estimate that result well. Let’s look back at the intersection cardinality formula. The left-hand side (that we are trying to estimate) is a “small” value, usually. The three terms on the right tend to be “large” (or at least “larger”) values. If any of the “large” terms has error as large as the left-hand side we’re out of luck.

So, let’s say you can compute the cardinality of an HLL with relative error of a few percent. If $|H_{A}|$ is two orders of magnitude smaller than $|H_{B}|$, then the error alone of $|H_{B}|$ is roughly as large as $|A|$.

$|A \cap B| \le |A|$ by definition, so

$|A \cap B| \le |A| \approx |H_{A}| \approx \Delta |H_{B}|$.

In the best scenario, where $A \cap B = A$, the errors of $|H_{B}|$ and $|H_{A \cup B}| \approx |H_{B}|$ are both roughly the same size as what you’re trying to measure. Furthermore, even if $|A| \approx |B|$ but the overlap is very small, then $|A \cap B|$  will be roughly as large as the error of all three right-hand terms.

### On the bubble

Let’s throw out the permutations whose error bounds clearly don’t support “directionally correct” answers ($overlap < 0.01$ and $\frac{|A|}{|B|} > 500$) and those that trivially do ($overlap > 0.4$) so we can focus more closely on the observations that are “on the bubble”. Sadly, these plots exhibit a good deal of variance inherent in their smaller sample size. Ideally we’d have tens of thousands of runs of each combination, but for now this rough sketch will hopefully be useful. (Apologies for the inconsistent colors between the two plots. It’s a real bear to coordinate these things in R.) Again, please click through for a larger, clearer image.

By doubling the number of registers, the variance of the relative error falls by about a quarter and moves the median relative error down (closer to zero) by 10-20 points.

In general, we’ve seen that the following cutoffs perform pretty well, practically. Note that some of these aren’t reflected too clearly in the plots because of the smaller sample sizes.

Register Count Data Structure Size Overlap Cutoff Cardinality Ratio Cutoff
8192 5kB 0.05 10
16384 10kB 0.05 20
32768 20kB 0.05 30
65536 41kB 0.05 100

### Error Estimation

To get a theoretical formulation of the error envelope for intersection, in terms of the two set sizes and their overlap, I tried the first and simplest error propagation technique I learned. For variables $Y, Z, ...$, and $X$ a linear combination of those (independent) variables, we have

$\Delta X = \sqrt{ (\Delta Y)^2 + (\Delta Z)^2 + ...}$

Applied to the inclusion-exclusion formula:

$\begin{array}{ll} \displaystyle \Delta |H_{A \cap B}| &= \sqrt{ (\Delta |H_{A}|)^2 + (\Delta |H_{B}|)^2 + (\Delta |H_{A \cup B}|)^2} \\ &= \sqrt{ (\sigma\cdot |A|)^2 + (\sigma\cdot |B|)^2 + (\sigma\cdot |A \cup B|)^2} \end{array}$

where

$\sigma = \frac{1.04}{\sqrt{m}}$ as in section 4 (“Discussion”) of the HLL paper.

Aside: Clearly $|H_{A \cup B}|$ is not independent of $|H_{A}| + |H_{B}|$, though $|H_{A}|$ is likely independent of $|H_{B}|$. However, I do not know how to a priori calculate the covariance in order to use an error propagation model for dependent variables. If you do, please pipe up in the comments!

I’ve plotted this error envelope against the relative error of the observations from HLLs with 8192 registers (approximately 5kB data structure).

Despite the crudeness of the method, it provided a 95% error envelope for the data without significant differences across cardinality ratio or $overlap$. Specifically, at least 95% of observations satisfied

$(|H_{A \cap B}| - |A \cap B|) < \Delta |H_{A \cap B}|$

However, it’s really only useful in the ranges shown in the table above. Past those cutoffs the bound becomes too loose and isn’t very useful.

This is particularly convenient because you can tune the number of registers you need to allocate based on the most common intersection sizes/overlaps you see in your application. Obviously, I’d recommend everyone run these tests and do this analysis on their business data, and not on some contrived setup for a blog post. We’ve definitely seen that we can get away with far less memory usage than expected to successfully power our features, simply because we tuned and experimented with our most common use cases.

### Next Steps

We hope to improve the intersection cardinality result by finding alternatives to the inclusion-exclusion formula. We’ve tried a few different approaches, mostly centered around the idea of treating the register collections themselves as sets, and in my next post we’ll dive into those and other failed attempts!

### Appendix A: A Review Of Sets

Let’s say we have two streams of user ids, $S_{A}$ and $S_{B}$. Take a snapshot of the unique elements in those streams as sets and call them $A$ and $B$. In the standard notation, we’ll represent the cardinality, or number of elements, of each set as $|A|$ and $|B|$.

Example: If $A = \{1,2,10\}$ then $|A| = 3$.

If I wanted to represent the unique elements in both of those sets combined as another set I would be performing the union, which is represented by $A \cup B$.

Example: If $A = \{1,2,3\}, B=\{2,3,4\}$ then $A \cup B = \{1,2,3,4\}$.

If I wanted to represent the unique elements that appear in both $A$ and $B$ I would be performing the intersection, which is represented by $A \cap B$.

Example: With $A, B$ as above, $A \cap B = \{2,3\}$.

The relationship between the union’s cardinality and the intersection’s cardinality is given by the inclusion-exclusion principle. (We’ll only be looking at the two-set version in this post.) For reference, the two-way inclusion-exclusion formula is $|A \cap B| = |A| + |B| - |A \cup B|$.

Example: With $A, B$ as above, we see that $|A \cap B| = 2$ and $|A| + |B| - |A \cup B| = 3 + 3 - 4 = 2$.

For convenience we’ll define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.

Example: With $A, B$ as above, $overlap(A,B) = \frac{|A \cap B|}{min(|A|, |B|)} = \frac{2}{min(3,3)} = \frac{2}{3}$.

Similarly, for convenience, we’ll define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.

The examples and operators shown above are all relevant for exact, true values. However, HLLs do not provide exact answers to the set cardinality question. They offer estimates of the cardinality along with certain error guarantees about those estimates. In order to differentiate between the two, we introduce HLL-specific operators.

Consider a set $A$. Call the HLL constructed from this set’s elements $H_{A}$. The cardinality estimate given by the HLL algorithm for $H_{A}$ is $|H_{A}|$.

Define the union of two HLLs $H_{A} \cup H_{B} := H_{A \cup B}$, which is also the same as the HLL created by taking the pairwise max of $H_{A}$‘s and $H_{B}$‘s registers.

Finally, define the intersection cardinality of two HLLs in the obvious way: $|H_{A} \cap H_{B}| := |H_{A}| + |H_{B}| - |H_{A \cup B}|$. (This is simply the inclusion-exclusion formula for two sets with the cardinality estimates instead of the true values.)

### Appendix B: A (Very Brief) Review of Error

The simplest way of understanding the error of an estimate is simply “how far is it from the truth?”. That is, what is the difference in value between the estimate and the exact value, also known as the absolute error.

However, that’s only useful if you’re only measuring a single thing over and over again. The primary criteria for judging the utility of HLL intersections is relative error because we are trying to measure intersections of many different sizes. In order to get an apples-to-apples comparison of the efficacy of our method, we normalize the absolute error by the true size of the intersection. So, for some observation $\hat{x}$ whose exact value is non-zero $x$, we say that the relative error of the observation is $\frac{x-\hat{x}}{x}$. That is, “by what percentage off the true value is the observation off?”

Example: If $|A| = 100, |H_{A}| = 90$ then the relative error is $\frac{100 - 90}{100} = \frac{10}{100} = 10\%$.

## Real-Time Streaming with Rsyslog and ZeroMQ

How do we go about streaming data in real time? At AK, we use Rsyslog in conjunction with ZeroMQ and a little AK secret sauce. Why Rsyslog? We looked for technology that existed in the world today that solved >90% of the problem. Since the beginning of modern UNIX operating systems, system logging has existed in the computer world and has evolved into real-time log routers and aggregators.

• Rsyslogd allows for multiple inputs and outputs.
• Rsyslogd allows for multiple routes based on stream type, origination (location and/or application), destinations.

As such, AK has written a ZeroMQ Rsyslog module

• ZeroMQ input/output interface (connect/bind, push/pull)
• pub/sub type coming soon

Simply put, we at AK have moved to a real-time data streaming process by integrating the Rsyslog service with the ZeroMQ library. This has allowed us to move from a brittle system of large scheduled data migrations and deferred processing to a lighter weight model of real time streaming data and processing. There are many benefits to this, including high scalability, durability and verification, real-time streaming among multiple data centers, and efficiency. We believe that others who have the same issues of counting and providing insights to massive data sets will follow in moving to a real time data analytics platform.

The pub-sub ZeroMQ integration. This is beyond cool since it basically allows us to expose a tap into the event stream. You want to simply connect to the event stream and try out some new algorithm? It’s trivial. Put ZeroMQ on the front and start listening. You want to grab a few minutes worth of events as they come in? Just connect and take what you need. No more going off to the log server, finding the logs, parsing them, breaking them up, etc, etc, etc. Just tap and go.

## Real-Time Streaming for Data Analytics

There are many reasons why real-time streaming is a necessity in today’s data analytics infrastructure. Just having data isn’t good enough. It must be timely and it must be actionable. Here is an analogy:

It’s 11:45 PM on a Friday night and you’re out at Streaming Seven — a swanky new club — after a long week of working. The Fire Marshall comes and asks the door man “How many people are in the establishment?” The doorman, who has been keeping a streaming count, looks at his hand counter and says “300, our capacity is 500”. The Fire Marshall nods and moves on.

The next night you join your friends in their favorite hang-out. Again the Fire Marshall comes and asks the same question “How many people are in the establishment?” In this bar the manager looks around frantically and replies “I don’t know!” The Fire Marshall orders the lights on, the doors closed, stop serving drinks, and does a headcount. The manager is losing money by the minute. You’re left standing there with an empty drink thinking about Streaming Seven.

Given that data sets continue to increase in size while the window for analyzing those data sets invariably decreases in this fast-paced industry, it becomes more and more critical to filter and stream data to multiple process end points in real time. It is very costly to defer the analytics until later, not only in dollars but also processing times, complex unreliable data stores, and waiting customers.

Here at AK we use Rsyslog in conjunction with ZeroMQ to provide our real-time streaming infrastructure. My next few posts are going to dive into these technologies and outline why we’ve chosen them. Stay tuned!