It’s worth thinking about how things can go wrong, and what the implications of such occurrences might be. In this post, I’ll be taking a look at the HyperLogLog (HLL) algorithm for cardinality estimation, which we’ve discussed before.
HLLs have the property that their register values increase monotonically as they run. The basic update rule is:
for item in stream: index, proposed_value = process_hashed_item(hash(item)) hll.registers[index] = max(hll.registers[index], proposed_value)
There’s an obvious vulnerability here: what happens to your counts if you get pathological data that blows up a register value to some really large number? These values are never allowed to decrease according to the vanilla algorithm. How much of a beating can these sketches take from such pathological data before their estimates are wholly unreliable?
Experiment The First
To get some sense of this, I took a 1024 bucket HLL, ran a stream through it, and then computed the error in the estimate. I then proceeded to randomly choose a register, max it out, and compute the error again. I repeated this process until I had maxed out 10% of the registers. In pseudo-python:
print("n_registers_touched,relative_error") print(0, relative_error(hll.cardinality(), stream_size), sep = ",") for index, reg in random.sample(range(1024), num_to_edit): hll.registers[reg] = 32 print(index + 1, relative_error(hll.cardinality(), stream_size), sep = ",")
In practice, HLL registers are fixed to be a certain bit width. In our case, registers are 5 bits wide, as this allows us to count runs of 0s up to length 32. This allows us to count astronomically high in a 1024 register HLL.
Repeating this for many trials, and stream sizes of 100k, 1M, and 10M, we have the following picture. The green line is the best fit line.
What we see is actually pretty reassuring. Roughly speaking, totally poisoning x% of registers results in about an x% error in your cardinality estimate. For example, here are the error means and variances across all the trials for the 1M element stream:
|Number of Registers Modified||Percentage of Registers Modified||Error Mean||Error Variance|
I was actually not too surprised to see that the induced error was modest when only a small fraction of the registers were poisoned. Along with some other machinery, the HLL algorithm uses the harmonic mean of the individual register estimates when computing its guess for the number of distinct values in the data stream. The harmonic mean does a very nice job of downweighting values that are significantly larger than others in the set under consideration:
In : from scipy.stats import hmean In : from numpy import mean In : f =  * 100000 +  In : mean(f) Out: 10000.899991000089 In : hmean(f) Out: 1.0000099999999899
It is this property that provides protection against totally wrecking the sketch’s estimate when we blow up a fairly small fraction of the registers.
Experiment The Second
Of course, the algorithm can only hold out so long. While I was not surprised by the modesty of the error, I was very surprised by how linear the error growth was in the first figure. I ran the same experiment, but instead of stopping at 10% of the registers, I went all the way to the end. This time, I have plotted the results with a log-scaled y-axis:
Without getting overly formal in our analysis, there are roughly three phases in error growth here. At first, it’s sublinear on the log-scale, then linear, then superlinear. This roughly corresponds to “slow”, “exponential”, and “really, really, fast”. As our mathemagician-in-residence points out, the error will grow roughly as p/(1-p) where p is the fraction of polluted registers. The derivation of this isn’t too hard to work out, if you want to give it a shot! The implication of this little formula matches exactly what we see above. When p is small, the denominator does not change much, and the error grows roughly linearly. As p approaches 1, the error begins to grow super-exponentially. Isn’t it nice when experiment matches theory?
It’s certainly nice to see that the estimates produced by HLLs are not overly vulnerable to a few errant register hits. As is often the case with this sort of analysis, the academic point must be put in balance with the practical. The chance of maxing out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, my first thought would be “What is going wrong with my system!?” and not “Oh, well, at least I know my estimate to within 10%!” I would be disinclined to trust the whole data set until I got a better sense of what caused the blowups, and why I should give any credence at all to the supposedly unpolluted registers.