### Intro

We’ve been talking about probabilistic distinct value counting with sketches (DV sketches) for a while now and have had some fun experiences implementing them into our production environment. In this post I want to talk about a DV sketch that is very intuitive and easy to implement, the K-minimum Values sketch (KMV). While KMV sketches are relatively lightweight and accurate, they are not the best of breed when it comes to DV counting. They are useful in two ways to me though, for exposition and multi-set operations.

### History

KMV seems to have been first introduced in 2002 by Ziv Bar-Yossef et. al. in the great paper Counting distinct elements in a data stream. In this paper they talk about improving on the basic intuition by the seminal DV sketch papers of Flajolet and Martin and Alon, Matias, and Szegedy (AMS) (AMS put some formality around the frequency moment problems, bounds of algorithms etc.) Flajolet and Martin’s paper is in turn based upon work from Morris 1978 (looking for streaks of right-most zeroes i.e. the predecessor to LogLog and HyperLogLog). These are fun to read (although they admittedly get pretty mathy) and it’s cool to see the progression of knowledge, accuracy, and efficiency as these guys do their work. You can almost imagine the fist fights that happen during their meet-ups! The final detailed work on KMV is by Beyer et. al. in On Synopses for Distinct-Value Estimation Under Multiset Operations.

### How it works

The intuition behind KMV is straightforward. Supposing you have a good hash function, i.e. hash values are evenly distributed over the hash space (I will normalize the hash space output to [0-1] for the rest of this), then you could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. If I see 10 distinct values, I would expect them on average to be spaced about 1/10th apart from each other. We could do this cheaply by keeping track of, say, the smallest value you have ever seen. If the values are indeed uniformly distributed and provided you’ve thrown a decent amount of data through it, you could guess that the smallest value you have seen is a decent estimate of the average spacing of hash values in your space.

Of course, this doesn’t have a lot of “nice” properties. Taking only one value opens you up to a ton of variance and you are fairly dependent on the “goodness” of your hash. In order to improve upon this Bar-Yossef suggests keeping the `k` smallest values you have ever seen around. The algorithm becomes:

Initialize KMV with first k values for all h(n): if h(n) < max(KMV): insert h(n) into KMV set remove largest value from KMV Cardinality(KMV): return: (k-1)/max(KMV)

For a KMV sketch of size `k=3`, graphically you have:

A very straightforward approach. Note that the “-1″ in the numerator comes from a bias correction in the estimate. You’re going to have to read the paper for that. So, the size of the sketch is basically `k` 64bit values large. Click below to run a KMV simulation:

### Set Operations

Performing set operations with KMV’s is also incredibly straightforward. The intuition around unions is that there is no difference between combining 2 KMV sketches and keeping the `k` minimum values in both versus just keeping one to start with, so unions are “lossless”. To perform union, you merely take 2 sketches and combine their values and keep the `k` smallest ones (if the 2 sketches are of different sizes, `k` and `k’`, then you keep the `min(k,k’)` values in order to keep the lowest resolution).

Union(A,B): k = min( |A|, |B|) return: min_k( A U B )

For intersections you use the KMV to estimate the Jaccard coefficient for the 2 (or n) sets. Basically, you treat the 2 KMV sketches for each set as a random uniform sample and intersect these to estimate Jaccard. So, you assemble the `k` minimum values of the two sets (as you did in union above), and intersect this result with the original sketches to obtain an estimate of the overlap of the 2 sets. The steps are:

IntersectionCard(A,B): L = UnionSet(A,B) # the set this time, not just the cardinality k = min( |A|, |B|) K = | L ∩ A ∩ B | return: K/k * Cardinality(L)

One of the nice features of KMV which is different than say HyperLogLog, is that you can take n-way intersections by extending the algorithm above. To do this with HyperLogLog you actually need to compute the n-way algebra for set intersection i.e.

|A ∩ B| = |A| + |B| - |A U B|

However, in our experience of using KMV for set operations on Zipfian data, KMV’s still don’t perform as well HyperLogLog sketches for computing n-way intersections using the same amount of memory.

### Expansion to Multisets

One of the nice features of KMV sketches is their expansion to supporting multiset operations, dubbed the AKMV sketch. This is great if you are using them for document representations and want to support document similarity operations like tf-idf (or any other multiset operation). In order to expand the basic KMV structure to support multisets (described here) you just add a counter on top of the `k `values you are storing. In this way you get a decent sample of the counts of things in the stream/document to use for multiset operations. Most other DV sketches, HyperLogLog in particular, don’t support these types of queries.

To see how well this might work in practice, I took a look at some simple tf-idf similarity against the 20 news groups data set. This data set contains about 1000 news group emails on various topics such as atheism and motorcycles (woo!). For each article I constructed an AKMV sketch of the words in it and used this representation as the basis for tf-idf. I cleaned up the data marginally by limiting my analysis to the 5000 most common words in the corpus (as seems to be the norm) and only considered alpahnumeric “words”. Additionally, I cherry picked only a few newsgroups from the set that showed “nice” separation in the SVD. You can think of the documents looking a bit like this where the red dots are the entries in the AKMV and the green dots are not (as above):

Once I created the tf-idf matrix, I SVD-ed it and plotted each newsgroup against the second and third singular vectors (the first vector in this case contained mostly information about the mean of the document vectors and contained little real information for classification). The intermediate singular vectors for differing `k` were projected onto the actual singular vectors from the complete matrix (`k = Inf`). Running through increasing `k`, the newsgroups look like this (click on the graphic to restart the animation):

You can see the structure start to appear relatively quickly for small `k` and you can also see how some of the articles “stick” to their final spots due to them having less than `k` words. Clearly you would have to do more work and testing if you wanted to implement something like this in a real classifier or search engine but it seems to be a promising approach.

Here is the same thing for a corpus composed of 23 articles about the Tom Cruise/Katie Holmes divorce and 20 articles about the Higgs boson.

Using document sketches as a basis for a recommender system/search engine or any other application that requires similarity metrics seems like a promising avenue. It would be very interesting indeed to run some real tests of precision/recall and memory footprint for sketch based recommenders/classifiers against other more standard approaches.

###### Disclaimer:

I make no claims about having built a classifier of any sort here. A lot of work and decisions would be necessary to move from these ideas to a useful classification scheme in a real environment. I was interested in how much of the flavor of a document would be retained in an AKMV sketch. Based on the above results, I think that the answer is “quite a bit,” even for modest values of *k*. I don’t think it would be out of the question to try to build a system that allowed you to compute similarities or apply classification tools after the sampling process inherent in the construction of these sketches.

### Compression

An interesting thing to notice is that as your DV count gets larger, your max value of the `k` items is getting smaller. What this means is a simple compression algorithm that works is to just throw away the higher order unused bits of all the `k` values. Oddly, as the DV count gets larger your KMV will get smaller without losing accuracy.

### Summary

There are many DV sketches in the world and KMV is one of the most interesting due to how easy it is to comprehend and implement. I particularly enjoy using KMV as a pedagogical tool and a solid jumping off point for DV sketching. The fact that KMV is so straightforward makes it stand out in a world of more confusing math and complicated sketching algorithms. In the right context it very well could be the right solution for your sketching needs, especially given the multiset support.

Can you please explain the KMV intersection operation in more detail? Also, how is it possible to do the intersection for more than 2 sets with HLL? Thanx in advance.

Bren – the KMV intersection is poorly worded in the papers (in my opinion). The basic way it works is you take the 2 sketches, intersect them, and only keep the ones that are in the KMV union set (truncated to k values). This gives you an estimate of how much the two sketches overlap. By way of example (for non-multiset KMV) lets say you have sketch A(k=5) = {0.1, 0.15, 0.24, 0.3, 0.33} and B = {0.1, 0.11, 0.15, 0.33, 0.4}. You first compute the intersection set, A ∩ B = {0.1, 0.15, 0.24, 0.3, 0.33} ∩ {0.1, 0.11, 0.15, 0.33, 0.4} = {0.1, 0.15, 0.33}. But, you only want to keep around values that are in the union set L = KMV_Union( A, B ) = min_k( {0.1, 0.15, 0.24, 0.3, 0.33} U {0.1, 0.11, 0.15, 0.33, 0.4} ) = min_k( {0.1, 0.11, 0.15, 0.24, 0.3, 0.33, 0.4} ) = {0.1, 0.11, 0.15, 0.24, 0.3}. If we look at our intersection set from above we see that only 0.1 and 0.15 are in L ∩ A ∩ B. So the intersection estimation is K/k=2/5 or a 40% overlap between these two sets.

As for HLL intersection with more than 2 sets you have to think about the inclusion-exclusion principle (http://en.wikipedia.org/wiki/Inclusion-exclusion_principle), You basically rewrite intersections as a series of union cardinalities and set cardinalities which can get messy pretty quickly but, as mentioned, works well in practice.

As I was reading that wikipedia page I noticed something interesting. The inclusion-exclusion principle is attributed to Abraham de Moivre (http://en.wikipedia.org/wiki/Abraham_de_Moivre), about whom wikipedia says: “As he grew older, he became increasingly lethargic and needed longer sleeping hours. He noted that he was sleeping an extra 15 minutes each night and correctly calculated the date of his death on the day when the additional sleep time accumulated to 24 hours, November 27, 1754″. Cool!

Matt, there are actually two ways to estimate the intersection; one is the Jaccard way as described in paper, the other way can be directly using the intersection sketch, in your example, A ∩ B = {0.1, 0.15, 0.24, 0.3, 0.33} ∩ {0.1, 0.11, 0.15, 0.33, 0.4} = {0.1, 0.15, 0.33} and based on the intersection vector and new K where K=3 for the A ∩ B to estimate the cardinality by treating the intersection vector as a KMV sketch for the intersection set; In my experiment, actually I find that later one actually gives better estimation compared to the Jaccard one described in the paper;

One experiment shows,

The estimated intersection cardinality is: 1325462, true cardinality: 1176471, register size: 46

The estimated intersection error is: 12.664230567519303%

The estimated union cardinality is: 82281685, true cardinality :82352944, register size: 2048

The estimated union error is: 0.08652878274758459%

The jaccard cardinality is: 1848124, true cardinality :1176471

The estimated jaccard error is: 57.09048501833024%

Matt, have you ever evidenced the such as well?

Ryan,

Very insightful comment. I believe the answer lies in the error bounds of your estimator. For instance, suppose you had two sets of data that only had 1 intersecting value, and further suppose the hash of this value is in your KMV sketches for both. In your approach you would end up with a sketch of k = 1 to approximate the intersection size and you have no way of improving this estimate. In the approach of the paper you can continually increase k for your original sketches and tighten the error bound of the intersection estimate. So, there are controllable error bounds in the algorithm of the paper while I believe your estimator is not controllable.

We haven’t tried your approach for the above reason. It is curious that you have such a large discrepancy in your example. I suppose I would suggest double checking your code and running the same experiments with increasing k to confirm the error decreases.

Matt,

Have double checked the code, the code is fine; Actually, I think that both the Ke = cardinality (A ∩ B) and K cardinality (A union B) will give individual error bounds for the two estimators and Ke = E ( Jaccard * K ); So basically, I would argue that your example is not correct, since increasing K will also indicate the increase in Ke; and in extreme case where (Ke = 1), both estimators will give quite poor performance;

For low Jaccard value, in my case ( J = (1176471/82352944) ~= 0.014 ), to achieve a good estimate (low ARE for A ∩ B); it means that the Jaccard error should be very small as it will be amplified by (A union B); However, Ke will be in such cases a reasonable value to achieve reasonable ARE; my point is that it might exist scenarios to balance off the estimator error by choosing one of the estimator based on the Ke / K value and their bounds respectively.

Ryan,

I’m having a very difficult time understanding your notation so I’ll try and make my point again. In the extreme example above (one overlapping element out of two very large sets) you will have an intersection vector of length 1 (i.e. a capital K of 1) regardless of the size of the registers (k’s) you start with (assuming this overlapping value is in both KMV’s to begin with). If you interpret this intersection vector as a KMV itself, you can never improve the error by changing the original register size. You are effectively using a new KMV with a register size of 1 and there is nothing you can do with the original registers to change this.

In the paper, the estimate of the Jaccard (K/k) is constantly improving with larger k. i.e. 1/1024, 1/2048, 1/4096, etc. so you can hone in on the correct answer by increasing k. Your statement that “both estimators will have poor performance” is not quite true, you can tune the paper’s approach to have increasingly smaller error.

However, it doesn’t seem out of the question that for certain values of k, |A U B |, and |A ∩ B | that your estimator would be more accurate. There are 2 issues I have with this approach though, I don’t think you have nice probabilistic error bounds and I’m not sure you can know when to use which estimator. We’ll have to look into it a bit more. Have you plotted any comparisons between the two?

Matt,

Seems that we have some confusion here;

Quote:

In the paper, the estimate of the Jaccard (K/k) is constantly improving with larger k. i.e. 1/1024, 1/2048, 1/4096, etc. so you can hone in on the correct answer by increasing k. Your statement that “both estimators will have poor performance” is not quite true, you can tune the paper’s approach to have increasingly smaller error.

[Ryan] In this case, increase in k will also lead to increase in K ( K = E(Jaccard * k), since Jaccard is fixed in such cases); so for extreme case, if K=1 with fixed Jaccard index, increase k will make K increasing as well to have ( K > 1);

Quote:

However, it doesn’t seem out of the question that for certain values of k, |A U B |, and |A ∩ B | that your estimator would be more accurate. There are 2 issues I have with this approach though, I don’t think you have nice probabilistic error bounds and I’m not sure you can know when to use which estimator. We’ll have to look into it a bit more. Have you plotted any comparisons between the two?

[Ryan] for the probabilistic error bound for K to estimate |A ∩ B|, we can borrow the error bound in original paper where the ARE is computed as sqrt (2 / pi * (K – 2));

Will get chance to collect data and plot one such comparison then when get chance.

We have updated the post to include an interactive KMV simulation. I hope that you find it useful!

Thanks for the excellent blog on KMV. I tested the KMV intersection. The error rate is very large.

When one set is over 10 times bigger than the other, the error rate can be anywhere between 30% and 200%. Even for two sets with similar size, the error rate is greater than 15%. Is this normal? I am using Jaccard way to estimate the intersection.

Jie,

It is not surprising to me that your intersection errors could be so high. The accuracy of your estimates will obviously very much depend upon the size of your sketch (K), the relative sizes of the sets your are comparing as well as the amount of overlap the two sets have. Take a look at some of our posts on HLL intersections and the ways we explored those. I would guess you could follow a similar route of exploring intersections as we did for HLL. I would love to see some output!