We’ve been talking about probabilistic distinct value counting with sketches (DV sketches) for a while now and have had some fun experiences implementing them into our production environment. In this post I want to talk about a DV sketch that is very intuitive and easy to implement, the K-minimum Values sketch (KMV). While KMV sketches are relatively lightweight and accurate, they are not the best of breed when it comes to DV counting. They are useful in two ways to me though, for exposition and multi-set operations.
KMV seems to have been first introduced in 2002 by Ziv Bar-Yossef et. al. in the great paper Counting distinct elements in a data stream. In this paper they talk about improving on the basic intuition by the seminal DV sketch papers of Flajolet and Martin and Alon, Matias, and Szegedy (AMS) (AMS put some formality around the frequency moment problems, bounds of algorithms etc.) Flajolet and Martin’s paper is in turn based upon work from Morris 1978 (looking for streaks of right-most zeroes i.e. the predecessor to LogLog and HyperLogLog). These are fun to read (although they admittedly get pretty mathy) and it’s cool to see the progression of knowledge, accuracy, and efficiency as these guys do their work. You can almost imagine the fist fights that happen during their meet-ups! The final detailed work on KMV is by Beyer et. al. in On Synopses for Distinct-Value Estimation Under Multiset Operations.
How it works
The intuition behind KMV is straightforward. Supposing you have a good hash function, i.e. hash values are evenly distributed over the hash space (I will normalize the hash space output to [0-1] for the rest of this), then you could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. If I see 10 distinct values, I would expect them on average to be spaced about 1/10th apart from each other. We could do this cheaply by keeping track of, say, the smallest value you have ever seen. If the values are indeed uniformly distributed and provided you’ve thrown a decent amount of data through it, you could guess that the smallest value you have seen is a decent estimate of the average spacing of hash values in your space.
Of course, this doesn’t have a lot of “nice” properties. Taking only one value opens you up to a ton of variance and you are fairly dependent on the “goodness” of your hash. In order to improve upon this Bar-Yossef suggests keeping the k smallest values you have ever seen around. The algorithm becomes:
Initialize KMV with first k values for all h(n): if h(n) < max(KMV): insert h(n) into KMV set remove largest value from KMV Cardinality(KMV): return: (k-1)/max(KMV)
For a KMV sketch of size k=3, graphically you have:
A very straightforward approach. Note that the “-1″ in the numerator comes from a bias correction in the estimate. You’re going to have to read the paper for that. So, the size of the sketch is basically k 64bit values large. Click below to run a KMV simulation:
Performing set operations with KMV’s is also incredibly straightforward. The intuition around unions is that there is no difference between combining 2 KMV sketches and keeping the k minimum values in both versus just keeping one to start with, so unions are “lossless”. To perform union, you merely take 2 sketches and combine their values and keep the k smallest ones (if the 2 sketches are of different sizes, k and k’, then you keep the min(k,k’) values in order to keep the lowest resolution).
Union(A,B): k = min( |A|, |B|) return: min_k( A U B )
For intersections you use the KMV to estimate the Jaccard coefficient for the 2 (or n) sets. Basically, you treat the 2 KMV sketches for each set as a random uniform sample and intersect these to estimate Jaccard. So, you assemble the k minimum values of the two sets (as you did in union above), and intersect this result with the original sketches to obtain an estimate of the overlap of the 2 sets. The steps are:
IntersectionCard(A,B): L = UnionSet(A,B) # the set this time, not just the cardinality k = min( |A|, |B|) K = | L ∩ A ∩ B | return: K/k * Cardinality(L)
One of the nice features of KMV which is different than say HyperLogLog, is that you can take n-way intersections by extending the algorithm above. To do this with HyperLogLog you actually need to compute the n-way algebra for set intersection i.e.
|A ∩ B| = |A| + |B| - |A U B|
However, in our experience of using KMV for set operations on Zipfian data, KMV’s still don’t perform as well HyperLogLog sketches for computing n-way intersections using the same amount of memory.
Expansion to Multisets
One of the nice features of KMV sketches is their expansion to supporting multiset operations, dubbed the AKMV sketch. This is great if you are using them for document representations and want to support document similarity operations like tf-idf (or any other multiset operation). In order to expand the basic KMV structure to support multisets (described here) you just add a counter on top of the k values you are storing. In this way you get a decent sample of the counts of things in the stream/document to use for multiset operations. Most other DV sketches, HyperLogLog in particular, don’t support these types of queries.
To see how well this might work in practice, I took a look at some simple tf-idf similarity against the 20 news groups data set. This data set contains about 1000 news group emails on various topics such as atheism and motorcycles (woo!). For each article I constructed an AKMV sketch of the words in it and used this representation as the basis for tf-idf. I cleaned up the data marginally by limiting my analysis to the 5000 most common words in the corpus (as seems to be the norm) and only considered alpahnumeric “words”. Additionally, I cherry picked only a few newsgroups from the set that showed “nice” separation in the SVD. You can think of the documents looking a bit like this where the red dots are the entries in the AKMV and the green dots are not (as above):
Once I created the tf-idf matrix, I SVD-ed it and plotted each newsgroup against the second and third singular vectors (the first vector in this case contained mostly information about the mean of the document vectors and contained little real information for classification). The intermediate singular vectors for differing k were projected onto the actual singular vectors from the complete matrix (k = Inf). Running through increasing k, the newsgroups look like this (click on the graphic to restart the animation):
You can see the structure start to appear relatively quickly for small k and you can also see how some of the articles “stick” to their final spots due to them having less than k words. Clearly you would have to do more work and testing if you wanted to implement something like this in a real classifier or search engine but it seems to be a promising approach.
Here is the same thing for a corpus composed of 23 articles about the Tom Cruise/Katie Holmes divorce and 20 articles about the Higgs boson.
Using document sketches as a basis for a recommender system/search engine or any other application that requires similarity metrics seems like a promising avenue. It would be very interesting indeed to run some real tests of precision/recall and memory footprint for sketch based recommenders/classifiers against other more standard approaches.
I make no claims about having built a classifier of any sort here. A lot of work and decisions would be necessary to move from these ideas to a useful classification scheme in a real environment. I was interested in how much of the flavor of a document would be retained in an AKMV sketch. Based on the above results, I think that the answer is “quite a bit,” even for modest values of k. I don’t think it would be out of the question to try to build a system that allowed you to compute similarities or apply classification tools after the sampling process inherent in the construction of these sketches.
An interesting thing to notice is that as your DV count gets larger, your max value of the k items is getting smaller. What this means is a simple compression algorithm that works is to just throw away the higher order unused bits of all the k values. Oddly, as the DV count gets larger your KMV will get smaller without losing accuracy.
There are many DV sketches in the world and KMV is one of the most interesting due to how easy it is to comprehend and implement. I particularly enjoy using KMV as a pedagogical tool and a solid jumping off point for DV sketching. The fact that KMV is so straightforward makes it stand out in a world of more confusing math and complicated sketching algorithms. In the right context it very well could be the right solution for your sketching needs, especially given the multiset support.