Author’s Note: The Kolmogorov-Smirnov test is a handy tool that is conceptually clean, and can be useful in a variety of data analysis situations. I’ll introduce it in the context of a problem that I came across, and give a feel for what it does, and how it might be useful.
A Question and A Tool
I’ve been doing a lot of work with hash functions, and as part of that work I was posed with a question. If I take the same data, encode it two different ways, and feed the two encodings to the same hash function, is there any difference in the statistical properties of the hashed output data sets?
The model I used to explore this question was to take a great number of SHA1 checksums, and MurmurHash3 these numbers, first encoded as 16 byte integers, and then again as Java Strings. There are a lot of things that one could do at this stage, but the first thing I thought to apply was the Kolmogorov-Smirnov (KS) test.
First, some background. The cumulative distribution function (CDF) is a common and natural way of characterizing a probability distribution. The KS test gives us a tool for taking two CDFs and speaking intelligently about how “different” they are. A typical use case is as follows:
- You collect data that you suspect follows some theoretical distribution (uniform, Poisson, whatever)
- From the raw data you construct an empirical cumulative distribution function (ECDF)
- You use the KS test to answer the question, “Assuming my data were sampled from this theoretical distribution, what is the probability of seeing an ECDF that is at least this different from what one would predict?”
A more interesting use case is to compare two empirical distributions for equality. The test is conceptually exactly the same, except instead of comparing a CDF generated from data to one generated by theory, the comparison is between two empirical CDFs. A minor consequence of comparing two empirical data sets is that there is some additional uncertainty that must be dealt with, but this can be addressed by simply using larger samples (see the scaling factors discussed below).
The figure on the right is very helpful in understanding what is going on in this test.
Given two CDFs, the first thing the KS test does is find their maximum positive and negative differences, D+ and D-, respectively. These differences are scaled to produce so-called “K statistics.” In the case where one is comparing an empirical to a theoretical CDF (shown in the figure), all one needs to do is scale the differences by sqrt(n) where there are n observations. For the comparison of two empirical distributions of size n and m, D+ and D- are scaled by sqrt(nm/(n+m))
This scaling takes care of the idea the same magnitude of difference is more troubling if you have more data. A chance large jump or long lag in your ECDF curve is increasingly unlikely as your samples grow.
For a vanilla KS test, the larger of K+ and K- is compared against the Kolmogorov distribution. This allows you to compute a p-value telling you the probability of seeing a K statistic as large as you did under the assumptions of the null hypothesis that the sample is drawn from the theoretical distribution you are testing it against.
The KS test doesn’t need a lot of data to start detecting fairly small differences. If you have a lot of data, and you want to get fancy, you can break your data set up into many disjoint subsets and run KS test on each of the subsets, keeping the K+ and K- statistics for each subset. You can then pool all of K+ statistics into one collection, all of the K- statistics into another and individually compare them to their theoretical distribution, which is well approximated by 1-e-2x2. In this way you can make good use of all of your data, and better balancing the competing goals of detecting both global and local divergence from the ideal CDF. See TAOCP Vol. II for a more thorough discussion of this technique.
So What Happened?
A simple call to scipy.stats.ks_2samp and some waiting returned a p-value of 0.9977065. The size difference between the two data samples’ ECDFs was well within what one would expect, were they drawn from the same underlying distribution. This result is nice. A good hash function should be as insensitive to the statistical nuances of the input data as possible, always producing a nice, uniform, output. Note that this statistic says nothing about the quality of MurmurHash3‘s output distribution, only that its ability to grind up the name numbers doesn’t appear to suffer dramatically when they are encoded as strings vs. bytes. As it so happens we’ve seen that Murmur is pretty darn good!
As with all test statistics, you shouldn’t blindly accept or reject a result on the basis of some arbitrary cutoff. The KS test can’t tell you whether or not any “statistically significant” difference is practically significant. It is a very sensitive test, and given a large enough sample size can detect differences that are meaningless to your application. It’s certainly worth looking at plots of your ECDFs, repeating your analysis on different subsets of your data, and even judging the results of the test in light of other statistical measures or related data. This test wasn’t end of my analysis of this problem, but it was certainly a useful tool along the way. I hope that it may one day be similarly useful for you!
- R’s ks.test and ks.boot functions implement the standard and bootstrapped KS test for single and two-sample cases
- SciPy implements a lot of KS tools in the scipy.stats module
- Matlab’s versions live in the statistics toolbox
- Octave has these tests as builtins
- TAOCP Vol II. Seminumerical Algorithms by Knuth has a very nice writeup, but is focused on 1 sample tests.
- The KS test is discussed in John Cook’s chapter on testing a random number generator in Beautiful Testing. It is freely readable here.
- Approximating it in O(n) time