We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, *O(N)*, but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

### History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0 majority = "" for val in stream: if count == 0: majority = val count = 1 elif val == majority: count += 1 else: count -= 1 print majority if count > 0 else "no majority!"

If you're anything like me you will go through a few phases: "That can't work!", "Wait, that works?!", "Nah, this data will break that", "Holy sh*t that works!!". If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to and then start decrementing until all N values have been seen, which would leave the counter at *. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently "liked this algorithm very much and called it the 'gang war', because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous"**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {"matt","matt","timon","timon","rob"} would result in the algorithm returning "rob". In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe's explanation.

### One "bit" - Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: "How would you find the quantiles of a stream of numbers in O(N) with limited memory?" There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu's first solution is a frugal way to find the median of a stream. As with MJRTY above, I'll just write it down and see how you react:

median_est = 0 for val in stream: if val > median_est: median_est += 1 elif val < median_est: median_est -= 1

Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don't think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:

quantile_75 = 0 for val in stream: r = random() if val > quantile_75 and r > 1 - 0.75: quantile_75 += 1 elif val < quantile_75 and r > 0.75: quantile_75 -= 1

As the paper states:

The trick to generalize median estimation to any -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability . The rationale behind it is that if we are estimating -quantile, and if the current estimate is at stream’s true -quantile, we will expect to see stream items larger than the current estimate with probability .

### Finding Quantiles With Two "bits"- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 "bit" so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably "far" away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one): step, sign = 1, 1 for item in stream: if item > m and random() > 1 - q: # Increment the step size if and only if the estimate keeps moving in # the same direction. Step size is incremented by the result of applying # the specified step function to the previous step size. step += f(step) if sign > 0 else -1 * f(step) # Increment the estimate by step size if step is positive. Otherwise, # increment the step size by one. m += step if step > 0 else 1 # Mark that the estimate increased this step sign = 1 # If the estimate overshot the item in the stream, pull the estimate back # and re-adjust the step size. if m > item: step += (item - m) m = item # If the item is less than the stream, follow all of the same steps as # above, with signs reversed. elif item < m and random() > q: step += f(step) if sign < 0 else -1 * f(step) m -= step if step > 0 else 1 sign = -1 if m < item: step += (m - item) m = item # Damp down the step size to avoid oscillation. if (m - item) * sign < 0 and step > 1: step = 1

You can play around with the 1U and 2U algorithms in the simulation below.

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to *step* as a function. This means they are allowing many different types of increments to *step*. For example, instead of increasing the size of *step* by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to *step* but there is no analysis around this (yet) and they restrict all of the work in the paper to *step* increments of 1. We offer a few different *step* update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the *step* variable is how it behaves under decrements. My initial thought was that *step* would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that *step* would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to *step* have the effect of *stabilizing* the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that *step* decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes *step* to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as for some simulations.

### Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call "static streams". These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to *quickly* figure out that something has broken and fire off an alarm with *high confidence*? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached "equilibrium" and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

... as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare "in the field" (you can play around with this by changing the underlying stream distribution in our simulation above.)

### Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu's creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of. Many thanks to Muthu, Qiang Ma and Jeremie Lumbroso for all their help!

### Appendix

**Variability**: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.- Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
**Cold Start**: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to "catch up" to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow's estimate with yesterdays final value.**Accuracy**: And, lastly, there is some interesting dependence on "atomicity". Meaning, the estimates in some sense depend on how "large" the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this "atomic" update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.