Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
if count == 0:
majority = val
count = 1
elif val == majority:
count += 1
else:
count -= 1

print majority if count > 0 else "no majority!"


If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to $N/2+1$ and then start decrementing until all N values have been seen, which would leave the counter at $2 = (N/2+1) - (N/2-1)$*. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
if val > median_est:
median_est += 1
elif val < median_est:
median_est -= 1


Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:


quantile_75 = 0
for val in stream:
r = random()
if val > quantile_75 and r > 1 - 0.75:
quantile_75 += 1
elif val < quantile_75 and r > 0.75:
quantile_75 -= 1


As the paper states:

The trick to generalize median estimation to any $\frac{h}{k}$ -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability $\frac{h}{k}$. The rationale behind it is that if we are estimating $\frac{h}{k}$ -quantile, and if the current estimate is at stream’s true $\frac{h}{k}$ -quantile, we will expect to see stream items larger than the current estimate with probability $1-\frac{h}{k}$ .

Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1

for item in stream:
if item > m and random() > 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign > 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step > 0 else 1
# Mark that the estimate increased this step
sign = 1
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m > item:
step += (item - m)
m = item
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item < m and random() > q:
step += f(step) if sign < 0 else -1 * f(step)
m -= step if step > 0 else 1
sign = -1
if m < item:
step += (m - item)
m = item
# Damp down the step size to avoid oscillation.
if (m - item) * sign < 0 and step > 1:
step = 1



You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as $-10^{102}$ for some simulations.

Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

Appendix

• Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
• Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
• Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
• Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

1. I’m having some proxy server trouble accessing the paper, so responding solely to what’s written in this blog post…

It’s an interesting thought experiment, but are they proposing it’s useful in practice? I’d think if you can afford a scan over the data, you can also afford O(1/∈ log(∈N)) memory to get a reasonable error bound. And if you have a lot of data, it’s probably more important to have a technique to merge two summaries (to achieve parallelism) than it is to save those last few bits of memory. So the papers linked from “other approaches” seem more practical.

2. mutu says:

Hi Scott,

Thanks for the thoughts.

If you have memory to run one of the classic streaming algorithms for quantiles, that is great, and yes merging summary approach is valuable for parallel implementations. Our work addresses a different data point where say one wants to estimate quantiles for a large number of group-bys. Then, it may be unrealistic to ask for 1/\eps log\epsN space per group. Our work might apply to that scenario.

Going beyond, it seem slike an open problem to design a mergable summary approach to quantiles with sub-sub-sub 1/\eps log\epsN memory per group.
— Q+M

3. Anon says:

I love this kind of thing. Great explanation of a truly interesting algorithm. I wish I were closer to the source of these kinds of papers.

4. I’ve implemented the paper ( github.com/dgryski/go-frugal ) and was wondering if some sort of stochastic averaging would be a useful addition to my library: implement n=5 (say) frugal estimators and return the median as the ‘true’ estimate. Still less memory that a full quantile summary, but is there enough accuracy improvement to make it worthwhile?

• mattcurcio says:

Hey Damian,
It’s great how accessible this algorithm is and thanks for posting your implementation. I agree, there are lots of things you can think of to improve the accuracy of this approach. In talking with Qiang and Muthu about this I believe there are two reasons for not adding in extra stuff, such as averaging:

1. This algorithm is very much at the intersection of theoretical computer science and the actual implementations of such things. In the CS world the thought is only complete after a careful and thorough mathematical analysis (I believe we can thank Flajolet and Sedgewick for this). I have noticed some “obvious” extensions to algorithms go undocumented as the math is just too difficult to prove. Of course, that is good for us implementers as well, since we can have some real guarantees around our production software.

2. But, I think the real lack of any extensions here is Muthu’s problem statement for frugal streaming. He is only interested in solutions that are EXTREMELY limited in space. So, even storing n=5 would break that in his mind.

m

5. “This will hold regardless of the ordering of the elements in the stream.” is not true. As the stream is processed at most one gang will have members standing, and count will be the numbers of members it has standing (if you catch my abuse of the Flajolet image). If other gangs help kill members of other gangs (other gangs = not the majority gang) the number will come out higher.

A, A, B, B, C, C, C, C => (C,4)

• mattcurcio says:

Bart,
You are correct. I didn’t mean to state that the counter will always be at 1 or 2, but that is the worst case scenario (most adversarial). Recall MJRTY only provides an answer if there is, in fact, a majority element in the stream (an element comprising more than floor(N/2) + 1 occurrences). In the example stream you provide there is no majority element so the algorithm does not work. However, if you add in another C then there is a majority and the algorithm will work under any permutation.

Thanks,
m

6. There is an error in the damping part in both your code and in the paper. The damping occurs every time there is no sign switch and thus step is never ever greater than 1. The condition should be something like “if old-sign * sign 1”

• Q says:

Hi Jozef,
Thank you for your careful thought and comment.

You were correct, this is a typo for the damping statement in the pseudo code. They should be moved inside the if-else segments (right before updating the value of ‘sign’), as edited below:

for item in stream:
if item > m and random() > 1 – q:
step += f(step) if sign > 0 else -1 * f(step)
m += step if step > 0 else 1
if m > item:
step += (item – m)
m = item
if sign 1:
step = 1
sign = 1
elif item q:
step += f(step) if sign 0 else 1
if m 0 and step > 1:
step = 1
sign = -1

-Q

• Q says:

Sorry for the format issue in the reply above. The edit for ‘item’ larger than ‘m’ case, for instance, should be checking if ‘sign’ is less than zero and resetting step to 1 before assigning new value to ‘sign’ (line 14).
– Q

7. For the Frugal-1U version, is it any different to use:

quantile_75 = 0
for val in stream:
if val > quantile_75:
quantile_75 += 0.75
elif val < quantile_75:
quantile_75 -= 0.25

For a large stream, it should be the same, no? And you avoid generating a random number. If you insist on integer math, simply add 75 and subtract 25 and then divide by 100 at the end…. Does this not work for some reason?

8. There is now an updated release of the paper with above fixes incorporated. It is available on arxiv.org: http://arxiv.org/abs/1407.1121

9. Paul Chernoch says:

Wonderful post! I just read a bunch of papers on Count-Min Sketch and for the life of me could not figure out how you could do the things they said you could without also storing the keys. Thank you for pointing out that you in fact do have to store the keys.