## Open Source Release: java-hll

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

## AK at re:Invent 2013

I was given the opportunity to speak at AWS re:Invent 2013, on Wednesday, Nov. 13th, about how we extract maximum performance, across our organization, from Redshift. You can find the slides here, and my voice-over, though not identical, is mostly captured by the notes in those slides.

The talk, in brief, was about the technical features of Redshift that allow us to write, run, and maintain many thousands of lines of SQL that run over hundreds of billions of rows at a time. I stayed away from nitty-gritty details in the talk, since I only had 15 minutes, and tried to focus on high-level take-aways:

• SQL is code, so treat it like the rest of your code. It has to be clean, factored, and documented. Use CTEs to break queries into logical chunks. Use window functions to express complex ideas succinctly.
• Redshift is an MPP system with fast IO, fast sorting, and lots of storage. Use this to your advantage by storing multiple different sort orders of your data if you have different access patterns. Materialize shared intermediates so many queries can take advantage of them.
• Redshift has excellent integration with S3. Use the fat pipes to cheaply materialize query intermediates for debugging. Use the one-click snapshot feature to open up experimentation with schema, data layout, and column compression. If it doesn’t work out, you revert to your old snapshot.
• Use the operational simplicity of Redshift to be frugal. Turn over the responsibility of managing cluster lifecycles to the people that use them. For instance, devs and QA rarely need their clusters when the workday is done. The dashboards are such a no-brainer that it’s barely a burden to have them turn off and start up their own clusters. Allow users to take responsibility for their cluster, and they will become more responsible about using their cluster.
• Use the operational simplicity of Redshift to be more aware. With just a few clicks, you can launch differently sized clusters and evaluate your reports and queries against all of them. Quantify the cost of your queries in time and money.
• It’s a managed service: stop worrying about nodes and rows and partitions and compression! Get back to business value:
• How long does the customer have to wait?
• How much does this report cost?
• How do I make my staff more productive?
• How do I minimize my technical debt?

## Slides and Videos are up for AK Data Science Summit

We’re happy to announce that all the talks (slides and video) from our Data Science Summit are available for free, right here on the blog! You can find a full list of the conference’s content here.

Enjoy!

## Data Science Summit – Update

I don’t think I’m going out on a limb saying that our conference last week was a success. Thanks to everyone who attended and thanks again to all the speakers. Muthu actually beat us to it and wrote up a nice summary. Very kind words from a great guy. For those of you that couldn’t make it (or those that want to relive the fun) we posted the videos and slides. Thanks again to everyone for making it such a success!

Muthu being Muthu during David Woodruff’s fantastic talk on streaming linear algebra

## Open Source Release: js-hll

One of the first things that we wanted to do with HyperLogLog when we first started playing with it was to support and expose it natively in the browser. The thought of allowing users to directly interact with these structures — perform arbitrary unions and intersections on effectively unbounded sets all on the client — was exhilarating to us. We knew it could be done but we simply didn’t have the time.

Fast forward a few years to today. We had finally enough in the meager science/research budget to pick up an intern for a few months and as a side project I tasked him with turning our dream into a reality. Without further ado, we are pleased to announce the open-source release of AK’s HyperLogLog implementation for JavaScript, js-hll. We are releasing this code under the Apache License, Version 2.0 matching our other open source offerings.

We knew that we couldn’t just release a bunch of JavaScript code without allowing you to see it in action — that would be a crime. We passed a few ideas around and the one that kept bubbling to the top was a way to kill two birds with one stone. We wanted something that would showcase what you can do with HLL in the browser and give us a tool for explaining HLLs. It is typical for us to explain how HLL intersections work using a Venn diagram. You draw some overlapping circles with a broder that represents the error and you talk about how if that border is close to or larger than the intersection then you can’t say much about the size of that intersection. This works just ok on a whiteboard but what you really want is to just build a visualization that allows you to select from some sets and see the overlap. Maybe even play with the precision a little bit to see how that changes the result. Well, we did just that!

Click above to interact with the visualization

Note: There’s more interesting math in the error bounds that we haven’t explored. Presenting error bounds on a measurement that cannot mathematically be less than zero is problematic. For instance, if you have a ruler that can only measure to 1/2″ and you measure an object that truly is 1/8″ long you can say “all I know is this object measures under 0.25 inches”. Your object cannot measure less than 0 inches, so you would never say 0 minus some error bound. That is, you DO NOT say 0.0 ± 0.25 inches.  Similarly with set intersections there is no meaning to a negative intersection. We did some digging and just threw our hands up and tossed in what we feel are best practices. In the js-hll code we a) never show negative values and b) we call “spurious” any calculation that results in an answer within 20% of the error bound. If you have a better answer, we would love to hear it!

## Conference Agenda – 9:15 am until 5:30 pm at 111 Minna Gallery

We’ve finalized the agenda for the “Data Science Summit” at 111 Minna. See here for details and the schedule of events. The first talks will start at 10am and we will try to wrap up around 5:30pm. We ask that you start showing up around 9:15 or so to ensure we are ready to start on time. Thanks again to our co-sponsors, Foundation Capital.

Update! We posted the videos and slides for those of you that couldn’t make it (or those that want to relive the fun). Enjoy!

## Open Source Release: js-murmur3-128

As you can imagine from of all of our blog posts about hashing that we hash a lot of things. While the various hashing algorithms may be well-defined, the devil is always in the details especially when working with multiple languages that have different ways of representing numbers. We’re happy to announce the open-source release of AK’s 128bit Murmur3 implementation for JavaScript, js-murmur3-128. We are releasing this code under the Apache License, Version 2.0 matching our other open source offerings.

### Details

The goal of the implementation is to produce a hash value that is equivalent to the C++ and Java (Guava) versions for the same input and it must be usable in the browser. (Full disclosure: we’re still working through some signed/unsigned issues between the C++ and Java/JavaScript versions. The Java and JavaScript versions match exactly.)

### Usage

Java (Guava):

final int seed = 0;
final byte[] bytes = { (byte)0xDE, (byte)0xAD, (byte)0xBE, (byte)0xEF,
(byte)0xFE, (byte)0xED, (byte)0xFA, (byte)0xCE };
.putBytes(bytes)
.hash();
System.err.println(hashCode.asLong());


JavaScript:

var seed = 0;
var rawKey = new ArrayBuffer(8);
var byteView = new Int8Array(rawKey);
byteView[0] = 0xDE; byteView[1] = 0xAD; byteView[2] = 0xBE; byteView[3] = 0xEF;
byteView[4] = 0xFE; byteView[5] = 0xED; byteView[6] = 0xFA; byteView[7] = 0xCE;
console.log(murmur3.hash128(rawKey, seed));


## Foundation Capital and Aggregate Knowledge Sponsor Streaming/Sketching Conference

We, along with our friends at Foundation Capital, are pleased to announce a 1 day mini-conference on streaming and sketching algorithms in Big Data.  We have gathered an amazing group of speakers from academia and industry to give talks.  If you are a reader of this blog we would love to have you come!  The conference will be on 6/20 (Thursday) from 10 AM to 5:30 PM at the 111 Minna Gallery in San Francisco and attendance is limited. Breakfast and lunch included!

There will also be a happy hour afterwards if you cannot make the conference or just want a beer.

The speaker list includes:

##### Muthu Muthukrishnan

The Count-Min Sketch, 10 Years Later

The Count-Min Sketch is a data structure for indexing data streams in very small space. In a decade since its introduction, it has found many uses in theory and practice, with data streaming systems and beyond. This talk will survey the developments.

Muthu Muthukrishnan is a Professor at Rutgers University and a Research Scientist at Microsoft, India. His research interest is in development of data stream theory and systems, as well as online advertising systems.

##### David P. Woodruff

Sketching as a Tool for Numerical Linear Algebra

The talk will focus on how sketching techniques from the data stream literature can be used to speed up well-studied algorithms for problems occurring in numerical linear algebra, such as least squares regression and approximate singular value decomposition. It will also discuss how they can be used to achieve very efficient algorithms for variants of these problems, such as robust regression.

David Woodruff joined the algorithms and complexity group at IBM Almaden in 2007 after completing his Ph.D. at MIT in theoretical computer science. His interests are in compressed sensing, communication, numerical linear algebra, sketching, and streaming.

##### Sam Ritchie

Summingbird is a platform for streaming map/reduce used at Twitter to build aggregations in real-time or on hadoop. When the programmer describes her job, that job can be run without change on Storm or Hadoop. Additionally, summingbird can manage merging realtime/online computations with offline batches so that small errors in real-time do not accumulate. Put another way, summingbird gives eventual consistency in a manner that is easy for the programmer to reason about.

Sam Ritchie works on data analysis and infrastructure problems in Twitter’s Revenue engineering team. He is co-author of a number of open-source Scala and Clojure libraries, including Bijection, Algebird, Cascalog 2 and ElephantDB. He holds a bachelor’s degree in mechanical and aerospace engineering.

##### Alexandr Andoni

Similarity Search Algorithms

Nearest Neighbor Search is an ubiquitous problem in analyzing massive datasets: its goal is to process a set of objects (such as images), so that later, one can find the object most similar to a given query object. I will survey the state-of-the-art for this problem, starting from the (Kanellakis-award winning) technique of Locality Sensitive Hashing, to its more modern relatives, and touch upon connection to the theory of sketching.

Alexandr Andoni is a researcher in the Microsoft Research at Silicon Valley since 2010, after finishing his PhD in MIT’s theory group and year-long postdoctoral position at Princeton University. His research interests revolve around algorithms for massive datasets, including similarity search and streaming/sublinear algorithms, as well as theoretical machine learning.

There will be a panel discussion on the topic of harboring research in startups. Speakers include:
Pete Skomoroch from the LinkedIn data science team.
Rob Grzywinski of Aggregate Knowledge.
Joseph Turian of Metaoptimize

##### Lightning talks
• Armon Dadgar (Kiip) - Sketching Data Structures at Kiip
• Blake Mizerany (Heroku) - An Engineer Reads a Data Sketch
• Timon Karnezos (AK) - TBD
• Jérémie Lumbroso (INRIA) - Philippe Flajolet’s Contribution to Streaming Algorithms

Update! We posted the videos and slides for those of you that couldn’t make it (or those that want to relive the fun). Enjoy!

## Call for Summer Interns

AK is looking for a summer intern in our R&D group. If any of our blog posts have interested you, then you’ll fit right in!

We’re looking for someone who has a good handle on a few programming languages (pick any two from R/Mathematica/Python/Javascript/Java) and has some math in their background — college-level calculus or algebra is plenty. Ideally, you’re interested in learning about:

• building and tuning high-performance data structures,
• streaming algorithms,
• interesting data visualizations, and

It’s OK if you’ve never seen the stuff we write about on the blog before! We didn’t either until we started researching them!

I can’t emphasize this enough: we don’t expect you to know how to do the things above yet. We simply expect you to have a passion for learning about them and the diligence to work through what (at the time) seem like impossible problems. Work experience is nice, but not necessary. As long as you can write clean code and can work hard, you’re well-qualified for this job.

If you’re interested, please send a brief note about why you’re interested, along with a CV and/or GitHub username to timon at aggregateknowledge dot com. For extra credit, please submit one (or more!) of the following:

• An implementation of HLL, Count-Min Sketch, K-Min Values, or Distinct Sampling in a language of your choice.
• An extension to Colin’s blog post about a good hash function that adds CityHash and SipHash to the shoot-out.
• An explanation of the tradeoffs between using a hash map and Count-Min Sketch for counting item frequency.

(I feel like I shouldn’t have to say this, but yes, these are all answered somewhere on the internet. Don’t plagiarize. What we want is for you to go learn from them and try your own hand at implementing/experimenting. Also, don’t freak out, these are extra credit!)

## Doubling the Size of an HLL Dynamically – Extra Bits…

Author’s Note: This post is related to a few previous posts on the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the final post in a series on doubling the number of bins in HLLs. The first post dealt with the recovery time after doubling, and the second dealt with doubling’s accuracy when taking unions of two HLLs.

### Introduction

The main draw to the HyperLogLog algorithm is its ability to make accurate cardinality estimates using small, fixed memory.  In practice, there are two choices a user makes which determine how much memory the algorithm will use: the number of registers (bins) and the size of each register (how high they can count).  As Timon discussed previously, increasing the size of each register will only increase the accuracy if the true cardinality of the stream is HUGE.

Recall that HyperLogLog (and most other streaming algorithms) is designed to work with a fixed number of registers, $m$, which is chosen as a function of the expected cardinality to approximate. We track a great number of different cardinality streams and in this context it is useful for us to not have one fixed value of $m$, but to have this evolve with the needs of a given estimation.

We are thus confronted with many engineering problems, some of which we have already discussed. In particular, one problem is that the neat feature of sketches, namely that they allow for an estimate of the cardinality of the union of multiple streams at no cost, depends on having sketches of the same size.

We’ve discussed how to get around this by folding HLLs, though with some increase in error. We’ve also explored a few options on how to effectively perform a doubling procedure. However, we started to wonder if any improvements could be made by using just a small amount of extra memory, say an extra bit for each register. In this post we will discuss one such idea and its use in doubling. Note: we don’t talk about quadrupling or more. We limit ourselves to the situation where HLL sketches only differ in $m$‘s by 1.

### The Setup

One of the downfalls in doubling is that it there is no way to know, after doubling, whether a value belongs in its bin or its partner bin. Recall that a “partner bin” is the register that could have been used had our “prefix” (the portion of the hashed value which is used to decide which register to update) been one bit longer. If the binary representation of the bin index used only two bits of the hashed value, e.g. $01$, then in an HLL that used a three-bit index, the same hashed value could have been placed in the bin whose index is either $101$ or $001$. Since $001$ and $01$ are the same number, we call $101$ the “partner bin”. (See the “Key Processing” section in Set Operations On HLLs of Different Sizes).

Consider an example where we have an HLL with $2^{10}$ bins.The $k^{th}$ bin has the value 7 in it, and after doubling we guess that its partner bin, at index $(2^{10} + k)^{th}$, should have a 5 in it. It is equally likely that the $k^{th}$ bin should have the 5 in it and the $(2^{10}+k)^{th}$ bin should have the 7 in it (since the “missing” prefix bit could have been a 1 or a 0)! Certainly the arrangement doesn’t change the basic cardinality estimate, but once we start getting involved with unions, the arrangement can make a very large difference.

To see how drastic the consequences can be, let’s look at a simple example. Suppose we start with an HLL with 2 bins and get the value 6 in each of its bins. Then we run the doubling procedure and decide that the partner bins should both have 1′s in them. With this information, it is equally likely that both of the arrangements below, “A” and “B”, could be the “true” larger HLL.

Further suppose we have some other data with which we wish to estimate the union. Below, I’ve diagrammed what happens when we take the union.

Arrangement A leads to a cardinality estimate (of the union) of about 12 and Arrangement B leads to a cardinality estimate (of the union) of about 122. This is an order of magnitude different! Obviously not all cases are this bad, but this example is instructive. It tells us that knowing the true location of each value is very important. We’ve attempted to improve our doubling estimate by keeping an extra bit of information as we will describe below.

### The Algorithm

Suppose we have an HLL with $m$ bins. Let’s keep another array of data which holds $m$ total bits, one for each bin — we will call these the “Cached Values.” For each bin, we keep a 0 if the value truly belongs in the bin in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the first $m$ bins in the HLL), and we keep a 1 if the value truly belongs in the partner bin of the one in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the last $m$ bins). See the image below for an example. Here we see two HLLs which have processed the same data. The one on the left is half the size and collects the cached values as it runs on the data. The one on the right is simply the usual HLL algorithm run on the same data.

Looking at the first row of the small HLL (with $m$ bins), the $0$ cache value means that the 2 “belongs” in the top half of the large HLL, i.e. if we had processed the stream using a larger HLL the 2 would be in the same register. Essentially this cached bit allows you to know exactly where the largest value in a bin was located in the larger HLL (if the $i^{th}$ bin has value $V$ and cached value $S$, we place the value $V$ in the $S * 2^{\log{m}} + i$ = $(S\cdot m + i)^{th}$ bin).

In practice, when we double, we populate the doubled HLL first with the (now correct location) bin values from the original HLL then we fill the remaining bins by using our “Proportion Doubling” algorithm.

Before we begin looking at the algorithm’s performance, let’s think about how much extra space this requires. In our new algorithm, notice that for each bin, we keep around either a zero or a one as its cached value. Hence, we require only one extra bit per bin to accommodate the cached values. Our implementation of HLL requires 5 bits per bin, since we want to be able to include values up to $2^5 -1= 31$ in our bins. Thus, a standard HLL with $m$ bins, requires $5m$ bits. Hence, this algorithm requires $5m + m = 6m$ bits (with the extra $m$ bins representing the cached values). This implies that this sketch requires 20% more space.

### The Data

Recall in the last post in this series, we explored doubling with two main strategies: Random Estimate (RE) and Proportion Doubling (PD). We did the same here, though using the additional information from this cached bit. We want to know a few things:

• Does doubling using a cache bit work? i.e. is it better to fold the bigger one or double the smaller one when comparing HLL’s of different sizes?
• Does adding in a cache bit change which doubling strategy is preferred (RE or PD)?
• Does the error in union estimate depend on intersection size as we have seen in the past?

Is it better to double or fold?

For each experiment we took 2 sets of data (each generated from 200k random keys) and estimated the intersection size between them using varying methods.

• “Folded”: estimate by filling up an HLL with $log_{2}(m) = 10$ and  comparing it to a folded HLL starting from $log_{2}(m) = 11$ and folded down $log_{2}(m) = 10$
• “Large”: estimate by using two HLL’s of a larger HLL of $log_{2}(m) = 11$.  This is effectively a lower bound for our doubling approaches.
• “Doubled – PD”: estimate by taking an HLL of $log_{2}(m) = 10$ and double it up to $log_{2}(m) = 11$ using the Proportion Doubling strategy.  Once this larger HLL is approximated we estimate the intersection with another HLL of native size $log_{2}(m) = 11$
• “Doubled – RE”: estimate by taking an HLL of $log_{2}(m) = 10$ and doubling up to $log_{2}(m) = 11$ using Random Estimate strategy.

We performed an experiment 300 times at varying intersection sizes from 0 up to 200k (100%) overlapping elements between sets (in steps of 10k). The plots below show our results (and extrapolate between points).

The graph of the mean error looks pretty bad for Random Estimate doubling. Again we see that the error depends heavily on the intersection size and becomes more biased as the set’s overlap more. On the other hand, Proportion Doubling was much more successful  (recall that this strategy forces the proportion of bins in the to-be-doubled HLL and the HLL with which we will union it to be equal before and after doubling.)  It’s possible there is some error bias with small intersections but we would need to run more trials to know for sure. As expected, the “Folded” and the “Large” are centered around zero. But what about the spread of the error?

The Proportion Doubling strategy looks great! In my last post on this subject, we found that this doubling strategy (without the cached part) really only worked well in the large intersection size regime, but here, with the extra cache bits, we seem to avoid that. Certainly the large intersection regime is where the standard deviation is lowest, but for every intersection size, it is significantly lower than that of the smaller HLL. This suggests that one of our largest sources of error when we use doubling in conjunction with unions is related to our lack of knowledge of the arrangement of the bins (i.e. when doubling, we do not know which of the two partner bins gets the larger, observed value). So it appears that the strategy of keeping cache bits around does indeed work, provided you use a decent doubling scheme.

Interestingly, it is always much better to double a smaller cache HLL than to fold a larger HLL when comparing sketches of different sizes. This is represented above by the lower error of the doubled HLL than the small HLL. The error bounds do seem to depend on the size of the intersection between the two sets but this will require more work to really understand how, especially in the case of Proportion Doubling.

Notes:  In this work we focus solely on doubling a HLL sketch and then immediately using this new structure to compute set operations. It would be interesting to see if set operation accuracy changes as a doubled HLL goes through its “recovery” period under varying doubling methods. It is our assumption that nothing out of the ordinary would come of this, but we definitely could be wrong. We will leave this as an exercise for the reader.

### Summary

We’ve found an interesting way of trading space for accuracy with this cached bit method, but there are certainly other ways of using an extra bit or two (per bucket). For instance, we could keep more information about the distribution of each bin by keeping a bit indicating whether or not the bin’s value minus one has been seen. (If the value is $k$, keep track of whether $k-1$ has shown up.)

We should be able to use any extra piece of information about the distribution or position of the data to help us obtain a more accurate estimate. Certainly, there are a myriad of other ideas ways of storing a bit or two of extra information per bin in order to gain a little leverage — it’s just a matter of figuring out what works best. We’ll be messing around more with this in the coming weeks, so if you have any ideas of what would work best, let us know in the comments!

(P.S. A lot of our recent work has been inspired by Flajolet et al.’s paper on PCSA – check out our post on this here!)

Thanks to Jeremie Lumbroso for his kind input on this post. We are much indebted to him and hopefully you will see more from our collaboration.