HLL talk at SFPUG

I had the pleasure of speaking at the SF PostgreSQL User Group’s meetup tonight about sketching, the history of HLL, and our implementation of HLL as a PG extension. My slides are embedded below and you can get a PDF copy here. Be sure to click the gear below to show speaker’s notes for context!

If video is made available, I’ll post an update with a link!

Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset

In my previous post, Differential Privacy: The Basics, I provided an introduction to differential privacy by exploring its definition and discussing its relevance in the broader context of public data release. In this post, I shall demonstrate how easily privacy can be breached and then counter this by showing how differential privacy can protect against this attack. I will also present a few other examples of differentially private queries.

The Data

There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. It was obtained via a FOIL (Freedom of Information Law) request earlier this year and has been making waves in the hacker community ever since.

The release of this data in this unalloyed format raises several privacy concerns. The most well-documented of these deals with the hash function used to “anonymize” the license and medallion numbers. A bit of lateral thinking from one civic hacker and the data was completely de-anonymized. This data can now be used to calculate, for example, any driver’s annual income. More disquieting, though, in my opinion, is the privacy risk to passengers. With only a small amount of auxiliary knowledge, using this dataset an attacker could identify where an individual went, how much they paid, weekly habits, etc. I will demonstrate how easy this is to do in the following section.

Violating Privacy

Let’s consider some of the different ways in which this dataset can be exploited. If I knew an acquaintance or colleague had been in New York last year, I could combine known information about their whereabouts to try and track their movements for my own personal advantage. Maybe they filed a false expense report? How much did they tip? Did they go somewhere naughty? This can be extended to people I don’t know – a savvy paparazzo could track celebrities in this way, for example.

There are other ways to go about this too. Simply focusing the search on an embarrassing night spot, for example, opens the door to all kinds of information about its customers, such as name, address, marital status, etc. Don’t believe me? Keep reading…

Stalking celebrities

First things first. How might I track a person? Well, to zone in on a particular trip, I can use any combination of known characteristics that appear in the dataset, such as the pickup or drop-off coordinates or datetime, the medallion or license number, or even the fare amount from a receipt. Being the avid fanboy that I am (note: sarcasm), I thought it might be interesting to find out something new about some of the celebrities who had been seen in New York in 2013. In particular, where did they go to / come from, and how much did they tip?

In order to do this, I spent some of the most riveting hours of my professional career searching through images of “celebrities in taxis in Manhattan in 2013″ to find enough information to identify the correct record in the database. I had some success – combining the below photos of Bradley Cooper and Jessica Alba with some information from celebrity gossip blogs allowed me to find their trips, which are shown in the accompanying maps.

Bradley Cooper (Click to Explore)

Jessica Alba (Click to Explore)

In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip. Ironically, he got in the cab to escape the photographers! We also know that Jessica Alba got into her taxi outside her hotel, the Trump SoHo, and somewhat surprisingly also did not add a tip to her $9 fare. Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain. Considering the speculative drivel that usually accompanies these photos (trust me, I know!), a celebrity journalist would be thrilled to learn this additional information.

A few innocent nights at the gentlemen’s club

But OK, perhaps you’re not convinced. After all, this dataset is (thankfully) not real-time. How about we leave the poor celebrities alone and consider something a little more provocative. Larry Flynt’s Hustler Club is in a fairly isolated location in Hell’s Kitchen, and no doubt experiences significant cab traffic in the early hours of the morning. I ran a query to pull out all pickups that occurred outside the club after midnight and before 6am, and mapped the drop-off coordinates to see if I could pinpoint individuals who frequented the establishment. The map below shows my results – the yellow points correspond to drop-offs that are closely clustered, implying a frequent customer.

Click to Explore

The potential consequences of this analysis cannot be overstated. Go ahead, zoom in. You will see that the GPS coordinates are terrifyingly precise. Using this freely-obtainable, easily-created map, one can find out where many of Hustler’s customers live, as there are only a handful of locations possible for each point. Add a little local knowledge, and, well, it’s not rocket science. “I was working late at the office” no longer cuts it: Big Brother is watching.

Even without suspicions or knowledge of the neighborhood, I was able to pinpoint certain individuals with high probability. Somewhat shockingly, just googling an address reveals all kinds of information about its inhabitants. Take the following example:

Examining one of the clusters in the map above revealed that only one of the 5 likely drop-off addresses was inhabited; a search for that address revealed its resident’s name. In addition, by examining other drop-offs at this address, I found that this gentleman also frequented such establishments as “Rick’s Cabaret” and “Flashdancers”. Using websites like Spokeo and Facebook, I was also able to find out his property value, ethnicity, relationship status, court records and even a profile picture!

Of course, I will not publish any of this information here, but this was by no means a unique situation. While the online availability of all this potentially private information is part of a wider discussion, it’s fair to say that this guy has a right to keep his nighttime activities a secret.

To reiterate: the findings in this section were not hard to uncover. Equipped with this dataset, and just a little auxiliary information about you, it would be quite trivial for someone to follow your movements, collecting data on your whereabouts and habits, while you remain blissfully unaware. A stalker could find out where you live and work. Your partner may spy on you. A thief could work out when you’re away from home, based on your habits. There are obvious mitigating factors here, such as the population density of Manhattan and the time delay of the data release, but the point still stands.

Applying Differential Privacy

So, we’re at a point now where we can agree this data should not have been released in its current form. But this data has been collected, and there is a lot of value in it – ask any urban planner. It would be a shame if it was withheld entirely.

Enter differential privacy. Sure, the data could be anonymized, but that obviously didn’t work out too well. More care could be taken, for example hashes could be salted as suggested in this forum. But that still doesn’t really protect against either of the attacks above. If you read my first post, you will know that only differential privacy guarantees the protection of all individuals in the data.

So how should we go about applying differential privacy? Remember that differential privacy works by adding noise to the query. Our three maps above are point queries: they filter the data rather than aggregating it. This means we have to be extra careful. We cannot simply run the query and add noise to the result (as we might do with an aggregation query) since this will release private information! To understand why, suppose we framed our question like this: “How many taxis picked up a passenger outside The Greenwich Hotel at 19:35 on July 8, 2013?” By running our original query, even after adding noise to the output coordinates, we have indirectly answered this question accurately. This constitutes a privacy breach, as information about specific individuals can be learned in this way.

Instead, we have to turn our query into an aggregation. If we place a grid over our map, and count the number of output coordinates that fall into each cell, we end up with a set of counts that are generally independent of each other. Differential privacy is now straightforward to apply, as we are working with aggregate quantities. The privatized versions of these three queries are displayed below. Please refer to the appendix for a more technical discussion of how this was done.

We can appreciate from these privatized maps the importance of ε, the privacy parameter, which I introduced in Differential Privacy: The Basics. When ε is low we can no longer accurately track either celebrity, nor learn how much they spent. In fact, it takes unreasonably high levels of ε to reliably locate them. We could opt for a finer grid, indeed we could all but replicate the point query with a fine enough grid, but differential privacy’s Laplace mechanism is robust enough to effectively ensure that the noise obfuscates any actual difference in the data. The privatization of the Hustler query is more interesting – since it is less targeted, the difference caused by the participation of one individual is less pronounced. As a result, there is more information in the privatized output – for example, the square over Wall Street still stands out, which echoes the actual evidence shown above.

Cooper, Privatized
(Click to Interact)

Alba, Privatized
(Click to Interact)

Hustler Customers, Privatized
(Click to Interact)

What about other queries? After all, did we just turn the data into nonsense? Clearly, our ability to learn what we sought from these queries was severely compromised. However, since they were so targeted, one could argue it is a good thing that the results are hard to interpret. The visualization below shows the results of other queries, and demonstrates that useful insights may still be extracted, particularly when aggregating over all records.

Click to Interact

Concluding Remarks

Now that you’re an expert in differential privacy, I urge you to consider: what should have been released by the NYC Taxi and Limousine Commission? How should the public be able to access it? Is this in line with Freedom of Information Law? If not, how should the law be changed to accommodate private data release?

It is only by considering these questions that we can hope to motivate a change in practices. With data privacy and security mishaps cropping up in the news almost daily, the topic has never before received so much attention. This post is yet another reminder of the fallibility of our systems and practices. As we know, differential privacy, with its strong guarantees – borne out of its sturdy fundamentals – offers a solution to many of these concerns.

As data scientists, it is in our best interests to encourage free data flow in our society, and lead the charge to ensure participants in these datasets are protected. The science is clearly there, and will continue to be developed in the coming years. The onus is now on industry and governments to pay attention and embrace differential privacy to protect their stakeholders and citizens.

Appendices

SQL queries used in this analysis

Tracking Bradley Cooper and Jessica Alba:

SELECT D.dropoff_latitude, D.dropoff_longitude, F.total_amount, F.tip_amount
FROM tripData AS D, tripFare AS F
WHERE D.hack_license = F.hack_license AND D.pickup_datetime = F.pickup_datetime
  AND pickup_datetime > "2013-07-08 19:33:00" AND pickup_datetime < "2013-07-08 19:37:00"
  AND pickup_latitude > 40.719 AND pickup_latitude < 40.7204
  AND pickup_longitude > -74.0106 AND pickup_longitude < -74.01;
SELECT D.dropoff_latitude, D.dropoff_longitude, F.total_amount, F.tip_amount
FROM tripData AS D, tripFare AS F
WHERE D.hack_license = F.hack_license AND D.pickup_datetime = F.pickup_datetime
  AND dropoff_datetime > "2013-09-07 12:19:00" AND dropoff_datetime < "2013-09-07 12:25:00"
  AND dropoff_latitude > 40.727 AND dropoff_latitude < 40.728
  AND dropoff_longitude > -73.994 AND dropoff_longitude < -73.993;

Identifying Hustler customers:

SELECT dropoff_latitude, dropoff_longitude
FROM tripData
WHERE pickup_latitude > 40.767249 AND pickup_latitude < 40.768
  AND pickup_longitude > -73.996782 AND pickup_longitude < -73.995538
  AND HOUR(pickup_datetime) < 6
  AND trip_distance > 5;
Privacy calculations

I have chosen to use the Laplace mechanism to privatize the above 3 queries, as it is relatively easy to apply and explain. However, in general, the Laplace mechanism is not appropriate for geospatial data. For example, it does not consider topographical features – inspections of the privatized maps show positive counts in the middle of the East River! Rather, there are more complex methods for adding noise to spatial data – Graham Cormode’s paper, Differentially Private Spatial Decompositions, offers a more thorough mechanism, while this blog covers the topic more generally. However, I will proceed with Laplace, in the hope that, by moving away from discussions of complicated mechanisms, the core tenets of differential privacy may be grasped more easily.

So how should we go about privatizing the above queries using the Laplace mechanism? As mentioned above, these are point queries. We should not apply noise directly to the output, because our sensitivity is essentially undefined. Why? Let’s consider the identity query, Q(D) = D. This is a point query – there is no aggregation. What happens if we remove a record from the dataset? We have Q(D’) = D’. Fine. But now look at what happens to our sensitivity. Recall the formula:

\Delta \mathit{f}=\underset{D,D'}{max}||\mathit{Q(D)}-\mathit{Q(D')}||_{1}

This formula is only defined when Q() returns a vector of fixed length (independent of D), such as an average, or a sum. With Q(D)=D, this is of course not the case, and so we cannot calculate the sensitivity. Intuitively, this result is true for other point queries. So we need to find another way to privatize point queries.

We know that privatizing aggregation queries is a lot more straightforward. Is there a way to turn our point query into an aggregation query? What if we bucketed the data into similar groups and then counted how many records returned by the query fall into each group? If we define our groups tightly enough, we would fully replicate the point query. We end up with a series of independent counts, which is easily privatized – the change from removing one record, any record, is at most 1. So this is our sensitivity. Adding Laplace noise with a sensitivity of 1 to the count in each bucket guarantees differential privacy.

Of course, if we have too many buckets, we will have lots of 0s and 1s. Since adding Laplace noise to this essentially hides this difference (when ε is at a reasonable level of course), we may struggle to learn anything from this data. By allowing each bucket to incorporate a greater range of values, we can obtain more meaningful privatized counts, albeit by sacrificing a degree of accuracy. The optimal way to group the data therefore depends on which features of the output are of interest.

This method is not confined to one-dimensional responses. As can be seen in the privatized maps above, the histogram is 2-dimensional, reflecting the latitude / longitude output from each query. In fact, any mapping of n-dimensional data into independent buckets is sufficient to ensure differential privacy. For the celebrity queries, I did treat the latitude/longitude coordinates and the fare/tip amounts as two separate histogram queries, as these pairs are broadly uncorrelated with each other.

For additional detail on the privacy calculations used throughout this post, the reader is encouraged to examine the code behind the various visualizations. While it is clear that the Laplace mechanism is applied slightly differently each time, it is not hard to imagine how this might be automated so differential privacy could be applied more generally.

Differential Privacy: The Basics

Hi! I’m Anthony Tockar. I am a masters student at Northwestern University and have been working with the Science team for the summer. This is the first of two posts I will contribute on the topic of differential privacy.

Anyone who has ever googled themselves, updated their Facebook settings or even received a call from a telemarketer knows about the importance of privacy. In a world where the data being collected on us is growing exponentially, attacks on our privacy are becoming commonplace. A quick search will reveal all kinds of horror stories, such as the identification of individuals’ AOL search queries, people’s movie ratings on Netflix, and the medical records of a former Massachusetts governor. In my next post, I will demonstrate how easy it is to track individuals in New York using data from the 2013 NYC Taxi data release. Clearly, common practices need to change when it comes to protecting the individual. Put another way, is it fair to citizens and consumers to use their data without even being able to guarantee their privacy?

Data custodians are acutely aware of this issue. Here at Neustar, for example, our datasets span many domains, and can often contain sensitive information, such as PII or “harmless” DNS lookups. Our customers, of course, trust us to keep this information private. Therefore it is not surprising that we employ a series of privacy experts to maintain the company’s contracts and policies with respect to data. As you might expect, it is also of interest to us in the data science team to know what rigor has been applied to this problem.

So – how is data privatized? It turns out that there are various methods that are commonly found in practice, from simple fixes like removing columns containing PII, to more advanced treatments such as k-anonymity and l-diversity. ALL of these methods have been shown to be vulnerable to attack in one way or another. However, in the last few years a new method has emerged that has survived close scrutiny: differential privacy. Unlike other methods, differential privacy operates off a solid mathematical foundation, making it possible to provide strong theoretical guarantees on the privacy and utility of released data.

The story of differential privacy, however, is not all good news. Despite showing massive potential, it has been used sparingly in practice, and remains largely confined to theoretical settings. In this post I hope to introduce the topic to a wider audience, and have included some interactive visualizations to help readers build their intuition. I will also discuss the relevance of differential privacy in more detail, and in doing so explore whether it really is the solution to the privacy puzzle.

Definition of Differential Privacy

This simple example should help illuminate the concept:

Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income.

What could one do to stop this? Perhaps make sure that every query returns an approximation of the total income? This could accomplish the goal of allowing the dataset to yield accurate information while protecting Mr. White. Formally, differential privacy is defined as follows:

A randomized function K gives ε-differential privacy if for all data sets D and D′ differing on at most one row, and all S ⊆ Range(K), Pr[K(D)\in S]\leq exp(\varepsilon)\times Pr[K(D')\in S]

This can be translated as meaning that the risk to one’s privacy should not substantially (as bounded by ε) increase as a result of participating in a statistical database. Thus an attacker should not be able to learn any information about any participant that they could not learn if the participant had opted out of the database. One could then state with some confidence that there is a low risk of any individual’s privacy being compromised as a result of their participation in the database.

Sounds great! So, how is it applied? Let’s go back to our example. If we can add some noise to the result of a query on our dataset, we should be able to ensure that the formula above holds. The function K() is our mechanism for adding this noise. Differential privacy-preserving mechanism design is a topic on its own. Here it suffices to say that there are different mechanisms available depending on the use case: for example, the Laplace mechanism, which I will introduce in the next paragraph, is unsuitable for categorical data – the exponential mechanism is more appropriate, though a strong case can be made for other methods depending on the outcome.

The Laplace mechanism involves adding random noise that conforms to the Laplace statistical distribution. That’s it! The obvious follow-up question is: well, how much noise should be added – i.e. how should we define our Laplace random variable? The 0-centered Laplace distribution has only one parameter (its scale), and this is directly proportional to its standard deviation, or noisiness. So how should we set the scale? Naturally it should have some dependence on the privacy parameter, ε. It should also depend on the nature of the query itself, and more specifically, the risk to the most different individual of having their private information teased out of the data. This can be defined mathematically, and is known as the sensitivity of the query:

\Delta \mathit{f}=\underset{D,D'}{max}||\mathit{f(D)}-\mathit{f(D')}||_{1}

Simply stated, it is the maximum difference in the values that the query f may take on a pair of databases that differ in only one row. There is a neat proof that shows that by adding a random Laplace(\Delta \mathit{f} / \varepsilon) variable to a query, ε-differential privacy is guaranteed.

The more astute among us may be thinking: “Wait! If you just add a symmetrical, 0-centred distribution to the data, I can just run the query multiple times and take the average!” You’d be right if not for the property known as composition. Now while the definition differs depending on the technique used to query the data, in the simple, non-adaptive case, the composition property is as follows:

For a dataset queried q times, with each query having privacy parameter εi, the total privacy budget of the dataset is given by \varepsilon _{total} =\sum_{i=1}^{q}\varepsilon _{i}.

So when we think about ε, we should really be thinking about a privacy budget rather than purely the statistical upper bound of a query. Sure, each query yields \varepsilon _{i}, but it is the total ε we should be concerned about as it reflects the maximum privacy release allowable for the total query session. As each query answer leaks privacy, once the budget is exceeded the user will not be able to make any further queries. It is this feature that allows differential privacy to work in practice.

How might this look?

Since we now have a basic grasp of the concept, let’s take it for a spin! In our Mr. White example above, let’s assume the total income in his original neighborhood is $50 million. After he leaves, this figure drops to $49 million. Therefore, one can infer that his true income is $1 million. To keep his income private, we have to ensure the query response is noisy enough to ‘hide’ this information. In fact, to ensure we privatize income for all people in our dataset we need to make sure the richest person is protected as well. As it turns out, Mr. White was the richest person in his neighborhood, so the sensitivity is $1 million *. Play around with the tool below to see what this looks like in practice – I have left ε variable to emphasize its effect.

Click to Interact

It is plain to see that Mr. White is protected – even at lenient levels of privacy, it is all but impossible to deduce his true income by running the query once. The histogram shows what would happen if we were given free rein to run multiple queries – eventually, we would reach a point where we could easily find his true income. However, we would be violating our \varepsilon _{total} requirement as mentioned above.

Now that we’ve seen what would happen with a query on a single value, let’s look at what happens when we privatize a whole distribution of values. The tool below provides the density plot for several different distributions, both real and randomly-generated, and again the user is free to experiment with different privacy parameters. Hopefully through interacting with this widget, the reader can develop a more intuitive understanding of the effects of differential privacy.

Click to Interact

There is a key difference in how this query is privatized. As it is a density plot, we are only interested in how many observations occur at each value (similar to a histogram). Since this is a series of counts, our sensitivity is 1 (the maximum amount an individual can contribute). It follows that we can add Laplace random variables to the count at each value to guarantee privacy. Click on the graph for a more in-depth description.

Note that this definition of sensitivity is dependent on our definition of the dataset – if we were considering every neighborhood in the US for example our sensitivity would be a lot higher – but we could adjust for this by setting a more relaxed privacy parameter given the difficulty of prizing an individual’s private data from a dataset of this breadth.

Relevance of Differential Privacy

Why should we care? At the very least differential privacy provides an interesting and rigorous framework around publishing data. So far, it is the only approach that has both theoretical bounds on the privacy of users in the dataset and still enables scientists to mine useful insights from it. Other solutions, such as anonymization (hashing), have time and again proven insufficient. Just look at the recent NYC taxi data that was de-anonymized within days of its release! Even with perfect anonymization, it has been posited that 63% of people in the US can be uniquely identified by just their birth date, zip code and gender.*

Unfortunately these attacks tend to stifle the release of data and information sharing. Differential privacy provides some hope. As we have learned, it is inherently flexible, which means it can easily be adapted to environments with differing privacy requirements. This flexibility does come at a cost: as we have seen, having very tiny privacy budgets (ε) can make some queries all but useless. However, as more people understand the concepts and more products get built on top of this paradigm we expect to see more sharing of data into the public domain without privacy concerns.

The more philosophical question is how private is private enough? Clearly, there is some tunability between how useful a differentially private query is and how ‘private’ it is. The aforementioned tradeoff between utility and privacy is unfortunately ‘left to the reader’. The literature does provide some rules of thumb for setting ε, with suggestions like 0.01, or ln2, etc. – however these have scant theoretical support. Perhaps most importantly, there are few, if any, precedents. At the end of the day it is the data curator’s job (or his lawyer) to decide on ‘private enough’. The lack of a clear framework to relate ε to privacy levels coupled with the difficulty of explaining it to the layperson has meant that differential privacy has largely remained confined to academia. However, as more people learn about it and more tools begin to emerge (PINQAiravat) this is starting to change. Clearly in this world of massive data sets and smart data scientists and hackers, data privacy needs to keep pace. We are very hopeful that these techniques are the next step.

* Aside: This is particularly interesting considering the birthday paradox, which states that in a room of 23 people, there is a >50% chance at least two of them share a birthday. However, by dissecting further by zip code and gender (and of course birth year), it appears that this no longer holds to the same degree.

Conclusion

Of most interest to us at Neustar is the practical implementation of differential privacy. For example, how can it be applied to the world of Internet advertising? What about sketching? Well, there has actually been some proprietary work in these areas. Our friend Muthu has used sketches to develop pan-private algorithms for dynamic data, which guarantee privacy and security for certain queries. In addition, this paper suggests differentially private measures for preventing attacks that utilize personalized advertising campaigns. Works such as these help pave the way for a new dawn in the way we think about data.

It is worth pointing out again that this post is not supposed to be a complete discussion of differential privacy and its nuances. Differential privacy comes in many different forms and iterations which have not been covered, and it does have several limitations. There have been reams written about it, and the interested reader is encouraged to seek out further information. The concept of differential privacy is still young, and there is still a lot to be fleshed out. What is clear is that it holds much potential.

In my next post, I will make it abundantly clear why we should be taking privacy seriously, and follow this up with a demonstration of how to privatize a real dataset in a way that protects its participants while retaining its ability to bestow useful insights.

Hitting the Books: EADS Summer School on Hashing

Rob, Matt, and I just wrapped up our trip to Copenhagen for the EADS Summer School on Hashing at the University of Copenhagen and it was a blast! The lineup of speakers was, simply put, unbeatable: Rasmus Pagh, Graham Cormode, Michael Mitzenmacher, Mikkel Thorup, Alex Andoni, Haim Kaplan, John Langford, and Suresh Venkatasubramanian. There’s a good chance that any paper you’ve read on hashing, sketching, or streaming has one of them as a co-author or is heavily influenced by their work. The format was three hour-and-a-half lectures for four days, with exercises presented at the end of each lecture. (Slides can be found here. They might also post videos. UPDATE: They’ve posted videos!)

Despite the depth of the material, almost all of it was approachable with some experience in undergraduate math and computer science. We would strongly recommend both of Michael Mitzenmacher’s talks (1, 2) for an excellent overview of Bloom Filters and Cuckoo hashing that are, in my opinion, significantly better and more in depth than any other out there. Specifically, the Bloom Filter talk presents very elegantly the continuum of Bloom Filter to Counting Bloom Filter to Count-Min Sketch (with “conservative update”) to the Stragglers Problem and Invertible Bloom Filters to, finally, extremely recent work called Odd Sketches.

Similarly, Mikkel Thorup’s two talks on hashing (1, 2) do a very thorough job of examining the hows and whys of integer hashing, all the way from the simplest multiply-mod-prime schemes all the way to modern work on tabulation hashing. And if you haven’t heard of tabulation hashing, and specifically twisted tabulation hashing, get on that because (1) it’s amazing that it doesn’t produce trash given how simple it is, (2) it’s unbelievably fast, and (3) it has been proven to provide the guarantees required for almost all of the interesting topics we’ve discussed on the blog in the past: Bloom Filters, Count-Min sketch, HyperLogLog, chaining/linear-probing/cuckoo hash tables, and so on. We really appreciated how much attention Mikkel devoted to practicality of implementation and to empirical performance when discussing hashing algorithms. It warms our heart to hear a leading academic in this field tout the number of nanoseconds it takes to hash an item as vocally as the elegance of the proof behind it!

We love this “Summer School” format because it delivers the accumulated didactic insight of the field’s top researchers and educators to both old techniques and brand new ones. (And we hope by now that everyone reading our blog appreciates how obsessed we are with teaching and clarifying interesting algorithms and data structures!) Usually most of this insight (into origins, creative process, stumbling blocks, intermediate results, inspiration, etc.) only comes out in conversation or lectures, and even worse is withheld or elided at publishing time for the sake of “clarity” or “elegance”, which is a baffling rationale given how useful these “notes in the margin” have been to us. The longer format of the lectures really allowed for useful “digressions” into the history or inspiration for topics or proofs, which is a huge contrast to the 10-minute presentations given at a conference like SODA. (Yes, obviously the objective of SODA is to show a much greater breadth of work, but it really makes it hard to explain or digest the context of new work.)

In much the same way, the length of the program really gave us the opportunity to have great conversations with the speakers and attendees between sessions and over dinner. We can’t emphasize this enough: if your ambition to is implement and understand cutting edge algorithms and data structures then the best bang for your buck is to get out there and meet the researchers in person. We’re incredibly lucky to call most of the speakers our friends and to regularly trade notes and get pointers to new research. They have helped us time and again when we’ve been baffled by inconsistent terminology or had a hunch that two pieces of research were “basically saying the same thing”. Unsurprisingly, they are also the best group of people to talk to when it comes to understanding how to foster a culture of successful research. For instance, Mikkel has a great article on how to systematically encourage and reward research article that appears in the March 2013 issue of CACM (pay-wall’d). Also worthwhile is his guest post on Bertrand Meyer’s blog.

If Mikkel decides to host another one of these, we cannot recommend attending enough. (Did we mention it was free?!) Thanks again Mikkel, Rasmus, Graham, Alex, Michael, Haim, and John for organizing such a great program and lecturing so eloquently!

Writing Analytics SQL with Common Table Expressions

Author’s Note: Hello readers! I’m Josh O’Brien. I recently joined the Science team as a junior engineer, and this is my first post for the blog.

Introduction

One of my first tasks with the Science team has been learning to write effective analytics SQL. I came in with a basic knowledge of SQL, but writing complex analytics reports required me to learn tools and strategies for managing complexity that aren’t yet part of the standard introductions to SQL. Luckily, I had the Science team to teach me to work with Common Table Expressions (CTEs). I’ve come to love CTEs for the clarity that they’ve helped bring to my thinking and writing in SQL. The CTE syntax encourages me to reason through a problem as a sequence of simple parts and enables me to directly code a solution in terms of those parts, which I can individually document and test for correctness. Working with CTEs has jump-started my productivity, and helped the team as a whole set a higher standard for our SQL.

In the Science team’s experience, much of the common frustration with SQL comes down to a failure to treat SQL queries as declarative programs that demand the same care as imperative programs. SQL is code, and we should treat it as such. We can better manage the complexity of SQL by using the same basic techniques we do in other languages: we can divide work into composable parts, document our intent, and test for correctness. We use CTEs as a foundation for building queries that are factored, documented, and tested, and we’ve enjoyed excellent results writing and maintaining numerous hundred- and thousand-line reports using this approach.

In this post, I’ll share an example of how the Science team uses CTEs to treat SQL as code. I’ll walk through the process of writing an analytics report with CTEs, and show how CTEs help me think through a problem and implement, document, and test a solution.

* If you’re thinking that CTEs are no better than temporary tables or views for these purposes, read on. CTEs, temporary tables, and views all have their place in our SQL toolkit. We use CTEs because they are best suited for this work. For more on the relative merits of CTEs, temporary tables, and views, please see the appendices to this post.

Common Table Expressions

Before we dive into the example report, let’s take a quick look at the CTE syntax we’ll be using. CTEs are defined inside of a WITH clause attached to a primary statement. Within the scope of the larger query, each CTE can be manipulated like a table. This allows us to chain CTEs together and build sequences of operations. In the following diagram, we’re building up a four-part query, part by part. We start with two parts: a foo CTE attached to a main SELECT statement. Next, we add a bar CTE. In the final step, we add a baz CTE to complete the four-part query.

cte_syntax_progression

Examples of two-, three-, and four-part queries with Common Table Expressions. The query grows by one CTE at each stage.

Notice what we did here. In the foo, bar, and baz CTEs, we now have three intermediate result sets that we can test individually and “print” with a SELECT *. Once we know each part is correct, we add another, until we’ve solved our problem. We can use CTEs to break queries into as many simple parts as the problem requires.

We use CTEs rather than temporary tables or views to decompose queries in development because they are simpler to use. There is no need to add the complexity of managing CREATE and DROP statements at this stage in the writing process.

Frequency Report

We’ll use a simplified example report to illustrate how we use CTEs in our everyday work: a frequency report. A frequency report is an online advertising analytics report that helps advertisers determine the number of ads to serve users over a specific time period. Advertisers want to reach out to customers enough times to build awareness of and interest in their offerings, but not so many times that customers become jaded or annoyed. A frequency report breaks down return on advertising investment by the number of ads users have been shown, a classification known as a user’s impression frequency class.

This report produces data that can be graphed as:

An example of a report in our UI, showing impressions for an advertiser by frequency class.

Stripped all the way down, the basic query that generates the report above is:

WITH impression_counts AS (
    SELECT user_id,
           SUM(1) AS impression_count
    FROM impressions
    GROUP BY 1
    )
SELECT impression_count      AS frequency_class,
       SUM(impression_count) AS total_impressions
FROM impression_counts
GROUP BY 1
;

The challenge of writing these reports comes from managing all the additional data we need. Actual reporting queries need to correctly handle the complexity of timestamp, ad campaign, conversion attribution, click, and cost data without becoming tangled messes.

For this simplified example, we’ll start with tables recording impression (ad view), click (ad interaction), and conversion (sale) events, and produce a frequency report tracking the total number of users, impressions, clicks, and conversions for each impression frequency class for each ad campaign in the database for the month of March 2014. We can visualize our task like this:

Our task: use the impressions, clicks, impression_attributed_conversions, and click_attributed_conversions tables to produce a frequency report for the month of March 2014.

Thinking with CTEs

Working with CTEs begins with reasoning about the problem in terms of the stages and parts needed to produce the report. From the above starting point, we can already work out four main stages.

We’ll need to:

  • FILTER the four input tables by record_date,
  • GROUP BY user_id and campaign_id, and SUM to get user-level counts for impressions, clicks, and conversions,
  • JOIN those counts together on user_id and campaign_id, and finally,
  • GROUP BY impression_count (= frequency_class) and campaign_id, and SUM to generate the report totals for users, impressions, clicks, and conversions.

We can express the relationships between these operations visually:

A map of the query to produce the frequency report. Each of the conceptual parts (rectangles) connecting the green input tables to the frequency report will be written as a simple CTE.

A map of the query to produce the frequency report. Each of the conceptual parts connecting the green input tables to the orange frequency report will be written as a simple CTE.

In one form or another, each of these operations would need to be a part of any query that produces this report. With CTEs, we can preserve the logical clarity of our thought process in the code itself. Each of the main parts of this query will be implemented using simple CTEs that serve only one main purpose. For added clarity, we will name and comment the CTEs to communicate our intent at every stage. This technique yields a query that we can read straight through and maintain with ease, just like our other code.

Writing with CTEs

Let’s take a look at a CTE from each stage right now. The full query with documentation comments can be found here, and in the appendices to this post.

First come the three filter CTEs. Here’s the CTE for filtered_impressions. Its only purpose is to filter the impressions table down to March 2014:

filtered_impressions AS (
    SELECT record_date,
           user_id,
           campaign_id
    FROM impressions
    WHERE record_date >= '2014-03-01' AND
          record_date <  '2014-04-01'
)

Next, we calculate user-level counts for impressions, clicks, and conversions. Each of the three “counts” CTEs performs only a simple aggregate function: a GROUP BY and a SUM. Here is the impression_counts CTE:

impression_counts AS (
    SELECT user_id,
           campaign_id,
           SUM(1) AS impression_count
    FROM filtered_impressions
    GROUP BY 1, 2
)

After that, we JOIN the three “counts” CTEs together in a single long table. This collated_counts CTE is the longest in the query, but, like the others, it has only one main purpose:

collated_counts AS (
    SELECT imp.user_id           AS user_id,
           imp.campaign_id       AS campaign_id,
           imp.impression_count  AS impression_count,
           cl.click_count        AS click_count,
           conv.conversion_count AS conversion_count
    FROM impression_counts imp
        LEFT OUTER JOIN click_counts cl ON
            imp.user_id      = cl.user_id AND
            imp.campaign_id  = cl.campaign_id
        LEFT OUTER JOIN conversion_counts conv ON
            imp.user_id      = conv.user_id AND
            imp.campaign_id  = conv.campaign_id
    )

Last comes the main SELECT statement. Its only purpose is to group by impression_count (= frequency_class) and campaign_id, and calculate the four SUMs for the report:

SELECT impression_count                   AS frequency_class,
       campaign_id                        AS campaign_id,
       SUM(1)                             AS total_users,
       SUM(impression_count)              AS total_impressions,
       SUM(COALESCE(click_count, 0))      AS total_clicks,
       SUM(COALESCE(conversion_count, 0)) AS total_conversions
FROM collated_counts
GROUP BY 1, 2

 

Testing with CTEs

As we build up the query with CTEs, we leverage the ability to SELECT from each CTE individually to test for correctness as part of the writing process. This basic testing can be as simple as three files in a text editor, which we execute from psql (or equivalent) in a sequence as we write:

  • setup.sql: CREATE tables and INSERT rows of test data
  • test.sql: the query itself
  • teardown.sql: DROP the tables created in setup.sql

We write and comment one CTE at a time in the test file. Each time we add a CTE, we add test rows to exercise that CTE to the setup file, and include comments to indicate what should happen to those rows when we SELECT * from the relevant CTE. When the output matches our expectations, we move to the next part of the query, and repeat the process.

As an example, initial tests for the filtered_impressions CTE could consist of creating an impressions table and inserting five rows to exercise the date range in the WHERE clause. We indicate our expectations for those rows with brief comments:

CREATE TABLE impressions (
    record_date  date   NOT NULL,
    user_id      bigint NOT NULL,
    campaign_id  bigint NOT NULL
);
INSERT INTO impressions (record_date, user_id, campaign_id) VALUES
    /* The following 2 rows should not appear in filtered_impressions: */
    ('2014-02-28', 707, 7),
    ('2014-04-01', 707, 7),
    /* The following 3 rows should appear in filtered_impressions: */
    ('2014-03-01', 101, 1),
    ('2014-03-15', 101, 1),
    ('2014-03-31', 101, 1)
;

This basic testing at the time of writing is not a substitute for a comprehensive test framework, but it is enough to catch many errors that could otherwise sneak through, and it provides a good return on a modest investment of effort. By the time the full query is complete, this process will have generated tests and documentation for each part of the query.

Conclusion

This method of working with CTEs has helped me by bringing clarity and simplicity to complex analytics queries. Thinking, writing, and testing with CTEs helps me treat SQL as part of software engineering practice by writing SQL that’s factored, documented, and tested more like other code.

The Science team thinks of this method as producing a foundation for further refinements. When appropriate, optimizations for performance can and will be made, but we focus on correctness first. Optimizations tend to add complexity, and before we do that, we want to mitigate the complexity of the query as much as possible.

By starting with CTEs, we can more easily write queries that we can quickly read and reuse six months from now. Analysts can return to their models and analyses with confidence and engineers are better able to add new features to reports without introducing new bugs. We’re building upon a foundation of factored, documented, and tested SQL.

Appendices

Code for the Example Report
On CTEs, Temporary Tables, and Views

We asked Christophe Pettus of PostgreSQL Experts to help illuminate the tradeoffs between CTEs, views, and temporary tables, and received the following helpful response, which we publish here with his permission and our thanks:

[E]ach have characteristics that can make them better or worse in particular situations:

1. CTEs are optimization fences; the query planner will plan CTEs
separately from the rest of the query. This can be good or bad,
depending on the way the CTE is used.

2. Views are *not* optimization fences; you can think of them as being
textually inserted into the query at the appropriate place, so queries
can be rewritten, join clauses moved around, etc.

3. Temporary tables can have indexes; for very large intermediate result
sets, this can be essential for good performance.

We agree that the choice between CTEs, temporary tables, and views is a matter of balancing the different trade-offs of the different stages of software development.

As explained in this post, the Science team finds the balance in favor of CTEs as the foundation for query development. We reach for the CTE syntax first for its clarity and ease of use. When we write and test queries part-by-part, we want to keep the code as clear and simple as possible. Juggling extra CREATE and DROP statements for temporary tables or views works against that goal.

Once we have a correct, clear foundation, then we move onto the optimizations I mentioned in the conclusion. At that point, we consider re-writing CTEs as views or materialized tables on a case-by-case basis. Sometimes the balance tips away from CTEs. In our experience, the most common reason for this has been to gain the performance benefits of indexing on intermediate result sets that can contain hundreds of millions to tens of billions of rows.

More posts featuring CTEs

Open Source Release: java-hll

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

AK at re:Invent 2013

I was given the opportunity to speak at AWS re:Invent 2013, on Wednesday, Nov. 13th, about how we extract maximum performance, across our organization, from Redshift. You can find the slides here, and my voice-over, though not identical, is mostly captured by the notes in those slides.

The talk, in brief, was about the technical features of Redshift that allow us to write, run, and maintain many thousands of lines of SQL that run over hundreds of billions of rows at a time. I stayed away from nitty-gritty details in the talk, since I only had 15 minutes, and tried to focus on high-level take-aways:

  • SQL is code, so treat it like the rest of your code. It has to be clean, factored, and documented. Use CTEs to break queries into logical chunks. Use window functions to express complex ideas succinctly.
  • Redshift is an MPP system with fast IO, fast sorting, and lots of storage. Use this to your advantage by storing multiple different sort orders of your data if you have different access patterns. Materialize shared intermediates so many queries can take advantage of them.
  • Redshift has excellent integration with S3. Use the fat pipes to cheaply materialize query intermediates for debugging. Use the one-click snapshot feature to open up experimentation with schema, data layout, and column compression. If it doesn’t work out, you revert to your old snapshot.
  • Use the operational simplicity of Redshift to be frugal. Turn over the responsibility of managing cluster lifecycles to the people that use them. For instance, devs and QA rarely need their clusters when the workday is done. The dashboards are such a no-brainer that it’s barely a burden to have them turn off and start up their own clusters. Allow users to take responsibility for their cluster, and they will become more responsible about using their cluster.
  • Use the operational simplicity of Redshift to be more aware. With just a few clicks, you can launch differently sized clusters and evaluate your reports and queries against all of them. Quantify the cost of your queries in time and money.
  • It’s a managed service: stop worrying about nodes and rows and partitions and compression! Get back to business value:
    • How long does the customer have to wait?
    • How much does this report cost?
    • How do I make my staff more productive?
    • How do I minimize my technical debt?

Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
    if count == 0:
        majority = val
        count = 1
    elif val == majority:
        count += 1
    else:
        count -= 1

print majority if count > 0 else "no majority!"

If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to N/2+1 and then start decrementing until all N values have been seen, which would leave the counter at 2 = (N/2+1) - (N/2-1) *. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
    if val > median_est:
        median_est += 1
    elif val < median_est:
        median_est -= 1 

Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:

 
quantile_75 = 0
for val in stream:
    r = random()
    if val > quantile_75 and r > 1 - 0.75:
        quantile_75 += 1
    elif val < quantile_75 and r > 0.75:
        quantile_75 -= 1

As the paper states:

The trick to generalize median estimation to any \frac{h}{k} -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability \frac{h}{k}. The rationale behind it is that if we are estimating \frac{h}{k} -quantile, and if the current estimate is at stream’s true \frac{h}{k} -quantile, we will expect to see stream items larger than the current estimate with probability 1-\frac{h}{k} .

Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
  step, sign = 1, 1

for item in stream:
  if item > m and random() > 1 - q:
    # Increment the step size if and only if the estimate keeps moving in
    # the same direction. Step size is incremented by the result of applying
    # the specified step function to the previous step size.
    step += f(step) if sign > 0 else -1 * f(step)
    # Increment the estimate by step size if step is positive. Otherwise,
    # increment the step size by one.
    m += step if step > 0 else 1
    # Mark that the estimate increased this step
    sign = 1
    # If the estimate overshot the item in the stream, pull the estimate back
    # and re-adjust the step size.
    if m > item:
      step += (item - m)
      m = item
  # If the item is less than the stream, follow all of the same steps as
  # above, with signs reversed.
  elif item < m and random() > q:
    step += f(step) if sign < 0 else -1 * f(step)
    m -= step if step > 0 else 1
    sign = -1
    if m < item:
      step += (m - item)
      m = item
  # Damp down the step size to avoid oscillation.
  if (m - item) * sign < 0 and step > 1:
    step = 1
 

You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as -10^{102} for some simulations.

Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

Appendix

  • Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
  • Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
  • Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
  • Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

Slides and Videos are up for AK Data Science Summit

We’re happy to announce that all the talks (slides and video) from our Data Science Summit are available for free, right here on the blog! You can find a full list of the conference’s content here.

Enjoy!

Data Science Summit – Update

I don’t think I’m going out on a limb saying that our conference last week was a success. Thanks to everyone who attended and thanks again to all the speakers. Muthu actually beat us to it and wrote up a nice summary. Very kind words from a great guy. For those of you that couldn’t make it (or those that want to relive the fun) we posted the videos and slides. Thanks again to everyone for making it such a success!

P1010510

Muthu being Muthu during David Woodruff’s fantastic talk on streaming linear algebra

Follow

Get every new post delivered to your Inbox.

Join 260 other followers