HLL talk at SFPUG

I had the pleasure of speaking at the SF PostgreSQL User Group’s meetup tonight about sketching, the history of HLL, and our implementation of HLL as a PG extension. My slides are embedded below and you can get a PDF copy here. Be sure to click the gear below to show speaker’s notes for context!

If video is made available, I’ll post an update with a link!

Writing Analytics SQL with Common Table Expressions

Author’s Note: Hello readers! I’m Josh O’Brien. I recently joined the Science team as a junior engineer, and this is my first post for the blog.

Introduction

One of my first tasks with the Science team has been learning to write effective analytics SQL. I came in with a basic knowledge of SQL, but writing complex analytics reports required me to learn tools and strategies for managing complexity that aren’t yet part of the standard introductions to SQL. Luckily, I had the Science team to teach me to work with Common Table Expressions (CTEs). I’ve come to love CTEs for the clarity that they’ve helped bring to my thinking and writing in SQL. The CTE syntax encourages me to reason through a problem as a sequence of simple parts and enables me to directly code a solution in terms of those parts, which I can individually document and test for correctness. Working with CTEs has jump-started my productivity, and helped the team as a whole set a higher standard for our SQL.

In the Science team’s experience, much of the common frustration with SQL comes down to a failure to treat SQL queries as declarative programs that demand the same care as imperative programs. SQL is code, and we should treat it as such. We can better manage the complexity of SQL by using the same basic techniques we do in other languages: we can divide work into composable parts, document our intent, and test for correctness. We use CTEs as a foundation for building queries that are factored, documented, and tested, and we’ve enjoyed excellent results writing and maintaining numerous hundred- and thousand-line reports using this approach.

In this post, I’ll share an example of how the Science team uses CTEs to treat SQL as code. I’ll walk through the process of writing an analytics report with CTEs, and show how CTEs help me think through a problem and implement, document, and test a solution.

* If you’re thinking that CTEs are no better than temporary tables or views for these purposes, read on. CTEs, temporary tables, and views all have their place in our SQL toolkit. We use CTEs because they are best suited for this work. For more on the relative merits of CTEs, temporary tables, and views, please see the appendices to this post.

Common Table Expressions

Before we dive into the example report, let’s take a quick look at the CTE syntax we’ll be using. CTEs are defined inside of a WITH clause attached to a primary statement. Within the scope of the larger query, each CTE can be manipulated like a table. This allows us to chain CTEs together and build sequences of operations. In the following diagram, we’re building up a four-part query, part by part. We start with two parts: a foo CTE attached to a main SELECT statement. Next, we add a bar CTE. In the final step, we add a baz CTE to complete the four-part query.

cte_syntax_progression

Examples of two-, three-, and four-part queries with Common Table Expressions. The query grows by one CTE at each stage.

Notice what we did here. In the foo, bar, and baz CTEs, we now have three intermediate result sets that we can test individually and “print” with a SELECT *. Once we know each part is correct, we add another, until we’ve solved our problem. We can use CTEs to break queries into as many simple parts as the problem requires.

We use CTEs rather than temporary tables or views to decompose queries in development because they are simpler to use. There is no need to add the complexity of managing CREATE and DROP statements at this stage in the writing process.

Frequency Report

We’ll use a simplified example report to illustrate how we use CTEs in our everyday work: a frequency report. A frequency report is an online advertising analytics report that helps advertisers determine the number of ads to serve users over a specific time period. Advertisers want to reach out to customers enough times to build awareness of and interest in their offerings, but not so many times that customers become jaded or annoyed. A frequency report breaks down return on advertising investment by the number of ads users have been shown, a classification known as a user’s impression frequency class.

This report produces data that can be graphed as:

An example of a report in our UI, showing impressions for an advertiser by frequency class.

Stripped all the way down, the basic query that generates the report above is:

WITH impression_counts AS (
    SELECT user_id,
           SUM(1) AS impression_count
    FROM impressions
    GROUP BY 1
    )
SELECT impression_count      AS frequency_class,
       SUM(impression_count) AS total_impressions
FROM impression_counts
GROUP BY 1
;

The challenge of writing these reports comes from managing all the additional data we need. Actual reporting queries need to correctly handle the complexity of timestamp, ad campaign, conversion attribution, click, and cost data without becoming tangled messes.

For this simplified example, we’ll start with tables recording impression (ad view), click (ad interaction), and conversion (sale) events, and produce a frequency report tracking the total number of users, impressions, clicks, and conversions for each impression frequency class for each ad campaign in the database for the month of March 2014. We can visualize our task like this:

Our task: use the impressions, clicks, impression_attributed_conversions, and click_attributed_conversions tables to produce a frequency report for the month of March 2014.

Thinking with CTEs

Working with CTEs begins with reasoning about the problem in terms of the stages and parts needed to produce the report. From the above starting point, we can already work out four main stages.

We’ll need to:

  • FILTER the four input tables by record_date,
  • GROUP BY user_id and campaign_id, and SUM to get user-level counts for impressions, clicks, and conversions,
  • JOIN those counts together on user_id and campaign_id, and finally,
  • GROUP BY impression_count (= frequency_class) and campaign_id, and SUM to generate the report totals for users, impressions, clicks, and conversions.

We can express the relationships between these operations visually:

A map of the query to produce the frequency report. Each of the conceptual parts (rectangles) connecting the green input tables to the frequency report will be written as a simple CTE.

A map of the query to produce the frequency report. Each of the conceptual parts connecting the green input tables to the orange frequency report will be written as a simple CTE.

In one form or another, each of these operations would need to be a part of any query that produces this report. With CTEs, we can preserve the logical clarity of our thought process in the code itself. Each of the main parts of this query will be implemented using simple CTEs that serve only one main purpose. For added clarity, we will name and comment the CTEs to communicate our intent at every stage. This technique yields a query that we can read straight through and maintain with ease, just like our other code.

Writing with CTEs

Let’s take a look at a CTE from each stage right now. The full query with documentation comments can be found here, and in the appendices to this post.

First come the three filter CTEs. Here’s the CTE for filtered_impressions. Its only purpose is to filter the impressions table down to March 2014:

filtered_impressions AS (
    SELECT record_date,
           user_id,
           campaign_id
    FROM impressions
    WHERE record_date >= '2014-03-01' AND
          record_date <  '2014-04-01'
)

Next, we calculate user-level counts for impressions, clicks, and conversions. Each of the three “counts” CTEs performs only a simple aggregate function: a GROUP BY and a SUM. Here is the impression_counts CTE:

impression_counts AS (
    SELECT user_id,
           campaign_id,
           SUM(1) AS impression_count
    FROM filtered_impressions
    GROUP BY 1, 2
)

After that, we JOIN the three “counts” CTEs together in a single long table. This collated_counts CTE is the longest in the query, but, like the others, it has only one main purpose:

collated_counts AS (
    SELECT imp.user_id           AS user_id,
           imp.campaign_id       AS campaign_id,
           imp.impression_count  AS impression_count,
           cl.click_count        AS click_count,
           conv.conversion_count AS conversion_count
    FROM impression_counts imp
        LEFT OUTER JOIN click_counts cl ON
            imp.user_id      = cl.user_id AND
            imp.campaign_id  = cl.campaign_id
        LEFT OUTER JOIN conversion_counts conv ON
            imp.user_id      = conv.user_id AND
            imp.campaign_id  = conv.campaign_id
    )

Last comes the main SELECT statement. Its only purpose is to group by impression_count (= frequency_class) and campaign_id, and calculate the four SUMs for the report:

SELECT impression_count                   AS frequency_class,
       campaign_id                        AS campaign_id,
       SUM(1)                             AS total_users,
       SUM(impression_count)              AS total_impressions,
       SUM(COALESCE(click_count, 0))      AS total_clicks,
       SUM(COALESCE(conversion_count, 0)) AS total_conversions
FROM collated_counts
GROUP BY 1, 2

 

Testing with CTEs

As we build up the query with CTEs, we leverage the ability to SELECT from each CTE individually to test for correctness as part of the writing process. This basic testing can be as simple as three files in a text editor, which we execute from psql (or equivalent) in a sequence as we write:

  • setup.sql: CREATE tables and INSERT rows of test data
  • test.sql: the query itself
  • teardown.sql: DROP the tables created in setup.sql

We write and comment one CTE at a time in the test file. Each time we add a CTE, we add test rows to exercise that CTE to the setup file, and include comments to indicate what should happen to those rows when we SELECT * from the relevant CTE. When the output matches our expectations, we move to the next part of the query, and repeat the process.

As an example, initial tests for the filtered_impressions CTE could consist of creating an impressions table and inserting five rows to exercise the date range in the WHERE clause. We indicate our expectations for those rows with brief comments:

CREATE TABLE impressions (
    record_date  date   NOT NULL,
    user_id      bigint NOT NULL,
    campaign_id  bigint NOT NULL
);
INSERT INTO impressions (record_date, user_id, campaign_id) VALUES
    /* The following 2 rows should not appear in filtered_impressions: */
    ('2014-02-28', 707, 7),
    ('2014-04-01', 707, 7),
    /* The following 3 rows should appear in filtered_impressions: */
    ('2014-03-01', 101, 1),
    ('2014-03-15', 101, 1),
    ('2014-03-31', 101, 1)
;

This basic testing at the time of writing is not a substitute for a comprehensive test framework, but it is enough to catch many errors that could otherwise sneak through, and it provides a good return on a modest investment of effort. By the time the full query is complete, this process will have generated tests and documentation for each part of the query.

Conclusion

This method of working with CTEs has helped me by bringing clarity and simplicity to complex analytics queries. Thinking, writing, and testing with CTEs helps me treat SQL as part of software engineering practice by writing SQL that’s factored, documented, and tested more like other code.

The Science team thinks of this method as producing a foundation for further refinements. When appropriate, optimizations for performance can and will be made, but we focus on correctness first. Optimizations tend to add complexity, and before we do that, we want to mitigate the complexity of the query as much as possible.

By starting with CTEs, we can more easily write queries that we can quickly read and reuse six months from now. Analysts can return to their models and analyses with confidence and engineers are better able to add new features to reports without introducing new bugs. We’re building upon a foundation of factored, documented, and tested SQL.

Appendices

Code for the Example Report
On CTEs, Temporary Tables, and Views

We asked Christophe Pettus of PostgreSQL Experts to help illuminate the tradeoffs between CTEs, views, and temporary tables, and received the following helpful response, which we publish here with his permission and our thanks:

[E]ach have characteristics that can make them better or worse in particular situations:

1. CTEs are optimization fences; the query planner will plan CTEs
separately from the rest of the query. This can be good or bad,
depending on the way the CTE is used.

2. Views are *not* optimization fences; you can think of them as being
textually inserted into the query at the appropriate place, so queries
can be rewritten, join clauses moved around, etc.

3. Temporary tables can have indexes; for very large intermediate result
sets, this can be essential for good performance.

We agree that the choice between CTEs, temporary tables, and views is a matter of balancing the different trade-offs of the different stages of software development.

As explained in this post, the Science team finds the balance in favor of CTEs as the foundation for query development. We reach for the CTE syntax first for its clarity and ease of use. When we write and test queries part-by-part, we want to keep the code as clear and simple as possible. Juggling extra CREATE and DROP statements for temporary tables or views works against that goal.

Once we have a correct, clear foundation, then we move onto the optimizations I mentioned in the conclusion. At that point, we consider re-writing CTEs as views or materialized tables on a case-by-case basis. Sometimes the balance tips away from CTEs. In our experience, the most common reason for this has been to gain the performance benefits of indexing on intermediate result sets that can contain hundreds of millions to tens of billions of rows.

More posts featuring CTEs

Open Source Release: java-hll

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

Open Source Release: postgresql-hll

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

  • C-based extension that provides the hll data structure and algorithms
  • Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
  • Full storage specification in STORAGE.markdown
  • Full function reference in REFERENCE.markdown
  • .spec file for rpmbuild
  • Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Why did you build it?

The short answer is to power these two UIs:

On the left is a simple plot of the number of unique users seen per day and the number of cumulative unique users seen over the days in the month. The SQL behind this is very very straightforward:

SELECT report_date,
       #users as by_day,
       #hll_union_agg(users) as cumulative_by_day OVER (ORDER BY report_date ASC)
FROM daily_uniques
WHERE report_date BETWEEN '2013-01-01' AND '2013-01-31'
ORDER BY report_date ASC;

where daily_uniques is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 report_date | date | 
 users       | hll  |

Briefly, # is the cardinality operator which is operating on the hll result of the hll_union_agg aggregate function which unions the previous days’ hlls.

On the right is a heatmap of the percentage of an inventory provider’s users that overlap with another inventory provider. Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

SELECT ip1.id as provider1,
       ip2.id as provider2,
       (#ip1.users + #ip2.users - #hll_union(ip1.users, ip2.users))/#ip1.users as overlap
FROM inventory_provider_stats ip1, inventory_provider_stats ip2
WHERE ip1.id <> ip2.id;

where inventory_provider_stats is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 id          | date | 
 users       | hll  |

(Some of you may note that the diagonal is labeled “exclusive reach” and is not represented in the query’s result set. That’s because the SQL above is a simplification of what’s happening. There’s some extra work done that replaces that the useless diagonal entries with the percent of the inventory provider’s users that are only seen on that inventory provider.)

We’ve been running this type of code in production for over a year now and are extremely pleased with its performance, ease of use, and expressiveness. Everyone from engineers to researchers to ops people to analysts have been using hlls in their daily reports and queries. We’re seeing product innovation coming from all different directions in the organization as a direct result of having these powerful data structures in an easily accessed and queried format. Dynamic COUNT(DISTINCT ...) queries that would have taken minutes or hours to compute from a fact table or would have been impossible in traditional cube aggregates return in milliseconds. Combine that speed with PostgreSQL’s window and aggregate functions and you have the ability to present interactive, rich distinct-value reporting over huge data sets. I’ll point you to the README and our blog posts on HyperLogLog for more technical details on storage, accuracy, and in-depth use cases.

I believe that this pattern of in-database probabilistic sketching is the future of interactive analytics. As our VP of Engineering Steve Linde said to me, “I can’t emphasize enough how much business value [sketches] deliver day in and day out.”

Our Commitment

Obviously we’re open-sourcing this for both philanthropic and selfish reasons: we’d love for more people to use this technology so that they can tell us all the neat uses for it that we haven’t thought of yet. In exchange for their insight, we’re promising to stay active in terms of stewardship and contribution of our own improvements. Our primary tool for this will be the GitHub Issues/Pull Request mechanism. We’d considered a mailing list but that seems like overkill right now. If people love postgresql-hll, we’ll figure something out as needed.

Please feel free to get in touch with us about the code on GitHub and about the project in general in the comments here. We hope to release additional tools that allow seamless Java application integration with the raw hll data in the future, so stay tuned!


Update

Looks Dimitri Fontaine wrote up a basic “how-to” post on using postgresql-hll here and another on unions here. (Thanks, Dimitri!) He brings up the issue that hll_add_agg() returns NULL when aggregating over an empty set when it should probably return an empty hll. Hopefully we’ll have a fix for that soon. You can follow the progress of the issue here.