Author’s Note: Hello readers! I’m Josh O’Brien. I recently joined the Science team as a junior engineer, and this is my first post for the blog.
One of my first tasks with the Science team has been learning to write effective analytics SQL. I came in with a basic knowledge of SQL, but writing complex analytics reports required me to learn tools and strategies for managing complexity that aren’t yet part of the standard introductions to SQL. Luckily, I had the Science team to teach me to work with Common Table Expressions (CTEs). I’ve come to love CTEs for the clarity that they’ve helped bring to my thinking and writing in SQL. The CTE syntax encourages me to reason through a problem as a sequence of simple parts and enables me to directly code a solution in terms of those parts, which I can individually document and test for correctness. Working with CTEs has jump-started my productivity, and helped the team as a whole set a higher standard for our SQL.
In the Science team’s experience, much of the common frustration with SQL comes down to a failure to treat SQL queries as declarative programs that demand the same care as imperative programs. SQL is code, and we should treat it as such. We can better manage the complexity of SQL by using the same basic techniques we do in other languages: we can divide work into composable parts, document our intent, and test for correctness. We use CTEs as a foundation for building queries that are factored, documented, and tested, and we’ve enjoyed excellent results writing and maintaining numerous hundred- and thousand-line reports using this approach.
In this post, I’ll share an example of how the Science team uses CTEs to treat SQL as code. I’ll walk through the process of writing an analytics report with CTEs, and show how CTEs help me think through a problem and implement, document, and test a solution.
* If you’re thinking that CTEs are no better than temporary tables or views for these purposes, read on. CTEs, temporary tables, and views all have their place in our SQL toolkit. We use CTEs because they are best suited for this work. For more on the relative merits of CTEs, temporary tables, and views, please see the appendices to this post.
Common Table Expressions
Before we dive into the example report, let’s take a quick look at the CTE syntax we’ll be using. CTEs are defined inside of a
WITH clause attached to a primary statement. Within the scope of the larger query, each CTE can be manipulated like a table. This allows us to chain CTEs together and build sequences of operations. In the following diagram, we're building up a four-part query, part by part. We start with two parts: a
foo CTE attached to a main
SELECT statement. Next, we add a
bar CTE. In the final step, we add a
baz CTE to complete the four-part query.
Notice what we did here. In the
baz CTEs, we now have three intermediate result sets that we can test individually and "print" with a
SELECT *. Once we know each part is correct, we add another, until we've solved our problem. We can use CTEs to break queries into as many simple parts as the problem requires.
We use CTEs rather than temporary tables or views to decompose queries in development because they are simpler to use. There is no need to add the complexity of managing
DROP statements at this stage in the writing process.
We’ll use a simplified example report to illustrate how we use CTEs in our everyday work: a frequency report. A frequency report is an online advertising analytics report that helps advertisers determine the number of ads to serve users over a specific time period. Advertisers want to reach out to customers enough times to build awareness of and interest in their offerings, but not so many times that customers become jaded or annoyed. A frequency report breaks down return on advertising investment by the number of ads users have been shown, a classification known as a user's impression frequency class.
This report produces data that can be graphed as:
Stripped all the way down, the basic query that generates the report above is:
WITH impression_counts AS ( SELECT user_id, SUM(1) AS impression_count FROM impressions GROUP BY 1 ) SELECT impression_count AS frequency_class, SUM(impression_count) AS total_impressions FROM impression_counts GROUP BY 1 ;
The challenge of writing these reports comes from managing all the additional data we need. Actual reporting queries need to correctly handle the complexity of timestamp, ad campaign, conversion attribution, click, and cost data without becoming tangled messes.
For this simplified example, we'll start with tables recording impression (ad view), click (ad interaction), and conversion (sale) events, and produce a frequency report tracking the total number of users, impressions, clicks, and conversions for each impression frequency class for each ad campaign in the database for the month of March 2014. We can visualize our task like this:
Thinking with CTEs
Working with CTEs begins with reasoning about the problem in terms of the stages and parts needed to produce the report. From the above starting point, we can already work out four main stages.
We'll need to:
FILTERthe four input tables by
SUMto get user-level counts for impressions, clicks, and conversions,
JOINthose counts together on
campaign_id, and finally,
SUMto generate the report totals for users, impressions, clicks, and conversions.
We can express the relationships between these operations visually:
In one form or another, each of these operations would need to be a part of any query that produces this report. With CTEs, we can preserve the logical clarity of our thought process in the code itself. Each of the main parts of this query will be implemented using simple CTEs that serve only one main purpose. For added clarity, we will name and comment the CTEs to communicate our intent at every stage. This technique yields a query that we can read straight through and maintain with ease, just like our other code.
Writing with CTEs
Let’s take a look at a CTE from each stage right now. The full query with documentation comments can be found here, and in the appendices to this post.
First come the three filter CTEs. Here’s the CTE for
filtered_impressions. Its only purpose is to filter the impressions table down to March 2014:
filtered_impressions AS ( SELECT record_date, user_id, campaign_id FROM impressions WHERE record_date >= '2014-03-01' AND record_date < '2014-04-01' )
Next, we calculate user-level counts for impressions, clicks, and conversions. Each of the three “counts” CTEs performs only a simple aggregate function: a
GROUP BY and a
SUM. Here is the
impression_counts AS ( SELECT user_id, campaign_id, SUM(1) AS impression_count FROM filtered_impressions GROUP BY 1, 2 )
After that, we
JOIN the three “counts” CTEs together in a single long table. This
collated_counts CTE is the longest in the query, but, like the others, it has only one main purpose:
collated_counts AS ( SELECT imp.user_id AS user_id, imp.campaign_id AS campaign_id, imp.impression_count AS impression_count, cl.click_count AS click_count, conv.conversion_count AS conversion_count FROM impression_counts imp LEFT OUTER JOIN click_counts cl ON imp.user_id = cl.user_id AND imp.campaign_id = cl.campaign_id LEFT OUTER JOIN conversion_counts conv ON imp.user_id = conv.user_id AND imp.campaign_id = conv.campaign_id )
Last comes the main
SELECT statement. Its only purpose is to group by
= frequency_class) and
campaign_id, and calculate the four
SUMs for the report:
SELECT impression_count AS frequency_class, campaign_id AS campaign_id, SUM(1) AS total_users, SUM(impression_count) AS total_impressions, SUM(COALESCE(click_count, 0)) AS total_clicks, SUM(COALESCE(conversion_count, 0)) AS total_conversions FROM collated_counts GROUP BY 1, 2
Testing with CTEs
As we build up the query with CTEs, we leverage the ability to
SELECT from each CTE individually to test for correctness as part of the writing process. This basic testing can be as simple as three files in a text editor, which we execute from psql (or equivalent) in a sequence as we write:
INSERTrows of test data
test.sql: the query itself
DROPthe tables created in setup.sql
We write and comment one CTE at a time in the test file. Each time we add a CTE, we add test rows to exercise that CTE to the setup file, and include comments to indicate what should happen to those rows when we
SELECT * from the relevant CTE. When the output matches our expectations, we move to the next part of the query, and repeat the process.
As an example, initial tests for the
filtered_impressions CTE could consist of creating an impressions table and inserting five rows to exercise the date range in the
WHERE clause. We indicate our expectations for those rows with brief comments:
CREATE TABLE impressions ( record_date date NOT NULL, user_id bigint NOT NULL, campaign_id bigint NOT NULL ); INSERT INTO impressions (record_date, user_id, campaign_id) VALUES /* The following 2 rows should not appear in filtered_impressions: */ ('2014-02-28', 707, 7), ('2014-04-01', 707, 7), /* The following 3 rows should appear in filtered_impressions: */ ('2014-03-01', 101, 1), ('2014-03-15', 101, 1), ('2014-03-31', 101, 1) ;
This basic testing at the time of writing is not a substitute for a comprehensive test framework, but it is enough to catch many errors that could otherwise sneak through, and it provides a good return on a modest investment of effort. By the time the full query is complete, this process will have generated tests and documentation for each part of the query.
This method of working with CTEs has helped me by bringing clarity and simplicity to complex analytics queries. Thinking, writing, and testing with CTEs helps me treat SQL as part of software engineering practice by writing SQL that’s factored, documented, and tested more like other code.
The Science team thinks of this method as producing a foundation for further refinements. When appropriate, optimizations for performance can and will be made, but we focus on correctness first. Optimizations tend to add complexity, and before we do that, we want to mitigate the complexity of the query as much as possible.
By starting with CTEs, we can more easily write queries that we can quickly read and reuse six months from now. Analysts can return to their models and analyses with confidence and engineers are better able to add new features to reports without introducing new bugs. We're building upon a foundation of factored, documented, and tested SQL.
Code for the Example Report
On CTEs, Temporary Tables, and Views
We asked Christophe Pettus of PostgreSQL Experts to help illuminate the tradeoffs between CTEs, views, and temporary tables, and received the following helpful response, which we publish here with his permission and our thanks:
[E]ach have characteristics that can make them better or worse in particular situations:
1. CTEs are optimization fences; the query planner will plan CTEs
separately from the rest of the query. This can be good or bad,
depending on the way the CTE is used.
2. Views are *not* optimization fences; you can think of them as being
textually inserted into the query at the appropriate place, so queries
can be rewritten, join clauses moved around, etc.
3. Temporary tables can have indexes; for very large intermediate result
sets, this can be essential for good performance.
We agree that the choice between CTEs, temporary tables, and views is a matter of balancing the different trade-offs of the different stages of software development.
As explained in this post, the Science team finds the balance in favor of CTEs as the foundation for query development. We reach for the CTE syntax first for its clarity and ease of use. When we write and test queries part-by-part, we want to keep the code as clear and simple as possible. Juggling extra
DROP statements for temporary tables or views works against that goal.
Once we have a correct, clear foundation, then we move onto the optimizations I mentioned in the conclusion. At that point, we consider re-writing CTEs as views or materialized tables on a case-by-case basis. Sometimes the balance tips away from CTEs. In our experience, the most common reason for this has been to gain the performance benefits of indexing on intermediate result sets that can contain hundreds of millions to tens of billions of rows.
More posts featuring CTEs
- "Everyday Postgres: How I write queries using psql: Common Table Expressions," by Selena Deckelmann
- "Postgres Common Table Expression Super Example," by Jeff Dwyer
- "PostgreSQL, Aggregates, and Histograms," by Dimitri Fontaine
- "The best Postgres feature you're not using -- CTEs aka WITH clauses," by Craig Kerstiens
- "How I write SQL," by Craig Kerstiens
- "Testing the output of tuned Postgres queries," by Gary Sieling