Performing joins at scale

At the core of analytics for digital advertising is a massive join that attributes impressions to interactions (e.g. clicks). The goal is to be able to answer questions such as: what was the click-through rate for creative A running in campaign B on inventory C for audience attributes D and E on day F?

Here at AK we are implementing a system that can record up to 30 billion impressions a day. If we assume a generous click-through rate of 1% then we will be recording 300 million clicks a day. Performing the attribution join on this volume of data is a formidable task for even the largest relational database or map-reduce cluster. Trying to get the results in near real-time raises the ante to the point where most techniques fold.

A single event (impression, interaction, or conversion) is far too granular to be meaningful, and this fact can be exploited to architect a system that can do the join efficiently in real-time.

Let’s start with a some background information!

Ad Serving 101

When an impression request is received, the ad server will generate a unique identifier for the impression and use it for any click URLs that will be needed in the ad unit. The ad server will record this impression id along with the associated dimensional data (e.g. campaign id, inventory id, creative id, etc) in its system-of-record. If a user clicks on the served ad unit, the URL which contains the impression id is sent back to the ad server where it is recorded. Later, attribution is performed by joining together a click and an impression by their impression ids. The resulting information is rolled up to the desired granularity in various summaries.

Turning the problem on its head

The multi-step process of recording impressions and clicks then later joining click to impression and finally rolling-up by impression dimensions feels wrong. All of the work is done at the individual impression and click yet that granularity is immediately lost in the roll-up process. If the join could be done at the level of the rolled-up data then we could gain great economies of scale.

Getting the roll-up of the impressions is easy: instead of storing the impressions individually and then later rolling them up to the desired granularity, the impressions can be aggregated as they come in. This is better visualized using the example from before: we need to aggregate impressions by campaign, inventory, creative and audience attributes. To do this we simply create a collection of key-value pairs where the key made up of the desired four dimensions — the campaign id, inventory id, creative id and audience attribute id — and the value is a count of the number of times that the key has been seen. In the language of our summarizer it looks like:

summary Impression {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   impression_count = 1;
    }
}

The key is made up of the four desired dimensions and the value has a field that aggregates (counts) each entry that it’s given. (The numbers associated with each field allow us to specify the order and position of the fields when they are serialized into and out of the summarizer.)

Now what about clicks? As it was described earlier, clicks are associated with an impression id and only by joining with the corresponding impression are all of the dimensional data known. What we really want is to have the dimensional data associated with the click so that it could be aggregated in the same way that the impressions are. This summary looks like:

summary Click {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   click_count = 1;
    }
}

These two parallel summaries could then be joined on their key for the final result. This would be much more efficient than joining by individual click and impression since there are inherently less summarized rows than there are individual rows (assuming that there is more than one impression and/or click per campaign, inventory, etc.).

The problem comes down to: how does the dimensional data get associated with the click without having to join to the impression? The answer is: instead of using the impression id to create the click URL, all of the dimensional data should be used. (By definition, at impression time, all of the dimensional data associated with the impression has to be available to generate the click URL.) When an ad unit it clicked on, the server can record all of the dimensional data and the above summary can be created.

We have simplified a massive join down to a simpler join by effectively “pre-joining” the click to the impression by making the dimensional data available when the click is recorded. Can we do any better? Yes! We can completely remove the need for a join entirely by simply combining the two summaries into one.

summary ImpressionClick {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   impression_count = 1;
        sumint32   click_count = 2;
    }
}

When an impression is recorded, the impression_count for a given key is incremented while the click_count is left unchanged. Likewise, when a click is recorded, the click_count for a given key is incremented while the impression_count is left unchanged. Without doing an explicit join and without storing individual impressions and clicks, the desired summary can be efficiently built by simply updating counts for a given key.

This idea can be extended to other activities such as conversions. For example, one of our summaries combines all of this information:

summary CampaignDetails {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32       impression_count = 1;
        multiset64     impression_uu = 2;

        sumcurrency    media_cost = 3;
        sumcurrency    data_cost = 4;

        summary Interaction {
            key {
                int64   action_id = 1;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Interaction interactions = 5;

        summary Conversion {
            key {
                int64   model_id = 1;
                int64   type_id = 2;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Conversion conversions = 6;
    }
}

(We have complex types to support operations such as counting unique users. We also have the ability to have nested summaries. As shown in the example above, this allows us to have multiple interaction or conversion types associated with a single type of impression.)

Focus on the result

By requiring that all of the impression information is available when a click is recorded, we have turned the problem of a join and roll-up of billions of activities into a simple matter of doing key lookups and incrementing counters. One unforeseen benefit of this approach is that we have been forced to think about what we want to report on when we construct the response to an ad request. Because our focus is on providing actionable reports to our customers, requiring that we understand what those reports are before we start writing software is a wonderful thing.

Comments

  1. This is interesting. How would you add the additional level of detail that some advertisers require, like cookie-grain info, or geoip data? Wouldn’t that throw off the key/value pairing you describe?

    Thanks.

    • We currently keep summaries down to an attribute level (geo, demo, etc) — they are obviously some of our biggest. This adds up to hundreds of millions of rows (key-row pairs) per day and there can be dozens of individual counters per row. We spend a lot of time looking at and making decisions about the granularity of data that we keep to ensure that it doesn’t blow up on us. If a customer does require us to go deep into every dimension and we don’t have that granular of information in our summaries then we use the closest summaries effectively as an index into the raw data to get us the the results as quickly as possible.

  2. Nice approach front loading a first round of aggregation. One drawback you hint at is data granularity, and one big one is usually timestamp. Time-series analyses are usually pretty common when looking for trends. Do your customers require you to look at the data across any time dimension (per second or hour or day or week etc). Do you have any in-house analytics that looks at the data across time?

    Thanks. This is a great blog!

    • Thank you for your kind words!

      In our world (digital advertising) the smallest actionable timescale is the day so we primarily aggregate at this level. We allow for reporting of arbitrary day ranges based on these day-level aggregates. WIth pre-computed weekly and monthly roll-ups along with the day-level data we can compute an aggregate over any arbitrary time range fairly efficiently. We also show a day’s worth of hourly aggregates but this is commonly only used to help in setting up tagging — it helps with the “is it collecting data yet?!?” problem.

Trackbacks

  1. […] can be mapped to the same summary (by key). This feature has provided us with unexpected benefits: (Diving slightly into the nuances of ad serving) Impressions, clicks and conversions occur at three distinct times and are recorded as three […]

  2. […] our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in […]

  3. […] oversimplify the design we chose for our infrastructure: we built the Summarizer and threw in some Postgres for our product reporting needs, mixed in some Elastic […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 266 other followers

%d bloggers like this: