Summarizer

Since I get a lot of questions about our summarizer I thought I would provide some additional details.

Below is an example summary definition that we use:

summary CampaignDetails {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32       impression_count = 1;
        multiset64     impression_uu = 2;

        sumcurrency    media_cost = 3;
        sumcurrency    data_cost = 4;

        summary Interaction {
            key {
                int64   action_id = 1;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Interaction interactions = 5;

        summary Conversion {
            key {
                int64   model_id = 1;
                int64   type_id = 2;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Conversion conversions = 6;
    }
}

The grammar was inspired by protobuf which is the IDL that we use to define our events. (The events themselves are currently a mix of CSV and JSON.) This schema definition is used to define both the in-memory storage format and the serialization format. The values are all aggregates of some kind: sumint32 and sumfloat aggregate (by summing) integers and floating-point numbers while multiset64 maintains a sketch of the set of unique users from which we can compute the cardinality. Streaming algorithms are a natural fit in this type of data store. Many O(n^m) algorithms have streaming O(n) counterparts that provide sufficiently accurate results. We’ll be covering streaming algorithms in future posts.

Ingested events are mapped to summaries via a mapping definition. Multiple events can be mapped to the same summary (by key). This feature has provided us with unexpected benefits: (Diving slightly into the nuances of ad serving) Impressions, clicks and conversions occur at three distinct times and are recorded as three separate events. When reporting on these entities one wants to see them as a single unit joined by their key. In other words, one wants to see clicks and conversions associated with impressions. In a relational database or map-reduce you one would need to group the impressions together by key, get their count, and join by key the group-by and count for the clicks and conversions. This is a bear of a join and when combined with counting the number of unique users it can bring even the largest cluster to its knees. The summarizer simply maps different events into the same summary by key and is aggregates as necessary. This provides us with a succinct (unsampled) summary of all impressions, clicks and conversions by key that can be accessed in O(1).

Currently we can aggregate more than 200,000 events per second per thread. Some events are aggregated into more than a dozen summaries. The key look-up and aggregation process parallelizes very well where we can ingest just shy of 1 million events per second per host. Even if we couldn’t reach our desired ingest rate, we could run multiple summarizers in parallel giving each a fraction of the event stream. A simple post-process (which could be performed by another summarizer) would bring all of the results together. We have around 50 summaries with an average of 10 fields (counters) each. We’re currently tracking about 20M keys which results in aggregating on more than 200M individual counters.

Our summaries are designed such that they rarely contain more than 10M rows and are stored in CSV format. Initially we used CSV simply because we already had all of the code written for ingesting 3rd party CSV data. We quickly found other uses for them: our analysts gobbled them up and use them in Excel, our data scientists use them directly in R, and even our engineers use them for back-of-the-envelope calculations. Having manageable summaries and/or sketches enabled agile analytics throughout our organization.

That’s just a quick overview of the summarizer. Please post comments for any questions that you may have!

Trackbacks

  1. […] Summarizer, our main piece of aggregation infrastructure, used to have a very simple […]

  2. […] behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper […]

  3. […] oversimplify the design we chose for our infrastructure: we built the Summarizer and threw in some Postgres for our product reporting needs, mixed in some Elastic […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: