Open Source Release: js-hll

One of the first things that we wanted to do with HyperLogLog when we first started playing with it was to support and expose it natively in the browser. The thought of allowing users to directly interact with these structures — perform arbitrary unions and intersections on effectively unbounded sets all on the client — was exhilarating to us. We knew it could be done but we simply didn’t have the time.

Fast forward a few years to today. We had finally enough in the meager science/research budget to pick up an intern for a few months and as a side project I tasked him with turning our dream into a reality. Without further ado, we are pleased to announce the open-source release of AK’s HyperLogLog implementation for JavaScript, js-hll. We are releasing this code under the Apache License, Version 2.0 matching our other open source offerings.

We knew that we couldn’t just release a bunch of JavaScript code without allowing you to see it in action — that would be a crime. We passed a few ideas around and the one that kept bubbling to the top was a way to kill two birds with one stone. We wanted something that would showcase what you can do with HLL in the browser and give us a tool for explaining HLLs. It is typical for us to explain how HLL intersections work using a Venn diagram. You draw some overlapping circles with a broder that represents the error and you talk about how if that border is close to or larger than the intersection then you can’t say much about the size of that intersection. This works just ok on a whiteboard but what you really want is to just build a visualization that allows you to select from some sets and see the overlap. Maybe even play with the precision a little bit to see how that changes the result. Well, we did just that!

Click above to interact with the visualization

Click above to interact with the visualization

Note: There’s more interesting math in the error bounds that we haven’t explored. Presenting error bounds on a measurement that cannot mathematically be less than zero is problematic. For instance, if you have a ruler that can only measure to 1/2″ and you measure an object that truly is 1/8″ long you can say “all I know is this object measures under 0.25 inches”. Your object cannot measure less than 0 inches, so you would never say 0 minus some error bound. That is, you DO NOT say 0.0 ± 0.25 inches.  Similarly with set intersections there is no meaning to a negative intersection. We did some digging and just threw our hands up and tossed in what we feel are best practices. In the js-hll code we a) never show negative values and b) we call “spurious” any calculation that results in an answer within 20% of the error bound. If you have a better answer, we would love to hear it!

Open Source Release: js-murmur3-128

As you can imagine from of all of our blog posts about hashing that we hash a lot of things. While the various hashing algorithms may be well-defined, the devil is always in the details especially when working with multiple languages that have different ways of representing numbers. We’re happy to announce the open-source release of AK’s 128bit Murmur3 implementation for JavaScript, js-murmur3-128. We are releasing this code under the Apache License, Version 2.0 matching our other open source offerings.

Details

The goal of the implementation is to produce a hash value that is equivalent to the C++ and Java (Guava) versions for the same input and it must be usable in the browser. (Full disclosure: we’re still working through some signed/unsigned issues between the C++ and Java/JavaScript versions. The Java and JavaScript versions match exactly.)

Usage

Java (Guava):

final int seed = 0;
final byte[] bytes = { (byte)0xDE, (byte)0xAD, (byte)0xBE, (byte)0xEF,
                       (byte)0xFE, (byte)0xED, (byte)0xFA, (byte)0xCE };
com.google.common.hash.HashFunction hashFunction = com.google.common.hash.Hashing.murmur3_128(seed);
com.google.common.hash.HashCode hashCode = hashFunction.newHasher()
       .putBytes(bytes)
       .hash();
System.err.println(hashCode.asLong());

JavaScript:

var seed = 0;
var rawKey = new ArrayBuffer(8);
    var byteView = new Int8Array(rawKey);
        byteView[0] = 0xDE; byteView[1] = 0xAD; byteView[2] = 0xBE; byteView[3] = 0xEF;
        byteView[4] = 0xFE; byteView[5] = 0xED; byteView[6] = 0xFA; byteView[7] = 0xCE;
console.log(murmur3.hash128(rawKey, seed));

Call for Summer Interns

AK is looking for a summer intern in our R&D group. If any of our blog posts have interested you, then you’ll fit right in!

We’re looking for someone who has a good handle on a few programming languages (pick any two from R/Mathematica/Python/Javascript/Java) and has some math in their background — college-level calculus or algebra is plenty. Ideally, you’re interested in learning about:

  • building and tuning high-performance data structures,
  • streaming algorithms,
  • interesting data visualizations, and
  • how to translate academic research into business value.

It’s OK if you’ve never seen the stuff we write about on the blog before! We didn’t either until we started researching them!

I can’t emphasize this enough: we don’t expect you to know how to do the things above yet. We simply expect you to have a passion for learning about them and the diligence to work through what (at the time) seem like impossible problems. Work experience is nice, but not necessary. As long as you can write clean code and can work hard, you’re well-qualified for this job.

If you’re interested, please send a brief note about why you’re interested, along with a CV and/or GitHub username to timon at aggregateknowledge dot com. For extra credit, please submit one (or more!) of the following:

  • An implementation of HLL, Count-Min Sketch, K-Min Values, or Distinct Sampling in a language of your choice.
  • An extension to Colin’s blog post about a good hash function that adds CityHash and SipHash to the shoot-out.
  • An explanation of the tradeoffs between using a hash map and Count-Min Sketch for counting item frequency.

(I feel like I shouldn’t have to say this, but yes, these are all answered somewhere on the internet. Don’t plagiarize. What we want is for you to go learn from them and try your own hand at implementing/experimenting. Also, don’t freak out, these are extra credit!)

Sketch of the Day: Probabilistic Counting with Stochastic Averaging (PCSA)

Before there was LogLog, SuperLogLog or HyperLogLog there was Probabilistic Counting with Stochastic Averaging (PCSA) from the seminal work “Probabilistic Counting Algorithms for Data Base Applications” (also known as the “FM Sketches” due to its two authors, Flajolet and Martin). The basis of PCSA matches that of the other Flajolet distinct value (DV) counters: hash values from a collection into binary strings, use patterns in those strings as indicators for the number of distinct values in that collection (bit-pattern observables), then use stochastic averaging to combine m trials into a better estimate. Our HyperLogLog post has more details on these estimators as well as stochastic averaging.

Observables

The choice of observable pattern in PCSA comes from the knowledge that in a collection of randomly generated binary strings, the following probabilities occur:

\begin{aligned}  &P( ..... 1) &= 2^{-1} \\  &P( .... 10) &= 2^{-2} \\  &P( ... 100) &= 2^{-3}  \end{aligned}

\vdots

P(...10^{k-1}) = 2^{-k}

For each value added to the DV counter, a suitable hash is created and the position of the least-significant (right-most) 1 is determined. The corresponding position in a bitmap is updated and stored. I’ve created the simulation below so that you can get a feel for how this plays out.

Bit Pattern Simulation

Click above to run the bit-pattern simulation

(All bit representations in this post are numbered from 0 (the least-significant bit) on the right. This is the opposite of the direction in which they’re represented in the paper.)

Run the simulation a few times and notice how the bitmap is filled in. In particular, notice that it doesn’t necessarily fill in from the right side to the left — there are gaps that exist for a time that eventually get filled in. As the cardinality increases there will be a block of 1s on the right (the high probability slots), a block of 0s on the left (the low probability slots) and a “fringe” (as Flajolet et al. called it) of 1s and 0s in the middle.

I added a small pointer below the bitmap in the simulation to show how the cardinality corresponds to the expected bit position (based on the above probabilities). Notice what Flajolet et al. saw when they ran this same experiment: the least-significant (right-most) 0 is a pretty good estimator for the cardinality! In fact when you run multiple trials you see that this least-significant 0 for a given cardinality has a narrow distribution. When you combine the results with stochastic averaging it leads to a small relative error of 0.78 / \sqrt{m} and estimates the cardinality of the set quite well. You might have also observed that the most-significant (left-most) 1 can also be used for an estimator for the cardinality but it isn’t as clear-cut. This value is exactly the observable used in LogLog, SuperLogLog and HyperLogLog and does in fact lead to the larger relative error of 1.04 / \sqrt{m} (in the case of HLL).

Algorithm

The PCSA algorithm is elegant in its simplicity:

m = 2^b # with b in [4...16]

bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s

##############################################################################################
# Construct the PCSA bitmaps
for h in hashed(data):
    bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of the rightmost b bits
    run_length = run_of_zeros( h,b ) # length of the run of zeroes starting at bit b+1
    bitmaps[ bitmap_index ][ run_length ] = 1 # set the bitmap bit based on the run length observed

##############################################################################################
# Determine the cardinality
phi = 0.77351
DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV estimate

Stochastic averaging is accomplished via the arithmetic mean.

You can see PCSA in action by clicking on the image below.

HyperLogLog Simulation

Click above to run the PCSA simulation

There is one point to note about PCSA and small cardinalities: Flajolet et al. mention that there are “initial nonlinearities” in the algorithm which result in poor estimation at small cardinalities (n/m \approx 10 \, \text{to} \, 20 ) which can be dealt with by introducing corrections but they leave it as an exercise for the reader to determine what those corrections are. Scheuermann et al. did the leg work in “Near-Optimal Compression of Probabilistic Counting Sketches for Networking Applications” and came up with a small correction term (see equation 6). Another approach is to simply use the linear (ball-bin) counting introduced in the HLL paper.

Set Operations

Just like HLL and KMV, unions are trivial to compute and lossless. The PCSA sketch is essentially a “marker” for runs of zeroes, so to perform a union you merely bit-wise OR the two sets of bitmaps together. Folding a PCSA down to a smaller m works the same way as HLL but instead of HLL’s max you bit-wise OR the bitmaps together. Unfortunately for intersections you have the same issue as HLL, you must perform them using the inclusion/exclusion principle. We haven’t done the plots on intersection errors for PCSA but you can imagine they are similar to HLL (and have the benefit of the better relative error 0.78 / \sqrt{m}).

PCSA vs. HLL

That fact that PCSA has a better relative error than HyperLogLog with the same number of registers (1.04 / 0.78 \approx 1.33 ) is slightly deceiving in that m (the number of stored observations) are different sizes. A better way to look at it is to fix the accuracy of the sketches and see how they compare. If we would like to have the same relative error from both sketches we can see that the relationship between registers is:

\text{PCSA}_{RE} = \text{HLL}_{RE}

\dfrac{0.78}{\sqrt{m_{\scriptscriptstyle PCSA}}} = \dfrac{1.04}{\sqrt{m_{\scriptscriptstyle HLL}}}

m_{\scriptscriptstyle PCSA} = \left( \dfrac{0.78}{1.04} \right)^2 m_{\scriptscriptstyle HLL} \approx 0.563 \ m_{\scriptscriptstyle HLL}

Interestingly, PCSA only needs a little more than half the registers of an HLL to reach the same relative error. But this is also deceiving. What we should be asking is what is the size of each sketch if they provide the same relative error? HLL commonly uses a register width of 5 bits to count to billions whereas PCSA requires 32 bits. That means a PCSA sketch with the same accuracy as an HLL would be:

\begin{aligned}  \text{Size of PCSA} &= 32 \text{bits} \ m_{\scriptscriptstyle PCSA} = 32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )   \\  \\  \text{Size of HLL} &= 5 \text{bits} \ m_{\scriptscriptstyle HLL}  \end{aligned}

Therefore,

\dfrac{\text{Size of PCSA}} {\text{Size of HLL}} = \dfrac{32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )}{5 \text{bits} \ m_{\scriptscriptstyle HLL}} \approx 3.6

A PCSA sketch with the same accuracy is 3.6 times larger than HLL!

Optimizations

But what if you could make PCSA smaller by reducing the size of the bitmaps? Near the end of the paper in the Scrolling section, Flajolet et al. bring up the point that you can make the bitmaps take up less space. From the simulation you can observe that with a high probability there is a block of consecutive 1s on the right side of the bitmap and a block of consecutive 0s on the left side of the bitmap with a fringe in between. If one found the “global fringe” — that is the region defined by the left-most 1 and right-most 0 across all bitmaps — then only those bits need to be stored (along with an offset value). The authors theorized that a fringe width of 8 bits would be sufficient (though they fail to mention if there are any dependencies on the number of distinct values counted). We put this to the test.

PCSA with Fringe Cardinality Error

For a plot with larger m values, click here.

In our simulations it appears that a fringe width of 12 bits is necessary to provide an unbiased estimator comparable to full-fringe PCSA (32-bit) for the range of distinct values we analyzed. (Notice the consistent bias of smaller fringe sizes.) There are many interesting reasons that this “fringe” concept can fail. Look at the notes to this post for more. If we take the above math and update 32 to 12 bits per register (and include the 32 bit offset value) we get:

\begin{aligned}  \dfrac{\text{Size of PCSA}} {\text{Size of HLL}}  &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} ) + 32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}}  \\  \\  &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} + \dfrac{32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}}  \\  \\  &\approx 1.35 \text{ (for }m\gg64 \text{)}    \end{aligned}

This is getting much closer to HLL! The combination of tighter bounds on the estimate and the fact that the fringe isn’t really that wide in practice result in PCSA being very close to the size of the much lauded HLL. This got us thinking about further compression techniques for PCSA. After all, we only need to get the sketch about 1/3 smaller to be comparable in size to HLL. In a future post we will talk about what happens if you Huffman code the PCSA bitmaps and the tradeoffs you make when you do this.

Summary

PCSA provides for all of the goodness of HLL: very fast updates making it suitable for real-time use, small footprint compared to the information that it provides, tunable accuracy and unions. The fact that it has a much better relative error per register than HLL indicates that it should get more credit than it does. Unfortunately, each bitmap in PCSA requires more space than HLL and you still get less accuracy per bit. Look for a future post on how it is possible to use compression (e.g. Huffman encoding) to reduce the number of bits per bitmap, thus reducing the error per bit to match that of HLL, resulting in an approach that matches HLL in size but exceeds its precision!

Notes on the Fringe

While we were putting this post together we discovered many interesting things to look at with respect to fringe optimization. One of the questions we wanted to answer was “How often does the limited size of the fringe muck up a bitmap?” Below is a plot that shows how often any given sketch had a truncation event (that affected the DV estimate) in the fringe of any one of its bitmaps for a given fringe width (i.e. some value could not be stored in the space available).

Truncated_Observations-2

Note that this is an upper bound on the error that could be generated by truncation. If you compare the number of runs that had a truncation event (almost all of the runs) with the error plot in the post it is quite shocking that the errors are as small as they are.

Since we might not get around to all of the interesting research here, we are calling out to the community to help! Some ideas:

  1. There are likely a few ways to improve the fringe truncation. Since PCSA is so sensitive to the least-significant 1 in each bitmap, it would be very interesting to see how different approaches affect the algorithm. For example, in our algorithm we “left” truncated meaning that all bitmaps had to have a one in the least-significant position of the bitmap in order to move up the offset. It would be interesting to look at “right” truncation. If one bitmap is causing many of the others to not record incoming values perhaps it should be bumped up. Is there some math to back up this intuition?
  2. It is interesting to us that the fringe width truncation events are DV dependent. We struggled with the math on this for a bit before we just stopped. Essentially we want to know what is the width of the theoretical fringe? It obviously appears to be DV dependent and some sort of coupon collector problem with unequal probabilities. Someone with better math skills than us needs to help here.

Closing thoughts

We uncovered PCSA again as a way to go back to first principles and see if there are lessons to be learned that could be applied to HLL to make it even better. For instance, can all of this work on the fringe be applied to HLL to reduce the number of bits per register while still maintaining the same precision? HLL effectively records the “strandline” (what we call the left-most 1s). More research into how this strandline behaves and if it is possible to improve the storage of it through truncation could reduce the standard HLL register width from 5 bits to 4, a huge savings! Obviously, we uncovered a lot of open questions with this research and we feel there are algorithmic improvements to HLL right around the corner. We have done some preliminary tests and the results so far are intriguing. Stay tuned!

Efficient Field-Striped, Nested, Disk-backed Record Storage

At AK we deal with a torrent of data every day. We can report on the lifetime of a campaign which may encompass more than a year’s worth of data. To be able to efficiently access our data we are constantly looking at different approaches to storage, retrieval and querying. One approach that we have been interested in involves dissecting data into its individual fields (or “columns” if you’re thinking in database terms) so that we only need to access the fields that are pertinent to a query. This is not a new approach to dealing with large volumes of data – it’s the basis of column-oriented databases like HBase.

Much of our data contains nested structures and this causes things to start to get a little more interesting, since this no longer easily fits within the data-model of traditional column-stores. Our Summarizer uses an in-memory approach to nested, field-striped storage but we wanted to investigate this for our on-disk data. Google published the Dremel paper a few years ago covering this exact topic. As with most papers, it only provides a limited overview of the approach without covering many of the “why”s and trade-offs made. So, we felt that we needed to start from the ground up and investigate how nested, field-striped storage works in order to really understand the problem.

Due to time constraints we have only been able to scratch the surface. Since the community is obviously interested in a Dremel-like project, we want to make the work that we have done available. We apologize in advance for the rough edges.

Without further ado: Efficient Field-Striped, Nested, Disk-backed Record Storage (on GitHub).

Netty’s CodecEmbedder

We love Netty. It’s a great full-featured network framework for Java. One of the features that rounds out the framework is the CodecEmbedder. It allows you to test your encoders and decoders without any fuss using a offer-poll paradigm. For example, to test our Rsyslog decoder, we simply:

ChannelBuffer messageBuffer =
    ChannelBuffers.copiedBuffer("2011-06-30T00:00:03-07:00 some.host.agkn.net EVT_NM column1,column2\n", CharsetUtil.UTF_8);
RsyslogDecoder decoder = new RsyslogDecoder();
DecoderEmbedder<IRsyslogMessage> embedder = new DecoderEmbedder<IRsyslogMessage>(decoder);
    embedder.offer(messageBuffer);

IRsyslogMessage message = embedder.poll();
assertNotNull(message, "Decoded message");
assertEquals(message.getTimestamp(), "2011-06-30T00:00:03-07:00", "Timestamp");
assertEquals(message.getHostname(), "some.host.agkn.net", "Host");
assertEquals(message.getProgramname(), "EVT_NM", "Programname");
assertEquals(message.getBody(), "column1,column2", "Body");

One gotcha to watch out for (which always manages to bite me in the butt, and is the impetus for writing this post) is that handlers will only process the type of data that they understand. Data of other types is passed along completely untouched. For example, while the following successfully compiles, it throws a java.lang.ClassCastException: java.lang.String cannot be cast to IRsyslogMessage at embedder.poll():

RsyslogDecoder decoder = new RsyslogDecoder();
DecoderEmbedder<IRsyslogMessage> embedder = new DecoderEmbedder<IRsyslogMessage>(decoder);
    embedder.offer("2011-06-30T00:00:03-07:00 some.host.agkn.net EVT_NM column1,column2\n");

IRsyslogMessage message = embedder.poll();
assertNotNull(message, "Decoded message");

The ChannelPipeline that backs the embedder can handle any type of input object. In the above case the object offer‘d is a string which is simply passed through the RsyslogDecoder untouched and tries unsuccessfully to pop out of the poll as an IRsyslogMessage. As long as you always make sure that your offered object is understood by one of your handlers then the embedder will work as you expect.

No BS Data Salon

After being quite disenchanted with the state of the Big Data conferences, I thought that I would reach out to some folks that do work similar to ours and plan a mini conference of our own. The first guy that I reached out to was Mike Driscoll, the CTO of MetaMarkets. I had hit the jackpot on the first pull. Mike had been toying with the idea of having a “No BS Data Salon” where he’d get together folks that have challenging problems and present how they’ve solved them in a use-case style format. He wanted to hit at least three levels of the stack: visualization, analytics and data infrastructure. Timon and I encouraged him to take his ideas and make it real since it was exactly what we were thinking.

Today we had the first in the series. It covered data visualization. Mike put together a fantastic group of presenters from Bret Victor to Nick Bilton. All told, there were 5 presenters, a panel discussion on JS visualization tools, and around 20 attendees. It was an awesome opportunity to just talk shop about data and visualizations.

A big thanks to Mike, Nisha and all of the MetaMarkets folks for all of their work and hospitality. And another big thanks to all of the presenters. I certainly look forward to attending and presenting at future get togethers.

Image provided by Xavier Leaute

Summarizer

Since I get a lot of questions about our summarizer I thought I would provide some additional details.

Below is an example summary definition that we use:

summary CampaignDetails {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32       impression_count = 1;
        multiset64     impression_uu = 2;

        sumcurrency    media_cost = 3;
        sumcurrency    data_cost = 4;

        summary Interaction {
            key {
                int64   action_id = 1;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Interaction interactions = 5;

        summary Conversion {
            key {
                int64   model_id = 1;
                int64   type_id = 2;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Conversion conversions = 6;
    }
}

The grammar was inspired by protobuf which is the IDL that we use to define our events. (The events themselves are currently a mix of CSV and JSON.) This schema definition is used to define both the in-memory storage format and the serialization format. The values are all aggregates of some kind: sumint32 and sumfloat aggregate (by summing) integers and floating-point numbers while multiset64 maintains a sketch of the set of unique users from which we can compute the cardinality. Streaming algorithms are a natural fit in this type of data store. Many O(n^m) algorithms have streaming O(n) counterparts that provide sufficiently accurate results. We’ll be covering streaming algorithms in future posts.

Ingested events are mapped to summaries via a mapping definition. Multiple events can be mapped to the same summary (by key). This feature has provided us with unexpected benefits: (Diving slightly into the nuances of ad serving) Impressions, clicks and conversions occur at three distinct times and are recorded as three separate events. When reporting on these entities one wants to see them as a single unit joined by their key. In other words, one wants to see clicks and conversions associated with impressions. In a relational database or map-reduce you one would need to group the impressions together by key, get their count, and join by key the group-by and count for the clicks and conversions. This is a bear of a join and when combined with counting the number of unique users it can bring even the largest cluster to its knees. The summarizer simply maps different events into the same summary by key and is aggregates as necessary. This provides us with a succinct (unsampled) summary of all impressions, clicks and conversions by key that can be accessed in O(1).

Currently we can aggregate more than 200,000 events per second per thread. Some events are aggregated into more than a dozen summaries. The key look-up and aggregation process parallelizes very well where we can ingest just shy of 1 million events per second per host. Even if we couldn’t reach our desired ingest rate, we could run multiple summarizers in parallel giving each a fraction of the event stream. A simple post-process (which could be performed by another summarizer) would bring all of the results together. We have around 50 summaries with an average of 10 fields (counters) each. We’re currently tracking about 20M keys which results in aggregating on more than 200M individual counters.

Our summaries are designed such that they rarely contain more than 10M rows and are stored in CSV format. Initially we used CSV simply because we already had all of the code written for ingesting 3rd party CSV data. We quickly found other uses for them: our analysts gobbled them up and use them in Excel, our data scientists use them directly in R, and even our engineers use them for back-of-the-envelope calculations. Having manageable summaries and/or sketches enabled agile analytics throughout our organization.

That’s just a quick overview of the summarizer. Please post comments for any questions that you may have!

Performing joins at scale

At the core of analytics for digital advertising is a massive join that attributes impressions to interactions (e.g. clicks). The goal is to be able to answer questions such as: what was the click-through rate for creative A running in campaign B on inventory C for audience attributes D and E on day F?

Here at AK we are implementing a system that can record up to 30 billion impressions a day. If we assume a generous click-through rate of 1% then we will be recording 300 million clicks a day. Performing the attribution join on this volume of data is a formidable task for even the largest relational database or map-reduce cluster. Trying to get the results in near real-time raises the ante to the point where most techniques fold.

A single event (impression, interaction, or conversion) is far too granular to be meaningful, and this fact can be exploited to architect a system that can do the join efficiently in real-time.

Let’s start with a some background information!

Ad Serving 101

When an impression request is received, the ad server will generate a unique identifier for the impression and use it for any click URLs that will be needed in the ad unit. The ad server will record this impression id along with the associated dimensional data (e.g. campaign id, inventory id, creative id, etc) in its system-of-record. If a user clicks on the served ad unit, the URL which contains the impression id is sent back to the ad server where it is recorded. Later, attribution is performed by joining together a click and an impression by their impression ids. The resulting information is rolled up to the desired granularity in various summaries.

Turning the problem on its head

The multi-step process of recording impressions and clicks then later joining click to impression and finally rolling-up by impression dimensions feels wrong. All of the work is done at the individual impression and click yet that granularity is immediately lost in the roll-up process. If the join could be done at the level of the rolled-up data then we could gain great economies of scale.

Getting the roll-up of the impressions is easy: instead of storing the impressions individually and then later rolling them up to the desired granularity, the impressions can be aggregated as they come in. This is better visualized using the example from before: we need to aggregate impressions by campaign, inventory, creative and audience attributes. To do this we simply create a collection of key-value pairs where the key made up of the desired four dimensions — the campaign id, inventory id, creative id and audience attribute id — and the value is a count of the number of times that the key has been seen. In the language of our summarizer it looks like:

summary Impression {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   impression_count = 1;
    }
}

The key is made up of the four desired dimensions and the value has a field that aggregates (counts) each entry that it’s given. (The numbers associated with each field allow us to specify the order and position of the fields when they are serialized into and out of the summarizer.)

Now what about clicks? As it was described earlier, clicks are associated with an impression id and only by joining with the corresponding impression are all of the dimensional data known. What we really want is to have the dimensional data associated with the click so that it could be aggregated in the same way that the impressions are. This summary looks like:

summary Click {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   click_count = 1;
    }
}

These two parallel summaries could then be joined on their key for the final result. This would be much more efficient than joining by individual click and impression since there are inherently less summarized rows than there are individual rows (assuming that there is more than one impression and/or click per campaign, inventory, etc.).

The problem comes down to: how does the dimensional data get associated with the click without having to join to the impression? The answer is: instead of using the impression id to create the click URL, all of the dimensional data should be used. (By definition, at impression time, all of the dimensional data associated with the impression has to be available to generate the click URL.) When an ad unit it clicked on, the server can record all of the dimensional data and the above summary can be created.

We have simplified a massive join down to a simpler join by effectively “pre-joining” the click to the impression by making the dimensional data available when the click is recorded. Can we do any better? Yes! We can completely remove the need for a join entirely by simply combining the two summaries into one.

summary ImpressionClick {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32   impression_count = 1;
        sumint32   click_count = 2;
    }
}

When an impression is recorded, the impression_count for a given key is incremented while the click_count is left unchanged. Likewise, when a click is recorded, the click_count for a given key is incremented while the impression_count is left unchanged. Without doing an explicit join and without storing individual impressions and clicks, the desired summary can be efficiently built by simply updating counts for a given key.

This idea can be extended to other activities such as conversions. For example, one of our summaries combines all of this information:

summary CampaignDetails {
    key {
        int64   campaign_id = 1;
        int64   inventory_id = 2;
        int64   creative_id = 3;
        int64   audience_attribute_id = 4;
    }
    value {
        sumint32       impression_count = 1;
        multiset64     impression_uu = 2;

        sumcurrency    media_cost = 3;
        sumcurrency    data_cost = 4;

        summary Interaction {
            key {
                int64   action_id = 1;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Interaction interactions = 5;

        summary Conversion {
            key {
                int64   model_id = 1;
                int64   type_id = 2;
            } 
            value {
                sumint32     count = 1;
                multiset64   uu_count = 2;
                sumcurrency  revenue = 3;
            }
        }
        repeated Conversion conversions = 6;
    }
}

(We have complex types to support operations such as counting unique users. We also have the ability to have nested summaries. As shown in the example above, this allows us to have multiple interaction or conversion types associated with a single type of impression.)

Focus on the result

By requiring that all of the impression information is available when a click is recorded, we have turned the problem of a join and roll-up of billions of activities into a simple matter of doing key lookups and incrementing counters. One unforeseen benefit of this approach is that we have been forced to think about what we want to report on when we construct the response to an ad request. Because our focus is on providing actionable reports to our customers, requiring that we understand what those reports are before we start writing software is a wonderful thing.

Bring it on!

Recently I did a bit of show and tell on some of the volumes that we’re dealing with on our new infrastructure. It’s less than a month later and we’re handling over 3.5 billion events per day.

Nov 15, 2011 -- Events per second

(The three spikes that you see are due to the fact that we’re still trying to figure out the balance between event archive compression speed versus size. We should have this figured out over the next few weeks. Hopefully I can convince our Ops folks to do a post on what they’re learned.)

Follow

Get every new post delivered to your Inbox.

Join 258 other followers