At Aggregate Knowledge we are constantly concerned about our data space. And since our most basic data key is cookies (cookie ids) we are very interested in how they behave. To that end we have done a ton of research into what the cookie space looks like in the advertising world and the web in general. Understanding the basic behavior of cookies (count, ingestion rate, growth rate, etc.) is vital for our architecture planning. Here I will show you a view of the cookie space that we collect at Aggregate Knowledge and then take you through some of the research we are doing in the next few posts.

To start things off we asked “How *should* cookies behave?” It’s pretty easy to model what we expect to see. Let’s make the reasonable assumptions that cookies are finite and persistent. As we track advertising around the web we are randomly sampling from this set of numbers (cookie ids). The question is: how many cookies will I see with respect to the number of ads I show? i.e., if I draw from a set of uniquely numbered balls with replacement, how many draws do I need to see most or all of the numbers? Well, if you think of this as a collision problem with `n` trials and `k` draws, you can write the expected number of collisions as:

`E[collisions]= n − k + k (1 −1/k)^n`

so the expected number of unique values we have seen is just `n` minus this or

`E[uniques] = k*(1 – (1-1/k)^n)`

Let’s make some reasonable assumptions and plot this against our data. With an assumption of 500M cookies in the US we would expect:

That seems reasonable. We’ll “see” all of the cookies in about 3 Billion page views. Let’s plot our data on top:

Uh…ok. Well, clearly there are more than 500M cookies. Some of this can be explained by everyone having smartphones and iPads, meaning there are at least a few devices per internet user. All we should really need to do is collect a bit more data on our side and see when the unique cookie vs impression chart starts to keel over. Then I could fit an asymptote curve to it and guess as to how many cookies there are in the world. Well, fortunately we have more data available – let’s look at all of AK’s traffic this summer:

What could this possibly mean? At 40B ad impressions we *must* have seen a significant amount of the cookies on the internet. So, whats going on? Well, we have some theories (robots, deleters, etc.) and over the next few weeks we’ll share some of our adventures in cookie analysis.

Isn’t this just the Coupon Collector’s Problem? The corresponding calculation (where n = # unique values, uniformly sampled with replacement) :

E[draws] ~ n * (ln(n) + gamma)

For n = 5*10^8, the expected # draws is about 1.03*10^10, or 20.6 * n. That’s a lot bigger than 3 billion.

Tobin – The coupon collector problem is slightly different in that you are waiting to find out how long it will take until you see EVERY coupon. It’s really hard to find the one coupon out of (5×10^8 – 1) that you haven’t seen yet, thus the O(n log(n) ). In our case it’s much looser. We don’t really care how long it takes to see ALL of the cookies, just most of them. I wasn’t really sure how to formalize “most” so we just plotted it out. Very insightful comment though. Thanks!

Awesome work. I’m really looking forward to seeing what comes up next. Will be tracking your research!