No BS Data Salon #3

On Saturday Aggregate Knowledge hosted the third No BS Data Salon on databases and data infrastructure. A handful of people showed up to hear Scott Andreas of Boundary talk about distributed, streaming service architecture, and I also gave a talk about AK’s use of probabilistic data structures.

The smaller group made for some fantastic, honest conversation about the different approaches to streaming architectures, the perils of distributing analytics workloads in a streaming setting, and the challenges of pushing scientific and engineering breakthroughs all the way through to product innovation.

We’re all looking forward to the next event, which will be in San Francisco, in a month or two. If you have topics you’d like to see covered, we’d love to hear from you in the comments below!

As promised, I’ve assembled something of a “References” section to my talk, which you can find below.

(Hyper)LogLog

Random

  • Sean Gourley’s talk on human-scale analytics and decision-making
  • Muthu Muthukrishnan’s home page, where research on streaming in general abounds
  • A collection of C and Java implementations of different probabilistic sketches

Attendees of the third No BS Data Salon

Comments

  1. Any chance of you guys using Webex or Skype to make these available to those of us who live out in the boondocks? The AK blogs are golden. Doubless the meetings would be equally valuable!

    • I have been a bit torn up to this point as to whether or not it makes sense to record the presentations and subsequent discussions and post them for general consumption. We’ve strived to make these an intimate and open discussion forum. I’m worried that recording with the intent to publish will change the dynamic as you’re forced to think about how a particular statement may be taken out of context, etc. Perhaps we can strike a compromise and record and make available just the presentations. I will float this idea past the attendees at the next data salon. Thanks for your interest!

  2. Do you guys use meetup or something similar to organize these? I’d be interested in attending the SF based ones.

  3. Tim – thanks for the link to your presentation on HLL. I’m one of the authors of the Clearspring HLL implementation. Can you share which implementation you are using in production?

    • Hi Matt,

      We use our own implementation in production. When we were looking at the HLL paper almost two years ago, no open-sourced Java implementations existed. We ended up rolling our own on top of primitive arrays and a whole heck of a lot of bit slicing. We’ve considered open-sourcing it in the past, but our use case (and hence our final implementation) was so specific to our needs that we held off for a while. We keep tens of millions if not hundreds of millions of HLL instances in memory at any given time. This forces us to use custom slab allocators to circumvent lengthy GC pauses, which in turn encumber our APIs with manual memory management semantics. That said, we’re starting to see external interest/demand for our common use case so we’re hoping to open-source our implementation in the near future.

      In addition to the Java implementation, we’ve also written a C implementation. We’re hoping to wrap it with the requisite bits to allow for a Python extension. What we’re most excited about, though, is the Postgres module we’ve built that allows the use of HLLs as native data types in the database. We’ve been using a highly specialized version in production for over a year and it’s been incredibly useful, as I mention in the presentation. We’re hoping to release that as well.

      Best,
      Timon

      • Awesome, the Postgres module sounds interesting. I’ve seen some traffic on the Cassandra mailing list indicating that they are looking into using it there as well. I’ll follow the blog to look for the annoucment for when/if you open source your implementation.

        Given your use case of storing many millions of HLL instances you may be interested in the DSIUtils IntHyperLogLogCounterArray implementation where each instance stores an array of counters and is extremely memory efficient when many counters are required, http://dsiutils.dsi.unimi.it/docs/

      • Thanks, Matt!

        Our implementation is similar to the one in DSIUtils, but we’ve included a few space-saving optimizations, like the one proposed by our pals at MetaMarkets. Our data distribution happens to be highly Zipfian (a few reporting keys get most of the events, while a long/heavy tail of keys get the rest) and as such punishes us for keeping a “full-width” counter representation for every key. We’ve also done a good bit of work around serialization and reallocation that seems to be missing from the DSIUtils implementation. I’ll be elaborating more on our implementation in the near future in another blog post.

        Thanks again!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: