On Saturday Aggregate Knowledge hosted the third No BS Data Salon on databases and data infrastructure. A handful of people showed up to hear Scott Andreas of Boundary talk about distributed, streaming service architecture, and I also gave a talk about AK’s use of probabilistic data structures.
The smaller group made for some fantastic, honest conversation about the different approaches to streaming architectures, the perils of distributing analytics workloads in a streaming setting, and the challenges of pushing scientific and engineering breakthroughs all the way through to product innovation.
We’re all looking forward to the next event, which will be in San Francisco, in a month or two. If you have topics you’d like to see covered, we’d love to hear from you in the comments below!
As promised, I’ve assembled something of a “References” section to my talk, which you can find below.
- Original LL paper by Durand and Flajolet
- Original HLL paper by Flajolet et al.
- Java implementation by ClearSpring
- Python implementation
- A paper on near-optimal compression of HLLs by Scheuermann and Mauve
- A post on LogLog and other similar probabilistic techniques like Count-min Sketch
- A post by our friends at Metamarkets about HLL where they propose a map-based technique for saving on memory
- Sean Gourley’s talk on human-scale analytics and decision-making
- Muthu Muthukrishnan’s home page, where research on streaming in general abounds
- A collection of C and Java implementations of different probabilistic sketches