Rob, Matt, and I just wrapped up our trip to Copenhagen for the EADS Summer School on Hashing at the University of Copenhagen and it was a blast! The lineup of speakers was, simply put, unbeatable: Rasmus Pagh, Graham Cormode, Michael Mitzenmacher, Mikkel Thorup, Alex Andoni, Haim Kaplan, John Langford, and Suresh Venkatasubramanian. There’s a good chance that any paper you’ve read on hashing, sketching, or streaming has one of them as a co-author or is heavily influenced by their work. The format was three hour-and-a-half lectures for four days, with exercises presented at the end of each lecture. (Slides can be found here. ~~They might also post videos.~~ **UPDATE: **They’ve posted videos!)

Despite the depth of the material, almost all of it was approachable with some experience in undergraduate math and computer science. We would strongly recommend both of Michael Mitzenmacher’s talks (1, 2) for an excellent overview of Bloom Filters and Cuckoo hashing that are, in my opinion, significantly better and more in depth than any other out there. Specifically, the Bloom Filter talk presents very elegantly the continuum of Bloom Filter to Counting Bloom Filter to Count-Min Sketch (with “conservative update”) to the Stragglers Problem and Invertible Bloom Filters to, finally, extremely recent work called Odd Sketches.

Similarly, Mikkel Thorup’s two talks on hashing (1, 2) do a very thorough job of examining the hows and whys of integer hashing, all the way from the simplest multiply-mod-prime schemes all the way to modern work on tabulation hashing. And if you haven’t heard of tabulation hashing, and specifically twisted tabulation hashing, get on that because (1) it’s amazing that it doesn’t produce trash given how simple it is, (2) it’s unbelievably fast, and (3) it has been *proven* to provide the guarantees required for almost all of the interesting topics we’ve discussed on the blog in the past: Bloom Filters, Count-Min sketch, HyperLogLog, chaining/linear-probing/cuckoo hash tables, and so on. We really appreciated how much attention Mikkel devoted to practicality of implementation and to empirical performance when discussing hashing algorithms. It warms our heart to hear a leading academic in this field tout the number of nanoseconds it takes to hash an item as vocally as the elegance of the proof behind it!

We love this “Summer School” format because it delivers the accumulated didactic insight of the field’s top researchers and educators to both old techniques and brand new ones. (And we hope by now that everyone reading our blog appreciates how obsessed we are with teaching and clarifying interesting algorithms and data structures!) Usually most of this insight (into origins, creative process, stumbling blocks, intermediate results, inspiration, etc.) only comes out in conversation or lectures, and even worse is withheld or elided at publishing time for the sake of “clarity” or “elegance”, which is a baffling rationale given how useful these “notes in the margin” have been to us. The longer format of the lectures really allowed for useful “digressions” into the history or inspiration for topics or proofs, which is a huge contrast to the 10-minute presentations given at a conference like SODA. (Yes, obviously the objective of SODA is to show a much greater breadth of work, but it really makes it hard to explain or digest the context of new work.)

In much the same way, the length of the program really gave us the opportunity to have great conversations with the speakers and attendees between sessions and over dinner. We can’t emphasize this enough: if your ambition to is implement and understand cutting edge algorithms and data structures then the best bang for your buck is to get out there and meet the researchers in person. We’re incredibly lucky to call most of the speakers our friends and to regularly trade notes and get pointers to new research. They have helped us time and again when we’ve been baffled by inconsistent terminology or had a hunch that two pieces of research were “basically saying the same thing”. Unsurprisingly, they are also the best group of people to talk to when it comes to understanding how to foster a culture of successful research. For instance, Mikkel has a great article on how to systematically encourage and reward research article that appears in the March 2013 issue of CACM (pay-wall’d). Also worthwhile is his guest post on Bertrand Meyer’s blog.

If Mikkel decides to host another one of these, we cannot recommend attending enough. (Did we mention it was free?!) Thanks again Mikkel, Rasmus, Graham, Alex, Michael, Haim, and John for organizing such a great program and lecturing so eloquently!

Hello and thanks for very interesting reading!

I have tested simple tabulation hashing for UInt64 -> UInt64 hash function on Intel(R) Core(TM) i5-2520M CPU,

but it is slower (latency, throughput) than Murmur finalizer, and have worse avalanche (tested with SMHasher).

Code for Murmur finalizer:

inline size_t murmurMix(UInt64 x)

{

x ^= x >> 33;

x *= 0xff51afd7ed558ccdULL;

x ^= x >> 33;

x *= 0xc4ceb9fe1a85ec53ULL;

x ^= x >> 33;

return x;

}

Code for simple tabulation hashing:

inline size_t tabulation(UInt64 x)

{

static UInt64 random[8][256] = …

size_t res = 0;

for (size_t i = 0; i >= 8;

}

return res;

}

Compiled with gcc 4.8.2 with -O3. Loop was unrolled.

I am using Murmur finalizer for linear probing hash table, and seeking for faster hash function with acceptable quality.

Let assume that all 64 bits of hash result are used.

BTW, table for simple tabulation hashing occupies half of L1d cache, which may be too much for practical applications.

Have I made any mistake?

Hi Alexey!

Thanks for the note. Any chance you could post your full source code to a gist so I can take a closer look? I think the comment-filter took out part of it.

Thanks,

Timon

Here is the code:

Alexey – Thanks again for your comment. Mikkel himself has read your comment and his response is:

“We know from Vadan and Mitzenmacher

http://people.seas.harvard.edu/~salil/research/streamhash-abs.html

that any 2-independent hashing works as good as truly random for

almost all input (since almost all inputs have high entropy). And for

2-indepdendent hashing you can use one of the multiplication shift

schemes from my notes, possibly with the cross product, and end up

using only a single multiplication for 64 bit keys. This will look

good to almost all random tests.

The issue is if you want good probabilistic guarantees for all

possible input, and not risk later nasty surprises when using your

algorithms in a scenario with a new type of input. For some

applications, 2-independence suffices, but for more complicated

things, you need something like tabulation hashing to be safe.”

— Mikkel Thorup

Hope that helps,

m