# Working with large sets

We do a lot of work with unique user counting and we have developed some techniques for accurate counting in small bounded-size structures.  Periodically I like to make sure that all of our assumptions still hold as the world changes around us.  I was recently running a number of experiments on large sets to get our science folks some data to analyze.  It involved getting the cardinality, union and intersection of large sets of user ids (which are 64bit values) the brute-force way.

Since I spend a good deal of my time writing Java, I figured I would just quickly whip something up.  For set sizes of 1M or less, the “standard techniques” worked well enough — java.util.HashSet will do the trick.  In general, for larger collections of primitives it’s a good idea to use one of the 3rd party libraries that is specifically tailored to primitives such as Trove or Colt to cut down on the time and memory bloat of autoboxing primitives into objects.  (There are a number of postings around what are the “best” collections for a given circumstance such as this one on StackOverflow.)  You can get to around 10M entries in a traditional set before running union and intersection take prohibitively long due to the fact that it works element-by-element.

Working with sets with over 10M entries requires different techniques.  The most common approach is to use bit arrays.  Bit arrays are not only compact in size (in that each element only takes one bit of RAM), but they are very fast for doing set operations. The set operations are typically performed by taking chunks of the bit array and performing regular instructions on them. For example, the bit array can be chunked up into longs (64bit ‘words’) and then bitwise or or bitwise and operations are performed pairwise on the two ‘sets’. Java provides java.util.BitSet which does all of the heavy lifting for you.  (There are 3rd party bit arrays available too.)

When using a hash-based set, one simply gives it the element that is to be stored (in our case, that’s a 64bit user id). With a bit array the interface is basically setBit(index, value) and getBit(index).  The problem comes down to:  what is the index. A naive approach would simply use the bit array in the same way as the hash set — pass it the element. Unfortunately, this would require a bit array that is 264-1 bits long. If you were a little dangerous with your knowledge, you could exploit some of the RLE (run-length encoding) techniques for compressing your bit array such as Word Aligned Hybrid (WAH — there’s even a Java implementation as well as others found here). Another, call it ‘sane’, approach is to use a map together with the bit array to provide the index. The map is used to map from the element to a sequential index and that index is used in the bit array.

To insert (pseudo code):

```index = elementToIndexMap.get(elementId)
if(index does not exist)
index = sequence++
elementToIndexMap.put(elementId, index)
endif
bitarray.setBit(index, true)```

To retrieve (pseudo code):

```index = elementToIndexMap.get(elementId)
if(index does not exist)
return false/*assume not present means 'not set'*/
endif
return bitarray.getBit(index)```

WIth this approach, you can easily get it 100M or even billions of elements in a set given that you have enough RAM.

Going beyond 1B requires other techniques and usually involve disk-based techniques. Unix join and sort for example can get you a long way.

Even though the examples that I gave were Java-based, the techniques presented here are universal. Underlying all of this is the necessity for a little profiling, a bit of understanding of the algorithms involved, and the realization that sometimes the default facilities provided are insufficient for all cases.