Billions of anything

In most programming languages an int is 32 bits wide providing for 4294967295 (232-1) values or 2147483647 (231-1) if signed. In the case of Java, which we use for a number of components in our infrastructure, many of the fundamental components use int‘s: array indexes, NIO, IO, most collections (as they are commonly based on arrays), etc. When you’re working with billions of anything, its easy to run into these bounds which result in subtle bugs that are hard to track down due to exceptions that aren’t what they seem. The most common cases that we run into are due to the roll-over that occurs when you add any positive value to 2147483647 — the value becomes negative (since Java’s int‘s are signed). Sometimes this will result in an ArrayIndexOutOfBounds exception or sometimes it will result in a seemingly impossible callpath from deep inside of some java.* class.

I remember working on my first few i18N (internationalization) and l10n (localization) projects where I learned the do’s and don’ts of how to write code that worked seamlessly (or at least was easy to work with) in multiple locales. Working with “big data” feels exactly like that — you have to slowly build up a set of techniques:  instead of a single array, you need to keep around arrays of arrays (since each dimension is limited to 2147483647 elements); you have to know how to shard your collections so that they do not exceed the maximum allowed capacity (e.g. HashMap is limited to 1073741824 (230) buckets); if(value > Integer.MAX_VALUE) doesn’t do what you think it does (and most of the time it’s hard to tell that that’s the code that you wrote). The list goes on.

One interesting development was “announced” at EclipseCon: there’s talk about “big data” support in Java 9 (ref Reinhold Already Talking About Java 9 for example). This is something that I will keep my eye on. Unfortunately, it wont help us for the next few years.