Hacking is Indeed a Data Science Skill

Recently I ran into a task that required me to manipulate a bunch of disparate log level data. Ahh, the tedium of data mining! There were a few particularly annoying things about this task.


  1. The data was split into 3 distinct files. Impression logs, activity logs, and user-attribute logs.
  2. The user-attribute logs were stored sequentially. It looked like:

    request_id, date, user_id, attribute_id

    where there were a varying number of lines for each request_id, representing how many attributes we may have been handed at that time for that user.

  3. The data was reasonably large (~500GB)

What I wanted after this was a flattened, normalized data set to use for various modeling tasks. The output format needed to be:

request_id, date, user_id, activity_id, { attribute_ids }

The first approach I thought of was to get the entire set of unique attributes from the file using something like “cat | cut | sort -u”to create a database table and generate a bunch of inserts. This was dumb and obviously this gets annoying very quickly. Not to mention that my final data set would be a few 100GB and my research instance of Postgres would get real annoyed.

How about Hadoop?  While this isn’t a terrible answer, there are a few problems. Mainly, I’m under a deadline and getting 500GB to the cluster would take too long. What I really want is some Unix-foo that i can kick off and forget about. It feels like there is some “cut | join | awk” solution. These are times when i wish I had better Unix skills.  Maybe emacs has a function that does this and brings you lunch (c-x-lunch)?

So, what did I do?  Well, many definitions of data science include the technical skill of “hacking” as a necessary ingredient. One of the finer points of “hacking” has to be social engineering. It’s way easier to get the president of the bank drunk and have him tell you the combination to his lock than it is to crack the safe. So, along these lines i came up with a plan. Most engineers pride themselves on being extremely smart (and most are) and love challenges. This can also get them into trouble though. Next time you walk into an engineering meeting, ask an engineer what sorting algorithm Java uses and if it’s the right choice. One hour wasted!

Our CTO, Rob G., happens to be a brilliant engineer, so I called him up and casually brought up this annoying formatting problem I was having. He immediately started brainstorming solutions and he ended up talking himself into Java as the fastest way (wall clock) that he could get this done.  Fortunately, I’m not really a Java guy. So after Rob convinced himself that his solution was best, he also ended up talking himself into writing all the code. Awesome! Now, my annoying data task was “executing” and I could go back to work on more important things. This entire conversation took about 10 minutes. Much faster than Googling around for Unix foo. The next morning, my data set was all organized and sitting on one of our servers.  Hacking is indeed a useful data science skill!

I guess the moral here is twofold. 1) Sometimes asking for help (and figuring out ways to get it!) really is the best solution, and 2) distributing workloads across your team makes everybody work faster.

P.S. Obviously wasting the CTO’s time is never a good idea. Luckily, Rob is a champion of scheduling and apparently he had a few extra cycles, so no harm done to the greater good.


  1. big AK fan says:

    great point you guys make – keep up the blogging, great to read this kind of stuff. would be fantastic to get more folks writing/more insight into these kinds of challenges/solutions

  2. Hey! Wait a minute! I see what happened here …. :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: