Using Tools You Already Have

I was reading FlowingData (a great data science blog!) a little while ago and came across a post on why programming is a good skill for data scientists to have. My first reaction was, “well, duh” – while I don’t expect everyone in the data science business to be a machine learning whiz or re-writing the linux kernel for fun, I would have thought that most people reading a the blog had some kind of computer science background. Judging from the comments on that post, it looks like my assumption was quite wrong – Nathan definitely has a fair number of readers who really appreciated that post.

Nathan is good enough to provide aspiring readers with a list of essential tools, and a list of good starting points. Both lists are quite good, covering everything from Python to Processing, but there’s a glaring omission: shell scripting. So, in the spirit of teaching, I thought I’d share a little bit about why every data scientist should know at least a bit about bash scripting.

They’re Everywhere

The tools built in to every flavor of *nix (check out this IEEE standard) cover most of what you need to do to manipulate, manhandle, and munge data-sets. There are tools for selecting columns, sorting, de-duping, counting, pattern matching and text manipulation, joins, and more. In order, that translates into:

  • cut
  • sort
  • uniq
  • wc
  • grep
  • sed and awk
  • join

I use all these nearly every day. The best part is, once you know they exist, these tools are available on every unix machine you will ever use. Nothing else (except maybe perl) is as universal – you don’t have to worry about versions or anything. Being comfortable with these tools means you can get work done anywhere – any EC2 instance you boot up will have them, as will any unix server you ssh into.

They’re Efficient

One of the first lessons I learned as a programmer is that there is absolutely no way I can sort data using a script faster than I could do it with sort. With a small data-set, it’ll take you longer to write print statements than it will for sort to finish, and with large data sets, I’m just glad someone else wrote N-way external merge-sort for me.

Similarly, the other commands are highly optimized, and the code has been around for years, touched by many great hands (it’s fun reading a man page and seeing “Written by Richard Stallman” at the bottom), and used by thousands and thousands of people. So, there probably aren’t that many obvious bugs left.

If you want to be a geek about it (and we do), they’re also all, with the obvious exception of sort, one-pass algorithms and O(N) or better with low memory usage. What’s not to love?

They’re easy to work with

Getting used to having things already done for you also makes a data-centric workflow more efficient. The first step of almost any data project is figuring out the idiosyncrasies of a data set. Knowing shell utilities lets you very easily get a first impression of a data set, and often gets you most of the way through the process of cleaning it up.

As an example, I can quickly get sense of how frequently users appear in our logs – let’s say the top 10 users – by just using a few commands chained together.

cut -d, -f1 data.csv | sort | uniq -c | sort -r | head

Running the same command again with tail instead of head gives the bottom 10 users by frequency, and with another round of cut I can get the whole frequency distribution of users in the log, all without writing a single line of code.

Once you end up doing this more than once, it’s easy to save a nice little script that you can easily re-run.

#! /bin/bash

if [ -z "$1" ]; then
  echo "usage: top_users.sh input_file"
  exit
fi

cut -d, -f1 $1 | sort | uniq -c | sort -r

EOF and Disclaimer

I’ve barely scratched the surface here, and already you can get a good sense of how easy and powerful a lot of these tools are. If you’re interested in picking them up, open a terminal and use man page, wikipedia, or your favorite search engine to find out more. There are good bash scripting guides scattered around the web, once you get interested in putting all of this together. The sky is the limit, really.

Comments

  1. This blog is incredible!. with all the information that has always been posted including the current one, I have acquired alot. I can’t fail to go through it every day. Thanks for sharing all with us.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: