John Trengrove

home posts about

Data Science at the Command Line

07 Mar 2015

A book I've been reading lately has been Data Science at the Command Line by Jeroen Janssens.

The book's main premise is that you can do a lot of interesting stuff with data purely by working with simple command line programs and joining them together using text streams. I like this approach as it allows you to quickly explore datasets without breaking into a full programming environment and also work on data with a vast array of useful utilities.

I thought the book was put together very well. The pace is good and it is fun to read. I created an example below to illustrate some of the concepts.

McDonald's Example

Let's suppose you are interested in McDonald's locations for research (in Australia as that is where I live). You might be interested in this data as a measure of fast-food consumption, or perhaps as a competitor trying to plan your expansion strategy.

The first step would be to acquire this data source. McDonald's Australia have a nice clean JSON file representing their stores in Australia available here https://mcdonalds.com.au/data/store.

Accessing this data can be done as follows:

curl https://mcdonalds.com.au/data/store > stores.txt

This will retreive the data in JSON format for all locations in Australia.

A new utility I learnt about in the book was jq. This command line utility will parse and extract JSON data. I use it below to search for the store state, and store title in the downloaded file.

cat stores.txt | jq '.[] | .store_state+","+.title'

This returns the following in a csv-like format:

"VIC,EASTLINK NORTHBOUND"
"WA,BROOME WA"
"QLD,MCWHIRTERS II QLD"
"QLD,KAWANA SHOPPINGWORLD II QLD"
"VIC,VICTORIA GARDENS VIC"

Another great set of utilities is csvkit. Csvkit makes it really easy covert to and work with csv files. Here I do some cleaning, add a header, and view the data with csvlook.

cat stores.txt | jq '.[] | .store_state+","+.title' | tr -d '"' | header -a "State,Store" | head | csvlook

Produces:

|--------+-------------------------------|
|  State | Store                         |
|--------+-------------------------------|
|  VIC   | EASTLINK NORTHBOUND           |
|  WA    | BROOME WA                     |
|  QLD   | MCWHIRTERS II QLD             |
|  QLD   | KAWANA SHOPPINGWORLD II QLD   |
|  VIC   | VICTORIA GARDENS VIC          |
|  QLD   | CLIFFORD GARDENS QLD          |
|  NSW   | ERINA FAIR II NSW             |
|  NSW   | NEETA CITY II NSW             |
|  NSW   | BONDI JUNCTION WESTFIELD NSW  |
|--------+-------------------------------|

Nothing great to look at but pretty good for a few lines of code. Csvkit even allows you to run SQL queries on the data. Below I am running a query to group the stores by state.

cat stores.txt | jq '.[] | .store_state+","+.title' | tr -d '"' | header -a "State,Store" | csvsql --query 'SELECT State,count(*) AS Count FROM stdin group by State order by Count desc' | csvlook

Result:

|----------+--------|
|  State   | Count  |
|----------+--------|
|  NSW     | 316    |
|  VIC     | 248    |
|  QLD     | 218    |
|  WA      | 78     |
|  SA      | 52     |
|  TAS     | 17     |
|  NT      | 8      |
|  ACT     | 6      |
|  Wadalba | 1      |
|----------+--------|

It seems like the Wadalba McDonald's store is going for independence! I did check the source data and it matches. Data quality is always an issue but props to McDonald's on providing their data in a decent format. It saves time scraping (scraping is always possible) and makes it easier to integrate into other applications.