Software! Math! Data! The blog of R. Sean Bowman
The blog of R. Sean Bowman
June 22 2015

Recently I saw a blurb for Drake, a tool similar to Make but made for munging data. This seems to be a popular topic these days; I was also reading a post about using Rake for similar things. I’d like to implement an example from Drake using Redo, which I’ve talked about some before. Drake looks like a very cool tool, and my intent is not to denigrate it, but rather to show that Redo can handle similar tasks.

Short Aside on Data Science

I’ll leave my opinions on “data science” and so forth for another day, but let me just say a few things. I love data. I love playing with it, graphing it, converting it to different formats, looking at it from a probability/statistical point of view, applying machine learning techniques to try to understand it better. Most of all, I love gaining insight by careful analysis, and turning data into actionable information whether for my own enjoyment or for others. If that’s what data science is, count me in.

The example

I’d like to take a look at the Drake "human-resources" demo. This is great demo because it shows off lots of problems that we face when we manipulate data: we have to filter some if out, collect and join together several sources, add entirely new features and synthesize new attributes from old. These are all things data science folks deal with all the time, so being able to do them reliably, fast, and easily are important.

The build system Redo can handle all these things quite well. For the impatient, you can check out my github repo containing the code described below. If you’re still here, let’s see what the Drake human-resources demo looks like using Redo. First, we have to data files: everyone is a CSV file with a column of first and last names and a column of phone numbers. The CSV file skills lists names (a subset of the names in everyone) and a column containing skills these folks have.

Our first goal is to make a CSV file containing three rows: name, skills, and phone number. The prefix 310 must be in the phone number. Here is the file that accomplishes that, people.skills.csv.do.

redo-ifchange everyone skills
grep 310 everyone | sort > $2.tmp
join -t, skills $2.tmp
rm $2.tmp

After running redo people.skills, we have a file people.skills.csv containing the appropriate information. Now we read this CSV file and transform it into a JSON file with the above fields plus a UUID field. For this task, I used the jq JSON tool, a very cool DSL for transforming JSON (and much more). This is the first time I’ve used it, though, so the code could probably be made a lot better… Anyhow, here’s people.json.do.

redo-ifchange people.skills.csv
while read p; do
  echo "$p,$(uuidgen)"
done < people.skills.csv | \
jq --slurp --raw-input --raw-output \
    'split("\n") | map(split(",")) |
    map({"name": .[0],
          "skills": .[1],
          "tel": .[2],
          "uuid": .[3]})'

Note that we declare a dependency on the file people.skills.csv created previously. Then we read this file line by line, adding a unique ID generated from the command uuidgen. These rows are piped into jq, which constructs a list of dictionaries out of them. New we have a file people.json. Cool!

The next task is slightly interesting: we need to create two reports, one containing the names in people.json for which the first name is longer than the second, and for which the second is longer than the first. For this, I’ve used Python. (I have an implementation using jq as well, but this one is easier for me to understand. It also illustrates using other languages in .do files.) Building with multiple targets can be slightly trickier than just one, but in this case it’s not bad at all. Here is name_length_reports.do.

#!/usr/bin/env python
import json, subprocess
subprocess.call("redo-ifchange people.json", shell=True)

with open("people.json", "r") as f:
    data = json.load(f)

with open(f_fname, "w") as f_file, open(l_fname, "w") as l_file
for row in data:
    full_name = row["name"]
    fname, lname = full_name.split()
    if len(fname) > len(lname):
        f_file.write(full_name + "\n")
    if len(lname) > len(fname):
        l_file.write(full_name + "\n")

We need the “shebang” on the first line to tell Redo that the file is in Python. Using Python’s shell facilities we declare out dependence on people.json. Then we open the two files and write the names from people.json to each according to the criteria we talked about before.

The last report to generate is a CSV file containing the information above together with a “suggested username” consisting of the person’s first initial and last name. Again, I’ve implemented it in Python; here is for_HR.csv.do.

#!/usr/bin/env python
import sys, subprocess, json, csv
subprocess.call("redo-ifchange people.json", shell=True)
with open("people.json", "r") as f:
    reader = json.load(f)
    writer = csv.writer(sys.stdout)
    for row in reader:
        first, last = row["name"].split()
        uname = first[0] + last
        writer.writerow([uname] + row.values())

That’s it! We can add a default.do file so that if we simply run redo in this directory, the appropriate files are generated:

redo-ifchange people.skills.csv people.json name_length_reports \
    for_HR.csv

I also added a clean.do script that removes all the created files and so forth, mainly for testing purposes.

Check out the code on github.

Approx. 861 words, plus code