Increasingly Functional.
by Joshua Miller | on Twitter | on the web | on github

Derived Columns With Julia DataFrames

December 27th 2013

Tagged: julia

Recently I've been digging into Julia for technical and statistical computing. It's still in its early stages and things break frequently, but I find it a lot more amenable to my style of thinking than the more established technical languages like Matlab/Octave or R.

I wanted to take the full hbg-crime.org dataset and see which neighborhoods experienced crimes at which hours. The full CSV set has timestamps and classification by neighborhood already, so we just need to convert the timestamp into an hour of the day (0 through 23) and then group them by neighborhood.

To read and manipulate the data, we're going to use DataFrames — it gives us the NA type for missing values (not uncommon in the dataset while I'm still tuning the geolocation) and nice functions for dealing with columns directly. For the timestamps, we're going to use Datetime.

using DataFrames
using Datetime

Let's read the data in:

julia> data = readtable("reports.csv")
DataFrame with 868 rows, 7 columns
Columns:

Start             306 non-null values    
End               868 non-null values    
Description       868 non-null values    
Address           868 non-null values    
Lat               841 non-null values    
Lon               841 non-null values    
Neighborhood      802 non-null values    

Then let's add a function that will map a column of timestamps into a column of integer hours:

formatter = "yyyy-MM-ddTHH:mm:ss"

function eachHour(m)
    map(h -> Datetime.hour(Datetime.datetime(formatter, h)), m)
end

To create a new table with a column called "Hour" with the results of that function in it, we're going to use the @transform macro:

withHours = @transform(data, Hour => eachHour(End))

Now to group the results down we're going to use the by function from DataFrames' Split-Apply-Combine:

results = by(withHours, ["Neighborhood", "Hour"], nrow)

We're just passing by a source, a list of columns we want to split on, and a function to apply to their combination (nrow, which just counts them). And the results are just what we wanted:

julia> results
119x3 DataFrame:
            Neighborhood hr x1
[1,]                  NA  0  4
[2,]      "allison-hill"  0 18
[3,]          "downtown"  0  5
[4,]           "midtown"  0 10
[5,]            "uptown"  0 13
[6,]                  NA  1  3
[7,]      "allison-hill"  1 20
[8,]          "downtown"  1  9
[9,]           "midtown"  1  1
[10,]           "uptown"  1  5
[11,]                 NA  2  4
[12,]     "allison-hill"  2 24
[13,]         "downtown"  2  7
[14,]          "midtown"  2  6
[15,]           "uptown"  2  4
[16,]                 NA  3  2
[17,]     "allison-hill"  3 13
[18,]         "downtown"  3  1
[19,]          "midtown"  3  6
[20,]           "uptown"  3  5
  :
[100,]          "uptown" 19  7
[101,]    "allison-hill" 20 22
[102,]        "downtown" 20  1
[103,]         "midtown" 20  1
[104,]          "uptown" 20  7
[105,]                NA 21  3
[106,]    "allison-hill" 21 20
[107,]        "downtown" 21  2
[108,]         "midtown" 21  1
[109,]          "uptown" 21  6
[110,]                NA 22  5
[111,]    "allison-hill" 22 12
[112,]        "downtown" 22  4
[113,]         "midtown" 22  6
[114,]          "uptown" 22  7
[115,]                NA 23  1
[116,]    "allison-hill" 23 17
[117,]        "downtown" 23  2
[118,]         "midtown" 23  5
[119,]          "uptown" 23  8