k-Nearest Neighbor Classification for Harrisburg Crimes

Update:

The general form of this code remains correct, but several things have changed with Julia 0.3+ and DataFrames 0.5+; most notably that DataFrames are now indexed by symbols instead of strings and that the fancy DataFrames metaprogramming functions have been pulled out into their own very-alpha package. Updated code is available on this experiment’s github repo.

One thing I’ve wanted to be able to do with the hbg-crime.org data set is to start emulating those IBM “Smarter Planet” commercials where someone with an English accent tells you they can help police departments predict where crime happens. Two months’ worth of data (give or take) is probably not going to give us the confidence we would want to dispatch squad cars, but it’s a fun experiment.

I’m going to use the k-Nearest Neighbors algorithm to take a given location, time of day, and day of the week, pick which crimes most closely match those criteria, and pick a winner from those crimes. The ‘k’ is the number of neighbors we’re going to pick from the sorted list. If you want to go over how it works, I found that the Data Mining Map had the clearest explanation of k-Nearest Neighbor for me at least.

Let’s start by loading up the libraries we need and importing the data:

using DataFrames
using Datetime

reports = readtable("reports.csv")

We don’t need the start time or neighborhood, and let’s only work with data where we’ve successfully geocoded the report:

delete!(reports, ["Start", "Neighborhood"])
complete_cases!(reports)

Since we are going to be comparing distances between columns with different units (degrees of latitude/longitude; day of the week; time), we need to standardize each column on a 0.0-1.0 scale.

function standardize(x::Float64, max, min)
    (x - min) / (max - min)
end

function standardize(m::DataArray{Float64,1}, max, min)
    (m .- min) ./ (max - min)
end

Note that we have two versions of the standardize function, one that applies to just a scalar value, and one that works on a DataArray, the underlying structure in a DataFrame. When we call standardize on a vector, it’ll use Julia’s vectorized arithmetic and apply to the whole vector. Neat.

So we need the max and min latitudes and longitudes for the standardize function:

max_lat = maximum(reports["Lat"])
min_lat = minimum(reports["Lat"])
max_lon = maximum(reports["Lon"])
min_lon = minimum(reports["Lon"])

If we hadn’t done complete_cases! earlier, we would need to do maximum(removeNA(reports["Lat"])) or we’d have just gotten NA (the null result) back.

Now we’ll create standard latitude and longitude columns:

@transform(reports, StandardLat => standardize(Lat, max_lat, min_lat))
@transform(reports, StandardLon => standardize(Lon, max_lon, min_lon))

Check the results:

julia> reports["StandardLat"]
1130-element DataArray{Float64,1}:
194104
170785
279784
103708
213908
188516
163927
 ⋮       
153612
165426
24092 
159025

That’s what we wanted.

So next we’ll convert the column of strings representing report dates and times into a column of DateTimes:

function parsedate(string::String)
    Datetime.datetime("yyyy-MM-ddTHH:mm:ss", string)
end
@vectorize_1arg String parsedate
@transform(reports, End => parsedate(End))

The @vectorize_1arg macro is pretty cool — it takes any function of one argument and a type to operate on and creates a new function that takes a vector of that type and applies it to each element in the vector.

Next we need to do the same standardization process. Rather than go 0.0 for midnight to 1.0 for the next midnight, making 11:59 pm and 12:01 am as far apart from each other as possible, what I did is standardize them on “distance from noon,” so that noon is 0.0, and 11:59 pm and 12:01 am are both about 1.0.

function standardize_minutes(d::DateTime)
    total_min = (Datetime.hour(d) * 60) + Datetime.minutes(d)
    dist_to_noon = abs(720 - total_min)
    dist_to_noon / 720
end
@vectorize_1arg Any standardize_minutes
@transform(reports, StandardMin => standardize_minutes(End))

Only one more column to standardize — day of the week:

function standardize_day(d::DateTime)
    (Datetime.dayofweek(d) - 1) / 6
end
@vectorize_1arg Any standardize_day
@transform(reports, StandardDay => standardize_day(End))

One thing to notice in this and the previous function is that I had to use Any as the type for @vectorize_1arg. I believe this is because the column in the DataFrame is untyped, despite having nothing but DateTimes in it. I think I should be able to set a type on a column, but no luck so far.

[UPDATE: This will be fixed in a future version of Julia, see this gist]

So the function for computing the Euclidean distance between one point and another is simple enough:

function eucl_dist(lat1, lon1, min1, day1, lat2, lon2, min2, day2)
    (lat2 - lat1)^2 +
    (lon2 - lon2)^2 +
    (min2 - min1)^2 +
    (day2 - day1)^2
end

I’d like to express that generally in terms of two vectors in such a way that it will work in the following function but couldn’t quite wrap my head around it. Speaking of which, the following function is the last funtion — take a current latitude, longitude, time, and k (i.e., how many neighbors to compare), and return an estimate for the likeliest crime to occur:

function nearest(lat, lon, time, k)
    std_lat = standardize(lat, max_lat, min_lat)
    std_lon = standardize(lon, max_lon, min_lon)
    std_min = standardize_minutes(time)
    std_day = standardize_day(time)
    reports["Distances"] = [eucl_dist(std_lat, std_lon, std_min, std_day, lat, lon, min, day) for (lat,lon,min,day)=zip(reports["StandardLat"], reports["StandardLon"], reports["StandardMin"], reports["StandardDay"])]
    sorted = sortby(reports, "Distances")
    last(sortby(by(sorted[1:k, ["Description"]], "Description", nrow), "x1")[:1])
end

First we standardize the location and time arguments on the same scale as the data. Then we create a column “Distances” in the DataFrame where each element is the distance between our arguments and that row. To do that, I’m using a list comprehension where the arguments are a zipped up 4-by-N matrix of the standardized columns. I’d like to get the eucl_dist function to take vectors in such a way that I could @transform the DataFrame like the standardization functions above but haven’t figured that out yet.

Last, we sort the DataFrame by distances, take the closest k rows, group by “Description” and count, then take the winner, all in one run-on line of code.

The results:

julia> println(nearest(40.2821445, -76.8804254, Datetime.now(), 7))
BURGLARY-ENTERS STRUCTURE W/NO PERSON PRESENT

Lock your doors!