Update:
The general form of this code remains correct, but several things have changed with Julia 0.3+ and DataFrames 0.5+; most notably that DataFrames are now indexed by symbols instead of strings and that the fancy DataFrames metaprogramming functions have been pulled out into their own very-alpha package. Updated code is available on this experiment’s github repo.
One thing I’ve wanted to be able to do with the hbg-crime.org data set is to start emulating those IBM “Smarter Planet” commercials where someone with an English accent tells you they can help police departments predict where crime happens. Two months’ worth of data (give or take) is probably not going to give us the confidence we would want to dispatch squad cars, but it’s a fun experiment.
I’m going to use the k-Nearest Neighbors algorithm to take a given location, time of day, and day of the week, pick which crimes most closely match those criteria, and pick a winner from those crimes. The ‘k’ is the number of neighbors we’re going to pick from the sorted list. If you want to go over how it works, I found that the Data Mining Map had the clearest explanation of k-Nearest Neighbor for me at least.
Let’s start by loading up the libraries we need and importing the data:
We don’t need the start time or neighborhood, and let’s only work with data where we’ve successfully geocoded the report:
Since we are going to be comparing distances between columns with different units (degrees of latitude/longitude; day of the week; time), we need to standardize each column on a 0.0-1.0 scale.
Note that we have two versions of the standardize
function, one that
applies to just a scalar value, and one that works on a DataArray
,
the underlying structure in a DataFrame
. When we call standardize
on a vector, it’ll use Julia’s vectorized arithmetic and apply to the
whole vector. Neat.
So we need the max and min latitudes and longitudes for the standardize function:
If we hadn’t done complete_cases!
earlier, we would need to do
maximum(removeNA(reports["Lat"]))
or we’d have just gotten NA
(the
null result) back.
Now we’ll create standard latitude and longitude columns:
Check the results:
julia> reports["StandardLat"]
1130-element DataArray{Float64,1}:
0.194104
0.170785
0.279784
0.103708
0.213908
0.188516
0.163927
⋮
0.153612
0.165426
0.24092
0.159025
That’s what we wanted.
So next we’ll convert the column of strings representing report dates
and times into a column of DateTime
s:
The @vectorize_1arg
macro is pretty cool — it takes any function of
one argument and a type to operate on and creates a new function that
takes a vector of that type and applies it to each element in the
vector.
Next we need to do the same standardization process. Rather than go 0.0 for midnight to 1.0 for the next midnight, making 11:59 pm and 12:01 am as far apart from each other as possible, what I did is standardize them on “distance from noon,” so that noon is 0.0, and 11:59 pm and 12:01 am are both about 1.0.
Only one more column to standardize — day of the week:
One thing to notice in this and the previous function is that I had to
use Any
as the type for @vectorize_1arg
. I believe this is because
the column in the DataFrame is untyped, despite having nothing but
DateTime
s in it. I think I should be able to set a type on a column,
but no luck so far.
[UPDATE: This will be fixed in a future version of Julia, see this gist]
So the function for computing the Euclidean distance between one point and another is simple enough:
I’d like to express that generally in terms of two vectors in such a way that it will work in the following function but couldn’t quite wrap my head around it. Speaking of which, the following function is the last funtion — take a current latitude, longitude, time, and k (i.e., how many neighbors to compare), and return an estimate for the likeliest crime to occur:
First we standardize the location and time arguments on the same scale
as the data. Then we create a column “Distances” in the DataFrame
where each element is the distance between our arguments and that row.
To do that, I’m using a list comprehension where the arguments are a
zipped up 4-by-N matrix of the standardized columns. I’d like to get
the eucl_dist
function to take vectors in such a way that I could
@transform
the DataFrame like the standardization functions above
but haven’t figured that out yet.
Last, we sort the DataFrame by distances, take the closest k rows, group by “Description” and count, then take the winner, all in one run-on line of code.
The results:
julia> println(nearest(40.2821445, -76.8804254, Datetime.now(), 7))
BURGLARY-ENTERS STRUCTURE W/NO PERSON PRESENT
Lock your doors!