For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It’s simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here’s how I did that with Clojure and PDFBox.
First, add PDFBox to your project.clj
:
Then import the relevant classes:
The PDDocument/load
method
takes a pretty wide variety of inputs, like String
filenames or
InputStreams
. Here, I’m going to pass it a java.net.URL
, then use
PDFTextStripper
to pull the text of the PDF out:
That’s it.