For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It’s simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here’s how I did that with Clojure and PDFBox.

First, add PDFBox to your project.clj:

Then import the relevant classes:

The PDDocument/load method takes a pretty wide variety of inputs, like String filenames or InputStreams. Here, I’m going to pass it a java.net.URL, then use PDFTextStripper to pull the text of the PDF out:

That’s it.