For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It’s simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here’s how I did that with Clojure and PDFBox.
First, add PDFBox to your
Then import the relevant classes:
takes a pretty wide variety of inputs, like
String filenames or
InputStreams. Here, I’m going to pass it a
java.net.URL, then use
PDFTextStripper to pull the text of the PDF out: