For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It’s simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here’s how I did that with Clojure and PDFBox.
First, add PDFBox to your project.clj:
[org.apache.pdfbox/pdfbox "1.8.2"]Then import the relevant classes:
(:import [org.apache.pdfbox.pdmodel PDDocument]
[org.apache.pdfbox.util PDFTextStripper])The PDDocument/load
method
takes a pretty wide variety of inputs, like String filenames or
InputStreams. Here, I’m going to pass it a java.net.URL, then use
PDFTextStripper to pull the text of the PDF out:
(defn text-of-pdf
[url]
(with-open [pd (PDDocument/load (URL. url))]
(let [stripper (PDFTextStripper.)]
(.getText stripper pd))))That’s it.