For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It’s simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here’s how I did that with Clojure and PDFBox.
First, add PDFBox to your project.clj
:
[org.apache.pdfbox/pdfbox "1.8.2"]
Then import the relevant classes:
(:import [org.apache.pdfbox.pdmodel PDDocument]
[org.apache.pdfbox.util PDFTextStripper])
The PDDocument/load
method
takes a pretty wide variety of inputs, like String
filenames or
InputStreams
. Here, I’m going to pass it a java.net.URL
, then use
PDFTextStripper
to pull the text of the PDF out:
(defn text-of-pdf
[url]
(with-open [pd (PDDocument/load (URL. url))]
(let [stripper (PDFTextStripper.)]
(.getText stripper pd))))
That’s it.