Increasingly Functional.
by Joshua Miller | on Twitter | on the web | on github

Parsing PDFs in Clojure

December 5th 2013

Tagged: clojure

For whatever unknowable reason, the Harrisburg PD publishes its crime blotter in PDF format. It's simple monospaced text with no images or extra formatting, but it comes as a PDF that has to be stripped in order to parse its contents. Here's how I did that with Clojure and PDFBox.

First, add PDFBox to your project.clj:

[org.apache.pdfbox/pdfbox "1.8.2"]

Then import the relevant classes:

(:import [org.apache.pdfbox.pdmodel PDDocument]
         [org.apache.pdfbox.util PDFTextStripper])

The PDDocument/load method takes a pretty wide variety of inputs, like String filenames or InputStreams. Here, I'm going to pass it a java.net.URL, then use PDFTextStripper to pull the text of the PDF out:

(defn text-of-pdf
  [url]
  (with-open [pd (PDDocument/load (URL. url))]
    (let [stripper (PDFTextStripper.)]
    (.getText stripper pd))))

That's it.