I've been doing a lot of work these days dealing with PDFs and for the most part I've been happy with using poppler-utils' pdftohtml. And that is great if you don't care about positioning or formatting and just care about the content. But for those of you who, like me, have run across the need to know text positioning, font size, indentation, coloring, etc., then we will have to use something more.
I had given just about every other option a chance before I finally found the pdf reader gem. But when I found pdf reader, it didn't have much documentation and it wasn't entirely clear how to get started using it and if it would work. Well, I can tell you that it will work and after playing with the examples a lot of it became much clearer. I learned a lot that would probably be useful for a few other people out there, hence this post.
Ok, well to get started, take a look at the github repository for pdf reader. You don't need to spend too much time, but just note a few places like the examples directory and the list of callbacks.
Let's get started -
Step 1: Install the gem
Yah, this is a pretty easy step, but it is required
sudo gem install pdf-reader
Step 2: Find a PDF (or PDFs) to use
It would be best to have several PDFs for you to work with since the callbacks could vary depending on the PDF.
NOTE: For these examples, I'm using a really simple PDF, pdf reader could take a while on some PDFs and seem as though it is hanging but it is not, it is just chugging away right around line 283 of this file, reading each byte of your PDF.
Step 3: List the possible callbacks and their args for one of your PDFs
The point of this is to find out what methods we can write for pdf reader to call when it encounters the various parts of our PDF.
The BIG One that you will most likely use is show_text() or some form of it like show_text_with_positioning().
But, for now, THIS all depends on the PDF file you are using, so we need to find out what your PDF uses and go from there.
The easiest way to do this is to follow this example and just substitute "somefile.pdf" with the path to your pdf file.
Run it and you will see a long list of possible callbacks and their arguments. It is likely all squished together, so you can simply change the line of your code that says
and get a MUCH better look at everything.
We will start with show_text, so
for show_text and see what you get. For my PDF, I have mostly show_text_with_positioning.
Step 4: Do some lookups
What are the args they are showing me for my callbacks and how do we find out?
You can do this two ways, try your luck at searching the pdf file for "show text" or "show text with positioning" and see what you get. Or you can lookup the token used to represent show_text or show_text_with_positioning.
The first way is pretty obvious, so on to the second - look in the list of callbacks I had your familiarize yourself with earlier, starting on line 172. Looking through we can find show_text and show_text_with_positioning, having Tj and TJ as their operators. Alright, now we have something to look up - "TJ". Well, I found it on page 251 of the PDF Specification from earlier. Some of descriptions for the operators will require rereading but you will get the hang of it.
Step 5: Use what we found
Now that we know how the show_text_with_positioning works and what args it brings in, we can write our code.
We need an instance of a receiver to pass to the PDF Reader. This is just a class that has methods likes show_text() of show_text_with_positioning(). Our receiver could look something like this:
Now we just need to create our receiver instance an pass our PDF file to pdf reader:
Don't forget to require the pdf reader at the top of your script like this:
Step 6: Check out the results
If we run our script, we will see all the text that uses Tj or TJ print out.
This is just the beginning and you can pick and choose any of the callbacks from that list (list of operators) and implement just about anything.
At the beginning of this post, I mentioned that I was concerned about positioning. This means I had to get very familiar with the text matrix operator (Tm), found on page 250 of the specification. It takes six arguments (a-f) all representing one thing or another and it is not very well documented. From what I can gather, the first four (a through d) are for things like scale and rotation, the last two e and f are for position on the page, where e is along the x axis and f along the y axis.
There is another text positioning operator that I saw quite often and that is move_text_position (Td operator, page 249 of the specification) that actually provides the x and y (unscaled) text space units coordinates. So if y is -1, that just means go to the next line and if y is 0, stay on the same line, -2, move down two lines, 2, move up two lines, etc. x is for indentation or horizontal spacing and represents the number of characters (spaces) to offset the text position by.