{"id":57505,"date":"2010-01-18T17:06:00","date_gmt":"2010-01-18T17:06:00","guid":{"rendered":"http:\/\/pullmonkey.com\/?p=57505"},"modified":"2010-01-18T19:51:07","modified_gmt":"2010-01-18T19:51:07","slug":"ruby-pdf-reader-gem-tutorial","status":"publish","type":"post","link":"http:\/\/pullmonkey.com\/2010\/01\/18\/ruby-pdf-reader-gem-tutorial\/","title":{"rendered":"Ruby PDF Reader Gem Tutorial"},"content":{"rendered":"
I've been doing a lot of work these days dealing with PDFs and for the most part I've been happy with using poppler-utils' pdftohtml. And that is great if you don't care about positioning or formatting and just care about the content. But for those of you who, like me, have run across the need to know text positioning, font size, indentation, coloring, etc., then we will have to use something more.<\/p>\n
I had given just about every other option a chance before I finally found the pdf reader gem<\/a>. But when I found pdf reader, it didn't have much documentation and it wasn't entirely clear how to get started using it and if it would work. Well, I can tell you that it will work and after playing with the examples a lot of it became much clearer. I learned a lot that would probably be useful for a few other people out there, hence this post. You should probably familiarize yourself with this PDF specification<\/a> too - found here<\/a>. It really came in handly when trying to figure out what arguments are being passed around what they represent.<\/p>\n Let's get started -<\/p>\n Yah, this is a pretty easy step, but it is required \ud83d\ude42 It would be best to have several PDFs for you to work with since the callbacks could vary depending on the PDF. The point of this is to find out what methods we can write for pdf reader to call when it encounters the various parts of our PDF. The easiest way to do this is to follow this example<\/a> and just substitute \"somefile.pdf\" with the path to your pdf file.<\/p>\n Run it and you will see a long list of possible callbacks and their arguments. It is likely all squished together, so you can simply change the line of your code that says We will start with show_text, so What are the args they are showing me for my callbacks and how do we find out? Now that we know how the show_text_with_positioning works and what args it brings in, we can write our code.
\nOk, well to get started, take a look at the github repository for pdf reader<\/a>. You don't need to spend too much time, but just note a few places like the examples directory<\/a> and the list of callbacks<\/a>.<\/p>\nStep 1:\u00a0 Install the gem<\/h3>\n
\nsudo gem install pdf-reader<\/p>\nStep 2:\u00a0 Find a PDF (or PDFs) to use<\/h3>\n
\nNOTE: For these examples, I'm using a really simple PDF, pdf reader could take a while on some PDFs and seem as though it is hanging but it is not, it is just chugging away right around line 283 of this file<\/a>, reading each byte of your PDF.<\/p>\nStep 3: List the possible callbacks and their args for one of your PDFs<\/h3>\n
\nThe BIG One that you will most likely use is show_text() or some form of it like show_text_with_positioning().
\nBut, for now, THIS all depends on the PDF file you are using, so we need to find out what your PDF uses and go from there.<\/p>\nputs cb<\/code> to
puts cb.inspect<\/code> and get a MUCH better look at everything.<\/p>\n
grep<\/code> for show_text and see what you get. For my PDF, I have mostly show_text_with_positioning.<\/p>\n
Step 4: Do some lookups<\/h3>\n
\nYou can do this two ways, try your luck at searching the pdf file for \"show text\" or \"show text with positioning\" and see what you get. Or you can lookup the token used to represent show_text or show_text_with_positioning.
\nThe first way is pretty obvious, so on to the second - look in the list of callbacks<\/a> I had your familiarize yourself with earlier, starting on line 172. Looking through we can find show_text and show_text_with_positioning, having Tj and TJ as their operators. Alright, now we have something to look up - \"TJ\". Well, I found it on page 251 of the PDF Specification from earlier. Some of descriptions for the operators will require rereading but you will get the hang of it.<\/p>\nStep 5: Use what we found<\/h3>\n
\nWe need an instance of a receiver to pass to the PDF Reader. This is just a class that has methods likes show_text() of show_text_with_positioning(). Our receiver could look something like this:<\/p>\n