HOWTO convert pdf to HTML on any platform

Date/Time Permalink: 06/23/06 05:00:20 am
Category: HOWTOs and Guides

The catch is, you need a Gmail account! No, I have no invites to Gmail hand out.

I've had this situation half a dozen times: I want to save a document to my offline machine for future reading, but the file is in .pdf. Ugh! Big waste of space, insane formatting unsuitable for the screen, requires a special viewer just to read it, can't read it period in a console. How do they get away with this?

OK, all you do is use Gmail to mail the file as an attachment to yourself. This can take a while, but it's worth it. When you receive it, simply click "open as HTML". It will open in a new browser tab. Use your browser to save it back to your machine as a plain HTML file (you might as well, because this method will nuke any images anyway). Ta-da!

The other way (on the Linux console) is to use "pdf2ps" to convert it to Postscript, then use "ps2ascii" to change it to plain text. The result of this, however, is a train wreck. The document will look like it ran through a shredder in most cases.

One irritation remains: now you have to scroll through 1000's of lines of whitespace in your web browser to read the dumb thing (and the text will be tiny flyspecks). So our next step is to strip HTML formatting to leave simple text. Help, sed!:

sed -e 's/<[^>]*>//g' target.html > new_file.txt

By now, the file size is down to just roughly 10% of the original pdf file! Very handy for my usual purposes where I don't give a damn about anything but the raw text (like a motherboard manual or textbook, for instance). By the way, I hear using Google to convert to text also works for Microsoft Word documents...but with MS Word files, since MS Word saves the text sandwiched between two blobs of binary garbage, I usually just open these in Emacs and manually strip everything before and after the text.

Happy hacking!

UPDATE: 10/20/07 A year and a half after I posted this, I got one comment saying it doesn't work. Please excuse his swearing; PDF does that to people. Maybe this doesn't work any more, or maybe that particular document was botched up somehow.

UPDATE AGAIN: For a completely different way to translate PDF to text using only the Linux command line, go here. Note, however, that this method will lose all images, fonts, and other fancy embedded stuff, leaving just plain text, which then has to be formatted on a case-by-case basis to make it human-readable and machine-filterable.

