Excuse the mistakes; I'm typing with thirteen thumbs today.

Arcane Linux Commands - Crunching PDF Data

Date/Time Permalink: 11/29/07 06:26:01 pm
Category: HOWTOs and Guides

A Slashdot story today is a very interesting read on Google's latest malware crackdown. And there's a bonus buck: a list of the search keywords which the malware spammers were targeting, in the form of a PDF file. Which will be ze subject in today's lab-ooor-atory expeereemint.

But first, let me take a moment to plug this story better. If you've ever wondered how malware and search spamming works, you can get a lot of insight from this story, including a listing at Sunbelt blog of some Javascript the spammers were using to alter behavior depending on if you came from Google. Check the code shot. Isn't that a clever, if evil, little hack?

But now to our subject:

We're going to deal with a PDF entirely on the command line. To check out what programs that handle PDF are available on your system, simply type 'pdf' and then hit the TAB key twice. I get:

pdf2dsc     pdfcrop     pdfimages   pdflatex    pdftex      pdftops
pdf2ps      pdfetex     pdfinfo     pdfopt      pdftohtml   pdftotext
pdf2swf     pdffonts    pdfjadetex  pdfroff     pdftoppm    pdfxtex

so I picked pdftotext. So,

 pdftotext searchterms21388.pdf Search_terms.txt

is the first step in getting a command-line friendly print of the data. Note that I gave the text file an uppercase first letter. That way I can just type shift-S-TAB instead of typing the filename over and over, and it won't stop to see if I want the text one or the original.

Now, the text file is still a mess here. We have:

| microsoft excel free download | microsoft office excel accounts | customer opp
ortunities | microsoft excel software | microsoft excel | 2002 visual basic for 
applications step by | microsoft excel file | password recovery | microsoft exce
l programs | microsoft excel macro | | microsoft excel visual basic macro | intr

a blob of phrases separated by a very obvious delimiter - the pipe '|' character. To comb this into more usable data, we can go:

sed 's/|/\n/g' Search_terms.txt > temp && mv temp Search_terms.txt

replacing the pipes with newlines, so now we have each search term on its own line. Remember that many of the command line filters edit data non-destructively; we have to specifically pipe it to a file and move it back to the original. Now we have:

 microsoft excel free download 
 microsoft office excel accounts 
 customer opportunities 
 microsoft excel software 
 microsoft excel 
 2002 visual basic for applications step by 
 microsoft excel file 
 password recovery 
 microsoft excel programs 
 microsoft excel macro 
 microsoft excel visual basic macro 
 introduction of microsoft excel 

which is better, but has lots of blank lines and isn't sorted. At this point, I briefly opened it in a text editor and deleted the first two lines which are only the title of the report. Back to the command line:

sort Search_terms.txt | uniq > temp && mv temp Search_terms.txt 

which sorts the data, gets rid of duplicate lines (such as those blanks), and avoids a 'useless use of cat' award. At last, our effort is rewarded with: vpn 
 2002 visual basic for applications step by 
 2003 by robert grauer ebook download 
 2003 questions 
 2003 workbook 
 2003, sort sum formula 
 2007 for sale 
 2keyword free domain name hosting 
 327w router telnet vpn 
 678 vpn 
 937 + vpn 

Ah, what clean, neat data! So, this is the keyword list which malware authors were targeting. Now let's get some stats:

wc -l Search_terms.txt 
grep "[Mm]icrosoft" Search_terms.txt | wc -l

wc is the 'word count' program and the -l argument counts lines. In the second line, I use grep to find all lines containing 'microsoft' and 'Microsoft' ( the [Mm] is either-or). So, out of 1,396 search phrases, 318 of them targeted Microsoft. What's that percentage?

 dc -e '2k 318 1396/p'

So, 22%. With another grep, I found out "[Cc]isco" is mentioned 54 times, a paltry 3%. But what word is the most frequent of the report? Here's a little script that I have saved as 'wf.sh' for "word frequency":


# wf, the friend to wc
# word frequency counter
# prints unique word count, sorted by most frequent words first,
# and reverse alphabetical order.

tr -s ' ' '\012' < $1 | sort -fd | uniq -c | sort -rn | less

Handy little one-liner. Just so everybody knows I'm not posting a fork bomb (*smiles wanly at commenter Craig McLure*), the ingredients are:

  • tr : The -s "squeezes" all consecutive instances of the first match together, treating one space, two spaces, etc. as one space. It is replacing the spaces with the Octal 12, which is the 'FF' formfeed ASCII - in other words, newlines. Each word gets its own line, now. The '< $1' says "take input from the file whose name I just passed to you."
  • sort : The first sort gets argument -f for "fold" (ignore) case, and -d for "dictionary order", ignoring numbers and punctuation.
  • uniq : The -c argument will count how many duplicates of each word there were before squeezing them to one unique word, and prefix the resulting number to the line.
  • sort : The second sort use the argument -r for "reverse" and -n for "numeric" sort. Ha! We now have the words with the highest number of repetitions on top! And we only had to type a little line noise to do it! "f y cn rd ths y mst hv bn usng unx"!
  • less : since I figured I'd never use this for making a permanent file, I figured to toss the pager right in there. This is normally bad form, but I build laziness into my scripts every chance I get.

Running 'wf.sh Search_terms.txt' on the file yields the following for the top ten of the list:

    401 vpn
    319 microsoft
    313 excel
    286 fetch
    118 to
     87 a
     79 go
     59 and
     56 how
     54 free

This 'VPN' seems to be Virtual Private Network - at 28%! So, if anybody out there did a lot of searching for VPN's while running an MS box, I'd check my system if I were you!

None of this was ground-breaking enlightening, but was just a little example of using quick commands to get some fast answers from a simple data set. If you had a need (like you get PDF reports every week) you could always tuck all of the translation and sorting stuff into a shell script.

Happy hacking out there in Linux-land!

the fake XKCD

Update 4/21/09: Since everybody's hitting this image with Google 'similar image' search looking for more stick-figure comics... might I suggest, if you're ready to go beyond stick figures, that you try Doomed to Obscurity?

(The Real XKCD)

Follow me on Twitter for an update every time this blog gets a post.
Stumble it Reddit this share on Facebook

suddenly the moon