PDF Character encodings

I often try to quote sections of PDF documents into my own, but occasionally come across documents where this is not possible. For example, when I copy the text “Evaluating” from this PDF into another application, it comes out as “#$%&$'()*

I can look at the fonts used in the document,
File->Properties->Fonts

It seems that several font subsets have been embedded, and some of these appear to have random names like “TTE1AE5470t00 (Embedded Subset)”, “Type: TrueType”, “Encoding: Built-in”

The reason that the copied text appears as garbage is because there is no way to map this “Built-in” encoding to (for example) Unicode. However, this doesn’t strike me as a particularly difficult problem (though searching online I couldn’t find anyone who’s managed to resolve it.) Simply by inspection, we can see that there’s a simple pattern going from the “real” text to the encoded garbage:

Evaluating
"#$%&$'()*

i.e.,
E = "
v = #
a = $
l = %
u = &
a = $
t = '
i = (
n = )
g = *

It looks like the PDF format should be able to support a table to convert from this garbage (which is presumably actually something like an index into a glyph lookup table) to Unicode. This conversion table is called toUnicode.

See here for some further info,
http://www.foolabs.com/xpdf/download.html
http://www.perlmonks.org/?node_id=793929
http://www.axsl.org/font/encoding.html

http://www.pdflib.com/products/tet/
“Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.”

http://itextpdf.com/
http://pdf.editme.com/font-glyph-encoding
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

It is possible to do this conversion with a bash script, but it does involve manually copy/pasting each character in the set like this,

#!/bin/bash
uppercaseSource="UGB<\"RMN)Cbc@_e8d?\`;"
lowercaseDest="abcdefghijklmnopqrstuvwxyz"

numberSource="IFQSWHYZEG"
numberDest="1234567890"

symbolSource="A=]^TOaLn!"
symbolDest=".,()&\-':/ "

fullSource=$uppercaseSource$lowercaseSource$numberSource$symbolSource
fullDest=$uppercaseDest$lowercaseDest$numberDest$symbolDest

# [GARETH] The tr -s bit is to remove duplicate spaces as sometimes my doc uses ! to represent space, but also uses ! followed by a space at other times.
tr "$fullSource" "$fullDest" | tr -s \

Advertisements

One thought on “PDF Character encodings

  1. I’d guess embedded font/encoding system is being abused for copy protection (security through obscurity). It would probably be fairly straightforward to dictionary-attack it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s