I often try to quote sections of PDF documents into my own, but occasionally come across documents where this is not possible. For example, when I copy the text “Evaluating” from this PDF into another application, it comes out as “#$%&$'()*
I can look at the fonts used in the document,
It seems that several font subsets have been embedded, and some of these appear to have random names like “TTE1AE5470t00 (Embedded Subset)”, “Type: TrueType”, “Encoding: Built-in”
The reason that the copied text appears as garbage is because there is no way to map this “Built-in” encoding to (for example) Unicode. However, this doesn’t strike me as a particularly difficult problem (though searching online I couldn’t find anyone who’s managed to resolve it.) Simply by inspection, we can see that there’s a simple pattern going from the “real” text to the encoded garbage:
E = "
v = #
a = $
l = %
u = &
a = $
t = '
i = (
n = )
g = *
It looks like the PDF format should be able to support a table to convert from this garbage (which is presumably actually something like an index into a glyph lookup table) to Unicode. This conversion table is called toUnicode.
“Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.”
It is possible to do this conversion with a bash script, but it does involve manually copy/pasting each character in the set like this,
# [GARETH] The tr -s bit is to remove duplicate spaces as sometimes my doc uses ! to represent space, but also uses ! followed by a space at other times.
tr "$fullSource" "$fullDest" | tr -s \