Trying to deduce the logical layout of a PDF file through following the often convoluted structure can be a very complex task [SF05]. Even though there are commands for setting the spacing between characters, words and lines and for specifying line breaks (described in section 4.3.1), they are very rarely used. Instead, absolute positioning commands are used and, due to kerning, single words are often split over several instructions and the position of line and column breaks has to be inferred through spatial analysis. Heuristics have been developed [YF04, NMRW98, HB07] to help identify these breaks, but for anything other than simple one column documents they often fail, as can be seen and heard when using PDF reader functions such as read out loud orsave as text.
Specialised PDF analysis in conjunction with open source PDF tools, the PDF API and whitespace analysis have been used for full text extraction and the identification of words, lines, paragraphs, tables and columns [LB95]. Hassan and Baumgartner worked with one of these libraries, PdfBox [Fou10] to perform intelligent text extraction on PDF documents [HB05]. By this they meant using the perfect extraction results from PDF together with a combination of top down and bottom up parsing to create a graph repre- senting the textual content of the file. Unfortunately this approach failed when presented with many scientific documents, especially those containing mathematical formulae and tables. Later work focused on improving the system to also analyse tables [HB07].
Yuan and Liu [YL05] also use a modified version of PdfBox, to extract and parse the text contained within a PDF file. The extracted text is processed to generate tags which are injected back into the file to aid searching. The focus was on identifying the title, author, address, abstract and keywords of each paper. By taking advantage of the additional font information and perfect character recognition, they were able to attain accuracy levels of up to 92.5%. This was achieved by experimenting with recent PDF files, those published within the past two years, which were also compatible with PdfBox. The conclusions of their work noted that more effort was required to modify the PDF parser and improve its compatibility to work with a wider range of files.
Marinai [Mar09] and Tkaczyk and Bolikowski [TB11] make use of the PdfBox and iText [Bru09] libraries respectively to extract data directly from PDF files in order to perform metadata analysis. Due to the lack of content demarcation within PDF however, both have to perform significant further analysis to determine such structures as words, lines, columns and tables.
Other attempts based on similar approaches have been made to identify and segment text, mathematics, tables and formulae, however all attempts suffer from similar problems, that of limited compatibility with all PDF files [Anj01, RA03, FGB+11, LGT+11]. This
is because many versions of PDF exist, with each widely used and available. There are also many authoring tools, including Adobe Acrobat, Ghostscript and pdfTEX [Inc11a, Sof11, Tha09], and in turn many versions of these, which can produce visually identical files, but with completely different underlying code. This may include, but is not limited to;
• Different instructions being used for positioning and displaying text and images
• Alternative fonts being used — Type 3 instead of Type 1
• Fonts being embedded and encoded in various ways
• The presence, or lack of, structural tags and optional content groups
• The amount of non-required attributes included in the file
This variance in files makes it very difficult to produce a comprehensive parser; with the exception of Adobe Reader [Inc11b], the majority of PDF viewers have compatibility problems with certain file versions and features, particularly those containing annotations and optional content.
4.2.1
PDF Bounding Boxes
Figure 4.2 highlights the difficulty of identifying the size, location and spatial relationships of characters by showing two different sets of character bounding boxes obtained from the
(a) Characters with maximum font bounding boxes
(b) Characters with minimal ascender, descen- der and width boxes
Figure 4.2: Bounding boxes of characters in PDF files
same area of a PDF file. Note that some symbols, especially large extendable ones like delimiters are actually constructed of multiple overlaid characters and thus have multiple associated bounding boxes.
Figure 4.2(a) shows the bounding boxes of each character, in green, as given by the
FontBBox attribute from the font descriptor and is analogous to the area highlighted
when selecting text. These boxes are sufficient for analysing characters of similar fonts with linear relationships, as in the first two lines of the example. However when different fonts or certain characters, usually non-alphanumeric, are used, then the bounding boxes are too large to deduce their spatial relationships. This is why copying and pasting a standard line of text will often return an accurate result, but performed over say, a formula, will not.
Figure 4.2(b) shows the bounding boxes of each character, in red, using the character widths as given by the font descriptor, together with the ascender and descender infor- mation, also from the font descriptor. Here the bounding boxes usually offer a far more accurate portrayal of the character’s true bounding box and in many cases would be suit- able for analysing two dimensional relationships, however the bounding boxes of certain, particularly large characters such as delimiters, integrals and summations do not contain the whole character, thus again are unsuitable.
In Figure 4.1, a screen shot of a highlighted mathematical formula within a PDF file clearly demonstrates the problems with bounding boxes as the highlighted areas in no way correspond to the position or sizes of the characters.