Read and write PDFs - Real Python Part 1

PDF files have become a sort of necessary evil these days. Despite their frequent use, PDFs are some of the most difficult files to work with in terms of making modifications, combining files, and espe-cially for extracting text information.

Fortunately, there are a few options in Python for working specifically with PDF files. None of these options are perfect solutions, but often you can use Python to completely automate or at least ease some of the pain of performing certain tasks using PDFs.

The most frequently used package for working with PDF files in Python is named pyPdf and can be foundhere. You will need to download and install this package before continuing with the chapter.

NOTEPython 3 note: I highly recommend that you stick with Python 2.7 for this chap-ter; there is an unofficial version of pyPdf renamed PyPDF2 availableherethat supports Python 3, but it’s still in development. At the time of this writing, it does not have the same full functionality of pyPdf and lacks documentation.

Windows: Download and run the automated installer (pyPdf-1.13.win32.exe).

OS X: if you have easy_install installed, you can type the following command into your Terminal to install pyPdf:sudo easy_install pypdf

Otherwise, you will need to download and unzip the.tar.gz file and install the module using the setup.py script as explained in the section on installing packages.

Debian/Linux: Just type the command:sudo apt-get install python-pypdf

The pyPdf package includes a PdfFileReader and a PdfFileWriter; just like when performing other types of file input/output, reading and writing are two entirely separate processes.

First, let’s get started by reading in some basic information from a sample PDF file, the first couple chapters of Jane Austen’s Pride and Prejudice viaProject Gutenberg:

1 import os

3 from pyPdf import PdfFileReader

5 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

7 inputFileName = os.path.join(path, "Pride and Prejudice.pdf")

9 inputFile = PdfFileReader(file(inputFileName, "rb"))

11 print "Number of pages:", inputFile.getNumPages()

13 print "Title:", inputFile.getDocumentInfo().title

We created a PdfFileReader objected named inputFile by passing a file()object with “rb” (read bi-nary) mode and giving the full path of the file. The additional “binary” part is necessary for reading PDF files because we aren’t just reading basic text data. PDFs include much more complicated in-formation, and saying “rb” here instead of just “r” tells Python that we might encounter and have to interpret characters that can’t be represented as standard readable text.

We can then return the number of pages included in the PDF input file. We also have access to certain attributes through the getDocumentInfo() method; in fact, if we display the result of simply calling this method, we will see a dictionary with all of the available document info:

1 >>> print inputFile.getDocumentInfo()

3 {'/CreationDate': u'D:20110812174208', '/Author': u'Chuck', '/Producer':

4 u'‚MicrosoftÃÂ® Office Word 2007', '/Creator': u'‚MicrosoftÃÂ® Office Word 2007',

5 '/ModDate': u'D:20110812174208', '/Title': u'Pride and Prejudice, by Jane

6 Austen'}

8 >>>

We can also retrieve individual pages from the PDF document using the getPage() method and spec-ifying the index number of the page (as always, starting at 0). However, since PDF pages include much more than simple text, displaying the text data on a PDF page is more involved. Fortunately, pyPdf has made the process of parsing out text somewhat easier, and we can use the extractText() method on each page:

1 >>> print inputFile.getPage(0).extractText()

3 The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen This eBook is for the use of anyone anywhere at no cost and with almost no

4 restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org

5 Title: Pride and Prejudice Author: Jane Austen Release

6 Date: August 26, 2008 [EBook #1342] [Last updated: August 11, 2011] Language:

7 English Character set encoding: ASCII *** START OF THIS PROJECT GUTENBERG

8 EBOOK PRIDE AND PREJUDICE *** Produced by Anonymous Volunteers, and David

9 Widger PRIDE AND PREJUDICE By Jane Austen Contents

11 >>>

Formatting standards in PDFs are inconsistent at best, and it’s usually necessary to take a look at the PDF files you want to use on a case-by-case basis. In this instance, notice how we don’t actually see newline characters in the output; instead, it appears that new lines are being represented as multiple spaces in the text extracted by pyPdf. We can use this knowledge to write out a roughly formatted version of the book to a plain text file (for instance, if we only had the PDF available and wanted to make it readable on an untalented mobile device):

1 import os

3 from pyPdf import PdfFileReader

5 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

7 inputFileName = os.path.join(path, "Pride and Prejudice.pdf")

9 inputFile = PdfFileReader(file(inputFileName, "rb"))

11 outputFileName = os.path.join(path, "Output/Pride and Prejudice.txt")

13 outputFile = open(outputFileName, "w")

15 title = inputFile.getDocumentInfo().title # get the file title

17 totalPages = inputFile.getNumPages() # get the total page count

19 outputFile.write(title + "\n")

21 outputFile.write("Number of pages: {}\n\n".format(totalPages))

23 for pageNum in range(0, totalPages):

25 text = inputFile.getPage(pageNum).extractText()

27 text = text.replace(" ", "\n")

29 outputFile.write(text)

31 outputFile.close()

Since we’re writing out basic text, we chose the plain “w” mode and created a file “book.txt” in the

“Output” folder. Meanwhile, we still use “rb” mode to read data from the PDF file since, before we can extract the plain text from each page, we are in fact reading much more complicated data. We loop over every page number in the PDF file, extracting the text from that page. Since we know that new lines will show up as additional spaces, we can approximate better formatting by replacing every instance of double spaces (” ”) with a newline character.

You may find that a PDF document includes unusual characters that cannot be written into a plain text file - for instance, a trademark symbol. These characters are not in theASCIIcharacter set, meaning that they can’t be represented using any of the 128 standard computer characters. Because of this, your code will not be able to write out a raw text file in “w” mode. Usually the way to get around this is by using theencode() method, like so: text = text.encode("utf-8")

If we have a stringtext that has unusual characters in it, this line will allow us to change how each character is represented (usingUTF-8 encoding) so that we can now store these symbols in a raw text file. These unusual characters might not appear the same way as in the original file, since the text file has a much more limited set of characters available, but if you do not change the encoding then you will not be able to output the text at all.

(If you really want to get a handle on text encoding and what’s really happening, take a look atthis talk. Note that in Python 3, strings are unicode by default.)

Instead of extracting text, we might want to modify the PDF file itself, saving out a new version of the PDF. We’ll see more examples of why and how this might occur in the next section, but for now create the simplest “modified” file by saving out only a section of the original file. Here we copy over the first three pages of the PDF (not including the cover page) into a new PDF file:

1 import os

3 from pyPdf import PdfFileReader, PdfFileWriter

5 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

7 inputFileName = os.path.join(path, "Pride and Prejudice.pdf")

9 inputFile = PdfFileReader(file(inputFileName, "rb"))

11 outputPDF = PdfFileWriter()

13 for pageNum in range(1, 4):

We imported bothPdfFileReader and PdfFileWriter from pyPdf so that we can write out a PDF file of our own.PdfFileWriter doesn’t take any arguments, which might be surprising; we can start adding PDF pages to our outputPDF before we’ve specified what file it will become. However, in order to save the output to an actual PDF file, at the end of our code we create anoutputFile as usual and then calloutputPDF.write(outputFile) in order to write the PDF contents into this file.

Review exercises:

1. Write a script that opens the file named “The Whistling Gypsy.pdf” from the Chapter 8 practice files, then displays the title, author, and total number of pages in the file

2. Extract the full contents of “The Whistling Gypsy.pdf” into a .TXT file; you will need to encode the text as UTF-8 before you can output it

3. Save a new version of “The Whistling Gypsy.pdf” that does not include the cover page into the Output folder

In document Real Python Part 1 (Page 116-121)