Manipulate PDF files - Real Python Part 1

Often the reason we want to modify a PDF file is more complicated than just saving a portion of the file. We might want to rotate some pages, crop pages, or even merge information from different pages together. When manually editing the files in Adobe Acrobat isn’t a practical or feasible solution, we can automate any of these tasks using pyPdf.

Let’s start with a surprisingly common problem: rotated PDF pages. Go ahead and open up the file “ugly.pdf” in the “Chapter 8/Practice files/” folder. You’ll see that it’s a lovely PDF file of Hans Christian Andersen’s The Ugly Duckling, except that every odd-numbered page is rotated counter-clockwise by ninety degrees. This is simple enough to correct by using the rotateClockwise() method on every other PDF page and specifying the number of to rotate:

1 import os

3 from pyPdf import PdfFileReader, PdfFileWriter

5 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

13 for pageNum in range(0, inputFile.getNumPages()):

23 outputFileName = os.path.join(path, "Output/The Conformed Duckling.pdf")

Another useful feature of pyPdf is the ability to crop pages, which in turn will allow us to split up PDF pages into multiple parts or save out partial sections of pages. For instance, open up the file “half and half.pdf” from the chapter 8 practice files folder to see an example of where this might be useful.

This time, we have a PDF that’s presented in two “frames” per page, which again is not an ideal layout

in many situations. In order to split these pages up, we will have to refer to the MediaBox belonging to each PDF page, which is a rectangle representing the boundaries of the page. Let’s take a look at the MediaBox of a PDF page in the interactive window to get an idea of what it looks like:

1 >>> from pyPdf import PdfFileReader

3 >>> inputFile = PdfFileReader(file("C:/Real Python/Course materials/Chapter

4 8/Practice files/half and half.pdf", "rb"))

A mediaBox is a type of object called a RectangleObject. Consequently, we can get the coordinates of the rectangle’s corners:

These locations are returned to us as tuples that include the x and y coordinate pairs. Notice how we didn’t include parentheses anywhere because mediaBox and its corners are unchangeable attributes, not methods of the PDF page.

We will have to do a little math in order to crop each of our PDF pages. Basically, we need to set the corners of each half-page so that we crop out the side of the page that we don’t want. To do this, we

divide the width of the landscape page into two halves; we set the right corner of the left-side page to be half of the total width, and we set the left corner of the right-side page to start halfway across the width of the page.

Since we have to crop the half-pages in order to write them out to our new PDF file, we will also have to create a copy of each page. This is because the PDF pages are mutable objects; if we change something about a page, we also change the same things about any variable that references that object. This is exactly the same problem that we ran into when having to copy an entire list into a new list before making changes. In this case, we import the built-in copy module, which creates and returns a copy of an object by using the copy.copy() function. (In fact, this function works just as well for making copies of entire lists instead of the shorthand list2 = list1[:] notation.)

This is tricky code, so take a while to work through it and play with different variations of the copying and cropping to make sure you understand the underlying math:

1 import os

3 import copy

5 from pyPdf import PdfFileReader, PdfFileWriter

7 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

9 inputFileName = os.path.join(path, "half and half.pdf")

11 inputFile = PdfFileReader(file(inputFileName, "rb"))

13 outputPDF = PdfFileWriter()

15 for pageNum in range(0, inputFile.getNumPages()):

17 pageLeft = inputFile.getPage(pageNum)

19 pageRight = copy.copy(pageLeft)

21 upperRight = pageLeft.mediaBox.upperRight # get original page corner

23 # crop and add left-side page

25 pageLeft.mediaBox.upperRight = (upperRight[0]/2, upperRight[1])

27 outputPDF.addPage(pageLeft)

29 # crop and add right-side page

31 pageRight.mediaBox.upperLeft = (upperRight[0]/2, upperRight[1])

33 outputPDF.addPage(pageRight)

35 outputFileName = os.path.join(path, "Output/The Little Mermaid.pdf")

NOTE: PDF files are a bit unusual in how they save page orientation. Depending on how the PDF was originally created, it might be the case that your axes are switched -for instance, a standard “portrait” document that’s been converted into a landscape PDF might have the x-axis represented vertically while the y-axis is horizontal. Likewise, the corners would all be rotated by 90 degrees; the upperLeft corner would appear on the upper right or the lower left, depending on the file’s rotation. Especially if you’re working with a landscape PDF file, it’s best to do some initial testing to make sure that you are using the correct corners and axes.

Beyond manipulating an already existing PDF, we can also add our own information by merging one PDF page with another. For instance, perhaps we want to automatically add a header or a watermark to every page in a file. I’ve saved an image with a transparent background into a one-page PDF file for this purpose, which we can use as a watermark, combining this image with every page in a PDF file by using themergePage() method:

1 import os

3 from pyPdf import PdfFileReader, PdfFileWriter

5 path = "C:/Real Python/Course materials/Chapter 8/Practice files"

7 inputFileName = os.path.join(path, "The Emperor.pdf")

9 inputFile = PdfFileReader(file(inputFileName, "rb"))

11 outputPDF = PdfFileWriter()

13 watermarkFileName = os.path.join(path, "top secret.pdf")

15 watermarkFile = PdfFileReader(file(watermarkFileName, "rb"))

17 for pageNum in range(0, inputFile.getNumPages()):

19 page = inputFile.getPage(pageNum)

21 page.mergePage(watermarkFile.getPage(0)) # add watermark image

23 outputPDF.addPage(page)

25 outputPDF.encrypt("good2Bking") # add a password to the PDF file

27 outputFileName = os.path.join(path, "Output/New Suit.pdf")

29 outputFile = file(outputFileName, "wb")

31 outputPDF.write(outputFile)

33 outputFile.close()

While we were securing the file, notice that we also added basic encryption by supplying the password

“good2Bking” through the PdfFileWriter’s encrypt() method. If you know the password used to pro-tect a PDF file, there is also a matching decrypt() method to decrypt an input file that is password protected;this can be incredibly useful as an automation tool if you have many identically encrypted PDFs and don’t want to have to type out a password each time you open one of the files.

Although pyPdf is one of the best and most frequently relied-upon packages for interacting with PDFs in Python, it does have some weaknesses. For instance, there is no way to generate your own PDF files from scratch; instead, you must start with at least a template document. For PDF generation in particular, I suggest researching theReportLabtoolkit, which is also free and open-source. Another popular choice for manipulation of existing PDF files in Python isPDFMiner, which offers slightly different functionality from pyPdf.

There is also aPyPDF2in the works that aims to handle more difficult PDF files and make some PDF manipulation tasks even easier than in pyPdf. However, as of this writing (March 2014) there is not yet any documentation available on how to use it.

Review exercises:

1. Write a script that opens the file named “Walrus.pdf” from the Chapter 8 practice files; you will need to decrypt the file using the password IamtheWalrus

2. Rotate every page in this input file counter-clockwise by 90 degrees

3. Split each page in half vertically, such that every column appears on its own separate page, and output the results as a new PDF file in the Output folder

In document Real Python Part 1 (Page 121-126)