84
Working with PDF and Word Documents
297
Download this PDF from http://nostarch.com/automatestuff/, and enter
the following into the interactive shell:
>>> import PyPDF2
>>> pdfFileObj = open('meetingminutes.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
u >>> pdfReader.numPages
19
v >>> pageObj = pdfReader.getPage(0)
w >>> pageObj.extractText()
'OOFFFFIICCIIAALL BBOOAARRDD MMIINNUUTTEESS Meeting of March 7, 2015
\n The Board of Elementary and Secondary Education shall provide leadership
and create policies for education that expand opportunities for children,
empower families and communities, and advance Louisiana in an increasingly
competitive global market. BOARD of ELEMENTARY and SECONDARY EDUCATION '
First, import the
PyPDF2
module. Then open meetingminutes.pdf in read
binary mode and store it in
pdfFileObj
. To get a
PdfFileReader
object that rep-
resents this PDF, call
PyPDF2.PdfFileReader()
and pass it
pdfFileObj
. Store this
PdfFileReader
object in
pdfReader
.
The total number of pages in the document is stored in the
numPages
attribute of a
PdfFileReader
object u. The example PDF has 19 pages, but
let’s extract text from only the first page.
To extract text from a page, you need to get a
Page
object, which repre-
sents a single page of a PDF, from a
PdfFileReader
object. You can get a
Page
object by calling the
getPage()
method v on a
PdfFileReader
object and pass-
ing it the page number of the page you’re interested in—in our case, 0.
PyPDF2 uses a zero-based index for getting pages: The first page is page 0,
the second is page 1, and so on. This is always the case, even if pages are
numbered differently within the document. For example, say your PDF is
a three-page excerpt from a longer report, and its pages are numbered 42,
43, and 44. To get the first page of this document, you would want to call
pdfReader. getPage(0)
, not
getPage(42)
or
getPage(1)
.
Once you have your
Page
object, call its
extractText()
method to return a
string of the page’s text
w
. The text extraction isn’t perfect: The text Charles E.
“Chas” Roemer, President from the PDF is absent from the string returned by
extractText()
, and the spacing is sometimes off. Still, this approximation of
the PDF text content may be good enough for your program.
Decrypting PDFs
Some PDF documents have an encryption feature that will keep them from
being read until whoever is opening the document provides a password.
Enter the following into the interactive shell with the PDF you downloaded,
which has been encrypted with the password rosebud:
>>> import PyPDF2
>>> pdfReader = PyPDF2.PdfFileReader(open('encrypted.pdf', 'rb'))
u >>> pdfReader.isEncrypted
True
85
298
Chapter 13
>>> pdfReader.getPage(0)
v Traceback (most recent call last):
File "<pyshell#173>", line 1, in <module>
pdfReader.getPage()
--snip--
File "C:\Python34\lib\site-packages\PyPDF2\pdf.py", line 1173, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted
w >>> pdfReader.decrypt('rosebud')
1
>>> pageObj = pdfReader.getPage(0)
All
PdfFileReader
objects have an
isEncrypted
attribute that is
True
if the
PDF is encrypted and
False
if it isn’t u. Any attempt to call a function that
reads the file before it has been decrypted with the correct password will
result in an error v.
To read an encrypted PDF, call the
decrypt()
function and pass the pass-
word as a string w. After you call
decrypt()
with the correct password, you’ll
see that calling
getPage()
no longer causes an error. If given the wrong pass-
word, the
decrypt()
function will return
0
and
getPage()
will continue to fail.
Note that the
decrypt()
method decrypts only the
PdfFileReader
object, not
the actual PDF file. After your program terminates, the file on your hard
drive remains encrypted. Your program will have to call
decrypt()
again the
next time it is run.
Creating PDFs
PyPDF2’s counterpart to
PdfFileReader
objects is
PdfFileWriter
objects, which
can create new PDF files. But PyPDF2 cannot write arbitrary text to a PDF
like Python can do with plaintext files. Instead, PyPDF2’s PDF-writing capa-
bilities are limited to copying pages from other PDFs, rotating pages, over-
laying pages, and encrypting files.
PyPDF2 doesn’t allow you to directly edit a PDF. Instead, you have to
create a new PDF and then copy content over from an existing document.
The examples in this section will follow this general approach:
1. Open one or more existing PDFs (the source PDFs) into
PdfFileReader
objects.
2. Create a new
PdfFileWriter
object.
3. Copy pages from the
PdfFileReader
objects into the
PdfFileWriter
object.
4. Finally, use the
PdfFileWriter
object to write the output PDF.
Creating a
PdfFileWriter
object creates only a value that represents a
PDF document in Python. It doesn’t create the actual PDF file. For that, you
must call the PdfFileWriter’s
write()
method.
80
Working with PDF and Word Documents
299
The
write()
method takes a regular
File
object that has been opened in
write-binary mode. You can get such a
File
object by calling Python’s
open()
function with two arguments: the string of what you want the PDF’s filename
to be and
'wb'
to indicate the file should be opened in write-binary mode.
If this sounds a little confusing, don’t worry—you’ll see how this works
in the following code examples.
Copying Pages
You can use PyPDF2 to copy pages from one PDF document to another.
This allows you to combine multiple PDF files, cut unwanted pages, or
reorder pages.
Download meetingminutes.pdf and meetingminutes2.pdf from http://nostarch
.com/automatestuff/ and place the PDFs in the current working directory.
Enter the following into the interactive shell:
>>> import PyPDF2
>>> pdf1File = open('meetingminutes.pdf', 'rb')
>>> pdf2File = open('meetingminutes2.pdf', 'rb')
u >>> pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
v >>> pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
w >>> pdfWriter = PyPDF2.PdfFileWriter()
>>> for pageNum in range(pdf1Reader.numPages):
x pageObj = pdf1Reader.getPage(pageNum)
y pdfWriter.addPage(pageObj)
>>> for pageNum in range(pdf2Reader.numPages):
x pageObj = pdf2Reader.getPage(pageNum)
y pdfWriter.addPage(pageObj)
z >>> pdfOutputFile = open('combinedminutes.pdf', 'wb')
>>> pdfWriter.write(pdfOutputFile)
>>> pdfOutputFile.close()
>>> pdf1File.close()
>>> pdf2File.close()
Open both PDF files in read binary mode and store the two resulting
File
objects in
pdf1File
and
pdf2File
. Call
PyPDF2.PdfFileReader()
and pass
it
pdf1File
to get a
PdfFileReader
object for meetingminutes.pdf u. Call it again
and pass it
pdf2File
to get a
PdfFileReader
object for meetingminutes2.pdf v.
Then create a new
PdfFileWriter
object, which represents a blank PDF
document w.
Next, copy all the pages from the two source PDFs and add them
to the
PdfFileWriter
object. Get the
Page
object by calling
getPage()
on a
PdfFileReader
object x. Then pass that
Page
object to your PdfFileWriter’s
addPage()
method y. These steps are done first for
pdf1Reader
and then
79
300
Chapter 13
again for
pdf2Reader
. When you’re done copying pages, write a new PDF
called combinedminutes.pdf by passing a
File
object to the PdfFileWriter’s
write()
method z.
no t e
PyPDF2 cannot insert pages in the middle of a
PdfFileWriter
object; the
addPage()
method will only add pages to the end.
You have now created a new PDF file that combines the pages from
meetingminutes.pdf and meetingminutes2.pdf into a single document. Remem-
ber that the
File
object passed to
PyPDF2.PdfFileReader()
needs to be opened
in read-binary mode by passing
'rb'
as the second argument to
open()
. Like-
wise, the
File
object passed to
PyPDF2.PdfFileWriter()
needs to be opened in
write-binary mode with
'wb'
.
rotating Pages
The pages of a PDF can also be rotated in 90-degree increments with
the
rotateClockwise()
and
rotateCounterClockwise()
methods. Pass one of
the integers
90
,
180
, or
270
to these methods. Enter the following into the
interactive shell, with the meetingminutes.pdf file in the current working
directory:
>>> import PyPDF2
>>> minutesFile = open('meetingminutes.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(minutesFile)
u >>> page = pdfReader.getPage(0)
v >>> page.rotateClockwise(90)
{'/Contents': [IndirectObject(961, 0), IndirectObject(962, 0),
--snip--
}
>>> pdfWriter = PyPDF2.PdfFileWriter()
>>> pdfWriter.addPage(page)
w >>> resultPdfFile = open('rotatedPage.pdf', 'wb')
>>> pdfWriter.write(resultPdfFile)
>>> resultPdfFile.close()
>>> minutesFile.close()
Here we use
getPage(0)
to select the first page of the PDF u, and then
we call
rotateClockwise(90)
on that page v. We write a new PDF with the
rotated page and save it as rotatedPage.pdf w.
The resulting PDF will have one page, rotated 90 degrees clock-
wise, as in Figure 13-2. The return values from
rotateClockwise()
and
rotateCounterClockwise()
contain a lot of information that you can ignore.
23
Working with PDF and Word Documents
301
Figure 13-2: The rotatedPagepdf file with the page
rotated 90 degrees clockwise
Overlaying Pages
PyPDF2 can also overlay the contents of one page over another, which is
useful for adding a logo, timestamp, or watermark to a page. With Python,
it’s easy to add watermarks to multiple files and only to pages your program
specifies.
Download watermark.pdf from http://nostarch.com/automatestuff/ and place
the PDF in the current working directory along with meetingminutes.pdf. Then
enter the following into the interactive shell:
>>> import PyPDF2
>>> minutesFile = open('meetingminutes.pdf', 'rb')
u >>> pdfReader = PyPDF2.PdfFileReader(minutesFile)
v >>> minutesFirstPage = pdfReader.getPage(0)
w >>> pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermark.pdf', 'rb'))
x >>> minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0))
y >>> pdfWriter = PyPDF2.PdfFileWriter()
z >>> pdfWriter.addPage(minutesFirstPage)
{ >>> for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
60
302
Chapter 13
>>> resultPdfFile = open('watermarkedCover.pdf', 'wb')
>>> pdfWriter.write(resultPdfFile)
>>> minutesFile.close()
>>> resultPdfFile.close()
Here we make a
PdfFileReader
object of meetingminutes.pdf u. We call
getPage(0)
to get a
Page
object for the first page and store this object in
minutesFirstPage
v. We then make a
PdfFileReader
object for watermark
.pdf w and call
mergePage()
on
minutesFirstPage
x. The argument we pass
to
mergePage()
is a
Page
object for the first page of watermark.pdf.
Now that we’ve called
mergePage()
on
minutesFirstPage
,
minutesFirstPage
represents the water marked first page. We make a
PdfFileWriter
object y
and add the watermarked first page z. Then we loop through the rest of
the pages in meetingminutes.pdf and add them to the
PdfFileWriter
object {.
Finally, we open a new PDF called watermarkedCover.pdf and write the con-
tents of the PdfFileWriter to the new PDF.
Figure 13-3 shows the results. Our new PDF, watermarkedCover.pdf, has
all the contents of the meetingminutes.pdf, and the first page is watermarked.
Figure 13-3: The original PDF (left), the watermark PDF (center), and the merged PDF (right)
Encrypting PDFs
A
PdfFileWriter
object can also add encryption to a PDF document. Enter
the following into the interactive shell:
>>> import PyPDF2
>>> pdfFile = open('meetingminutes.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFile)
>>> pdfWriter = PyPDF2.PdfFileWriter()
>>> for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
u >>> pdfWriter.encrypt('swordfish')
>>> resultPdf = open('encryptedminutes.pdf', 'wb')
>>> pdfWriter.write(resultPdf)
>>> resultPdf.close()
55
Working with PDF and Word Documents
303
Before calling the
write()
method to save to a file, call the
encrypt()
method and pass it a password string u. PDFs can have a user password
(allowing you to view the PDF) and an owner password (allowing you to set
permissions for printing, commenting, extracting text, and other features).
The user password and owner password are the first and second arguments
to
encrypt()
, respectively. If only one string argument is passed to
encrypt()
,
it will be used for both passwords.
In this example, we copied the pages of meetingminutes.pdf to a
PdfFileWriter
object. We encrypted the PdfFileWriter with the password
swordfish, opened a new PDF called encryptedminutes.pdf, and wrote the
contents of the PdfFileWriter to the new PDF. Before anyone can view
encryptedminutes.pdf, they’ll have to enter this password. You may want to
delete the original, unencrypted meetingminutes.pdf file after ensuring its
copy was correctly encrypted.
Project: combining Select Pages from many Pdfs
Say you have the boring job of merging several dozen PDF documents into
a single PDF file. Each of them has a cover sheet as the first page, but you
don’t want the cover sheet repeated in the final result. Even though there
are lots of free programs for combining PDFs, many of them simply merge
entire files together. Let’s write a Python program to customize which pages
you want in the combined PDF.
At a high level, here’s what the program will do:
• Find all PDF files in the current working directory.
• Sort the filenames so the PDFs are added in order.
• Write each page, excluding the first page, of each PDF to the
output file.
In terms of implementation, your code will need to do the following:
• Call
os.listdir()
to find all the files in the working directory and
remove any non-PDF files.
• Call Python’s
sort()
list method to alphabetize the filenames.
• Create a
PdfFileWriter
object for the output PDF.
• Loop over each PDF file, creating a
PdfFileReader
object for it.
• Loop over each page (except the first) in each PDF file.
• Add the pages to the output PDF.
• Write the output PDF to a file named allminutes.pdf.
For this project, open a new file editor window and save it as
combinePdfs.py.
53
304
Chapter 13
Step 1: Find All PDF Files
First, your program needs to get a list of all files with the .pdf extension in
the current working directory and sort them. Make your code look like the
following:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# into a single PDF.
u import PyPDF2, os
# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
v pdfFiles.append(filename)
w pdfFiles.sort(key/str.lower)
x pdfWriter = PyPDF2.PdfFileWriter()
# TODO: Loop through all the PDF files.
# TODO: Loop through all the pages (except the first) and add them.
# TODO: Save the resulting PDF to a file.
After the shebang line and the descriptive comment about what
the program does, this code imports the
os
and
PyPDF2
modules u. The
os.listdir('.')
call will return a list of every file in the current working
directory. The code loops over this list and adds only those files with the
.pdf extension to
pdfFiles
v. Afterward, this list is sorted in alphabetical
order with the
key/str.lower
keyword argument to
sort()
w.
A
PdfFileWriter
object is created to hold the combined PDF pages x.
Finally, a few comments outline the rest of the program.
Step 2: Open Each PDF
Now the program must read each PDF file in
pdfFiles
. Add the following to
your program:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.
import PyPDF2, os
# Get all the PDF filenames.
pdfFiles = []
--snip--
69
Working with PDF and Word Documents
305
# Loop through all the PDF files.
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# TODO: Loop through all the pages (except the first) and add them.
# TODO: Save the resulting PDF to a file.
For each PDF, the loop opens a filename in read-binary mode by calling
open()
with
'rb'
as the second argument. The
open()
call returns a
File
object,
which gets passed to
PyPDF2.PdfFileReader()
to create a
PdfFileReader
object
for that PDF file.
Step 3: Add Each Page
For each PDF, you’ll want to loop over every page except the first. Add this
code to your program:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.
import PyPDF2, os
--snip--
# Loop through all the PDF files.
for filename in pdfFiles:
--snip--
# Loop through all the pages (except the first) and add them.
u for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# TODO: Save the resulting PDF to a file.
The code inside the
for
loop copies each
Page
object individually to
the
PdfFileWriter
object. Remember, you want to skip the first page. Since
PyPDF2 considers
0
to be the first page, your loop should start at
1
u and
then go up to, but not include, the integer in
pdfReader.numPages
.
Step 4: Save the Results
After these nested
for
loops are done looping, the
pdfWriter
variable will
contain a
PdfFileWriter
object with the pages for all the PDFs combined. The
last step is to write this content to a file on the hard drive. Add this code to
your program:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.
67
306
Chapter 13
import PyPDF2, os
--snip--
# Loop through all the PDF files.
for filename in pdfFiles:
--snip--
# Loop through all the pages (except the first) and add them.
for pageNum in range(1, pdfReader.numPages):
--snip--
# Save the resulting PDF to a file.
pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
Passing
'wb'
to
open()
opens the output PDF file, allminutes.pdf, in write-
binary mode. Then, passing the resulting
File
object to the
write()
method
creates the actual PDF file. A call to the
close()
method finishes the program.
Ideas for Similar Programs
Being able to create PDFs from the pages of other PDFs will let you make
programs that can do the following:
• Cut out specific pages from PDFs.
• Reorder pages in a PDF.
• Create a PDF from only those pages that have some specific text, identi-
fied by
extractText()
.
word documents
Python can create and modify Word documents, which have the .docx file
extension, with the
python-docx
module. You can install the module by run-
ning
pip install python-docx
. (Appendix A has full details on installing
third-party modules.)
no t e
When using pip to first install Python-Docx, be sure to install
python-docx
, not
docx
.
The installation name
docx
is for a different module that this book does not cover.
However, when you are going to import the
python-docx
module, you’ll need to run
import docx
, not
import python-docx
.
If you don’t have Word, LibreOffice Writer and OpenOffice Writer are
both free alternative applications for Windows, OS X, and Linux that can be
used to open .docx files. You can download them from https://www.libreoffice
.org and http://openoffice.org, respectively. The full documentation for Python-
Docx is available at https://python-docx.readthedocs.org/. Although there is a
version of Word for OS X, this chapter will focus on Word for Windows.
Documents you may be interested
Documents you may be interested