How to Automate PDF Batch Processing with pypdf in Python
Stop manually wrestling with PDF editors. Learn how to use the pure Python `pypdf` library to merge documents, extract text, add encryption, and automate repetitive PDF tasks in under 10 seconds.

Last week, I helped a colleague process a batch of contract files: 30 PDFs needed to be merged into a single book, the watermark on page 5 was misaligned and needed replacement, and finally, an open password had to be added to the finished document. Manually editing with a GUI PDF editor took nearly an hour—and if another batch came in tomorrow, I'd have to start all over again.
Then I pulled out pypdf, a pure Python PDF processing library. The exact same requirements took under 10 seconds to run once the script was written. In this practical guide, I'll walk you step-by-step through the most common PDF automation operations. By the end, you'll be ready to apply these scripts directly to your daily workflow.
Prerequisites
- Python 3.7+ (officially supported by
pypdf) - Basic Python knowledge: writing functions, using
forloops, and handling path strings is more than enough. - No Java backend or heavy dependencies required. Pure Python scripts only.
Installing pypdf
Open your terminal or command prompt and install with one line:
bash
pip install pypdf
If you need to process AES encrypted/decrypted PDFs, install the optional crypto extension:
bash
pip install pypdf[crypto]
Verify the installation in Python:
python
import pypdf
print(pypdf.__version__)
If a version number is printed, you're all set.
Quick Start: Three Core Operations
1. Read a PDF & Extract Text
This is the most common use case: pulling text out of a PDF for analysis, search, or archiving.
python
from pypdf import PdfReader
reader = PdfReader("contract_template.pdf")
## Get total page count
print(f"Total pages: {len(reader.pages)}")
## Extract text from page 1
page = reader.pages[0]
text = page.extract_text()
print(text[:500]) # Print first 500 characters
Why write it this way? PdfReader loads the entire PDF into memory, and reader.pages acts as an indexed list. extract_text() parses the underlying PDF content stream to pull recognizable characters. Note: Scanned PDFs (image-only) cannot be extracted this way; you'll need to pair this with an OCR tool like pytesseract for those cases.
2. Merge Multiple PDFs
python
from pypdf import PdfWriter
writer = PdfWriter()
## Merge files in order
for filename in ["cover.pdf", "body.pdf", "appendix.pdf"]:
reader = PdfReader(filename)
for page in reader.pages:
writer.add_page(page)
with open("merged_output.pdf", "wb") as f:
writer.write(f)
Think of PdfWriter as a blank canvas. You sequentially paste each source page onto it, then write the entire document to disk at once. This "batch-then-write" approach is significantly faster than repeatedly opening and closing files.
3. Add an Open Password
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("unencrypted.pdf")
writer = PdfWriter()
## Copy all pages to the writer
for page in reader.pages:
writer.add_page(page)
## Set user password
writer.encrypt("your_secret_password")
with open("encrypted_output.pdf", "wb") as f:
writer.write(f)
The encrypt() method handles the protection. Once applied, opening the PDF will prompt for a password. Important: You must have installed pypdf[crypto] for modern AES encryption to function correctly.
Real-World Scenario: Batch Processing Meeting Notes
Suppose you have 10 separate meeting minutes PDFs named meeting_2024_01.pdf through meeting_2024_10.pdf. You need to:
- Merge them into a single master file
- Apply password protection
Here's a complete, production-ready script:
python
import os
from pypdf import PdfReader, PdfWriter
def process_meeting_pdfs(input_dir: str, output_path: str, password: str):
writer = PdfWriter()
# 1. Load files in alphabetical order
files = sorted([f for f in os.listdir(input_dir) if f.endswith(".pdf")])
for fname in files:
reader = PdfReader(os.path.join(input_dir, fname))
for page in reader.pages:
writer.add_page(page)
# 2. Apply password protection
writer.encrypt(password)
# 3. Write to disk
with open(output_path, "wb") as f:
writer.write(f)
print(f"Done! Merged {len(files)} files, saved to {output_path}")
## Example usage
process_meeting_pdfs(
input_dir="./meetings",
output_path="./meetings_combined.pdf",
password="Meeting2024Secure!"
)
Drop all your PDFs into the ./meetings folder, run the script, and you'll get a secure, merged document instantly.
Under the hood: sorted() guarantees a predictable merge order (since os.listdir() returns files arbitrarily). encrypt() must be called before writing so the security settings apply to the final output stream.
Common Pitfalls & FAQs
- Extracted text is garbled or empty? The PDF is likely a scan (image-based with no text layer). Convert pages to images using
pdf2image, then runpytesseractfor OCR. - Merged file size is unexpectedly large? Check if the source PDFs contain high-resolution, uncompressed images.
pypdfdoes not recompress content by default; it performs page-level stitching. - Password isn't working? Confirm you installed
pypdf[crypto]. Additionally, some older PDF readers have inconsistent AES support. Test with a different viewer if needed. - Pages rotated after merging? PDFs store per-page
/Rotateattributes that are preserved during merges. If orientation is inconsistent, normalize it usingpage.rotate(90)before adding to the writer.
Summary & Next Steps
Today, we covered four core operations with pypdf: text extraction → merging files → password encryption. You'll quickly realize that repetitive GUI clicks can be replaced with a single script execution. Next time you face a batch of documents, just update the paths and run one command.
Where to go from here:
- Integrate
pdfplumberto extract structured table data from PDFs. - Use
pypdf's cropping utilities to automatically strip headers/footers from large datasets. - Wrap your script in a FastAPI endpoint so your team can upload files via a simple web UI for automated processing.
With over 10k GitHub stars and an active maintainer community, pypdf is highly reliable. For edge cases, the [pypdf] tag on Stack Overflow contains numerous high-quality discussions.
Have questions or run into a unique use case? Drop a comment below, and let's streamline our office automation workflows together.