How to Build a Local OCR Pipeline from Scratch with Surya
A step-by-step guide to deploying Surya locally for document OCR. Covers installation, inference backend configuration (GPU/CPU/Mac), CLI tricks, and a complete Python pipeline for layout analysis, text extraction, and table digitization.

Building a Local Document OCR Pipeline with Surya: From Installation to Table Extraction
1. The Problem You Might Be Facing
As a backend developer, you've probably encountered this scenario repeatedly: stakeholders dump a pile of scanned PDFs, contract images, and invoice photos on you, asking to extract the text and tables into a database. The traditional approaches usually involve either calling paid OCR APIs (costs add up quickly, and you have to upload sensitive data) or wrestling with Tesseract (accuracy is questionable, and parameter tweaking is tedious).
Today, I'll show you how to solve this entirely locally using an open-source project called Surya. With 650 million parameters, it supports OCR in 91 languages, automatically performs document layout analysis, reading order sorting, and table recognition. Running on an RTX 5090, it hits 5 pages per second. No GPU? It runs smoothly on CPU or Apple Silicon via the llama.cpp backend.
By the end of this guide, you will:
- Deploy a complete local OCR service.
- Use the Python API to recognize text from images/PDFs.
- Extract tables from documents and export them as HTML.
- Troubleshoot common issues independently.
2. Prerequisites
Before we begin, ensure your environment meets these requirements:
| Requirement | Details |
|---|---|
| Python | 3.9+ (3.10 recommended) |
| Inference Backend | NVIDIA GPU: Requires Docker + NVIDIA Container Toolkit. CPU/Apple Silicon: Requires llama.cpp |
| Memory | GPU inference: 8GB+ VRAM recommended. CPU inference: 16GB+ RAM recommended. |
| Disk | ~1.3GB for automatic model weight download on first run. |
Why do you need an inference backend?
Surya's core is a Vision-Language Model (VLM). OCR, layout analysis, and table recognition are all handled by the same model. It doesn't run purely in Python; it requires an inference server. The good news? You don't need to deploy it manually. SuryaInferenceManager automatically spins it up on your first API call.
3. Quick Start: Get It Running in 5 Steps
Step 1: Install surya-ocr
bash
pip install surya-ocr
That's it. The package handles all dependencies automatically.
Step 2: Configure the Inference Backend
For NVIDIA GPU users:
Ensure Docker and the NVIDIA Container Toolkit are installed. Verify with:
bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
If you see your GPU info, you're good to go.
For CPU or Apple Silicon (Mac) users:
You need the llama.cpp server binary. On macOS, simply run:
bash
brew install llama.cpp
Linux users should download the appropriate package from the llama.cpp releases and ensure llama-server is in your system's PATH.
Step 3: Verify the Installation
Grab a test image with text (e.g., test.jpg) and run a quick check:
bash
surya_ocr /path/to/your/test.jpg
On the first run, Surya will:
- Automatically download model weights (~1.3GB, please be patient).
- Start the inference server (vllm or llama.cpp).
- Execute OCR and output
results.json.
Pro Tip: Reuse the Inference Server
By default, CLI commands start and stop the server every time, which means reloading the model on each run. Add the --keep_server flag to keep it alive and reuse it for subsequent commands:
bash
surya_ocr test.jpg --keep_server # Start and keep server running
surya_layout test2.jpg # Reuses instantly, results in seconds
surya_table test3.pdf --keep_server # Keeps going
4. Real-World: Building a Digitization Pipeline with Python API
CLI commands are great for quick tests, but integrating into production requires the Python API. Below is a complete script that implements: Input Scanned Image → OCR Text Extraction → Layout Recognition → Table Extraction → Structured Output.
python
from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor
from surya.layout import LayoutPredictor
from surya.table_rec import TableRecPredictor
import json
## 1. Initialize inference manager (auto-selects vllm or llama.cpp)
manager = SuryaInferenceManager()
## 2. Load test image
image_path = "scanned_invoice.jpg" # Replace with your file path
image = Image.open(image_path)
## 3. Run layout analysis first (identifies tables, body text, etc.)
layout_predictor = LayoutPredictor(manager)
layouts = layout_predictor([image])
print("=== Layout Analysis Results ===")
for page in layouts:
for block in page["blocks"]:
print(f"Type: {block['label']}, Reading Order: {block['reading_order']}, Confidence: {block['confidence']:.2f}")
## 4. Fine-grained OCR based on layout (more accurate than full-page OCR)
recognition_predictor = RecognitionPredictor(manager)
## Passing layout_results automatically enables block mode
predictions = recognition_predictor([image], layouts)
print("\n=== OCR Text Extraction ===")
for page in predictions:
for block in page["blocks"]:
if not block["skipped"]:
print(f"[{block['label']}] {block['html'][:100]}...")
## 5. Extract tables (if any)
table_predictor = TableRecPredictor(manager)
table_results = table_predictor.predict_full([image]) # predict_full outputs complete HTML
print("\n=== Table Extraction Results ===")
if table_results and not table_results[0]["error"]:
for tbl in table_results:
print(f"Found table: {len(tbl['rows'])} rows x {len(tbl['cols'])} columns")
# Output standard HTML table
print(tbl["html"][:200], "...")
# Save as JSON for downstream processing
with open("table_output.json", "w", encoding="utf-8") as f:
json.dump(table_results, f, ensure_ascii=False, indent=2)
print("Table results saved to table_output.json")
else:
print("No tables detected")
Code Breakdown
Why layout first, then OCR?
Surya supports two OCR modes:
- Full-page mode: One VLM call processes the entire page. Fast, but might miss fine details.
- Block mode: Runs layout analysis to locate text blocks first, then OCRs each block individually. Much higher accuracy.
When you pass thelayout_resultsparameter,RecognitionPredictorautomatically switches to block mode. For structured documents like contracts or invoices, block mode is highly recommended.
predict_full vs Default Call
By default, TableRecPredictor only returns row/column geometry (simple mode). Calling .predict_full() outputs a complete <table> HTML structure, including merged cells and rowspan/colspan headers. If your downstream pipeline expects standard tabular formats, always use predict_full.
5. Common Pitfalls & Troubleshooting
1. Inaccurate OCR Results?
Check image resolution first:
- Text is too small → Increase DPI or upscale the image.
- Image is too large (>2048px width) → Downscale it. The model can get confused by overly high resolutions.
- Old/scanned documents (blurry, skewed) → Apply preprocessing first (binarization, denoising, deskewing).
2. Inference Server Fails to Start
Common error: Backend connection timeout. Troubleshooting steps:
- GPU users: Verify Docker can access the GPU (
nvidia-smitest). - CPU/Mac users: Ensure
llama-serveris in yourPATH(runllama-server --helpin terminal to verify). - Manually specify backend:
bash
export SURYA_INFERENCE_BACKEND=vllm # or llamacpp export SURYA_INFERENCE_URL=http://localhost:8000/v1 # Point to an existing service
3. Out of Memory / VRAM
Lower the concurrency:
bash
export SURYA_INFERENCE_PARALLEL=4 # Default is 8, lowering reduces peak memory usage
GPU users can also lower the DPI (default 192, try 96):
python
import os
os.environ["SURYA_OCR_DPI"] = "96"
4. Suboptimal Chinese Recognition?
Surya achieves an 87.2% pass rate across 91 languages, with Chinese specifically at 82.5%. For mixed Chinese-English documents:
- Ensure high image clarity.
- Use block mode instead of full-page mode.
- For highly specialized layouts, consider reaching out to the project maintainers for fine-tuning options.
6. Summary & Next Steps
Today, we walked through a complete local OCR pipeline:
pip install surya-ocrfor one-click installation.- Configure the inference backend based on your hardware (GPU → vllm/Docker, CPU/Mac → llama.cpp).
- Use
--keep_serverto reuse the server and avoid repeated model loading. - Build a full pipeline with the Python API: Layout → OCR → Table Extraction.
- Master optimization tricks like DPI tuning and concurrency control.
What to try next:
- Batch processing:
surya_ocr /path/to/folder --page_range 0-5to specify page ranges. - Run the official Streamlit UI:
pip install streamlit pdftext && surya_gui - Explore math formula recognition: Surya 2 automatically outputs equations wrapped in
<math>...</math>KaTeX-compatible LaTeX.
Production Integration Tips:
- Licensing: Code is Apache 2.0. Model weights are free for commercial use by startups (<$5M funding/revenue). Larger enterprises need to contact Datalab for licensing.
- Performance Benchmarking: Run benchmarks with your actual document types. Adjust DPI and concurrency parameters to find the optimal balance for your workload.
Surya already has 20k+ stars on GitHub and an active community. If you hit a wall, open a GitHub Issue or ask in their Discord channel. Happy OCR-ing!