How to Build a Local OCR Pipeline from Scratch with Surya

23 views 0 likes 0 comments 20 minutesOriginalTutorial

A step-by-step guide to deploying Surya locally for document OCR. Covers installation, inference backend configuration (GPU/CPU/Mac), CLI tricks, and a complete Python pipeline for layout analysis, text extraction, and table digitization.

#OCR #Document Digitization #Surya #Python #Layout Analysis #Table Recognition #Open Source
How to Build a Local OCR Pipeline from Scratch with Surya

Building a Local Document OCR Pipeline with Surya: From Installation to Table Extraction

1. The Problem You Might Be Facing

As a backend developer, you've probably encountered this scenario repeatedly: stakeholders dump a pile of scanned PDFs, contract images, and invoice photos on you, asking to extract the text and tables into a database. The traditional approaches usually involve either calling paid OCR APIs (costs add up quickly, and you have to upload sensitive data) or wrestling with Tesseract (accuracy is questionable, and parameter tweaking is tedious).

Today, I'll show you how to solve this entirely locally using an open-source project called Surya. With 650 million parameters, it supports OCR in 91 languages, automatically performs document layout analysis, reading order sorting, and table recognition. Running on an RTX 5090, it hits 5 pages per second. No GPU? It runs smoothly on CPU or Apple Silicon via the llama.cpp backend.

By the end of this guide, you will:

  1. Deploy a complete local OCR service.
  2. Use the Python API to recognize text from images/PDFs.
  3. Extract tables from documents and export them as HTML.
  4. Troubleshoot common issues independently.

2. Prerequisites

Before we begin, ensure your environment meets these requirements:

Requirement Details
Python 3.9+ (3.10 recommended)
Inference Backend NVIDIA GPU: Requires Docker + NVIDIA Container Toolkit. CPU/Apple Silicon: Requires llama.cpp
Memory GPU inference: 8GB+ VRAM recommended. CPU inference: 16GB+ RAM recommended.
Disk ~1.3GB for automatic model weight download on first run.

Why do you need an inference backend?
Surya's core is a Vision-Language Model (VLM). OCR, layout analysis, and table recognition are all handled by the same model. It doesn't run purely in Python; it requires an inference server. The good news? You don't need to deploy it manually. SuryaInferenceManager automatically spins it up on your first API call.


3. Quick Start: Get It Running in 5 Steps

Step 1: Install surya-ocr

bash 复制代码
pip install surya-ocr

That's it. The package handles all dependencies automatically.

Step 2: Configure the Inference Backend

For NVIDIA GPU users:
Ensure Docker and the NVIDIA Container Toolkit are installed. Verify with:

bash 复制代码
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If you see your GPU info, you're good to go.

For CPU or Apple Silicon (Mac) users:
You need the llama.cpp server binary. On macOS, simply run:

bash 复制代码
brew install llama.cpp

Linux users should download the appropriate package from the llama.cpp releases and ensure llama-server is in your system's PATH.

Step 3: Verify the Installation

Grab a test image with text (e.g., test.jpg) and run a quick check:

bash 复制代码
surya_ocr /path/to/your/test.jpg

On the first run, Surya will:

  1. Automatically download model weights (~1.3GB, please be patient).
  2. Start the inference server (vllm or llama.cpp).
  3. Execute OCR and output results.json.

Pro Tip: Reuse the Inference Server
By default, CLI commands start and stop the server every time, which means reloading the model on each run. Add the --keep_server flag to keep it alive and reuse it for subsequent commands:

bash 复制代码
surya_ocr test.jpg --keep_server   # Start and keep server running
surya_layout test2.jpg             # Reuses instantly, results in seconds
surya_table test3.pdf --keep_server # Keeps going

4. Real-World: Building a Digitization Pipeline with Python API

CLI commands are great for quick tests, but integrating into production requires the Python API. Below is a complete script that implements: Input Scanned Image → OCR Text Extraction → Layout Recognition → Table Extraction → Structured Output.

python 复制代码
from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor
from surya.layout import LayoutPredictor
from surya.table_rec import TableRecPredictor
import json

## 1. Initialize inference manager (auto-selects vllm or llama.cpp)
manager = SuryaInferenceManager()

## 2. Load test image
image_path = "scanned_invoice.jpg"  # Replace with your file path
image = Image.open(image_path)

## 3. Run layout analysis first (identifies tables, body text, etc.)
layout_predictor = LayoutPredictor(manager)
layouts = layout_predictor([image])

print("=== Layout Analysis Results ===")
for page in layouts:
    for block in page["blocks"]:
        print(f"Type: {block['label']}, Reading Order: {block['reading_order']}, Confidence: {block['confidence']:.2f}")

## 4. Fine-grained OCR based on layout (more accurate than full-page OCR)
recognition_predictor = RecognitionPredictor(manager)
## Passing layout_results automatically enables block mode
predictions = recognition_predictor([image], layouts)

print("\n=== OCR Text Extraction ===")
for page in predictions:
    for block in page["blocks"]:
        if not block["skipped"]:
            print(f"[{block['label']}] {block['html'][:100]}...")

## 5. Extract tables (if any)
table_predictor = TableRecPredictor(manager)
table_results = table_predictor.predict_full([image])  # predict_full outputs complete HTML

print("\n=== Table Extraction Results ===")
if table_results and not table_results[0]["error"]:
    for tbl in table_results:
        print(f"Found table: {len(tbl['rows'])} rows x {len(tbl['cols'])} columns")
        # Output standard HTML table
        print(tbl["html"][:200], "...")
        
        # Save as JSON for downstream processing
        with open("table_output.json", "w", encoding="utf-8") as f:
            json.dump(table_results, f, ensure_ascii=False, indent=2)
    print("Table results saved to table_output.json")
else:
    print("No tables detected")

Code Breakdown

Why layout first, then OCR?
Surya supports two OCR modes:

  • Full-page mode: One VLM call processes the entire page. Fast, but might miss fine details.
  • Block mode: Runs layout analysis to locate text blocks first, then OCRs each block individually. Much higher accuracy.
    When you pass the layout_results parameter, RecognitionPredictor automatically switches to block mode. For structured documents like contracts or invoices, block mode is highly recommended.

predict_full vs Default Call
By default, TableRecPredictor only returns row/column geometry (simple mode). Calling .predict_full() outputs a complete <table> HTML structure, including merged cells and rowspan/colspan headers. If your downstream pipeline expects standard tabular formats, always use predict_full.


5. Common Pitfalls & Troubleshooting

1. Inaccurate OCR Results?

Check image resolution first:

  • Text is too small → Increase DPI or upscale the image.
  • Image is too large (>2048px width) → Downscale it. The model can get confused by overly high resolutions.
  • Old/scanned documents (blurry, skewed) → Apply preprocessing first (binarization, denoising, deskewing).

2. Inference Server Fails to Start

Common error: Backend connection timeout. Troubleshooting steps:

  • GPU users: Verify Docker can access the GPU (nvidia-smi test).
  • CPU/Mac users: Ensure llama-server is in your PATH (run llama-server --help in terminal to verify).
  • Manually specify backend:
    bash 复制代码
    export SURYA_INFERENCE_BACKEND=vllm  # or llamacpp
    export SURYA_INFERENCE_URL=http://localhost:8000/v1  # Point to an existing service

3. Out of Memory / VRAM

Lower the concurrency:

bash 复制代码
export SURYA_INFERENCE_PARALLEL=4  # Default is 8, lowering reduces peak memory usage

GPU users can also lower the DPI (default 192, try 96):

python 复制代码
import os
os.environ["SURYA_OCR_DPI"] = "96"

4. Suboptimal Chinese Recognition?

Surya achieves an 87.2% pass rate across 91 languages, with Chinese specifically at 82.5%. For mixed Chinese-English documents:

  • Ensure high image clarity.
  • Use block mode instead of full-page mode.
  • For highly specialized layouts, consider reaching out to the project maintainers for fine-tuning options.

6. Summary & Next Steps

Today, we walked through a complete local OCR pipeline:

  1. pip install surya-ocr for one-click installation.
  2. Configure the inference backend based on your hardware (GPU → vllm/Docker, CPU/Mac → llama.cpp).
  3. Use --keep_server to reuse the server and avoid repeated model loading.
  4. Build a full pipeline with the Python API: Layout → OCR → Table Extraction.
  5. Master optimization tricks like DPI tuning and concurrency control.

What to try next:

  • Batch processing: surya_ocr /path/to/folder --page_range 0-5 to specify page ranges.
  • Run the official Streamlit UI: pip install streamlit pdftext && surya_gui
  • Explore math formula recognition: Surya 2 automatically outputs equations wrapped in <math>...</math> KaTeX-compatible LaTeX.

Production Integration Tips:

  • Licensing: Code is Apache 2.0. Model weights are free for commercial use by startups (<$5M funding/revenue). Larger enterprises need to contact Datalab for licensing.
  • Performance Benchmarking: Run benchmarks with your actual document types. Adjust DPI and concurrency parameters to find the optimal balance for your workload.

Surya already has 20k+ stars on GitHub and an active community. If you hit a wall, open a GitHub Issue or ask in their Discord channel. Happy OCR-ing!

Last Updated:2026-06-03 10:05:47

Comments (0)

Post Comment

Loading...
0/500
Loading comments...