Skip to content

feat(ocr): integrate vparse for asynchronous document processing

Kushal Lagichetty requested to merge feat/vparse into develop

Summary

This MR introduces high-performance, asynchronous OCR capabilities to the Corpus Client CLI by integrating the vparse (MinerU-based) engine. The integration uses an isolated lifecycle management strategy, allowing heavy document processing without polluting the global system environment or conflicting with existing dependencies.


Key Features

Isolated Lifecycle (vparse init)

  • Automatically manages a dedicated repository clone and virtual environment in ~/.corpus-cli/vparse.
  • Forces Python 3.13 in the isolated environment to ensure compatibility with heavy ML libraries (Torch, VLLM) even when the host system uses 3.14.
  • Features an interactive model download prompt (HuggingFace/ModelScope) during initialization.

Asynchronous Bulk Processing (vparse run)

  • Executes parallelized OCR on entire directories of PDFs.
  • Provides real-time feedback using Rich progress bars, including pages-per-second (PPS) and ETA tracking.
  • Built-in support for hardware acceleration via the --device flag (cuda, mps, cpu).

Native Bridge (vparse exec)

  • A pass-through command that grants access to the full vparse ecosystem (Gradio UI, FastAPI server, specialized parsers) directly through the corpus-client entry point.

Security & Standards

  • Resolved all Ruff, Mypy, and Bandit violations in new modules.
  • Upgraded idna (v3.11 → v3.16) to resolve the CVE-2024-3651 vulnerability found during audit.

Technical Implementation Details

Concern Approach
Isolation All ML-heavy dependencies strictly confined to ~/.corpus-cli/
Inter-Process Communication Secure JSON-bridge layer transmits real-time processing events from the isolated Python 3.13 environment back to the main 3.14 CLI
Dependency Management Uses uv for fast environment setup and dependency synchronization (vparse[all])

How to Test

# 1. Initialize the engine and models
./scripts/install_vparse_integration.sh
# Follow the interactive prompts to select your model source

# 2. Run OCR on a sample directory
uv run corpus-client vparse run ./test_pdfs/ --device cuda

# 3. Verify isolated state
# Check that ~/.corpus-cli/vparse/env exists with a distinct Python 3.13 installation

Dependencies Added / Updated

Type Item
New Script scripts/install_vparse_integration.sh
New Module src/corpus_client_cli/vparse_cmd.py
Updated uv.lock — upgraded idna to v3.16 for CVE-2024-3651 fix

Merge request reports

Loading