feat(ocr): integrate vparse for asynchronous document processing
Summary
This MR introduces high-performance, asynchronous OCR capabilities to the Corpus Client CLI by integrating the vparse (MinerU-based) engine. The integration uses an isolated lifecycle management strategy, allowing heavy document processing without polluting the global system environment or conflicting with existing dependencies.
Key Features
Isolated Lifecycle (vparse init)
- Automatically manages a dedicated repository clone and virtual environment in
~/.corpus-cli/vparse. - Forces Python 3.13 in the isolated environment to ensure compatibility with heavy ML libraries (Torch, VLLM) even when the host system uses 3.14.
- Features an interactive model download prompt (HuggingFace/ModelScope) during initialization.
Asynchronous Bulk Processing (vparse run)
- Executes parallelized OCR on entire directories of PDFs.
- Provides real-time feedback using Rich progress bars, including pages-per-second (PPS) and ETA tracking.
- Built-in support for hardware acceleration via the
--deviceflag (cuda,mps,cpu).
Native Bridge (vparse exec)
- A pass-through command that grants access to the full
vparseecosystem (Gradio UI, FastAPI server, specialized parsers) directly through thecorpus-cliententry point.
Security & Standards
- Resolved all Ruff, Mypy, and Bandit violations in new modules.
- Upgraded
idna(v3.11 → v3.16) to resolve theCVE-2024-3651vulnerability found during audit.
Technical Implementation Details
| Concern | Approach |
|---|---|
| Isolation | All ML-heavy dependencies strictly confined to ~/.corpus-cli/
|
| Inter-Process Communication | Secure JSON-bridge layer transmits real-time processing events from the isolated Python 3.13 environment back to the main 3.14 CLI |
| Dependency Management | Uses uv for fast environment setup and dependency synchronization (vparse[all]) |
How to Test
# 1. Initialize the engine and models
./scripts/install_vparse_integration.sh
# Follow the interactive prompts to select your model source
# 2. Run OCR on a sample directory
uv run corpus-client vparse run ./test_pdfs/ --device cuda
# 3. Verify isolated state
# Check that ~/.corpus-cli/vparse/env exists with a distinct Python 3.13 installation
Dependencies Added / Updated
| Type | Item |
|---|---|
| New Script | scripts/install_vparse_integration.sh |
| New Module | src/corpus_client_cli/vparse_cmd.py |
| Updated |
uv.lock — upgraded idna to v3.16 for CVE-2024-3651 fix |