refactor(docker): optimize orchestration, dynamic vparse, and model caching (!10) · Merge requests · VISWAM / admin / Meta Data Indexing

Lakshy Yarlagadda requested to merge fix/docker into develop Apr 30, 2026

💡 What does this MR do?

This Merge Request completely overhauls the Docker infrastructure to make it significantly more modular, faster to build, and resilient across different Docker environments.

🛠️ Key Technical Changes

Dynamic VParse Orchestration via Git Context
- Removed: All Dockerfile.vparse* variants have been deleted to reduce repository bloat.
- Added: VParse is now entirely optional and hidden behind Docker Compose profiles (pipeline, vlm, hybrid). When a profile is invoked, Docker Compose directly clones the upstream mineru-dots repository and builds the corresponding backend on the fly.
Legacy Docker Compose Compatibility Fix
- Fixed: model-downloader previously used dockerfile_inline, causing silent build failures (and subsequent ModuleNotFoundError crashes) on older Docker installations that fallback to the legacy builder. This logic has been extracted into a dedicated Dockerfile.model-downloader.
Optimized Initial Model Fetching
- Fixed: setup_models.py no longer forces a 2GB download of Paddle OCR models on default startup. It now strictly downloads the Gemma LLM for BookExtractor. VParse will dynamically handle downloading its own OCR models when a user spins up a specific profile.
Persistent HuggingFace Cache Mapping
- Fixed: Added - models:/root/.cache/huggingface volume mappings to all VParse profiles. MinerU expects models at this hardcoded internal path. This ensures VParse recognizes previously downloaded models and stops wasting 15+ minutes re-downloading them on every container restart.

🧪 How to test this PR

Test Default Startup: Run docker compose up --build -d. Verify that only the BookExtractor and model-downloader start, and only the Gemma model is downloaded.
Test Profile Startup: Run docker compose --profile pipeline up -d. Verify that it successfully builds VParse from the remote repo.
Test Cache Persistence: Restart the VParse container (docker compose restart vparse-pipeline). Check the logs to ensure it instantly uses the cached models instead of re-downloading them.

📸 Documentation

The README.md has been fully updated to reflect the new profile-based startup instructions and the removal of the old VPARSE_DOCKERFILE environment variable overrides.

Edited Apr 30, 2026 by Lakshy Yarlagadda

refactor(docker): optimize orchestration, dynamic vparse, and model caching

💡 What does this MR do?

🛠️ Key Technical Changes

🧪 How to test this PR

📸 Documentation

Merge request reports