Skip to content

refactor(docker): optimize orchestration, dynamic vparse, and model caching

Lakshy Yarlagadda requested to merge fix/docker into develop

Closes #8 (closed)

💡 What does this MR do?

This Merge Request completely overhauls the Docker infrastructure to make it significantly more modular, faster to build, and resilient across different Docker environments.

🛠️ Key Technical Changes

  1. Dynamic VParse Orchestration via Git Context

    • Removed: All Dockerfile.vparse* variants have been deleted to reduce repository bloat.
    • Added: VParse is now entirely optional and hidden behind Docker Compose profiles (pipeline, vlm, hybrid). When a profile is invoked, Docker Compose directly clones the upstream mineru-dots repository and builds the corresponding backend on the fly.
  2. Legacy Docker Compose Compatibility Fix

    • Fixed: model-downloader previously used dockerfile_inline, causing silent build failures (and subsequent ModuleNotFoundError crashes) on older Docker installations that fallback to the legacy builder. This logic has been extracted into a dedicated Dockerfile.model-downloader.
  3. Optimized Initial Model Fetching

    • Fixed: setup_models.py no longer forces a 2GB download of Paddle OCR models on default startup. It now strictly downloads the Gemma LLM for BookExtractor. VParse will dynamically handle downloading its own OCR models when a user spins up a specific profile.
  4. Persistent HuggingFace Cache Mapping

    • Fixed: Added - models:/root/.cache/huggingface volume mappings to all VParse profiles. MinerU expects models at this hardcoded internal path. This ensures VParse recognizes previously downloaded models and stops wasting 15+ minutes re-downloading them on every container restart.

🧪 How to test this PR

  1. Test Default Startup: Run docker compose up --build -d. Verify that only the BookExtractor and model-downloader start, and only the Gemma model is downloaded.
  2. Test Profile Startup: Run docker compose --profile pipeline up -d. Verify that it successfully builds VParse from the remote repo.
  3. Test Cache Persistence: Restart the VParse container (docker compose restart vparse-pipeline). Check the logs to ensure it instantly uses the cached models instead of re-downloading them.

📸 Documentation

The README.md has been fully updated to reflect the new profile-based startup instructions and the removal of the old VPARSE_DOCKERFILE environment variable overrides.

Edited by Lakshy Yarlagadda

Merge request reports

Loading