Optimize Docker Orchestration: Dynamic VParse, Model Caching, and Legacy Compose Compatibility
🎯 Problem Statement
Currently, our Docker orchestration has several inefficiencies and bugs related to the VParse (MinerU) integration and model downloading:
-
Redundant Dockerfiles: We are maintaining multiple
Dockerfile.vparse*variants locally, which duplicates code already present in the upstreammineru-dotsrepository. -
Legacy Compose Breakage: The
model-downloaderservice uses thedockerfile_inlinefeature, which is not supported by legacy Docker Compose builders (e.g., without the buildx plugin). This causes it to silently fall back to building the default BookExtractor image, resulting in aModuleNotFoundErrorforhuggingface_hub. -
Aggressive Downloading: The
setup_models.pyscript forces the download of 2GB of Paddle OCR models on default startup, even if the user only wants to run the BookExtractor pipeline without VParse. -
Cache Misses in VParse: VParse uses a hardcoded internal cache path (
/root/.cache/huggingface) for its models. Because our Docker volume isn't mapped to this internal path, VParse fails to recognize downloaded models and re-downloads them every time the container restarts.
🛠 Proposed Solution
-
Dynamic Orchestration: Delete local VParse Dockerfiles and configure
docker-compose.ymlto build directly from the remotemineru-dotsrepository using Docker Compose profiles (pipeline,vlm,hybrid). -
Fix Downloader Build: Move the inline Dockerfile instructions to a physical
Dockerfile.model-downloaderto ensure compatibility across all Docker Compose versions. -
Optimize Model Setup: Remove the pipeline/Paddle model download logic from
setup_models.pyso it only fetches the Gemma LLM by default. -
Map Persistent Cache: Map the persistent
modelsvolume directly to/root/.cache/huggingfacein the VParse services so OCR models are properly persisted and never re-downloaded.
✅ Acceptance Criteria
-
Running docker compose up -dstarts only BookExtractor and downloads only the Gemma LLM. -
Running docker compose --profile pipeline up -ddynamically clones and builds VParse from the remote repository. -
Restarting the VParse container uses the cached models in the volume instead of re-downloading them. -
The pipeline functions successfully on older Docker Compose installations.