Draft: Fix: Enable Qwen2.5-VL Multimodal Support and CUDA GPU Acceleration
Description
This MR upgrades the metadata extraction pipeline from a text-only workflow to a Vision-Language Model (VLM) workflow. It introduces GPU acceleration for both OCR and LLM tasks and enables the model to "see" page images directly, significantly improving extraction accuracy for complex document layouts.
Key Changes
- Multimodal Extraction: Migrated from Gemma to Qwen2.5-VL-7B-Instruct. The pipeline now passes Base64-encoded page images to the model alongside raw OCR text.
-
GPU Acceleration:
- Updated
Dockerfileto use a CUDA-enabled PyTorch devel image. - Enabled
GGML_CUDAduring the compilation ofllama-cpp-python. - Configured
docker-compose.ymlto reserve all available NVIDIA GPUs and setN_GPU_LAYERS=-1for full VRAM offloading.
- Updated
-
Pipeline Enhancements:
- Integrated
Qwen25VLChatHandlerfor native multimodal chat support. - Added robust model resolution and auto-download logic for both the GGUF model and the multimodal projector (
mmproj).
- Integrated
-
Dependencies: Bumped
llama-cpp-pythonto0.3.1to support the latest Qwen-VL architecture.
Technical Details
-
Base Image:
pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel - Model Architecture: Qwen2.5-VL (GGUF)
- OCR Engine: EasyOCR (now accelerated by CUDA via the PyTorch base)
How to Verify
- Ensure the NVIDIA Container Toolkit is installed on the host.
- Build and start the services:
docker-compose up --build - Monitor logs to confirm
cuBLASorCUDAis initialized inllama-cpp-python. - Submit a complex PDF (e.g., a book cover with artistic fonts) and verify that the metadata is extracted correctly using visual context.
Environment Variables Added
-
VLM_MODEL_PATH: Path to the Qwen-VL GGUF file. -
MMPROJ_MODEL_PATH: Path to the multimodal projector file. -
N_GPU_LAYERS: Number of layers to offload to GPU (defaults to -1 for all).
Closes #1 (closed)
Edited by Praneeth Ashish