Draft: Fix: Enable Qwen2.5-VL Multimodal Support and CUDA GPU Acceleration (!2) · Merge requests · VISWAM / admin / Meta Data Indexing

Praneeth Ashish requested to merge feat/vlm-cuda-support into main Apr 26, 2026

Description

This MR upgrades the metadata extraction pipeline from a text-only workflow to a Vision-Language Model (VLM) workflow. It introduces GPU acceleration for both OCR and LLM tasks and enables the model to "see" page images directly, significantly improving extraction accuracy for complex document layouts.

Key Changes

Multimodal Extraction: Migrated from Gemma to Qwen2.5-VL-7B-Instruct. The pipeline now passes Base64-encoded page images to the model alongside raw OCR text.
GPU Acceleration:
- Updated Dockerfile to use a CUDA-enabled PyTorch devel image.
- Enabled GGML_CUDA during the compilation of llama-cpp-python.
- Configured docker-compose.yml to reserve all available NVIDIA GPUs and set N_GPU_LAYERS=-1 for full VRAM offloading.
Pipeline Enhancements:
- Integrated Qwen25VLChatHandler for native multimodal chat support.
- Added robust model resolution and auto-download logic for both the GGUF model and the multimodal projector (mmproj).
Dependencies: Bumped llama-cpp-python to 0.3.1 to support the latest Qwen-VL architecture.

Technical Details

Base Image: pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel
Model Architecture: Qwen2.5-VL (GGUF)
OCR Engine: EasyOCR (now accelerated by CUDA via the PyTorch base)

How to Verify

Ensure the NVIDIA Container Toolkit is installed on the host.
Build and start the services:
```
docker-compose up --build
```
Monitor logs to confirm cuBLAS or CUDA is initialized in llama-cpp-python.
Submit a complex PDF (e.g., a book cover with artistic fonts) and verify that the metadata is extracted correctly using visual context.

Environment Variables Added

VLM_MODEL_PATH: Path to the Qwen-VL GGUF file.
MMPROJ_MODEL_PATH: Path to the multimodal projector file.
N_GPU_LAYERS: Number of layers to offload to GPU (defaults to -1 for all).

Closes #1 (closed)

Edited Apr 26, 2026 by Praneeth Ashish

Draft: Fix: Enable Qwen2.5-VL Multimodal Support and CUDA GPU Acceleration

Description

Key Changes

Technical Details

How to Verify

Environment Variables Added

Merge request reports