Skip to content

Draft: Fix: Enable Qwen2.5-VL Multimodal Support and CUDA GPU Acceleration

Praneeth Ashish requested to merge feat/vlm-cuda-support into main

Description

This MR upgrades the metadata extraction pipeline from a text-only workflow to a Vision-Language Model (VLM) workflow. It introduces GPU acceleration for both OCR and LLM tasks and enables the model to "see" page images directly, significantly improving extraction accuracy for complex document layouts.

Key Changes

  • Multimodal Extraction: Migrated from Gemma to Qwen2.5-VL-7B-Instruct. The pipeline now passes Base64-encoded page images to the model alongside raw OCR text.
  • GPU Acceleration:
    • Updated Dockerfile to use a CUDA-enabled PyTorch devel image.
    • Enabled GGML_CUDA during the compilation of llama-cpp-python.
    • Configured docker-compose.yml to reserve all available NVIDIA GPUs and set N_GPU_LAYERS=-1 for full VRAM offloading.
  • Pipeline Enhancements:
    • Integrated Qwen25VLChatHandler for native multimodal chat support.
    • Added robust model resolution and auto-download logic for both the GGUF model and the multimodal projector (mmproj).
  • Dependencies: Bumped llama-cpp-python to 0.3.1 to support the latest Qwen-VL architecture.

Technical Details

  • Base Image: pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel
  • Model Architecture: Qwen2.5-VL (GGUF)
  • OCR Engine: EasyOCR (now accelerated by CUDA via the PyTorch base)

How to Verify

  1. Ensure the NVIDIA Container Toolkit is installed on the host.
  2. Build and start the services:
    docker-compose up --build
  3. Monitor logs to confirm cuBLAS or CUDA is initialized in llama-cpp-python.
  4. Submit a complex PDF (e.g., a book cover with artistic fonts) and verify that the metadata is extracted correctly using visual context.

Environment Variables Added

  • VLM_MODEL_PATH: Path to the Qwen-VL GGUF file.
  • MMPROJ_MODEL_PATH: Path to the multimodal projector file.
  • N_GPU_LAYERS: Number of layers to offload to GPU (defaults to -1 for all).

Closes #1 (closed)

Edited by Praneeth Ashish

Merge request reports

Loading