Fix: Upgrade Metadata Extraction Pipeline to Vision-Language Model (VLM) with GPU Acceleration

Overview

The current metadata extraction pipeline relies on a text-only workflow where raw OCR text is passed to a Gemma model. This approach struggles with complex book layouts, stylized fonts, and non-linear text placements common on book covers and copyright pages. Additionally, the current CPU-only setup leads to slow processing times for multi-page PDFs.

Requirements / Goals

Multimodal Support: Enable the LLM to directly analyze page images to better understand visual context and layout.
Performance Optimization: Implement CUDA acceleration for both the OCR engine (EasyOCR) and the LLM (llama-cpp-python) to reduce extraction latency.
Multilingual Accuracy: Maintain or improve extraction quality for Telugu, Hindi, and English text.
Hardware Utilization: Fully utilize available NVIDIA GPU resources within the Docker environment.

Proposed Solution

Model Upgrade: Replace the text-only Gemma model with the Qwen2.5-VL-7B-Instruct Vision-Language Model.
Architecture Update: Modify the extraction pipeline to convert PDF pages to images and pass them as Base64 encoded strings to the VLM using the Chat Completions API.
Environment Overhaul:
- Migration to a CUDA-enabled base image (pytorch/pytorch).
- Compilation of llama-cpp-python with CUDA backend support.
- Configuration of GPU device passthrough in docker-compose.yml.
Auto-Download Logic: Implement robust handling for the two-part VLM architecture (GGUF model + mmproj projector) to ensure the system is "plug-and-play".

Success Criteria

System successfully loads Qwen2.5-VL into GPU VRAM.
Extraction results improve for documents where OCR text is fragmented or noisy.
Processing time per PDF page is significantly reduced compared to the CPU-only baseline.
System remains containerized and easy to deploy with docker-compose.