Fix: Upgrade Metadata Extraction Pipeline to Vision-Language Model (VLM) with GPU Acceleration
Overview
The current metadata extraction pipeline relies on a text-only workflow where raw OCR text is passed to a Gemma model. This approach struggles with complex book layouts, stylized fonts, and non-linear text placements common on book covers and copyright pages. Additionally, the current CPU-only setup leads to slow processing times for multi-page PDFs.
Requirements / Goals
- Multimodal Support: Enable the LLM to directly analyze page images to better understand visual context and layout.
- Performance Optimization: Implement CUDA acceleration for both the OCR engine (EasyOCR) and the LLM (llama-cpp-python) to reduce extraction latency.
- Multilingual Accuracy: Maintain or improve extraction quality for Telugu, Hindi, and English text.
- Hardware Utilization: Fully utilize available NVIDIA GPU resources within the Docker environment.
Proposed Solution
- Model Upgrade: Replace the text-only Gemma model with the Qwen2.5-VL-7B-Instruct Vision-Language Model.
- Architecture Update: Modify the extraction pipeline to convert PDF pages to images and pass them as Base64 encoded strings to the VLM using the Chat Completions API.
-
Environment Overhaul:
- Migration to a CUDA-enabled base image (
pytorch/pytorch). - Compilation of
llama-cpp-pythonwith CUDA backend support. - Configuration of GPU device passthrough in
docker-compose.yml.
- Migration to a CUDA-enabled base image (
- Auto-Download Logic: Implement robust handling for the two-part VLM architecture (GGUF model + mmproj projector) to ensure the system is "plug-and-play".
Success Criteria
-
System successfully loads Qwen2.5-VLinto GPU VRAM. -
Extraction results improve for documents where OCR text is fragmented or noisy. -
Processing time per PDF page is significantly reduced compared to the CPU-only baseline. -
System remains containerized and easy to deploy with docker-compose.