Praneeth Ashish requested to merge feat/vllm-gemma4-gpu into develop Apr 30, 2026

MR Description

Summary

Replace llama-cpp-python with vllm for GPU-accelerated inference using google/gemma-4-E4B-it. This migration enables proper CUDA/PyTorch GPU support and aligns the codebase with the documented Phase 2 architecture.

Key Changes

Inference Engine: llama-cpp → vllm (GPU-accelerated)
Model Format: GGUF files → HuggingFace safetensors (via vLLM)
Model Cache: Project's models/ volume → Host's ~/.cache/huggingface/hub/
Image VLM: Placeholder added for future Gemma-4 vision support

Changes

File	Change
`pyproject.toml`	Replace `llama-cpp-python` with `vllm>=0.8.0`, add `huggingface_hub`
`bookextractor/pipeline.py`	Use `vllm.LLM` and `SamplingParams.generate()` API
`bookextractor/models.py`	Add `ImageVLMMetadata` placeholder
`bookextractor/vlm_client.py`	New placeholder for future image VLM
`.env.example`	Replace GGUF env vars with `VLLM_MODEL`, `HF_TOKEN`
`Dockerfile.bookextractor`	Use `nvidia/cuda:12.4.0-runtime-ubuntu22.04` base
`docker-compose.yml`	Add GPU resources, HF cache bind mount
`tests/test_pipeline.py`	Update mocks for vLLM API
`scripts/setup_models.py`	Remove Gemma GGUF download (vLLM handles it)

Storage Architecture

Host Filesystem:
~/.cache/huggingface/hub/           ← vLLM model cache (bind mount)
  └── google/gemma-4-E4B-it/        ← Downloaded by vLLM on first run

Docker Volume:
/models/                            ← VParse OCR models only

Environment Variables

Variable	Description	Required
`VLLM_MODEL`	HuggingFace model ID	Yes (default: `google/gemma-4-E4B-it`)
`HF_TOKEN`	HuggingFace access token	Yes (gated model)
`VLLM_TENSOR_PARALLEL_SIZE`	GPU count	No (default: 1)

Breaking Changes

HF_TOKEN is now required for gated Gemma model access
GPU is now required - no CPU fallback with llama-cpp
Removed VLM_MODEL_PATH, MMPROJ_MODEL_PATH, VLM_MODEL_URL env vars
models/ volume no longer stores Gemma GGUF files

Prerequisites

Accept Gemma model terms: https://huggingface.co/google/gemma-4-E4B-it
Docker with NVIDIA runtime (nvidia-container-toolkit) configured
NVIDIA GPU with CUDA 12.4+ support

Testing

# Build
docker compose build bookextractor

# Run
docker compose up bookextractor

# Test extraction
curl -X POST http://localhost:8000/extract -F "[email protected]"

# Verify HF cache
ls ~/.cache/huggingface/hub/

# Verify GPU usage
nvidia-smi

### Issues
Closes #7

Fix: Replace llama-cpp with vLLM for GPU-accelerated Gemma-4 inference