Skip to content

Fix: Replace llama-cpp with vLLM for GPU-accelerated Gemma-4 inference

Praneeth Ashish requested to merge feat/vllm-gemma4-gpu into develop

MR Description

Summary

Replace llama-cpp-python with vllm for GPU-accelerated inference using google/gemma-4-E4B-it. This migration enables proper CUDA/PyTorch GPU support and aligns the codebase with the documented Phase 2 architecture.

Key Changes

  • Inference Engine: llama-cppvllm (GPU-accelerated)
  • Model Format: GGUF files → HuggingFace safetensors (via vLLM)
  • Model Cache: Project's models/ volume → Host's ~/.cache/huggingface/hub/
  • Image VLM: Placeholder added for future Gemma-4 vision support

Changes

File Change
pyproject.toml Replace llama-cpp-python with vllm>=0.8.0, add huggingface_hub
bookextractor/pipeline.py Use vllm.LLM and SamplingParams.generate() API
bookextractor/models.py Add ImageVLMMetadata placeholder
bookextractor/vlm_client.py New placeholder for future image VLM
.env.example Replace GGUF env vars with VLLM_MODEL, HF_TOKEN
Dockerfile.bookextractor Use nvidia/cuda:12.4.0-runtime-ubuntu22.04 base
docker-compose.yml Add GPU resources, HF cache bind mount
tests/test_pipeline.py Update mocks for vLLM API
scripts/setup_models.py Remove Gemma GGUF download (vLLM handles it)

Storage Architecture

Host Filesystem:
~/.cache/huggingface/hub/           ← vLLM model cache (bind mount)
  └── google/gemma-4-E4B-it/        ← Downloaded by vLLM on first run

Docker Volume:
/models/                            ← VParse OCR models only

Environment Variables

Variable Description Required
VLLM_MODEL HuggingFace model ID Yes (default: google/gemma-4-E4B-it)
HF_TOKEN HuggingFace access token Yes (gated model)
VLLM_TENSOR_PARALLEL_SIZE GPU count No (default: 1)

Breaking Changes

  • HF_TOKEN is now required for gated Gemma model access
  • GPU is now required - no CPU fallback with llama-cpp
  • Removed VLM_MODEL_PATH, MMPROJ_MODEL_PATH, VLM_MODEL_URL env vars
  • models/ volume no longer stores Gemma GGUF files

Prerequisites

  1. Accept Gemma model terms: https://huggingface.co/google/gemma-4-E4B-it
  2. Docker with NVIDIA runtime (nvidia-container-toolkit) configured
  3. NVIDIA GPU with CUDA 12.4+ support

Testing

# Build
docker compose build bookextractor

# Run
docker compose up bookextractor

# Test extraction
curl -X POST http://localhost:8000/extract -F "[email protected]"

# Verify HF cache
ls ~/.cache/huggingface/hub/

# Verify GPU usage
nvidia-smi

### Issues
Closes #7

Merge request reports

Loading