Fix: Replace llama-cpp with vLLM for GPU-accelerated Gemma-4 inference
MR Description
Summary
Replace llama-cpp-python with vllm for GPU-accelerated inference using google/gemma-4-E4B-it. This migration enables proper CUDA/PyTorch GPU support and aligns the codebase with the documented Phase 2 architecture.
Key Changes
-
Inference Engine:
llama-cpp→vllm(GPU-accelerated) - Model Format: GGUF files → HuggingFace safetensors (via vLLM)
-
Model Cache: Project's
models/volume → Host's~/.cache/huggingface/hub/ - Image VLM: Placeholder added for future Gemma-4 vision support
Changes
| File | Change |
|---|---|
pyproject.toml |
Replace llama-cpp-python with vllm>=0.8.0, add huggingface_hub
|
bookextractor/pipeline.py |
Use vllm.LLM and SamplingParams.generate() API |
bookextractor/models.py |
Add ImageVLMMetadata placeholder |
bookextractor/vlm_client.py |
New placeholder for future image VLM |
.env.example |
Replace GGUF env vars with VLLM_MODEL, HF_TOKEN
|
Dockerfile.bookextractor |
Use nvidia/cuda:12.4.0-runtime-ubuntu22.04 base |
docker-compose.yml |
Add GPU resources, HF cache bind mount |
tests/test_pipeline.py |
Update mocks for vLLM API |
scripts/setup_models.py |
Remove Gemma GGUF download (vLLM handles it) |
Storage Architecture
Host Filesystem:
~/.cache/huggingface/hub/ ← vLLM model cache (bind mount)
└── google/gemma-4-E4B-it/ ← Downloaded by vLLM on first run
Docker Volume:
/models/ ← VParse OCR models only
Environment Variables
| Variable | Description | Required |
|---|---|---|
VLLM_MODEL |
HuggingFace model ID | Yes (default: google/gemma-4-E4B-it) |
HF_TOKEN |
HuggingFace access token | Yes (gated model) |
VLLM_TENSOR_PARALLEL_SIZE |
GPU count | No (default: 1) |
Breaking Changes
-
HF_TOKENis now required for gated Gemma model access - GPU is now required - no CPU fallback with llama-cpp
- Removed
VLM_MODEL_PATH,MMPROJ_MODEL_PATH,VLM_MODEL_URLenv vars -
models/volume no longer stores Gemma GGUF files
Prerequisites
- Accept Gemma model terms: https://huggingface.co/google/gemma-4-E4B-it
- Docker with NVIDIA runtime (
nvidia-container-toolkit) configured - NVIDIA GPU with CUDA 12.4+ support
Testing
# Build
docker compose build bookextractor
# Run
docker compose up bookextractor
# Test extraction
curl -X POST http://localhost:8000/extract -F "[email protected]"
# Verify HF cache
ls ~/.cache/huggingface/hub/
# Verify GPU usage
nvidia-smi
### Issues
Closes #7