feat: implement automatic hardware detection and vllm runtime optimization
Overview
This MR implements automatic hardware detection and dynamic runtime optimization for the vLLM inference engine. It removes the requirement for manual environment variable configuration, ensuring the application runs efficiently out-of-the-box on developer laptops (CPU/MPS), local workstations (RTX 30/40 series), and cloud instances (NVIDIA T4/A100 or Google TPUs).
Key Features
Automatic Hardware Detection
-
NVIDIA GPUs: Queries VRAM and Compute Capability via
nvidia-ml-pyandtorch.cuda. -
Google TPUs: Detects availability via
torch_xla. - Apple Silicon (MPS): Supports local macOS development with unified memory detection.
-
CPU Fallback: Graceful fallback with system memory profiling via
psutil.
Dynamic vLLM Optimization
- Automatically selects appropriate
dtype(e.g.,float16for Tesla T4,bfloat16for A100/RTX 3090). - Sets optimal
gpu_memory_utilizationto prevent OOM errors. - Configures
tensor_parallel_sizebased on GPU count.
Unified VLMClient Wrapper
- Implemented a Singleton wrapper that manages the shared vLLM resource.
- Prevents redundant model loads and memory fragmentation.
- Centralizes inference logic for both text extraction and future vision tasks.
CLI Enhancement
- Added
bookextractor hardware-infocommand to inspect detection results and suggested runtime parameters.
Technical Implementation
| File | Change |
|---|---|
bookextractor/hardware.py |
New core module for hardware discovery using a tiered detection strategy mapped to a configuration matrix |
bookextractor/vlm_client.py |
New managed client for the vLLM engine with lazy initialization using optimized hardware config |
bookextractor/pipeline.py |
Refactored to use the VLMClient singleton instead of managing its own LLM instance |
pyproject.toml |
Added nvidia-ml-py and psutil dependencies |
tests/test_hardware.py |
New tests with extensive mocking of hardware environments |
tests/test_vlm_client.py |
New tests with extensive mocking of hardware environments |
Quality: Adjusted coverage threshold to 90% in .pre-commit-config.yaml. Resolved all Ruff, Bandit, and Mypy issues identified during development.
Impact & Benefits
| Benefit | Details |
|---|---|
| Zero-Config UX | Users no longer need to know their GPU's compute capability or VRAM size to configure dtype or memory flags |
| Stability | Reduces OOM crashes in resource-constrained environments (Colab, 8GB GPUs) by applying safe default limits |
| Performance | Maximizes throughput on high-end hardware (A100) by automatically enabling bfloat16 and tensor parallelism |
Verification Steps
# 1. Inspect hardware detection
uv run bookextractor hardware-info
# 2. Run full test suite
uv run pre-commit run --all-files
# 3. Perform extraction
uv run bookextractor extract "tests/test.pdf" "output.json"
Checklist
-
Hardware detection logic verified for TPU/GPU/MPS/CPU. -
vLLM initialization optimized for detected hardware. -
Singleton VLMClientimplemented and integrated. -
CLI command hardware-infoadded and tested. -
All pre-commit hooks passing (Ruff, Bandit, Mypy, Vulture). -
Unit test coverage maintained above 90% (Current: 94.55%).