feat: implement automatic hardware detection and vllm runtime optimization (!15) · Merge requests · VISWAM / admin / Meta Data Indexing

Kushal Lagichetty requested to merge chore/wrapper into develop May 07, 2026

Overview

This MR implements automatic hardware detection and dynamic runtime optimization for the vLLM inference engine. It removes the requirement for manual environment variable configuration, ensuring the application runs efficiently out-of-the-box on developer laptops (CPU/MPS), local workstations (RTX 30/40 series), and cloud instances (NVIDIA T4/A100 or Google TPUs).

Key Features

Automatic Hardware Detection

NVIDIA GPUs: Queries VRAM and Compute Capability via nvidia-ml-py and torch.cuda.
Google TPUs: Detects availability via torch_xla.
Apple Silicon (MPS): Supports local macOS development with unified memory detection.
CPU Fallback: Graceful fallback with system memory profiling via psutil.

Dynamic vLLM Optimization

Automatically selects appropriate dtype (e.g., float16 for Tesla T4, bfloat16 for A100/RTX 3090).
Sets optimal gpu_memory_utilization to prevent OOM errors.
Configures tensor_parallel_size based on GPU count.

Unified `VLMClient` Wrapper

Implemented a Singleton wrapper that manages the shared vLLM resource.
Prevents redundant model loads and memory fragmentation.
Centralizes inference logic for both text extraction and future vision tasks.

CLI Enhancement

Added bookextractor hardware-info command to inspect detection results and suggested runtime parameters.

Technical Implementation

File	Change
`bookextractor/hardware.py`	New core module for hardware discovery using a tiered detection strategy mapped to a configuration matrix
`bookextractor/vlm_client.py`	New managed client for the vLLM engine with lazy initialization using optimized hardware config
`bookextractor/pipeline.py`	Refactored to use the `VLMClient` singleton instead of managing its own LLM instance
`pyproject.toml`	Added `nvidia-ml-py` and `psutil` dependencies
`tests/test_hardware.py`	New tests with extensive mocking of hardware environments
`tests/test_vlm_client.py`	New tests with extensive mocking of hardware environments

Quality: Adjusted coverage threshold to 90% in .pre-commit-config.yaml. Resolved all Ruff, Bandit, and Mypy issues identified during development.

Impact & Benefits

Benefit	Details
Zero-Config UX	Users no longer need to know their GPU's compute capability or VRAM size to configure `dtype` or memory flags
Stability	Reduces OOM crashes in resource-constrained environments (Colab, 8GB GPUs) by applying safe default limits
Performance	Maximizes throughput on high-end hardware (A100) by automatically enabling `bfloat16` and tensor parallelism

Verification Steps

# 1. Inspect hardware detection
uv run bookextractor hardware-info

# 2. Run full test suite
uv run pre-commit run --all-files

# 3. Perform extraction
uv run bookextractor extract "tests/test.pdf" "output.json"

Checklist

Hardware detection logic verified for TPU/GPU/MPS/CPU.
vLLM initialization optimized for detected hardware.
Singleton VLMClient implemented and integrated.
CLI command hardware-info added and tested.
All pre-commit hooks passing (Ruff, Bandit, Mypy, Vulture).
Unit test coverage maintained above 90% (Current: 94.55%).

feat: implement automatic hardware detection and vllm runtime optimization