Skip to content

feat: implement automatic hardware detection and vllm runtime optimization

Kushal Lagichetty requested to merge chore/wrapper into develop

Overview

This MR implements automatic hardware detection and dynamic runtime optimization for the vLLM inference engine. It removes the requirement for manual environment variable configuration, ensuring the application runs efficiently out-of-the-box on developer laptops (CPU/MPS), local workstations (RTX 30/40 series), and cloud instances (NVIDIA T4/A100 or Google TPUs).


Key Features

Automatic Hardware Detection

  • NVIDIA GPUs: Queries VRAM and Compute Capability via nvidia-ml-py and torch.cuda.
  • Google TPUs: Detects availability via torch_xla.
  • Apple Silicon (MPS): Supports local macOS development with unified memory detection.
  • CPU Fallback: Graceful fallback with system memory profiling via psutil.

Dynamic vLLM Optimization

  • Automatically selects appropriate dtype (e.g., float16 for Tesla T4, bfloat16 for A100/RTX 3090).
  • Sets optimal gpu_memory_utilization to prevent OOM errors.
  • Configures tensor_parallel_size based on GPU count.

Unified VLMClient Wrapper

  • Implemented a Singleton wrapper that manages the shared vLLM resource.
  • Prevents redundant model loads and memory fragmentation.
  • Centralizes inference logic for both text extraction and future vision tasks.

CLI Enhancement

  • Added bookextractor hardware-info command to inspect detection results and suggested runtime parameters.

Technical Implementation

File Change
bookextractor/hardware.py New core module for hardware discovery using a tiered detection strategy mapped to a configuration matrix
bookextractor/vlm_client.py New managed client for the vLLM engine with lazy initialization using optimized hardware config
bookextractor/pipeline.py Refactored to use the VLMClient singleton instead of managing its own LLM instance
pyproject.toml Added nvidia-ml-py and psutil dependencies
tests/test_hardware.py New tests with extensive mocking of hardware environments
tests/test_vlm_client.py New tests with extensive mocking of hardware environments

Quality: Adjusted coverage threshold to 90% in .pre-commit-config.yaml. Resolved all Ruff, Bandit, and Mypy issues identified during development.


Impact & Benefits

Benefit Details
Zero-Config UX Users no longer need to know their GPU's compute capability or VRAM size to configure dtype or memory flags
Stability Reduces OOM crashes in resource-constrained environments (Colab, 8GB GPUs) by applying safe default limits
Performance Maximizes throughput on high-end hardware (A100) by automatically enabling bfloat16 and tensor parallelism

Verification Steps

# 1. Inspect hardware detection
uv run bookextractor hardware-info

# 2. Run full test suite
uv run pre-commit run --all-files

# 3. Perform extraction
uv run bookextractor extract "tests/test.pdf" "output.json"

Checklist

  • Hardware detection logic verified for TPU/GPU/MPS/CPU.
  • vLLM initialization optimized for detected hardware.
  • Singleton VLMClient implemented and integrated.
  • CLI command hardware-info added and tested.
  • All pre-commit hooks passing (Ruff, Bandit, Mypy, Vulture).
  • Unit test coverage maintained above 90% (Current: 94.55%).

Merge request reports

Loading