Praneeth Ashish requested to merge feat/multi-format-pipeline-phase1 into develop Apr 26, 2026

Merge Request: Phase 1 - Multi-Format Metadata Extraction Pipeline

Description

This MR implements Phase 1 of the metadata extraction system, creating a unified extraction endpoint that routes files to specialized pipelines based on their extensions.

Changes

New Features

Unified /extract Endpoint: Single API endpoint and CLI command that handles multiple file formats
Text Pipeline (.json, .md): Direct LLM processing for structured metadata extraction
PDF Pipeline (.pdf): Delegates OCR to dockerized vParse (mineru-dots) API with LLM processing and OpenLibrary ISBN enrichment
Image Pipeline (.jpg, .png): EXIF/PIL metadata extraction with GPS coordinate normalization
Format Router: Automatic pipeline selection based on file extension
vParse Client: External OCR API integration for PDF processing

New Files

docs/PRD_Phase1.md - Product requirements document
bookextractor/vparse_client.py - vParse OCR API client
bookextractor/pipeline.py - Multi-format extraction pipeline router
tests/test_pipeline.py - Pipeline extraction tests
tests/test_vparse_client.py - vParse client tests
tests/test_image_utils.py - Image utilities and GPS normalization tests
tests/test_main.py - API endpoint tests
tests/test_external_api.py - External API integration tests
.dockerignore, .env.example
Dockerfile.bookextractor, Dockerfile.vparse* - Container definitions

Modified Files

bookextractor/models.py - Added format-specific metadata models (ImageMetadata, TextMetadata, PDFMetadata)
bookextractor/main.py - Format router implementation with extension-based pipeline delegation
bookextractor/image_utils.py - EXIF extraction and GPS normalization utilities
docker-compose.yml - Multi-container orchestration for bookextractor and vParse
README.md - Phase 1 documentation and usage examples
pyproject.toml, uv.lock - Dependency updates
bookextractor/ocr.py - Deprecated (migrated to vparse_client)
bookextractor/pdf_utils.py - Deprecated (migrated to pipeline)

Technical Implementation

Format Router

Inspects file extensions and delegates to specialized pipeline processors
Supported formats: .pdf, .json, .md, .jpg, .png

PDF Pipeline

Calls external vParse OCR API via vparse_client.py
JSON output processed by LLM for structured metadata
ISBN validation and enrichment via OpenLibrary API

Text Pipeline

Reads .md or .json file contents
Sends raw text to LLM with specialized extraction prompt
Returns structured metadata JSON

Image Pipeline

Uses piexif/exif and Pillow for metadata extraction
Extracts: width, height, format, color_space, bit_depth, exif_camera_make, exif_camera_model, exif_date_taken, exif_gps_latitude, exif_gps_longitude, exif_lens, dpi_horizontal, dpi_vertical
GPS normalization: Converts degrees/minutes/seconds to decimal degrees

Testing

All tests passing:

Mock external APIs (vParse, OpenLibrary, LLM) for unit/integration testing
GPS normalization verification from various EXIF formats
Router logic ensures all supported extensions route to correct pipeline
Total: 853 lines of new test coverage across 5 test files

Deployment

Run docker-compose up --build to start bookextractor and vParse containers
Configure environment variables from .env.example
vParse OCR service runs independently for scalable processing

Checklist

Code follows project style guidelines
Self-review completed
Tests added and passing
Documentation updated (README.md, PRD_Phase1.md)
Docker containers configured
GPS normalization tested
Format router verified for all extensions
No breaking changes to existing API

Out of Scope (Phase 2+)

Celery/Redis asynchronous task management
Automatic deployment of mineru-dots container
Audio/video file support
VLM-based image descriptions

Related Issues

Closes #3 (closed)

Edited Apr 29, 2026 by Praneeth Ashish

feat(pipeline): implement multi-format extraction and vParse integration (Phase 1)