feat(pipeline): implement multi-format extraction and vParse integration (Phase 1)
Merge Request: Phase 1 - Multi-Format Metadata Extraction Pipeline
Description
This MR implements Phase 1 of the metadata extraction system, creating a unified extraction endpoint that routes files to specialized pipelines based on their extensions.
Changes
New Features
-
Unified
/extractEndpoint: Single API endpoint and CLI command that handles multiple file formats -
Text Pipeline (
.json,.md): Direct LLM processing for structured metadata extraction -
PDF Pipeline (
.pdf): Delegates OCR to dockerized vParse (mineru-dots) API with LLM processing and OpenLibrary ISBN enrichment -
Image Pipeline (
.jpg,.png): EXIF/PIL metadata extraction with GPS coordinate normalization - Format Router: Automatic pipeline selection based on file extension
- vParse Client: External OCR API integration for PDF processing
New Files
-
docs/PRD_Phase1.md- Product requirements document -
bookextractor/vparse_client.py- vParse OCR API client -
bookextractor/pipeline.py- Multi-format extraction pipeline router -
tests/test_pipeline.py- Pipeline extraction tests -
tests/test_vparse_client.py- vParse client tests -
tests/test_image_utils.py- Image utilities and GPS normalization tests -
tests/test_main.py- API endpoint tests -
tests/test_external_api.py- External API integration tests -
.dockerignore,.env.example -
Dockerfile.bookextractor,Dockerfile.vparse*- Container definitions
Modified Files
-
bookextractor/models.py- Added format-specific metadata models (ImageMetadata,TextMetadata,PDFMetadata) -
bookextractor/main.py- Format router implementation with extension-based pipeline delegation -
bookextractor/image_utils.py- EXIF extraction and GPS normalization utilities -
docker-compose.yml- Multi-container orchestration for bookextractor and vParse -
README.md- Phase 1 documentation and usage examples -
pyproject.toml,uv.lock- Dependency updates -
bookextractor/ocr.py- Deprecated (migrated to vparse_client) -
bookextractor/pdf_utils.py- Deprecated (migrated to pipeline)
Technical Implementation
Format Router
- Inspects file extensions and delegates to specialized pipeline processors
- Supported formats:
.pdf,.json,.md,.jpg,.png
PDF Pipeline
- Calls external vParse OCR API via
vparse_client.py - JSON output processed by LLM for structured metadata
- ISBN validation and enrichment via OpenLibrary API
Text Pipeline
- Reads
.mdor.jsonfile contents - Sends raw text to LLM with specialized extraction prompt
- Returns structured metadata JSON
Image Pipeline
- Uses
piexif/exifandPillowfor metadata extraction - Extracts:
width,height,format,color_space,bit_depth,exif_camera_make,exif_camera_model,exif_date_taken,exif_gps_latitude,exif_gps_longitude,exif_lens,dpi_horizontal,dpi_vertical - GPS normalization: Converts degrees/minutes/seconds to decimal degrees
Testing
All tests passing:
- Mock external APIs (vParse, OpenLibrary, LLM) for unit/integration testing
- GPS normalization verification from various EXIF formats
- Router logic ensures all supported extensions route to correct pipeline
- Total: 853 lines of new test coverage across 5 test files
Deployment
- Run
docker-compose up --buildto start bookextractor and vParse containers - Configure environment variables from
.env.example - vParse OCR service runs independently for scalable processing
Checklist
-
Code follows project style guidelines -
Self-review completed -
Tests added and passing -
Documentation updated (README.md, PRD_Phase1.md) -
Docker containers configured -
GPS normalization tested -
Format router verified for all extensions -
No breaking changes to existing API
Out of Scope (Phase 2+)
- Celery/Redis asynchronous task management
- Automatic deployment of mineru-dots container
- Audio/video file support
- VLM-based image descriptions
Related Issues
Closes #3 (closed)
Edited by Praneeth Ashish