Skip to content

feat(pipeline): implement multi-format extraction and vParse integration (Phase 1)

Praneeth Ashish requested to merge feat/multi-format-pipeline-phase1 into develop

Merge Request: Phase 1 - Multi-Format Metadata Extraction Pipeline

Description

This MR implements Phase 1 of the metadata extraction system, creating a unified extraction endpoint that routes files to specialized pipelines based on their extensions.

Changes

New Features

  • Unified /extract Endpoint: Single API endpoint and CLI command that handles multiple file formats
  • Text Pipeline (.json, .md): Direct LLM processing for structured metadata extraction
  • PDF Pipeline (.pdf): Delegates OCR to dockerized vParse (mineru-dots) API with LLM processing and OpenLibrary ISBN enrichment
  • Image Pipeline (.jpg, .png): EXIF/PIL metadata extraction with GPS coordinate normalization
  • Format Router: Automatic pipeline selection based on file extension
  • vParse Client: External OCR API integration for PDF processing

New Files

  • docs/PRD_Phase1.md - Product requirements document
  • bookextractor/vparse_client.py - vParse OCR API client
  • bookextractor/pipeline.py - Multi-format extraction pipeline router
  • tests/test_pipeline.py - Pipeline extraction tests
  • tests/test_vparse_client.py - vParse client tests
  • tests/test_image_utils.py - Image utilities and GPS normalization tests
  • tests/test_main.py - API endpoint tests
  • tests/test_external_api.py - External API integration tests
  • .dockerignore, .env.example
  • Dockerfile.bookextractor, Dockerfile.vparse* - Container definitions

Modified Files

  • bookextractor/models.py - Added format-specific metadata models (ImageMetadata, TextMetadata, PDFMetadata)
  • bookextractor/main.py - Format router implementation with extension-based pipeline delegation
  • bookextractor/image_utils.py - EXIF extraction and GPS normalization utilities
  • docker-compose.yml - Multi-container orchestration for bookextractor and vParse
  • README.md - Phase 1 documentation and usage examples
  • pyproject.toml, uv.lock - Dependency updates
  • bookextractor/ocr.py - Deprecated (migrated to vparse_client)
  • bookextractor/pdf_utils.py - Deprecated (migrated to pipeline)

Technical Implementation

Format Router

  • Inspects file extensions and delegates to specialized pipeline processors
  • Supported formats: .pdf, .json, .md, .jpg, .png

PDF Pipeline

  1. Calls external vParse OCR API via vparse_client.py
  2. JSON output processed by LLM for structured metadata
  3. ISBN validation and enrichment via OpenLibrary API

Text Pipeline

  • Reads .md or .json file contents
  • Sends raw text to LLM with specialized extraction prompt
  • Returns structured metadata JSON

Image Pipeline

  • Uses piexif/exif and Pillow for metadata extraction
  • Extracts: width, height, format, color_space, bit_depth, exif_camera_make, exif_camera_model, exif_date_taken, exif_gps_latitude, exif_gps_longitude, exif_lens, dpi_horizontal, dpi_vertical
  • GPS normalization: Converts degrees/minutes/seconds to decimal degrees

Testing

All tests passing:

  • Mock external APIs (vParse, OpenLibrary, LLM) for unit/integration testing
  • GPS normalization verification from various EXIF formats
  • Router logic ensures all supported extensions route to correct pipeline
  • Total: 853 lines of new test coverage across 5 test files

Deployment

  1. Run docker-compose up --build to start bookextractor and vParse containers
  2. Configure environment variables from .env.example
  3. vParse OCR service runs independently for scalable processing

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Tests added and passing
  • Documentation updated (README.md, PRD_Phase1.md)
  • Docker containers configured
  • GPS normalization tested
  • Format router verified for all extensions
  • No breaking changes to existing API

Out of Scope (Phase 2+)

  • Celery/Redis asynchronous task management
  • Automatic deployment of mineru-dots container
  • Audio/video file support
  • VLM-based image descriptions

Related Issues

Closes #3 (closed)

Edited by Praneeth Ashish

Merge request reports

Loading