Feature: Implement Multi-Format Extraction Pipeline with vParse Integration (Phase 1)
Issue: Phase 1 - Multi-Format Metadata Extraction Pipeline
Problem Statement
The current system only handles PDFs and relies on an internal OCR pipeline. Users need a versatile system that can extract metadata from various file types (.pdf, .json, .md, .png, .jpg) using a unified endpoint and CLI. Additionally, the OCR capability needs to be delegated to an external mineru-dots (vParse) dockerized pipeline for better accuracy, and a new image metadata extraction pipeline must be introduced to capture EXIF and basic file properties.
Proposed Solution
Create a unified extraction endpoint (/extract) and CLI that routes files to specialized pipelines based on their extensions:
-
Text Pipeline (
.json,.md): Content is passed directly to the multimodal LLM for structured metadata extraction -
PDF Pipeline (
.pdf): Delegates OCR to a dockerized vParse (mineru-dots) API. The resulting JSON output is passed to the LLM for structured metadata extraction. If an ISBN is found, it is validated and enriched via OpenLibrary -
Image Pipeline (
.jpg,.png, etc.): Extracts metadata (dimensions, camera info, GPS, DPI) using EXIF/PIL, normalizing GPS coordinates to decimal degrees
User Stories
-
As an API user, I want to upload a .pdffile to the/extractroute, so that I can get its book metadata using the vParse OCR and LLM -
As a CLI user, I want to pass a .mdor.jsonfile to the extract command, so that I can get structured metadata directly from the LLM -
As an API user, I want to upload a .jpgor.pngimage to the/extractroute, so that I can receive detailed EXIF and image metadata -
As a system administrator, I want the PDF extraction to delegate to the mineru-dots dockerized service, so that I can independently scale the OCR engine -
As an API user, I want the system to automatically infer the correct extraction pipeline based on my file extension -
As a reader, I want extracted ISBNs to be validated and enriched with OpenLibrary data
Technical Requirements
Format Router
- Central routing module in FastAPI/Typer to inspect file extensions and delegate to specific pipeline processors
- Supported formats:
.pdf,.json,.md,.jpg,.png
PDF Pipeline
- Call external vParse OCR API instead of internal logic
- Process results with LLM for structured metadata
- Validate and enrich ISBNs via OpenLibrary
Text Pipeline
- Read contents of
.mdor.jsonfiles - Send raw text to LLM with specialized prompt for metadata extraction
Image Pipeline
- Use
piexif/exifandPillowfor metadata extraction - Extract:
width,height,format,color_space,bit_depth,exif_camera_make,exif_camera_model,exif_date_taken,exif_gps_latitude,exif_gps_longitude,exif_lens,dpi_horizontal,dpi_vertical - GPS Normalization: Dedicated utility to convert degrees/minutes/seconds to decimal degrees
Testing Requirements
-
Mock external APIs (vParse, OpenLibrary, LLM) for unit/integration testing -
GPS verification tests for various EXIF formats -
Router logic tests to ensure all supported extensions route to correct internal pipeline
Out of Scope
- Celery/Redis asynchronous task management (Phase 2)
- Automatic deployment of the mineru-dots container
- Audio/video file support (Phase 2)
- VLM-based image descriptions (Phase 2)
Acceptance Criteria
- Unified
/extractendpoint handles.pdf,.json,.md,.jpg,.pngfiles - PDF files are processed via vParse OCR API with LLM metadata extraction
- Text files (
.md,.json) are processed directly by LLM - Images return EXIF metadata with GPS coordinates normalized to decimal degrees
- ISBNs are validated and enriched via OpenLibrary when found
- All new functionality covered by unit and integration tests
- Docker deployment includes separate vParse container for OCR scaling
References
- PRD:
docs/PRD_Phase1.md
Edited by Praneeth Ashish