Feature: Implement Multi-Format Extraction Pipeline with vParse Integration (Phase 1)

Issue: Phase 1 - Multi-Format Metadata Extraction Pipeline

Problem Statement

The current system only handles PDFs and relies on an internal OCR pipeline. Users need a versatile system that can extract metadata from various file types (.pdf, .json, .md, .png, .jpg) using a unified endpoint and CLI. Additionally, the OCR capability needs to be delegated to an external mineru-dots (vParse) dockerized pipeline for better accuracy, and a new image metadata extraction pipeline must be introduced to capture EXIF and basic file properties.

Proposed Solution

Create a unified extraction endpoint (/extract) and CLI that routes files to specialized pipelines based on their extensions:

Text Pipeline (.json, .md): Content is passed directly to the multimodal LLM for structured metadata extraction
PDF Pipeline (.pdf): Delegates OCR to a dockerized vParse (mineru-dots) API. The resulting JSON output is passed to the LLM for structured metadata extraction. If an ISBN is found, it is validated and enriched via OpenLibrary
Image Pipeline (.jpg, .png, etc.): Extracts metadata (dimensions, camera info, GPS, DPI) using EXIF/PIL, normalizing GPS coordinates to decimal degrees

User Stories

As an API user, I want to upload a .pdf file to the /extract route, so that I can get its book metadata using the vParse OCR and LLM
As a CLI user, I want to pass a .md or .json file to the extract command, so that I can get structured metadata directly from the LLM
As an API user, I want to upload a .jpg or .png image to the /extract route, so that I can receive detailed EXIF and image metadata
As a system administrator, I want the PDF extraction to delegate to the mineru-dots dockerized service, so that I can independently scale the OCR engine
As an API user, I want the system to automatically infer the correct extraction pipeline based on my file extension
As a reader, I want extracted ISBNs to be validated and enriched with OpenLibrary data

Technical Requirements

Format Router

Central routing module in FastAPI/Typer to inspect file extensions and delegate to specific pipeline processors
Supported formats: .pdf, .json, .md, .jpg, .png

PDF Pipeline

Call external vParse OCR API instead of internal logic
Process results with LLM for structured metadata
Validate and enrich ISBNs via OpenLibrary

Text Pipeline

Read contents of .md or .json files
Send raw text to LLM with specialized prompt for metadata extraction

Image Pipeline

Use piexif/exif and Pillow for metadata extraction
Extract: width, height, format, color_space, bit_depth, exif_camera_make, exif_camera_model, exif_date_taken, exif_gps_latitude, exif_gps_longitude, exif_lens, dpi_horizontal, dpi_vertical
GPS Normalization: Dedicated utility to convert degrees/minutes/seconds to decimal degrees

Testing Requirements

Mock external APIs (vParse, OpenLibrary, LLM) for unit/integration testing
GPS verification tests for various EXIF formats
Router logic tests to ensure all supported extensions route to correct internal pipeline

Out of Scope

Celery/Redis asynchronous task management (Phase 2)
Automatic deployment of the mineru-dots container
Audio/video file support (Phase 2)
VLM-based image descriptions (Phase 2)

Acceptance Criteria

Unified /extract endpoint handles .pdf, .json, .md, .jpg, .png files
PDF files are processed via vParse OCR API with LLM metadata extraction
Text files (.md, .json) are processed directly by LLM
Images return EXIF metadata with GPS coordinates normalized to decimal degrees
ISBNs are validated and enriched via OpenLibrary when found
All new functionality covered by unit and integration tests
Docker deployment includes separate vParse container for OCR scaling

References

PRD: docs/PRD_Phase1.md

Edited Apr 29, 2026 by Praneeth Ashish