Feat: Implement Image Metadata Extraction Pipeline
Summary
This MR introduces a specialized extraction pipeline for image formats (.jpg, .jpeg, .png). It allows the system to extract embedded EXIF data, GPS coordinates, and technical image properties without requiring OCR or LLM processing, significantly improving performance for image-only metadata tasks.
Key Changes
1. Metadata Extraction Logic
-
bookextractor/image_utils.py: Addedextract_image_metadatausingPillowandpiexifto parse image headers. -
GPS Normalization: Implemented
get_decimal_from_dmsto convert Degrees/Minutes/Seconds (DMS) rational tuples into standard decimal degrees. - Field Coverage: Added extraction for camera make/model, lens info, date taken, DPI, bit depth, and color space.
2. Architectural Updates
-
Models: Added the
ImageMetadataPydantic model tobookextractor/models.py. -
Pipeline Router: Updated
ExtractionPipelineinbookextractor/pipeline.pywith aprocess_imagemethod. -
Unified Routing: Updated
main.py(FastAPI and Typer CLI) to automatically route files to the image pipeline based on extension.
3. Engineering & Tooling
-
Dependencies: Added
piexiftopyproject.tomland updateduv.lock. -
Validation: Added
tests/test_image_utils.pywith unit tests for GPS normalization and EXIF extraction. - Linting: Resolved multiple Ruff, Mypy, and pre-commit hook issues across modified files.
How to Test
CLI
bookextractor path/to/image.jpg output.json
API
# 1. Start the server
bookextractor --api
# 2. Send a POST request to /extract with an image file
curl -X POST -F "[email protected]" http://localhost:8000/extract
Automated Tests
uv run pytest tests/test_image_utils.py
Example Output
{
"width": 4032,
"height": 3024,
"format": "JPEG",
"dpi_horizontal": 72.0,
"dpi_vertical": 72.0,
"color_space": "RGB",
"bit_depth": 24,
"exif_camera_make": "Apple",
"exif_camera_model": "iPhone 13",
"exif_date_taken": "2023:10:27 10:00:00",
"exif_gps_latitude": 34.053055,
"exif_gps_longitude": -118.245833,
"exif_lens": "iPhone 13 back dual wide camera 5.1mm f/1.6"
}
Checklist
-
New dependency piexifadded. -
GPS normalization handles N/S/E/W references. -
API and CLI both support the new extensions. -
Tests pass and coverage is maintained. -
Linting and type checking (Ruff/Mypy) are clean.