Fix: Optimize JSON Processing and Context Window Expansion

Overview

This Merge Request improves the efficiency of the metadata extraction pipeline when handling structured JSON inputs and expands the LLM's visible context window.

Changes

Core Pipeline

  • Native JSON Handling: Updated ExtractionPipeline.process_text_file to detect .json files and use json.load().
  • Targeted Extraction: The pipeline now specifically extracts the value from the transcription key, stripping away thousands of lines of unnecessary structural metadata (e.g., bounding boxes).
  • Context Expansion: Increased the LLM prompt truncation limit from 3,000 to 15,000 characters to leverage the full capacity of the vLLM engine (A100 optimized).

Bug Fixes & Refactoring

  • Fixed a syntax error in the calculate_confidence method.
  • Resolved Mypy type errors related to vLLM initialization.
  • Applied Ruff formatting and import sorting across affected files.

Testing

  • Added tests/test_json_pipeline_update.py with three new test cases:
    1. test_process_json_with_transcription: Verifies targeted extraction.
    2. test_process_json_fallback_if_no_transcription: Verifies fallback to raw text.
    3. test_truncation_limit_increased: Verifies the 15,000-character limit.

Impact

  • Efficiency: Significantly reduces token usage by excluding non-textual JSON metadata.
  • Accuracy: Prevents metadata loss in large documents by increasing the truncation limit.
  • Performance: Native parsing is faster than LLM-based extraction for structured wrappers.

Related Issues

Closes #14

Checklist

  • Code follows project style guidelines (Ruff/Mypy).
  • Tests passed locally.
  • Documentation updated.
  • No breaking changes for PDF/Image pipelines.

Reviewer Note: Please verify the 15,000-character limit against the current A100 VRAM availability for high-concurrency scenarios.

Edited by Praneeth Ashish

Merge request reports

Loading