Fix: Optimize JSON Processing and Context Window Expansion
Overview
This Merge Request improves the efficiency of the metadata extraction pipeline when handling structured JSON inputs and expands the LLM's visible context window.
Changes
Core Pipeline
-
Native JSON Handling: Updated
ExtractionPipeline.process_text_fileto detect.jsonfiles and usejson.load(). -
Targeted Extraction: The pipeline now specifically extracts the value from the
transcriptionkey, stripping away thousands of lines of unnecessary structural metadata (e.g., bounding boxes). -
Context Expansion: Increased the LLM prompt truncation limit from
3,000to15,000characters to leverage the full capacity of the vLLM engine (A100 optimized).
Bug Fixes & Refactoring
- Fixed a syntax error in the
calculate_confidencemethod. - Resolved Mypy type errors related to vLLM initialization.
- Applied Ruff formatting and import sorting across affected files.
Testing
- Added
tests/test_json_pipeline_update.pywith three new test cases:-
test_process_json_with_transcription: Verifies targeted extraction. -
test_process_json_fallback_if_no_transcription: Verifies fallback to raw text. -
test_truncation_limit_increased: Verifies the 15,000-character limit.
-
Impact
- Efficiency: Significantly reduces token usage by excluding non-textual JSON metadata.
- Accuracy: Prevents metadata loss in large documents by increasing the truncation limit.
- Performance: Native parsing is faster than LLM-based extraction for structured wrappers.
Related Issues
Closes #14
Checklist
-
Code follows project style guidelines (Ruff/Mypy). -
Tests passed locally. -
Documentation updated. -
No breaking changes for PDF/Image pipelines.
Reviewer Note: Please verify the 15,000-character limit against the current A100 VRAM availability for high-concurrency scenarios.
Edited by Praneeth Ashish