Fix: Optimize JSON Processing and Context Window Expansion (!14) · Merge requests · VISWAM / admin / Meta Data Indexing

Overview

This Merge Request improves the efficiency of the metadata extraction pipeline when handling structured JSON inputs and expands the LLM's visible context window.

Changes

Core Pipeline

Native JSON Handling: Updated ExtractionPipeline.process_text_file to detect .json files and use json.load().
Targeted Extraction: The pipeline now specifically extracts the value from the transcription key, stripping away thousands of lines of unnecessary structural metadata (e.g., bounding boxes).
Context Expansion: Increased the LLM prompt truncation limit from 3,000 to 15,000 characters to leverage the full capacity of the vLLM engine (A100 optimized).

Bug Fixes & Refactoring

Fixed a syntax error in the calculate_confidence method.
Resolved Mypy type errors related to vLLM initialization.
Applied Ruff formatting and import sorting across affected files.

Testing

Added tests/test_json_pipeline_update.py with three new test cases:
1. test_process_json_with_transcription: Verifies targeted extraction.
2. test_process_json_fallback_if_no_transcription: Verifies fallback to raw text.
3. test_truncation_limit_increased: Verifies the 15,000-character limit.

Impact

Efficiency: Significantly reduces token usage by excluding non-textual JSON metadata.
Accuracy: Prevents metadata loss in large documents by increasing the truncation limit.
Performance: Native parsing is faster than LLM-based extraction for structured wrappers.

Related Issues

Closes #14

Checklist

Code follows project style guidelines (Ruff/Mypy).
Tests passed locally.
Documentation updated.
No breaking changes for PDF/Image pipelines.

Reviewer Note: Please verify the 15,000-character limit against the current A100 VRAM availability for high-concurrency scenarios.

Edited May 07, 2026 by Praneeth Ashish

Fix: Optimize JSON Processing and Context Window Expansion