Fix: Optimize JSON Ingestion and Increase LLM Context Window

Description

The current metadata extraction pipeline treats .json files as raw text. This causes two major issues:

  1. Structural Noise: The entire JSON object, including large blocks of structural metadata (segments, bounding boxes), is stringified and sent to the LLM, wasting tokens and confusing the model.
  2. Data Truncation: A hardcoded 3,000-character limit in the LLM prompt causes the actual text content (often stored in a transcription field) to be truncated before the model can process it.

Objectives

  • Implement native JSON parsing for the .json input format.
  • Target the transcription field specifically to reduce noise.
  • Increase the LLM context window from 3,000 to 15,000 characters to support larger documents.

Proposed Solution

  • Update bookextractor/pipeline.py to check for .json extensions.
  • Use json.load() to extract the transcription key.
  • Update the prompt logic to support 15,000 characters.

Acceptance Criteria

  • .json files are parsed natively without crashing.
  • Only the transcription text is sent to the LLM when available.
  • Files with up to 15,000 characters of text are processed without truncation.
  • Automated tests verify the new logic.