Fix: Optimize JSON Ingestion and Increase LLM Context Window
Description
The current metadata extraction pipeline treats .json files as raw text. This causes two major issues:
- Structural Noise: The entire JSON object, including large blocks of structural metadata (segments, bounding boxes), is stringified and sent to the LLM, wasting tokens and confusing the model.
-
Data Truncation: A hardcoded 3,000-character limit in the LLM prompt causes the actual text content (often stored in a
transcriptionfield) to be truncated before the model can process it.
Objectives
- Implement native JSON parsing for the
.jsoninput format. - Target the
transcriptionfield specifically to reduce noise. - Increase the LLM context window from 3,000 to 15,000 characters to support larger documents.
Proposed Solution
- Update
bookextractor/pipeline.pyto check for.jsonextensions. - Use
json.load()to extract thetranscriptionkey. - Update the prompt logic to support 15,000 characters.
Acceptance Criteria
-
.jsonfiles are parsed natively without crashing. -
Only the transcription text is sent to the LLM when available. -
Files with up to 15,000 characters of text are processed without truncation. -
Automated tests verify the new logic.