Fix: Optimize JSON Ingestion and Increase LLM Context Window

Description

The current metadata extraction pipeline treats .json files as raw text. This causes two major issues:

Structural Noise: The entire JSON object, including large blocks of structural metadata (segments, bounding boxes), is stringified and sent to the LLM, wasting tokens and confusing the model.
Data Truncation: A hardcoded 3,000-character limit in the LLM prompt causes the actual text content (often stored in a transcription field) to be truncated before the model can process it.

Objectives

Implement native JSON parsing for the .json input format.
Target the transcription field specifically to reduce noise.
Increase the LLM context window from 3,000 to 15,000 characters to support larger documents.

Proposed Solution

Update bookextractor/pipeline.py to check for .json extensions.
Use json.load() to extract the transcription key.
Update the prompt logic to support 15,000 characters.

Acceptance Criteria

.json files are parsed natively without crashing.
Only the transcription text is sent to the LLM when available.
Files with up to 15,000 characters of text are processed without truncation.
Automated tests verify the new logic.