Skip to content

Fix magazine JSON parsing: handle array inputs, harden prompt, add diagnostic logging

Description

Problem

Magazine metadata extraction fails on structured OCR JSON arrays with the unhelpful error "Magazine extraction prompt did not produce valid JSON".

Related Issue

closes #15

Root Cause

  • process_text_file() didn't handle JSON array inputs — raw JSON string sent to LLM as "OCR text"
  • parse_json_object() failed silently — no logging of LLM response
  • stop=["```"] truncated markdown-fenced JSON mid-response
  • Magazine prompt too permissive — models responded with preambles and verbose article content
  • max_tokens=512 too low — verbose responses truncated mid-JSON

Changes

File Change
pipeline.py Handle JSON array inputs — extract text from first page only (metadata lives on cover)
pipeline.py Populate ParsingError.raw_output for both magazine and book paths
llm_clients.py New _extract_json_text() — handles markdown-fenced JSON (```json...```) before falling back to {...} extraction
llm_clients.py Log raw LLM response at INFO level; log parse failures at WARNING with response content
llm_clients.py Remove stop=["```"] from remote _call_llm() (parser handles fences now)
llm_clients.py Increase max_tokens for magazine extraction: 512 → 1024
model_client.py Same markdown-fence handling + logging for local vLLM path
prompts.py Hardened magazine prompt — explicit STRICT RULES banning preambles, markdown, Chinese/Japanese text, and article content extraction

Verification

uv run metaextractor extract supermama.json output.json -b remote
# Before: WARNING Magazine extraction prompt did not produce valid JSON
# After:  Magazine successfully extracted — చందమామ, August 1948, 0-6-0

All 213 tests pass, ruff clean.

Merge request reports

Loading