Fix magazine JSON parsing: handle array inputs, harden prompt, add diagnostic logging
Description
Problem
Magazine metadata extraction fails on structured OCR JSON arrays with the unhelpful error "Magazine extraction prompt did not produce valid JSON".
Related Issue
closes #15
Root Cause
-
process_text_file()didn't handle JSON array inputs — raw JSON string sent to LLM as "OCR text" -
parse_json_object()failed silently — no logging of LLM response -
stop=["```"]truncated markdown-fenced JSON mid-response - Magazine prompt too permissive — models responded with preambles and verbose article content
-
max_tokens=512too low — verbose responses truncated mid-JSON
Changes
| File | Change |
|---|---|
pipeline.py |
Handle JSON array inputs — extract text from first page only (metadata lives on cover) |
pipeline.py |
Populate ParsingError.raw_output for both magazine and book paths |
llm_clients.py |
New _extract_json_text() — handles markdown-fenced JSON (```json...```) before falling back to {...} extraction |
llm_clients.py |
Log raw LLM response at INFO level; log parse failures at WARNING with response content |
llm_clients.py |
Remove stop=["```"] from remote _call_llm() (parser handles fences now) |
llm_clients.py |
Increase max_tokens for magazine extraction: 512 → 1024 |
model_client.py |
Same markdown-fence handling + logging for local vLLM path |
prompts.py |
Hardened magazine prompt — explicit STRICT RULES banning preambles, markdown, Chinese/Japanese text, and article content extraction |
Verification
uv run metaextractor extract supermama.json output.json -b remote
# Before: WARNING Magazine extraction prompt did not produce valid JSON
# After: Magazine successfully extracted — చందమామ, August 1948, 0-6-0
All 213 tests pass, ruff clean.