Fix magazine JSON parsing: handle array inputs, harden prompt, add diagnostic logging (!19) · Merge requests · VISWAM / admin / Meta Data Indexing

Praneeth Ashish requested to merge fix/magazine-json-parsing-and-input-handling into main Jun 03, 2026

Description

Problem

Magazine metadata extraction fails on structured OCR JSON arrays with the unhelpful error "Magazine extraction prompt did not produce valid JSON".

Related Issue

closes #15

Root Cause

process_text_file() didn't handle JSON array inputs — raw JSON string sent to LLM as "OCR text"
parse_json_object() failed silently — no logging of LLM response
stop=["```"] truncated markdown-fenced JSON mid-response
Magazine prompt too permissive — models responded with preambles and verbose article content
max_tokens=512 too low — verbose responses truncated mid-JSON

Changes

File	Change
`pipeline.py`	Handle JSON array inputs — extract text from first page only (metadata lives on cover)
`pipeline.py`	Populate `ParsingError.raw_output` for both magazine and book paths
`llm_clients.py`	New `_extract_json_text()` — handles markdown-fenced JSON (```json...```) before falling back to `{...}` extraction
`llm_clients.py`	Log raw LLM response at INFO level; log parse failures at WARNING with response content
`llm_clients.py`	Remove stop=["```"] from remote `_call_llm()` (parser handles fences now)
`llm_clients.py`	Increase `max_tokens` for magazine extraction: 512 → 1024
`model_client.py`	Same markdown-fence handling + logging for local vLLM path
`prompts.py`	Hardened magazine prompt — explicit STRICT RULES banning preambles, markdown, Chinese/Japanese text, and article content extraction

Verification

uv run metaextractor extract supermama.json output.json -b remote
# Before: WARNING Magazine extraction prompt did not produce valid JSON
# After:  Magazine successfully extracted — చందమామ, August 1948, 0-6-0

All 213 tests pass, ruff clean.

Fix magazine JSON parsing: handle array inputs, harden prompt, add diagnostic logging

Description

Problem

Related Issue

Root Cause

Changes

Verification

Merge request reports