Magazine extraction fails silently: LLM response JSON not parsed, no diagnostic logging
Description
When running metaextractor extract on structured OCR JSON arrays (e.g., supermama.json), the magazine metadata pipeline fails with:
Magazine extraction prompt did not produce valid JSON
Three root causes:
-
Data preparation bug:
process_text_file()only handleddictJSON inputs. JSON arrays (lists of OCR blocks) fell through to the fallback path, which passed the raw JSON string (including all structural characters{,},[,]) as "OCR text" to the LLM. This garbled input confused the model. -
No diagnostic logging: When
parse_json_object()failed, it silently returned{}with no log of the LLM's actual response. TheParsingErrorhad araw_outputfield that was never populated. Debugging was impossible. -
Prompt + LLM interaction: The magazine prompt was too permissive. Models (especially Qwen2.5) responded with Chinese preambles, markdown fences (
```json), and tried to extract article content as nested JSON — hitting the 512-token limit and producing truncated/invalid JSON.