[Feature]: OCR extracted_text segment validation and ordering
name: Feature Request about: Suggest a new feature or enhancement for the Corpus Collector Backend. title: "[Feature]: OCR extracted_text segment validation and ordering" labels: "feature" assignees: ''
title: "feat(extracted-text): enforce OCR segment schema and ordering"
🚀 Feature Request
Is your feature request related to a problem? Please describe.
OCR output is currently not reliably validated for layout-aware proofreading use cases. For OCR records, segment metadata like bounding boxes and reading order are required by downstream proofreading UX, but payloads can be inconsistent or missing required structure. This causes incorrect display order and weak data quality at storage time.
Describe the solution you'd like
Add extraction-type-aware validation for extracted_text payloads:
- When extraction_type = "ocr":
- require per segment: bbox, type, reading_order, start, end, text
- enforce page-index rule: end == start + 1
- validate bbox structure and geometry
- enforce segment ordering via reading_order
- restrict type to approved OCR segment labels
- When extraction_type = "asr":
- keep existing ASR behavior
- do not make OCR-only fields mandatory
Also include unit and integration tests covering valid/invalid OCR payloads and ASR compatibility.
Describe alternatives you've considered
- Keep validation loose and let frontend handle bad OCR payloads. Rejected: pushes data integrity issues to consumers and increases runtime failures.
- Introduce a new DB table for OCR segments. Rejected for now: larger migration and API contract change than needed.
- Apply OCR constraints to all extraction types. Rejected: would break ASR/caption/manual flows.
Additional context
- This is a backend validation/contract change only; no schema migration is required.
- Expected impact: better OCR data consistency, stable paragraph ordering, safer proofreading pipeline.