feat(extracted-text): enforce OCR segment validation with bbox/type/reading_order
Title (semantic)
feat(extracted-text): enforce OCR segment validation with bbox/type/reading_order
Description
This merge request implements OCR-specific extracted text segment validation while keeping ASR behavior compatible.
What was implemented
- Extended OCR segment support in extracted text schema with:
- bbox
- type
- reading_order
- Added OCR-only validation rules:
- bbox, type, reading_order, start, end are required per segment
- end == start + 1 (page-index rule)
- bbox must have 4 values and valid geometry (x2 > x1, y2 > y1)
- type must be one of:
- Text, Title, Caption, Table, Picture, Formula, Section-header, List-item, Page-header, Page-footer, Footnote
- segment reading_order must be in proper order
- Preserved ASR compatibility:
- OCR-only fields are not mandatory for extraction_type="asr"
- existing ASR temporal consistency validation remains
Files changed
- app/schemas/extracted_text.py
- tests/unit/schemas/test_extracted_text_ocr_segments.py
- tests/integration/api/test_extracted_text_ocr_validation.py
Checklist
-
The feature has been fully implemented. -
Tests for the new feature are included and passing. -
User documentation/guides have been updated (if applicable). -
Impact on existing functionality has been considered.
Related Issue(s)
Closes #107 (closed)
Edited by Kushal Lagichetty