Skip to content

feat(extracted-text): enforce OCR segment validation with bbox/type/reading_order

Kushal Lagichetty requested to merge doc-digitization into develop

Title (semantic)

feat(extracted-text): enforce OCR segment validation with bbox/type/reading_order

Description

This merge request implements OCR-specific extracted text segment validation while keeping ASR behavior compatible.

What was implemented

  • Extended OCR segment support in extracted text schema with:
    • bbox
    • type
    • reading_order
  • Added OCR-only validation rules:
    • bbox, type, reading_order, start, end are required per segment
    • end == start + 1 (page-index rule)
    • bbox must have 4 values and valid geometry (x2 > x1, y2 > y1)
    • type must be one of:
      • Text, Title, Caption, Table, Picture, Formula, Section-header, List-item, Page-header, Page-footer, Footnote
    • segment reading_order must be in proper order
  • Preserved ASR compatibility:
    • OCR-only fields are not mandatory for extraction_type="asr"
    • existing ASR temporal consistency validation remains

Files changed

  • app/schemas/extracted_text.py
  • tests/unit/schemas/test_extracted_text_ocr_segments.py
  • tests/integration/api/test_extracted_text_ocr_validation.py

Checklist

  • The feature has been fully implemented.
  • Tests for the new feature are included and passing.
  • User documentation/guides have been updated (if applicable).
  • Impact on existing functionality has been considered.

Related Issue(s)

Closes #107 (closed)

Edited by Kushal Lagichetty

Merge request reports

Loading