Skip to content

feat: add word_count metric for machine-extracted text

Summary

  • Added word_count field to ExtractedText model — nullable integer stored persistently in the extractedtext table
  • Auto-computed server-side on POST /{record_id}/extracted_text by summing words across all segment text fields using Python's .split() — handles Indic scripts correctly
  • Recalculated on PATCH /{record_id}/extracted_text whenever segments are updated; preserved unchanged if PATCH contains no segments
  • Exposed in all record responses under extracted_text.word_count
  • Client cannot submit word_count directly — always derived from segments
  • Added Celery backfill task backfill_word_count to compute and store word_count for all existing ExtractedText rows with NULL word_count — processes in configurable batches (default 200) to handle large corpora efficiently
  • Added admin-only endpoint POST /tasks/backfill-word-count to trigger the backfill task asynchronously
  • Added alembic migration d4e5f6a7b8c9 to add nullable word_count INTEGER column to extractedtext table
  • Fixed pre-existing CI failures: resolved ruff-lint (1906 violations from unenforced D/E rules added in a previous commit) and bandit (13 medium-severity B108/B113 findings in pre-existing files)
  • Fixed alembic multiple-heads conflict caused by diverged feature branches — added stub migrations for missing revisions and a merge migration to consolidate all heads into a single head

Test plan

  • POST /{record_id}/extracted_text with 2 segments → word_count: 6
  • GET /{record_id}extracted_text.word_count present and correct
  • PATCH /{record_id}/extracted_text with updated segments → word_count recalculated to 7
  • GET /{record_id} after PATCH → word_count updated to 7
  • PATCH with notes only (no segments) → word_count stays 7
  • Duplicate POST409 — cannot overwrite existing extracted text
  • POST /tasks/backfill-word-count as admin → 200 PENDING with task_id
  • POST /tasks/backfill-word-count?batch_size=50 → message reflects custom batch size
  • POST /tasks/backfill-word-count as regular user → 403
  • POST on non-existent record → 404
  • PATCH on record with no extracted text → 400

Checklist

  • Code follows project API guidelines
  • Documentation is updated (OpenAPI docs reflect new word_count field)
  • Code adheres to project coding standards
  • Backfill task for existing records implemented and tested
  • CI pipeline passing (ruff-lint and bandit resolved)

Closes #123

Edited by Gunaputra Nagendra Pavan Yedida

Merge request reports

Loading