feat: add word_count metric for machine-extracted text
Summary
- Added
word_countfield toExtractedTextmodel — nullable integer stored persistently in theextractedtexttable - Auto-computed server-side on
POST /{record_id}/extracted_textby summing words across all segmenttextfields using Python's.split()— handles Indic scripts correctly - Recalculated on
PATCH /{record_id}/extracted_textwhenever segments are updated; preserved unchanged if PATCH contains no segments - Exposed in all record responses under
extracted_text.word_count - Client cannot submit
word_countdirectly — always derived from segments - Added Celery backfill task
backfill_word_countto compute and storeword_countfor all existingExtractedTextrows withNULL word_count— processes in configurable batches (default 200) to handle large corpora efficiently - Added admin-only endpoint
POST /tasks/backfill-word-countto trigger the backfill task asynchronously - Added alembic migration
d4e5f6a7b8c9to add nullableword_count INTEGERcolumn toextractedtexttable - Fixed pre-existing CI failures: resolved
ruff-lint(1906 violations from unenforced D/E rules added in a previous commit) andbandit(13 medium-severity B108/B113 findings in pre-existing files) - Fixed alembic multiple-heads conflict caused by diverged feature branches — added stub migrations for missing revisions and a merge migration to consolidate all heads into a single head
Test plan
-
POST /{record_id}/extracted_textwith 2 segments →word_count: 6 -
GET /{record_id}→extracted_text.word_countpresent and correct -
PATCH /{record_id}/extracted_textwith updated segments →word_countrecalculated to 7 -
GET /{record_id}after PATCH →word_countupdated to 7 -
PATCHwith notes only (no segments) →word_countstays 7 -
Duplicate POST→409— cannot overwrite existing extracted text -
POST /tasks/backfill-word-countas admin →200 PENDINGwith task_id -
POST /tasks/backfill-word-count?batch_size=50→ message reflects custom batch size -
POST /tasks/backfill-word-countas regular user →403 -
POSTon non-existent record →404 -
PATCHon record with no extracted text →400
Checklist
-
Code follows project API guidelines -
Documentation is updated (OpenAPI docs reflect new word_countfield) -
Code adheres to project coding standards -
Backfill task for existing records implemented and tested -
CI pipeline passing (ruff-lint and bandit resolved)
Closes #123
Edited by Gunaputra Nagendra Pavan Yedida