feat(records): Implement is_extracted Boolean Filter for Targeted Record Review (!231) · Merge requests · Corpus / Corpus Server Platform

Kushal Lagichetty requested to merge feat/records-is-extracted-filter into develop May 28, 2026

Summary

This MR introduces a highly-requested filtering capability to the POST /api/v1/records/for-review endpoint. It allows reviewers to specifically target records based on their extraction status, bridging the gap between automated AI processing and manual transcription workflows.

Closes #(issue number)

Context

Previously, reviewers had to sift through records without knowing whether they were validating AI-generated results (OCR/ASR) or starting a fresh manual transcription. This filter provides the necessary granularity to prioritize validation vs. creation tasks.

Technical Implementation

Boolean Filter Logic

Value	Strategy	Behavior
`is_extracted: true`	`INNER JOIN` with `extractedtext`	Returns records processed by AI models (OCR, ASR), excluding those marked as `manual`
`is_extracted: false`	`LEFT JOIN` with `extractedtext`	Returns records with no extraction entry, plus `manual` entries where `segments` is not yet populated

Security & Compliance

Raw SQL construction was refactored to use static literals for JOIN types, ensuring compliance with Bandit B608 (SQL injection prevention) by avoiding string interpolation for SQL keywords.

Performance

Integrated into the existing optimized raw SQL path to maintain low latency during random record selection.

Documentation

Updated the endpoint's docstring with parameter definitions and concrete examples for both true and false states.

Checklist

Feature fully implemented.
Tests for the new feature included and passing.
User documentation/guides updated where applicable.
Impact on existing functionality considered.

feat(records): Implement is_extracted Boolean Filter for Targeted Record Review