refactor(record): move extracted_text JSONB to dedicated table

name: Refactoring Request about: Suggest improvements to the codebase structure, readability, or performance without changing external behavior. title: "[Refactor]: move extracted_text JSONB to dedicated table" labels: "refactoring" assignees: ''

title: "refactor(record): move extracted_text JSONB to dedicated table"

♻️ Refactoring Request

Describe the current situation Currently, extracted_text (OCR/ASR data) is stored as a JSONB blob within the Record model. This prevents the use of standard database indexes for primary fields like confidence, language, and extraction_type, and lacks database-level type enforcement.

Describe the suggested refactoring Move extracted_text into a dedicated extractedtext table with a one-to-one relationship to Record.

Map primary JSON keys (transcription, confidence, language, extraction_type, quality_score, notes, summary, model_name, processing_date) to typed database columns.
Retain segments, named_entities, and metadata as JSONB columns within the new table.
Implement an Alembic migration to transfer existing data and drop the old column.

Benefits of the refactoring

Performance: Enables B-tree indexing on core metadata fields for faster filtering.
Data Integrity: Enforces type safety and NOT NULL constraints at the database level.
Scalability: Reduces record table bloat by offloading large text transcriptions.

Potential impact or risks

Migration: Requires a data-safe Alembic script to prevent loss during the transition.
API Compatibility: Pydantic schemas must use validators to ensure the API response remains a nested JSON object for the frontend.
Versioning: RecordHistory snapshots must be updated to include the related table data.

Additional context Maintain the existing versioning logic by serializing the ExtractedText model into RecordHistory snapshots.

Edited Apr 07, 2026 by Kushal Lagichetty