refactor(record): move extracted_text JSONB to dedicated table
name: Refactoring Request about: Suggest improvements to the codebase structure, readability, or performance without changing external behavior. title: "[Refactor]: move extracted_text JSONB to dedicated table" labels: "refactoring" assignees: ''
title: "refactor(record): move extracted_text JSONB to dedicated table"
Describe the current situation Currently, extracted_text (OCR/ASR data) is stored as a JSONB blob within the Record model. This prevents the use of standard database indexes for primary fields like confidence, language, and extraction_type, and lacks database-level type enforcement.
Describe the suggested refactoring Move extracted_text into a dedicated extractedtext table with a one-to-one relationship to Record.
- Map primary JSON keys (transcription, confidence, language, extraction_type, quality_score, notes, summary, model_name, processing_date) to typed database columns.
- Retain segments, named_entities, and metadata as JSONB columns within the new table.
- Implement an Alembic migration to transfer existing data and drop the old column.
Benefits of the refactoring
- Performance: Enables B-tree indexing on core metadata fields for faster filtering.
- Data Integrity: Enforces type safety and NOT NULL constraints at the database level.
- Scalability: Reduces record table bloat by offloading large text transcriptions.
Potential impact or risks
- Migration: Requires a data-safe Alembic script to prevent loss during the transition.
- API Compatibility: Pydantic schemas must use validators to ensure the API response remains a nested JSON object for the frontend.
- Versioning: RecordHistory snapshots must be updated to include the related table data.
Additional context Maintain the existing versioning logic by serializing the ExtractedText model into RecordHistory snapshots.