Feat/rag: RAG Implementation and Hybrid Retrieval (FTS + pg_trgm + context_window)

Feature Request Template

Title (semantic)

Feat/rag: RAG Implementation and Hybrid Retrieval (FTS + pg_trgm + context_window)

Description

This merge request implements a Retrieval-Augmented Generation (RAG) system for the Corpus Server App with hybrid semantic retrieval capabilities. The implementation enables users to query records using natural language and receive relevant text segments with context.

Key Features Implemented

  • Hybrid Retrieval Service: Combines PostgreSQL full-text search (tsvector) with trigram similarity for robust multilingual text matching (Telugu + English)
  • Knowledge Graph Validation: MetadataIndexer service validates and optimizes knowledge graphs for storage in the extracted text table
  • RAG API Endpoints: Four endpoints for semantic querying:
    • GET /{record_id}/knowledge-status - Check indexing status for a record
    • PUT /{record_id}/knowledge - Save/update knowledge graph and metadata
    • GET /{record_id}/knowledge-map - Retrieve knowledge graph with usage stats
    • POST /{record_id}/retrievals - Search within a record using hybrid retrieval
  • Text Normalization Utils: TextNormalizer class handles extraction of text from various formats (dict/model, segments, transcription fields)
  • Record Model Extensions: Added semantic_metadata, knowledge_graph, and indexed_version fields to support semantic indexing
  • RAG Status Tracking: RAGStatus enum (no_text, ready_for_indexing, indexed, index_failed) with version-aware caching

Adaptive Query Parameters

Query parameters (top_k, context_window, trigram_threshold) are dynamically computed based on query characteristics:

  • Short queries (≤2 tokens): Larger top_k (12) and context_window (2)
  • Queries with keywords like "explain", "describe": Increased context_window for detailed results
  • Queries with "approx", "similar", "like": Lowered trigram_threshold for fuzzy matching

Fallback Strategy

Multi-tier retrieval with graceful degradation:

  1. Hybrid query with OR tsquery
  2. FTS-only fallback
  3. Token-wise similarity fallback

Checklist

  • The feature has been fully implemented.
  • Tests for the new feature are included and passing.
  • User documentation/guides have been updated (if applicable).
  • Impact on existing functionality has been considered.

Related Issue(s)

Closes #

Edited by Mukthanand Reddy M

Merge request reports

Loading