Refactor Speaker Diarization into Transcription Endpoint with Flag (Remove Redundant Endpoints & Code Duplication)

Description

Currently, speaker diarization is handled separately from the transcription pipeline, resulting in duplicated logic across multiple endpoints and services. This leads to increased maintenance overhead and inconsistencies in how transcription and diarization are combined.

This issue proposes consolidating speaker diarization into the existing /transcribe endpoint using a configurable flag (enable_diarization). The goal is to produce a unified, speaker-aware transcription output with precise timestamps, while eliminating redundant code paths and legacy endpoints.


Problem

  • Duplicate logic for diarization and transcription across different modules
  • Separate endpoints for diarization create fragmentation
  • Inconsistent handling of timestamps and speaker alignment
  • Harder to maintain and extend the pipeline
  • Increased technical debt due to legacy wrappers and unused code

Proposed Solution

  • Introduce a flag-based approach:

    {
      "enable_diarization": true
    }
  • Integrate diarization into the /transcribe pipeline:

    • Run diarization → get speaker segments
    • Segment audio
    • Run ASR per segment
    • Align and merge results into a conversation-style output
  • Return structured response:

    • Speaker-wise grouped transcription
    • Accurate start/end timestamps
    • Optional segment-level detail (for future extensibility)

Refactoring Tasks

  • Remove or deprecate standalone diarization endpoints

  • Consolidate logic into transcribe pipeline

  • Ensure single source of truth for:

    • audio segmentation
    • ASR invocation
    • diarization alignment
  • Eliminate duplicate helper functions across modules

  • Clean up legacy wrappers if no longer required

  • Standardize response format using shared models

  • Add proper logging for diarization-enabled flows


Expected Outcome

  • Cleaner and more maintainable codebase
  • No duplication of diarization/transcription logic
  • Single unified pipeline for all transcription use cases
  • Easier extensibility (e.g., subtitles, analytics, summarization)
  • Improved consistency in API responses

Acceptance Criteria

  • /transcribe supports enable_diarization flag
  • Output includes speaker-wise transcription with timestamps
  • No duplicate diarization logic exists elsewhere
  • Legacy diarization endpoints are removed or deprecated
  • Existing functionality (non-diarization transcription) remains unaffected

Notes

This change aligns with a pipeline-based architecture and improves long-term scalability of the ASR + diarization system.