Refactor Speaker Diarization into Transcription Endpoint with Flag (Remove Redundant Endpoints & Code Duplication)
Description
Currently, speaker diarization is handled separately from the transcription pipeline, resulting in duplicated logic across multiple endpoints and services. This leads to increased maintenance overhead and inconsistencies in how transcription and diarization are combined.
This issue proposes consolidating speaker diarization into the existing /transcribe endpoint using a configurable flag (enable_diarization). The goal is to produce a unified, speaker-aware transcription output with precise timestamps, while eliminating redundant code paths and legacy endpoints.
Problem
- Duplicate logic for diarization and transcription across different modules
- Separate endpoints for diarization create fragmentation
- Inconsistent handling of timestamps and speaker alignment
- Harder to maintain and extend the pipeline
- Increased technical debt due to legacy wrappers and unused code
Proposed Solution
-
Introduce a flag-based approach:
{ "enable_diarization": true } -
Integrate diarization into the
/transcribepipeline:- Run diarization → get speaker segments
- Segment audio
- Run ASR per segment
- Align and merge results into a conversation-style output
-
Return structured response:
- Speaker-wise grouped transcription
- Accurate start/end timestamps
- Optional segment-level detail (for future extensibility)
Refactoring Tasks
-
Remove or deprecate standalone diarization endpoints -
Consolidate logic into transcribepipeline -
Ensure single source of truth for: - audio segmentation
- ASR invocation
- diarization alignment
-
Eliminate duplicate helper functions across modules -
Clean up legacy wrappers if no longer required -
Standardize response format using shared models -
Add proper logging for diarization-enabled flows
Expected Outcome
- Cleaner and more maintainable codebase
- No duplication of diarization/transcription logic
- Single unified pipeline for all transcription use cases
- Easier extensibility (e.g., subtitles, analytics, summarization)
- Improved consistency in API responses
Acceptance Criteria
-
/transcribesupportsenable_diarizationflag - Output includes speaker-wise transcription with timestamps
- No duplicate diarization logic exists elsewhere
- Legacy diarization endpoints are removed or deprecated
- Existing functionality (non-diarization transcription) remains unaffected
Notes
This change aligns with a pipeline-based architecture and improves long-term scalability of the ASR + diarization system.