Feat: Speaker Diarization Support (!29) · Merge requests · VISWAM / apps / Speech / Voice App Backend

Vemuri priya requested to merge speaker-diarization into develop Apr 26, 2026

Summary

This MR introduces speaker diarization support to the voice app backend, enabling the system to automatically detect and differentiate multiple speakers within a single audio recording. Instead of treating the entire audio as coming from one source, the system now analyzes voice characteristics to segment the audio into distinct speaker-specific intervals. Each segment is assigned a unique speaker label (e.g., SPEAKER_0, SPEAKER_1) along with precise start and end timestamps.

By integrating this capability into the existing speech-to-text pipeline, the backend evolves from generating plain transcripts to producing speaker-aware structured outputs. This means that downstream processing can associate spoken content with the correct speaker, making the results far more meaningful and usable in real-world scenarios such as meetings, interviews, and multi-party conversations. Overall, this enhancement significantly improves the interpretability, organization, and analytical value of the transcription output while maintaining compatibility with the current async job-based architecture.

Problem

Currently, the backend processes audio and returns transcriptions without distinguishing between speakers. This limits usability in real-world scenarios such as meetings, interviews, and multi-speaker conversations.

Solution

Implemented a speaker diarization module using pyannote.audio, which:

Detects distinct speakers in audio recordings
Assigns speaker labels (SPEAKER_0, SPEAKER_1, etc.)
Returns structured speaker segments with timestamps
Integrates cleanly with the existing async job pipeline

Implementation Details

New Module

services/diarization.py

Implements:

def diarize_audio(audio_path: str) -> list

Core Functionality

Uses pretrained diarization pipeline
Extracts:
- speaker labels
- start time
- end time
Ensures:
- timestamps are accurate
- segments are sorted chronologically

Output Format

[
  {
    "speaker": "SPEAKER_0",
    "start": 0.0,
    "end": 5.2
  }
]

Integration

Added optional flag in request:

{
  "enable_diarization": true
}

Pipeline behavior:
- If enabled → run diarization
- Else → standard transcription flow

Architecture Update

Audio
  ↓
Diarization Module
  ↓
Speaker Segments
  ↓
(Used by ASR module for speaker-wise transcription)

Testing

✅ Verified diarization output format matches contract
✅ Tested with multi-speaker audio samples
✅ Ensured compatibility with async job system
✅ Confirmed chronological ordering of segments

Limitations

Speaker labels are generic (SPEAKER_0, etc.), not real identities
Performance may degrade for very long audio files
Overlapping speech handling depends on model capability

Impact

Enables speaker-aware transcription workflows
Improves usability for:
- meeting recordings
- interviews
- multi-speaker conversations
Lays foundation for advanced features (speaker-wise transcripts, analytics)

Future Improvements

Speaker name identification (if metadata available)
Chunk-based diarization for long audio
Improved overlap handling
Performance optimizations

Closes #6 (closed)

Feat: Speaker Diarization Support