Skip to content

Feat: Speaker Diarization Support

Vemuri priya requested to merge speaker-diarization into develop

Summary

This MR introduces speaker diarization support to the voice app backend, enabling the system to automatically detect and differentiate multiple speakers within a single audio recording. Instead of treating the entire audio as coming from one source, the system now analyzes voice characteristics to segment the audio into distinct speaker-specific intervals. Each segment is assigned a unique speaker label (e.g., SPEAKER_0, SPEAKER_1) along with precise start and end timestamps.

By integrating this capability into the existing speech-to-text pipeline, the backend evolves from generating plain transcripts to producing speaker-aware structured outputs. This means that downstream processing can associate spoken content with the correct speaker, making the results far more meaningful and usable in real-world scenarios such as meetings, interviews, and multi-party conversations. Overall, this enhancement significantly improves the interpretability, organization, and analytical value of the transcription output while maintaining compatibility with the current async job-based architecture.


Problem

Currently, the backend processes audio and returns transcriptions without distinguishing between speakers. This limits usability in real-world scenarios such as meetings, interviews, and multi-speaker conversations.


Solution

Implemented a speaker diarization module using pyannote.audio, which:

  • Detects distinct speakers in audio recordings
  • Assigns speaker labels (SPEAKER_0, SPEAKER_1, etc.)
  • Returns structured speaker segments with timestamps
  • Integrates cleanly with the existing async job pipeline

Implementation Details

New Module

  • services/diarization.py

    • Implements:

      def diarize_audio(audio_path: str) -> list

Core Functionality

  • Uses pretrained diarization pipeline

  • Extracts:

    • speaker labels
    • start time
    • end time
  • Ensures:

    • timestamps are accurate
    • segments are sorted chronologically

Output Format

[
  {
    "speaker": "SPEAKER_0",
    "start": 0.0,
    "end": 5.2
  }
]

Integration

  • Added optional flag in request:
{
  "enable_diarization": true
}
  • Pipeline behavior:

    • If enabled → run diarization
    • Else → standard transcription flow

Architecture Update

Audio

Diarization Module

Speaker Segments

(Used by ASR module for speaker-wise transcription)

Testing

  • Verified diarization output format matches contract
  • Tested with multi-speaker audio samples
  • Ensured compatibility with async job system
  • Confirmed chronological ordering of segments

Limitations

  • Speaker labels are generic (SPEAKER_0, etc.), not real identities
  • Performance may degrade for very long audio files
  • Overlapping speech handling depends on model capability

Impact

  • Enables speaker-aware transcription workflows

  • Improves usability for:

    • meeting recordings
    • interviews
    • multi-speaker conversations
  • Lays foundation for advanced features (speaker-wise transcripts, analytics)


Future Improvements

  • Speaker name identification (if metadata available)
  • Chunk-based diarization for long audio
  • Improved overlap handling
  • Performance optimizations

Closes #6 (closed)

Merge request reports

Loading