Speaker Segmentation with Timestamped Label (!30) · Merge requests · VISWAM / apps / Speech / Voice App Backend

Summary

This MR implements the Speaker Diarization Module, responsible for detecting and segmenting different speakers in an audio file. The module outputs structured, timestamped speaker labels that will be used by the ASR module in the next stage of the pipeline.

Objective

To build a standalone, reusable component that:

Accepts an audio file path
Identifies distinct speakers
Returns time-based speaker segments in a predefined format

Implementation

Added diarize_audio(audio_path: str) -> list
Utilized pyannote.audio for speaker detection
Extracted speaker segments with:
- speaker label (SPEAKER_0, SPEAKER_1, …)
- start time (float, seconds)
- end time (float, seconds)
Ensured output is chronologically sorted

Output Format

[
  {
    "speaker": "SPEAKER_0",
    "start": 0.0,
    "end": 5.2
  }
]

Constraints Followed

No transcription (ASR not included)
No changes to shared schema
Speaker labels strictly follow SPEAKER_X format
Timestamps are accurate and in seconds

Testing

Verified with multi-speaker audio samples
Ensured correct segmentation and ordering
Validated contract-compliant output

Outcome

This module provides clean speaker segmentation and serves as the foundation for building speaker-aware transcription in the overall pipeline.

Speaker Segmentation with Timestamped Label