Speaker Segmentation with Timestamped Label
Summary
This MR implements the Speaker Diarization Module, responsible for detecting and segmenting different speakers in an audio file. The module outputs structured, timestamped speaker labels that will be used by the ASR module in the next stage of the pipeline.
Objective
To build a standalone, reusable component that:
- Accepts an audio file path
- Identifies distinct speakers
- Returns time-based speaker segments in a predefined format
Implementation
-
Added
diarize_audio(audio_path: str) -> list -
Utilized pyannote.audio for speaker detection
-
Extracted speaker segments with:
-
speakerlabel (SPEAKER_0,SPEAKER_1, …) -
starttime (float, seconds) -
endtime (float, seconds)
-
-
Ensured output is chronologically sorted
Output Format
[
{
"speaker": "SPEAKER_0",
"start": 0.0,
"end": 5.2
}
]
Constraints Followed
- No transcription (ASR not included)
- No changes to shared schema
- Speaker labels strictly follow
SPEAKER_Xformat - Timestamps are accurate and in seconds
Testing
- Verified with multi-speaker audio samples
- Ensured correct segmentation and ordering
- Validated contract-compliant output
Outcome
This module provides clean speaker segmentation and serves as the foundation for building speaker-aware transcription in the overall pipeline.