Speaker Segmentation with Timestamped Label

Summary

This MR implements the Speaker Diarization Module, responsible for detecting and segmenting different speakers in an audio file. The module outputs structured, timestamped speaker labels that will be used by the ASR module in the next stage of the pipeline.


Objective

To build a standalone, reusable component that:

  • Accepts an audio file path
  • Identifies distinct speakers
  • Returns time-based speaker segments in a predefined format

Implementation

  • Added diarize_audio(audio_path: str) -> list

  • Utilized pyannote.audio for speaker detection

  • Extracted speaker segments with:

    • speaker label (SPEAKER_0, SPEAKER_1, …)
    • start time (float, seconds)
    • end time (float, seconds)
  • Ensured output is chronologically sorted


Output Format

[
  {
    "speaker": "SPEAKER_0",
    "start": 0.0,
    "end": 5.2
  }
]

Constraints Followed

  • No transcription (ASR not included)
  • No changes to shared schema
  • Speaker labels strictly follow SPEAKER_X format
  • Timestamps are accurate and in seconds

Testing

  • Verified with multi-speaker audio samples
  • Ensured correct segmentation and ordering
  • Validated contract-compliant output

Outcome

This module provides clean speaker segmentation and serves as the foundation for building speaker-aware transcription in the overall pipeline.

Merge request reports

Loading