feat: Pyannote model
Summary
This MR introduces speaker diarization support to the voice app backend, enabling the system to automatically detect and differentiate multiple speakers within a single audio recording. Instead of treating the entire audio as coming from one source, the system now analyzes voice characteristics to segment the audio into distinct speaker-specific intervals. Each segment is assigned a unique speaker label (e.g., SPEAKER_0, SPEAKER_1) along with precise start and end timestamps.
By integrating this capability into the existing speech-to-text pipeline, the backend evolves from generating plain transcripts to producing speaker-aware structured outputs. This means that downstream processing can associate spoken content with the correct speaker, making the results far more meaningful and usable in real-world scenarios such as meetings, interviews, and multi-party conversations. Overall, this enhancement significantly improves the interpretability, organization, and analytical value of the transcription output while maintaining compatibility with the current async job-based architecture.
Problem
Currently, the backend processes audio and returns transcriptions without distinguishing between speakers. This limits usability in real-world scenarios such as meetings, interviews, and multi-speaker conversations.
Solution
Implemented a speaker diarization module using pyannote.audio, which:
- Detects distinct speakers in audio recordings
- Assigns speaker labels (
SPEAKER_0,SPEAKER_1, etc.) - Returns structured speaker segments with timestamps
- Integrates cleanly with the existing async job pipeline
Implementation Details
New Module
-
services/diarization.py-
Implements:
def diarize_audio(audio_path: str) -> list
-
Core Functionality
-
Uses pretrained diarization pipeline
-
Extracts:
- speaker labels
- start time
- end time
-
Ensures:
- timestamps are accurate
- segments are sorted chronologically
Output Format
[
{
"speaker": "SPEAKER_0",
"start": 0.0,
"end": 5.2
}
]
Integration
- Added optional flag in request:
{
"enable_diarization": true
}
-
Pipeline behavior:
- If enabled → run diarization
- Else → standard transcription flow
Architecture Update
Audio
↓
Diarization Module
↓
Speaker Segments
↓
(Used by ASR module for speaker-wise transcription)
Testing
-
✅ Verified diarization output format matches contract -
✅ Tested with multi-speaker audio samples -
✅ Ensured compatibility with async job system -
✅ Confirmed chronological ordering of segments
Limitations
- Speaker labels are generic (
SPEAKER_0, etc.), not real identities - Performance may degrade for very long audio files
- Overlapping speech handling depends on model capability
Impact
-
Enables speaker-aware transcription workflows
-
Improves usability for:
- meeting recordings
- interviews
- multi-speaker conversations
-
Lays foundation for advanced features (speaker-wise transcripts, analytics)
Future Improvements
- Speaker name identification (if metadata available)
- Chunk-based diarization for long audio
- Improved overlap handling
- Performance optimizations
Closes #6 (closed)