Feature Request: Add Speaker Diarization Support
Feature Request
Is your feature request related to a problem? Please describe.
Currently, the voice app backend processes audio but doesn't support speaker diarization - the ability to identify and separate different speakers in an audio stream.
Describe the solution you'd like
Add speaker diarization functionality that can:
- Identify distinct speakers in audio recordings
- Label timestamps with speaker IDs (e.g., Speaker 0, Speaker 1, etc.)
- Return structured output with speaker segments and their corresponding time ranges
Describe alternatives you've considered
Manual speaker annotation or using external services, but integrated support would be more efficient and user-friendly.
Additional context
Speaker diarization would be valuable for:
- Meeting transcription and analysis
- Interview processing
- Multi-conversation scenarios
- Creating speaker-labeled transcripts
This would complement the existing speech-to-text capabilities and make the backend more comprehensive for real-world audio processing use cases.