Feature Request: Add Speaker Diarization Support

Feature Request

Is your feature request related to a problem? Please describe.

Currently, the voice app backend processes audio but doesn't support speaker diarization - the ability to identify and separate different speakers in an audio stream.

Describe the solution you'd like

Add speaker diarization functionality that can:

Identify distinct speakers in audio recordings
Label timestamps with speaker IDs (e.g., Speaker 0, Speaker 1, etc.)
Return structured output with speaker segments and their corresponding time ranges

Describe alternatives you've considered

Manual speaker annotation or using external services, but integrated support would be more efficient and user-friendly.

Additional context

Speaker diarization would be valuable for:

Meeting transcription and analysis
Interview processing
Multi-conversation scenarios
Creating speaker-labeled transcripts

This would complement the existing speech-to-text capabilities and make the backend more comprehensive for real-world audio processing use cases.