Add Speaker Diarization API Support with Push-to-Talk Integration
Summary
This merge request introduces backend API support for speaker diarization, enabling processing of multi-speaker audio with structured, speaker-aware transcription outputs. The implementation extends the existing ASR pipeline while maintaining consistency with current APIs.
In addition, push-to-talk (hold-to-talk) functionality has been integrated to allow controlled, user-initiated audio capture, improving input reliability and reducing unintended recordings.
Key Changes
- Added dedicated API endpoint(s) for speaker diarization
- Extended ASR pipeline to support multi-speaker processing
- Implemented speaker-labeled and timestamped transcription output
- Defined consistent request/response schema aligned with existing APIs
- Integrated push-to-talk (hold-to-talk) input handling
- Added input validation and standardized error handling
- Updated API documentation for frontend integration
Screenshots
API Contract
Request
- Audio input (file or stream)
- Supports push-to-talk captured audio
Response
- Speaker-labeled transcription
- Timestamped segments per speaker
Implementation Details
- Diarization is processed as part of the transcription pipeline
- Output is structured to include speaker IDs with corresponding time segments
- Push-to-talk ensures audio is captured only during active user interaction
- Backward compatibility with existing transcription APIs is maintained
Testing & Validation
-
Tested with multi-speaker audio samples
-
Verified speaker segmentation accuracy and timestamp alignment
-
Validated behavior for:
- Single-speaker input
- Low-quality/noisy audio
-
Tested push-to-talk flow for correct audio capture and processing
-
Confirmed stability under normal usage conditions
Impact
- Enables multi-speaker transcription workflows
- Improves usability for conversational scenarios (meetings, interviews, discussions)
- Enhances input control through push-to-talk
- Provides structured output for improved frontend rendering and UX
Checklist
-
API endpoints implemented -
Diarization integrated with ASR pipeline -
Push-to-talk functionality integrated -
Validation and error handling implemented -
Documentation updated -
Tested across relevant scenarios
Closes: #20 (closed)
Edited by srilatha bandari
