Implement Backend API Calls for Speaker Diarization Feature
Description
To support the speaker diarization feature, we need to introduce and standardize backend API calls that handle multi-speaker audio processing efficiently.
Currently, the system processes audio primarily for transcription, but it lacks structured API endpoints for identifying and segmenting different speakers within the same audio stream.
This issue focuses on designing and integrating backend API calls that:
- Accept audio input for diarization processing
- Return speaker-segmented transcription results
- Maintain consistency with existing ASR API structure
- Ensure scalability and performance for real-time or near real-time use
Objectives
-
Create dedicated API endpoints for speaker diarization
-
Ensure compatibility with existing transcription pipeline
-
Optimize request/response format for frontend consumption
-
Handle edge cases such as:
- Overlapping speech
- Low-quality audio
- Single-speaker fallback
Expected API Behavior
-
Input: Audio file/stream
-
Output:
- Transcription text
- Speaker labels (e.g., Speaker 1, Speaker 2)
- Timestamped segments
Tasks
-
Design API contract (request & response schema) -
Implement backend route/controller for diarization -
Integrate diarization model/service -
Add validation and error handling -
Test with multi-speaker audio samples -
Document API usage
Acceptance Criteria
- API correctly returns diarized output with speaker labels
- Response format is consistent and usable by frontend
- Handles real-world audio scenarios without failure
- Proper error handling and logging implemented