Implement Backend API Calls for Speaker Diarization Feature

Description

To support the speaker diarization feature, we need to introduce and standardize backend API calls that handle multi-speaker audio processing efficiently.

Currently, the system processes audio primarily for transcription, but it lacks structured API endpoints for identifying and segmenting different speakers within the same audio stream.

This issue focuses on designing and integrating backend API calls that:

Accept audio input for diarization processing
Return speaker-segmented transcription results
Maintain consistency with existing ASR API structure
Ensure scalability and performance for real-time or near real-time use

Objectives

Create dedicated API endpoints for speaker diarization
Ensure compatibility with existing transcription pipeline
Optimize request/response format for frontend consumption
Handle edge cases such as:
- Overlapping speech
- Low-quality audio
- Single-speaker fallback

Expected API Behavior

Input: Audio file/stream
Output:
- Transcription text
- Speaker labels (e.g., Speaker 1, Speaker 2)
- Timestamped segments

Tasks

Design API contract (request & response schema)
Implement backend route/controller for diarization
Integrate diarization model/service
Add validation and error handling
Test with multi-speaker audio samples
Document API usage

Acceptance Criteria

API correctly returns diarized output with speaker labels
Response format is consistent and usable by frontend
Handles real-world audio scenarios without failure
Proper error handling and logging implemented