Add Text-to-Speech Support for Speaker Diarization Output
Summary
Implement a Text-to-Speech (TTS) feature for the Speaker Diarization module to enable playback of transcribed speech speaker-wise. This enhancement will improve accessibility and usability by allowing users to listen to the diarized transcript output directly from the application.
Problem Statement
Currently, the Speaker Diarization feature only displays segmented transcripts for different speakers in text format. Users are unable to replay or listen to individual speaker conversations from the generated transcript output.
Proposed Solution
Integrate a Text-to-Speech functionality that:
- Converts diarized transcript text into audio.
- Supports playback speaker-wise.
- Allows users to play/pause generated audio.
- Optionally differentiates speakers using different voices or tones.
- Works seamlessly with the existing diarization workflow.
Expected Features
- TTS generation for each identified speaker segment.
- Playback controls (Play/Pause/Stop).
- Speaker-wise audio rendering.
- Proper synchronization between transcript and audio playback.
- Error handling for failed TTS generation.
Technical Considerations
- Integrate browser-based TTS APIs or external TTS services.
- Ensure compatibility with existing frontend and backend architecture.
- Handle long transcript chunks efficiently.
- Maintain low latency during audio generation.
Acceptance Criteria
-
Users can generate audio from diarized transcripts. -
Audio playback works for all speaker segments. -
UI clearly identifies which speaker audio is being played. -
No regression in existing speaker diarization functionality. -
Feature works across supported browsers/devices.
Edited by srilatha bandari