Add Text-to-Speech Support for Speaker Diarization Output

Summary

Implement a Text-to-Speech (TTS) feature for the Speaker Diarization module to enable playback of transcribed speech speaker-wise. This enhancement will improve accessibility and usability by allowing users to listen to the diarized transcript output directly from the application.

Problem Statement

Currently, the Speaker Diarization feature only displays segmented transcripts for different speakers in text format. Users are unable to replay or listen to individual speaker conversations from the generated transcript output.

Proposed Solution

Integrate a Text-to-Speech functionality that:

Converts diarized transcript text into audio.
Supports playback speaker-wise.
Allows users to play/pause generated audio.
Optionally differentiates speakers using different voices or tones.
Works seamlessly with the existing diarization workflow.

Expected Features

TTS generation for each identified speaker segment.
Playback controls (Play/Pause/Stop).
Speaker-wise audio rendering.
Proper synchronization between transcript and audio playback.
Error handling for failed TTS generation.

Technical Considerations

Integrate browser-based TTS APIs or external TTS services.
Ensure compatibility with existing frontend and backend architecture.
Handle long transcript chunks efficiently.
Maintain low latency during audio generation.

Acceptance Criteria

Users can generate audio from diarized transcripts.
Audio playback works for all speaker segments.
UI clearly identifies which speaker audio is being played.
No regression in existing speaker diarization functionality.
Feature works across supported browsers/devices.

Edited May 16, 2026 by srilatha bandari