feat(diarization): improve speaker playback, segment merging, modal UX & timestamp handling
Summary
This MR introduces a major upgrade to the speaker diarization workflow across the frontend experience, including:
- Optional manual speaker count control
- Intelligent diarization segment merging
- Improved playback boundary handling
- Modern speaker detail modal UI
- Karaoke-style word timestamp rendering
- Fallback timestamp synthesis for smoother highlighting
The result is a significantly cleaner, more scalable, and more user-friendly transcription + diarization experience.
Key Features Added
1. Optional Speaker Count Support
Added optional num_speakers support across the full transcription workflow, including API integration and frontend state handling.
Highlights
- Added speaker count input beside the diarization toggle
- Supports both auto-detection and manual speaker limits
- Automatically sends valid speaker count values during transcription requests
Benefits
- More accurate diarization
- Better user control
- Cleaner UX
2. Consecutive Speaker Segment Merging
Implemented automatic merging of consecutive segments from the same speaker to reduce fragmented transcript blocks.
Improvements
- Combines continuous speaker segments
- Expands timestamps for seamless playback
- Reduces excessive speaker cards and UI clutter
Benefits
- Cleaner transcript view
- Smoother playback experience
- Better readability
3. Accurate Playback Boundaries
Updated playback behavior so audio stops exactly at the selected segment instead of auto-playing the next one.
Improvements
- Stops at exact segment end time
- Prevents unintended speaker transitions
- Properly resets playback state
Benefits
- More precise playback
- Better speaker isolation
- Predictable controls
4. Modern Speaker Detail Modal
Introduced a redesigned SpeakerDetailModal with improved animations and playback interaction.
Features
- Smooth animated modal transitions
- Modern glassmorphic UI
- Waveform/equalizer playback visuals
- Transcript metadata and inline controls
- ESC and background-click dismissal
Benefits
- Cleaner speaker review experience
- Improved usability
- More polished UI/UX
5. Karaoke-Style Word Timestamp Rendering
Added karaoke-style transcript highlighting support.
Features
- Real-time active word highlighting
- Playback-synced transcript progression
- Dynamic word-level visual feedback
Additional Improvements
Implemented fallback timestamp synthesis when backend word timing data is unavailable.
Benefits
- Smoother transcript playback experience
- More resilient rendering pipeline
- Better accessibility and readability
User Experience Improvements
- Cleaner speaker grouping
- Better playback precision
- More scalable transcript rendering
- Modern modal-based interaction
- Improved diarization control
- Enhanced transcript readability
- More natural audio navigation
Diarization
-
Auto speaker detection works, but not 100% accurate -
Manual speaker count works -
Invalid values are ignored safely
Playback
-
Segment stops at exact end timestamp -
No auto-transition occurs -
Modal playback syncs correctly
Segment Merging
-
Consecutive speakers merge correctly -
Timestamps expand properly -
Transcript concatenation works
Modal
-
ESC closes modal -
Background click closes modal -
Waveform animates during playback
Karaoke Highlighting
-
Words highlight progressively -
Fallback timestamps render safely -
No crashes when timestamps missing
*Closes #46 (closed) #47 (closed) #48 (closed)
