Skip to content

GitLab

Explore

Sign in
Register

feat(diarization): improve speaker playback, segment merging, modal UX & timestamp handling

Review changes
Download
Patches
Plain diff

ashritha kunjeti requested to merge speaker-merge into feat/develop-pro May 27, 2026

Overview 4
Commits 16
Pipelines 2
Changes 5

Summary

This MR introduces a major upgrade to the speaker diarization workflow across the frontend experience, including:

Optional manual speaker count control
Intelligent diarization segment merging
Improved playback boundary handling
Modern speaker detail modal UI
Karaoke-style word timestamp rendering
Fallback timestamp synthesis for smoother highlighting

The result is a significantly cleaner, more scalable, and more user-friendly transcription + diarization experience.

Key Features Added

1. Optional Speaker Count Support

Added optional num_speakers support across the full transcription workflow, including API integration and frontend state handling.

Highlights

Added speaker count input beside the diarization toggle
Supports both auto-detection and manual speaker limits
Automatically sends valid speaker count values during transcription requests

Benefits

More accurate diarization
Better user control
Cleaner UX

2. Consecutive Speaker Segment Merging

Implemented automatic merging of consecutive segments from the same speaker to reduce fragmented transcript blocks.

Improvements

Combines continuous speaker segments
Expands timestamps for seamless playback
Reduces excessive speaker cards and UI clutter

Benefits

Cleaner transcript view
Smoother playback experience
Better readability

3. Accurate Playback Boundaries

Updated playback behavior so audio stops exactly at the selected segment instead of auto-playing the next one.

Improvements

Stops at exact segment end time
Prevents unintended speaker transitions
Properly resets playback state

Benefits

More precise playback
Better speaker isolation
Predictable controls

4. Modern Speaker Detail Modal

Introduced a redesigned SpeakerDetailModal with improved animations and playback interaction.

Features

Smooth animated modal transitions
Modern glassmorphic UI
Waveform/equalizer playback visuals
Transcript metadata and inline controls
ESC and background-click dismissal

Benefits

Cleaner speaker review experience
Improved usability
More polished UI/UX

5. Karaoke-Style Word Timestamp Rendering

Added karaoke-style transcript highlighting support.

Features

Real-time active word highlighting
Playback-synced transcript progression
Dynamic word-level visual feedback

Additional Improvements

Implemented fallback timestamp synthesis when backend word timing data is unavailable.

Benefits

Smoother transcript playback experience
More resilient rendering pipeline
Better accessibility and readability

User Experience Improvements

Cleaner speaker grouping
Better playback precision
More scalable transcript rendering
Modern modal-based interaction
Improved diarization control
Enhanced transcript readability
More natural audio navigation

Diarization

Auto speaker detection works, but not 100% accurate
Manual speaker count works
Invalid values are ignored safely

Playback

Segment stops at exact end timestamp
No auto-transition occurs
Modal playback syncs correctly

Segment Merging

Consecutive speakers merge correctly
Timestamps expand properly
Transcript concatenation works

Modal

ESC closes modal
Background click closes modal
Waveform animates during playback

Karaoke Highlighting

Words highlight progressively
Fallback timestamps render safely
No crashes when timestamps missing

*Closes #46 (closed) #47 (closed) #48 (closed)

Edited May 28, 2026 by ashritha kunjeti

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading