Skip to content

feat(diarization): improve speaker playback, segment merging, modal UX & timestamp handling

ashritha kunjeti requested to merge speaker-merge into feat/develop-pro

Summary

This MR introduces a major upgrade to the speaker diarization workflow across the frontend experience, including:

  • Optional manual speaker count control
  • Intelligent diarization segment merging
  • Improved playback boundary handling
  • Modern speaker detail modal UI
  • Karaoke-style word timestamp rendering
  • Fallback timestamp synthesis for smoother highlighting

The result is a significantly cleaner, more scalable, and more user-friendly transcription + diarization experience.


Key Features Added

1. Optional Speaker Count Support

Added optional num_speakers support across the full transcription workflow, including API integration and frontend state handling.

Highlights

  • Added speaker count input beside the diarization toggle
  • Supports both auto-detection and manual speaker limits
  • Automatically sends valid speaker count values during transcription requests

Benefits

  • More accurate diarization
  • Better user control
  • Cleaner UX

2. Consecutive Speaker Segment Merging

Implemented automatic merging of consecutive segments from the same speaker to reduce fragmented transcript blocks.

Improvements

  • Combines continuous speaker segments
  • Expands timestamps for seamless playback
  • Reduces excessive speaker cards and UI clutter

Benefits

  • Cleaner transcript view
  • Smoother playback experience
  • Better readability

3. Accurate Playback Boundaries

Updated playback behavior so audio stops exactly at the selected segment instead of auto-playing the next one.

Improvements

  • Stops at exact segment end time
  • Prevents unintended speaker transitions
  • Properly resets playback state

Benefits

  • More precise playback
  • Better speaker isolation
  • Predictable controls

4. Modern Speaker Detail Modal

Introduced a redesigned SpeakerDetailModal with improved animations and playback interaction.

Features

  • Smooth animated modal transitions
  • Modern glassmorphic UI
  • Waveform/equalizer playback visuals
  • Transcript metadata and inline controls
  • ESC and background-click dismissal

Benefits

  • Cleaner speaker review experience
  • Improved usability
  • More polished UI/UX

5. Karaoke-Style Word Timestamp Rendering

Added karaoke-style transcript highlighting support.

Features

  • Real-time active word highlighting
  • Playback-synced transcript progression
  • Dynamic word-level visual feedback

image

Additional Improvements

Implemented fallback timestamp synthesis when backend word timing data is unavailable.

Benefits

  • Smoother transcript playback experience
  • More resilient rendering pipeline
  • Better accessibility and readability

User Experience Improvements

  • Cleaner speaker grouping
  • Better playback precision
  • More scalable transcript rendering
  • Modern modal-based interaction
  • Improved diarization control
  • Enhanced transcript readability
  • More natural audio navigation

Diarization

  • Auto speaker detection works, but not 100% accurate
  • Manual speaker count works
  • Invalid values are ignored safely

Playback

  • Segment stops at exact end timestamp
  • No auto-transition occurs
  • Modal playback syncs correctly

Segment Merging

  • Consecutive speakers merge correctly
  • Timestamps expand properly
  • Transcript concatenation works

Modal

  • ESC closes modal
  • Background click closes modal
  • Waveform animates during playback

Karaoke Highlighting

  • Words highlight progressively
  • Fallback timestamps render safely
  • No crashes when timestamps missing

*Closes #46 (closed) #47 (closed) #48 (closed)

Edited by ashritha kunjeti

Merge request reports

Loading