Skip to content

feat: add diarization in the front-end independent of back-end

vyshnavi requested to merge frontend_diarization into feat/develop-pro

Summary

This MR introduces speaker diarization in the frontend, enabling speaker identification and segmentation directly in the browser without requiring the backend ASR API to be running.

What was implemented

  • Added client-side speaker diarization using the onnx-community/pyannote-segmentation-3.0 model.
  • Implemented a dedicated Web Worker to perform diarization inference without blocking the UI thread.
  • Added lazy model initialization to load the diarization model only when required.
  • Integrated diarization with the existing transcription workflow.
  • Mapped diarization segments to transcription chunks to assign speaker labels (SPEAKER_00, SPEAKER_01, etc.).
  • Added speaker-segment rendering in the UI.
  • Preserved existing transcription functionality when diarization is disabled.
  • Implemented fallback heuristics when diarization results are unavailable.

Diarization Flow

  • User uploads or selects an audio file.
  • Frontend performs diarization inference locally using the Pyannote ONNX model.
  • Speaker segments are generated and stored for reuse across transcription chunks.
  • During chunk-wise transcription, the corresponding speaker is identified based on segment overlap.
  • Transcribed text is tagged with speaker labels and displayed in the editor and speaker panel.

Benefits

  • No backend dependency for speaker diarization.
  • Faster user feedback by performing inference locally.
  • Reduced backend resource utilization.
  • Works in offline or backend-unavailable scenarios.
  • Improves transcription readability through speaker-attributed transcripts.

Fallback Behavior

If the diarization model fails to load or inference fails, the application falls back to the existing heuristic-based speaker assignment. Standard transcription remains unaffected when diarization is disabled.

Testing Verified:

  • Diarization model loading and initialization.
  • Speaker segment generation for uploaded audio files.
  • Speaker label assignment during transcription.
  • UI rendering of diarized speaker segments.
  • Fallback behavior when diarization is unavailable.
  • Transcription flow without backend API connectivity.

*Closes #51

Edited by vyshnavi

Merge request reports

Loading