Skip to content

Add Speaker Diarization API Support with Push-to-Talk Integration

srilatha bandari requested to merge feat/readme into develop

Summary

This merge request introduces backend API support for speaker diarization, enabling processing of multi-speaker audio with structured, speaker-aware transcription outputs. The implementation extends the existing ASR pipeline while maintaining consistency with current APIs.

In addition, push-to-talk (hold-to-talk) functionality has been integrated to allow controlled, user-initiated audio capture, improving input reliability and reducing unintended recordings.


Key Changes

  • Added dedicated API endpoint(s) for speaker diarization
  • Extended ASR pipeline to support multi-speaker processing
  • Implemented speaker-labeled and timestamped transcription output
  • Defined consistent request/response schema aligned with existing APIs
  • Integrated push-to-talk (hold-to-talk) input handling
  • Added input validation and standardized error handling
  • Updated API documentation for frontend integration

Screenshots

image

API Contract

Request

  • Audio input (file or stream)
  • Supports push-to-talk captured audio

Response

  • Speaker-labeled transcription
  • Timestamped segments per speaker

Implementation Details

  • Diarization is processed as part of the transcription pipeline
  • Output is structured to include speaker IDs with corresponding time segments
  • Push-to-talk ensures audio is captured only during active user interaction
  • Backward compatibility with existing transcription APIs is maintained

Testing & Validation

  • Tested with multi-speaker audio samples

  • Verified speaker segmentation accuracy and timestamp alignment

  • Validated behavior for:

    • Single-speaker input
    • Low-quality/noisy audio
  • Tested push-to-talk flow for correct audio capture and processing

  • Confirmed stability under normal usage conditions


Impact

  • Enables multi-speaker transcription workflows
  • Improves usability for conversational scenarios (meetings, interviews, discussions)
  • Enhances input control through push-to-talk
  • Provides structured output for improved frontend rendering and UX

Checklist

  • API endpoints implemented
  • Diarization integrated with ASR pipeline
  • Push-to-talk functionality integrated
  • Validation and error handling implemented
  • Documentation updated
  • Tested across relevant scenarios

Closes: #20 (closed)

Edited by srilatha bandari

Merge request reports

Loading