Skip to content

Phase 2: Real-time streaming

srilatha bandari requested to merge feat/phase2 into develop

Overview

Implemented real-time streaming transcription support by adding a new WebSocket endpoint /stream with chunk-based inference and resumable session handling. Clients can stream 16kHz mono 16-bit PCM audio frames and receive incremental transcript responses with low latency partial updates and finalized segments.

What Was Added

WebSocket Streaming Endpoint

  • Added WS /stream endpoint for real-time transcription.
  • Accepts binary PCM audio frames from clients.
  • Supports initial JSON configuration messages such as:
{"language":"te"}
  • Defaults to existing DEFAULT_LANGUAGE config when language is not provided.

Streaming Inference Handler

  • Added chunked audio processing pipeline for incremental ASR inference.
  • Emits partial transcript responses during speech.
  • Emits final transcript after end-of-speech detection.

Response Format

Server now returns streaming JSON responses in the format:

{
  "text": "...",
  "is_final": true,
  "language": "te"
}

Latency Optimizations

  • Partial transcripts targeted within 500ms of chunk submission.
  • Final transcript targeted within 2 seconds after speech ends.
  • Punctuation applied only to final segments to reduce partial-result latency.

Session / Reconnection Handling

  • Added minimal in-memory session state.
  • Supports graceful reconnection after network drops.
  • Client can reconnect and continue streaming session.
  • Cleanup logic added for disconnected/expired sessions.

Tests Added

  • WebSocket integration tests for round-trip transcription using audio fixtures.
  • Binary PCM frame handling tests.
  • Language message handling tests.
  • Connection drop / cleanup behavior tests.
  • Session resume flow validation.

Acceptance criteria

  • WS /stream accepts binary PCM audio frames from client
  • Server emits JSON {text: "...", is_final: bool, language: "..."} frames
  • Partial results arrive within 500ms of chunk submission (per PRD latency budget)
  • Final results arrive within 2s of speech end
  • Punctuation applied on is_final: true segments only
  • Language can be specified by client sending {language: "..."} message
  • Integration tests verify WebSocket round-trip with audio fixtures
  • Connection drop handling is tested (server cleans up session state)

Impact

Enables low-latency real-time speech-to-text streaming for browser/mobile clients and lays foundation for future auto-language detection and advanced dictation features.

closes #9 (closed)

Edited by srilatha bandari

Merge request reports

Loading