Phase 2: Real-time streaming (!16) · Merge requests · VISWAM / apps / Speech / Voice App Backend

srilatha bandari requested to merge feat/phase2 into develop Apr 16, 2026

Overview

Implemented real-time streaming transcription support by adding a new WebSocket endpoint /stream with chunk-based inference and resumable session handling. Clients can stream 16kHz mono 16-bit PCM audio frames and receive incremental transcript responses with low latency partial updates and finalized segments.

What Was Added

WebSocket Streaming Endpoint

Added WS /stream endpoint for real-time transcription.
Accepts binary PCM audio frames from clients.
Supports initial JSON configuration messages such as:

{"language":"te"}

Defaults to existing DEFAULT_LANGUAGE config when language is not provided.

Streaming Inference Handler

Added chunked audio processing pipeline for incremental ASR inference.
Emits partial transcript responses during speech.
Emits final transcript after end-of-speech detection.

Response Format

Server now returns streaming JSON responses in the format:

{
  "text": "...",
  "is_final": true,
  "language": "te"
}

Latency Optimizations

Partial transcripts targeted within 500ms of chunk submission.
Final transcript targeted within 2 seconds after speech ends.
Punctuation applied only to final segments to reduce partial-result latency.

Session / Reconnection Handling

Added minimal in-memory session state.
Supports graceful reconnection after network drops.
Client can reconnect and continue streaming session.
Cleanup logic added for disconnected/expired sessions.

Tests Added

WebSocket integration tests for round-trip transcription using audio fixtures.
Binary PCM frame handling tests.
Language message handling tests.
Connection drop / cleanup behavior tests.
Session resume flow validation.

Acceptance criteria

WS /stream accepts binary PCM audio frames from client
Server emits JSON {text: "...", is_final: bool, language: "..."} frames
Partial results arrive within 500ms of chunk submission (per PRD latency budget)
Final results arrive within 2s of speech end
Punctuation applied on is_final: true segments only
Language can be specified by client sending {language: "..."} message
Integration tests verify WebSocket round-trip with audio fixtures
Connection drop handling is tested (server cleans up session state)

Impact

Enables low-latency real-time speech-to-text streaming for browser/mobile clients and lays foundation for future auto-language detection and advanced dictation features.

closes #9 (closed)

Edited Apr 16, 2026 by srilatha bandari

Phase 2: Real-time streaming