Phase 2: Real-time streaming
Overview
Implemented real-time streaming transcription support by adding a new WebSocket endpoint /stream with chunk-based inference and resumable session handling. Clients can stream 16kHz mono 16-bit PCM audio frames and receive incremental transcript responses with low latency partial updates and finalized segments.
What Was Added
WebSocket Streaming Endpoint
- Added
WS /streamendpoint for real-time transcription. - Accepts binary PCM audio frames from clients.
- Supports initial JSON configuration messages such as:
{"language":"te"}
- Defaults to existing
DEFAULT_LANGUAGEconfig when language is not provided.
Streaming Inference Handler
- Added chunked audio processing pipeline for incremental ASR inference.
- Emits partial transcript responses during speech.
- Emits final transcript after end-of-speech detection.
Response Format
Server now returns streaming JSON responses in the format:
{
"text": "...",
"is_final": true,
"language": "te"
}
Latency Optimizations
- Partial transcripts targeted within 500ms of chunk submission.
- Final transcript targeted within 2 seconds after speech ends.
- Punctuation applied only to final segments to reduce partial-result latency.
Session / Reconnection Handling
- Added minimal in-memory session state.
- Supports graceful reconnection after network drops.
- Client can reconnect and continue streaming session.
- Cleanup logic added for disconnected/expired sessions.
Tests Added
- WebSocket integration tests for round-trip transcription using audio fixtures.
- Binary PCM frame handling tests.
- Language message handling tests.
- Connection drop / cleanup behavior tests.
- Session resume flow validation.
Acceptance criteria
-
WS /streamaccepts binary PCM audio frames from client -
Server emits JSON {text: "...", is_final: bool, language: "..."}frames -
Partial results arrive within 500ms of chunk submission (per PRD latency budget) -
Final results arrive within 2s of speech end -
Punctuation applied on is_final: truesegments only -
Language can be specified by client sending {language: "..."}message -
Integration tests verify WebSocket round-trip with audio fixtures -
Connection drop handling is tested (server cleans up session state)
Impact
Enables low-latency real-time speech-to-text streaming for browser/mobile clients and lays foundation for future auto-language detection and advanced dictation features.
closes #9 (closed)
Edited by srilatha bandari