Phase 2 - Implement Real-Time Streaming via WebSocket
Description:
Develop Phase 2 real-time streaming transcription support by adding a /stream WebSocket endpoint. The client will continuously send 16kHz mono 16-bit PCM binary audio frames captured from the microphone. The server must process incoming audio in chunks using the existing ASR pipeline and return live incremental transcript responses in JSON format.
An optional initial client message can specify the language using {language: "te"}. If no language is provided, the system should use the existing DEFAULT_LANGUAGE configuration until automatic language detection is introduced in a later phase.
To improve responsiveness, punctuation should only be applied to final transcript segments, while partial results remain unpunctuated for lower latency.
The system must also support graceful reconnection handling during temporary network interruptions. Minimal session state should be preserved so the client can reconnect and resume streaming. Disconnected or expired sessions must be cleaned up properly.
Expected Response Format:
{
"text": "sample transcript",
"is_final": false,
"language": "en"
}
Tasks:
- Create
/streamWebSocket endpoint - Accept binary PCM audio frames from client
- Process audio in chunks for real-time inference
- Emit partial and final transcript messages
- Support optional language selection message
- Apply punctuation only on final transcripts
- Implement reconnect and resume handling
- Cleanup inactive sessions after disconnect
- Add integration and performance tests
Acceptance Criteria:
-
WebSocket /streamaccepts audio frames successfully -
Server emits valid transcript JSON messages -
Partial transcript latency is under 500ms -
Final transcript latency is under 2 seconds after speech ends -
Punctuation only appears on final segments -
Language override works correctly -
Reconnection resumes session without major interruption -
Session state is cleared after disconnect/timeout -
Integration tests pass with sample audio fixtures
Labels:
backend websocket streaming fastapi speech-to-text real-time