Phase 2 - Implement Real-Time Streaming via WebSocket

Description: Develop Phase 2 real-time streaming transcription support by adding a /stream WebSocket endpoint. The client will continuously send 16kHz mono 16-bit PCM binary audio frames captured from the microphone. The server must process incoming audio in chunks using the existing ASR pipeline and return live incremental transcript responses in JSON format.

An optional initial client message can specify the language using {language: "te"}. If no language is provided, the system should use the existing DEFAULT_LANGUAGE configuration until automatic language detection is introduced in a later phase.

To improve responsiveness, punctuation should only be applied to final transcript segments, while partial results remain unpunctuated for lower latency.

The system must also support graceful reconnection handling during temporary network interruptions. Minimal session state should be preserved so the client can reconnect and resume streaming. Disconnected or expired sessions must be cleaned up properly.

Expected Response Format:

{
  "text": "sample transcript",
  "is_final": false,
  "language": "en"
}

Tasks:

Create /stream WebSocket endpoint
Accept binary PCM audio frames from client
Process audio in chunks for real-time inference
Emit partial and final transcript messages
Support optional language selection message
Apply punctuation only on final transcripts
Implement reconnect and resume handling
Cleanup inactive sessions after disconnect
Add integration and performance tests

Acceptance Criteria:

WebSocket /stream accepts audio frames successfully
Server emits valid transcript JSON messages
Partial transcript latency is under 500ms
Final transcript latency is under 2 seconds after speech ends
Punctuation only appears on final segments
Language override works correctly
Reconnection resumes session without major interruption
Session state is cleared after disconnect/timeout
Integration tests pass with sample audio fixtures

Labels: backend websocket streaming fastapi speech-to-text real-time