feat(asr): transition to stateful streaming ASR architecture
Description: The existing transcription pipeline is based on a polling-driven HTTP model, which introduces avoidable latency and limits the system’s ability to support real-time speech-to-text use cases. This issue proposes migrating to a stateful, streaming-based architecture using WebSockets to enable continuous, low-latency transcription.
Background: The current implementation processes audio in discrete intervals via HTTP requests. While functional, this approach:
- Delays transcription delivery
- Prevents real-time feedback (interim results)
- Reduces subtitle synchronization accuracy
- Increases overhead for continuous audio streams
Objectives:
- Enable real-time, low-latency transcription
- Support continuous audio streaming
- Provide interim (partial) and final transcription outputs
- Improve timestamp precision for subtitle generation
- Maintain backward compatibility with existing APIs
Proposed Approach:
- Introduce a WebSocket endpoint for streaming transcription
- Implement session-based audio processing
- Integrate Voice Activity Detection (VAD) for efficient segmentation
- Refactor ASR pipeline to support incremental decoding and streaming outputs
Acceptance Criteria:
- WebSocket endpoint supports concurrent streaming sessions
- System produces both interim and final transcription results
- Word-level timestamps are accurate and consistent
- System performs efficiently under continuous audio input
- Existing HTTP-based
/api/transcribeendpoint remains functional
Priority: High Status: Resolved
Resolution: A stateful WebSocket-based streaming architecture has been successfully implemented. The system now supports real-time transcription with partial updates, improved latency, and enhanced subtitle alignment, addressing the limitations of the previous polling-based approach.