feat(asr): transition to stateful streaming ASR architecture

Description: The existing transcription pipeline is based on a polling-driven HTTP model, which introduces avoidable latency and limits the system’s ability to support real-time speech-to-text use cases. This issue proposes migrating to a stateful, streaming-based architecture using WebSockets to enable continuous, low-latency transcription.

Background: The current implementation processes audio in discrete intervals via HTTP requests. While functional, this approach:

Delays transcription delivery
Prevents real-time feedback (interim results)
Reduces subtitle synchronization accuracy
Increases overhead for continuous audio streams

Objectives:

Enable real-time, low-latency transcription
Support continuous audio streaming
Provide interim (partial) and final transcription outputs
Improve timestamp precision for subtitle generation
Maintain backward compatibility with existing APIs

Proposed Approach:

Introduce a WebSocket endpoint for streaming transcription
Implement session-based audio processing
Integrate Voice Activity Detection (VAD) for efficient segmentation
Refactor ASR pipeline to support incremental decoding and streaming outputs

Acceptance Criteria:

WebSocket endpoint supports concurrent streaming sessions
System produces both interim and final transcription results
Word-level timestamps are accurate and consistent
System performs efficiently under continuous audio input
Existing HTTP-based /api/transcribe endpoint remains functional

Priority: High Status: Resolved

Resolution: A stateful WebSocket-based streaming architecture has been successfully implemented. The system now supports real-time transcription with partial updates, improved latency, and enhanced subtitle alignment, addressing the limitations of the previous polling-based approach.