feat(asr): finalize Phase 4 LID with Whisper's built-in detection (!27) · Merge requests · VISWAM / apps / Speech / Voice App Backend

ashritha kunjeti requested to merge saas_phase4 into develop Apr 25, 2026

Description

This MR completes Phase 4: Language Identification (LID) & ASR stability improvements by upgrading the LID model, improving ASR fallback behavior, and strengthening system reliability across multilingual workflows.

Key Changes

LID Engine Upgrade
- Migrated to speechbrain/lang-id-voxlingua107-ecapa
- Improves acoustic-level language detection accuracy, especially for Indic languages
ASR Fallback Improvement
- Replaced fallback with openai/whisper-base
- Eliminates decoder hallucination loops on empty/noisy inputs (e.g., YouTube boilerplate audio)
Mixed Language Handling
- Introduced explicit mixed language bypass
- Enables better handling of code-switched and multilingual inputs in router
HuggingFace Fix
- Resolved huggingface_hub 401 Unauthorized issue
- Stabilized model fetch and timeout handling
Documentation Update
- Rewrote docs/LID_SELECTION.md
- Added standardized evaluation metrics based on VoxLingua107 benchmarks
Testing Enhancements
- Strengthened integration tests
- Added strict assertion thresholds
- Achieved full workflow coverage (batch + PCM16 streaming)
Add language dropdown in Swagger using Enum
- Introduced LanguageOption Enum with 'te', 'en', 'hi', and 'other'
- Updated /transcribe endpoint to use Enum instead of raw string
- Set default language to 'other' for better UX
- Enabled automatic Swagger UI dropdown for language selection
- Added internal mapping: 'other' → None to preserve LID auto-detection flow *This improves API usability while maintaining existing language routing logic
Language Detection & Routing Enhancements

A new /language endpoint has been introduced to analyze uploaded audio files and automatically detect the spoken language using the upgraded LID (Language Identification) model. This allows users to quickly identify the language of any audio input before transcription.

In addition, the /transcribe endpoint has been enhanced with a language selection dropdown in the backend using a structured LanguageOption Enum. The available options include:

te (Telugu) en (English) hi (Hindi) other (Auto-detect)

Behavior:

When a specific language is selected (te, en, or hi), the system directly routes the audio to the appropriate ASR pipeline. When other is selected, the system first performs Language Identification (LID). If the detected language is Telugu → routed to Swecha ASR For all other detected languages → routed to Whisper ASR

This design ensures both flexibility and accuracy by combining user control with automatic language detection when needed.

Impact
More accurate language detection across diverse audio inputs
Improved ASR reliability with reduced hallucinations
Better support for multilingual and code-switched scenarios
Increased system stability and test coverage
Checklist
LID model upgraded and validated
ASR fallback tested on noisy/empty audio
Mixed language routing verified
Docs updated
Integration tests passing
closes #13 (closed)

Edited Apr 25, 2026 by ashritha kunjeti

feat(asr): finalize Phase 4 LID with Whisper's built-in detection

Merge request reports