feat(asr): finalize Phase 4 LID with Whisper's built-in detection
Description
This MR completes Phase 4: Language Identification (LID) & ASR stability improvements by upgrading the LID model, improving ASR fallback behavior, and strengthening system reliability across multilingual workflows.
Key Changes
-
LID Engine Upgrade
- Migrated to speechbrain/lang-id-voxlingua107-ecapa
- Improves acoustic-level language detection accuracy, especially for Indic languages
-
ASR Fallback Improvement
- Replaced fallback with openai/whisper-base
- Eliminates decoder hallucination loops on empty/noisy inputs (e.g., YouTube boilerplate audio)
-
Mixed Language Handling
- Introduced explicit mixed language bypass
- Enables better handling of code-switched and multilingual inputs in router
-
HuggingFace Fix
- Resolved huggingface_hub 401 Unauthorized issue
- Stabilized model fetch and timeout handling
-
Documentation Update
- Rewrote docs/LID_SELECTION.md
- Added standardized evaluation metrics based on VoxLingua107 benchmarks
-
Testing Enhancements
- Strengthened integration tests
- Added strict assertion thresholds
- Achieved full workflow coverage (batch + PCM16 streaming)
-
Add language dropdown in Swagger using Enum
- Introduced LanguageOption Enum with 'te', 'en', 'hi', and 'other'
- Updated /transcribe endpoint to use Enum instead of raw string
- Set default language to 'other' for better UX
- Enabled automatic Swagger UI dropdown for language selection
- Added internal mapping: 'other' → None to preserve LID auto-detection flow *This improves API usability while maintaining existing language routing logic
-
Language Detection & Routing Enhancements
A new /language endpoint has been introduced to analyze uploaded audio files and automatically detect the spoken language using the upgraded LID (Language Identification) model. This allows users to quickly identify the language of any audio input before transcription.
In addition, the /transcribe endpoint has been enhanced with a language selection dropdown in the backend using a structured LanguageOption Enum. The available options include:
te (Telugu) en (English) hi (Hindi) other (Auto-detect)
- Behavior:
When a specific language is selected (te, en, or hi), the system directly routes the audio to the appropriate ASR pipeline. When other is selected, the system first performs Language Identification (LID). If the detected language is Telugu → routed to Swecha ASR For all other detected languages → routed to Whisper ASR
This design ensures both flexibility and accuracy by combining user control with automatic language detection when needed.
-
Impact
-
More accurate language detection across diverse audio inputs
-
Improved ASR reliability with reduced hallucinations
-
Better support for multilingual and code-switched scenarios
-
Increased system stability and test coverage
-
Checklist
-
LID model upgraded and validated
-
ASR fallback tested on noisy/empty audio
-
Mixed language routing verified
-
Docs updated
-
Integration tests passing
-
closes #13 (closed)