fix: Shift Model Loading from Lazy Loading to Preloading with Warm-Up
Summary
This MR updates the model initialization strategy by replacing lazy loading with eager preloading at application startup, along with a dummy inference (warm-up pass) to ensure all models are fully initialized before handling real requests.
Changes Made
-
Moved model loading from on-demand (lazy) to startup phase
-
Implemented centralized model loader
-
Added warm-up step using dummy input
-
Ensures all pipelines are initialized:
- ASR (Swecha + Whisper)
- Language Identification (LID)
- Diarization
- Punctuation
- Alignment
-
Updated logs to reflect loading and warm-up stages
-
Ensured models are cached and reused across requests
Motivation
Previously:
- First request experienced high latency (cold start)
- Models loaded during request → poor user experience
- Risk of race conditions under concurrent requests
Now:
- Models are ready before serving traffic
- Eliminates cold-start delays
- Improves consistency and reliability
Impact
Improvements
- Faster response time for first request
- Consistent latency across all requests
- Early detection of model loading failures
- Better production readiness
Trade-offs
- Increased application startup time (~100s observed)
- Higher initial CPU/GPU usage during boot
- Slightly heavier memory footprint at idle
Performance Observations
- Model loading time: ~100 seconds
- Warm-up time: ~14 seconds
- Runtime latency improved by eliminating cold start delays
*closes #22 (closed)
Edited by ashritha kunjeti