Improvement: Preloading Models with Warm-Up (Dummy Inference)
Description
To avoid latency caused by lazy loading, the system has been updated to preload models at application startup. Additionally, a dummy inference (warm-up test) is executed to ensure all models are fully initialized and “hot” before serving real user requests.
What was implemented
- Models are loaded during startup
- A dummy request/inference is run immediately after loading
- Ensures:
- Model weights are in memory
- GPU/CPU kernels are initialized
- Any just-in-time (JIT) compilation is completed
Benefits
- No cold start latency
- First real user request is fast
- Consistent performance
- No variation between first and subsequent requests
- Improved reliability
- Errors in model loading are caught during startup, not during user requests
- Better production readiness
- Ideal for APIs with real-time requirements (like ASR/diarization)
Notes on Warm-Up Strategy
- Dummy input should be:
- Small but valid (e.g., short audio clip for ASR)
- Warm-up should cover:
- All critical pipelines (ASR, diarization, etc.)
- Should not affect logs/metrics as real requests