Improvement: Preloading Models with Warm-Up (Dummy Inference)

Description

To avoid latency caused by lazy loading, the system has been updated to preload models at application startup. Additionally, a dummy inference (warm-up test) is executed to ensure all models are fully initialized and “hot” before serving real user requests.

What was implemented

Models are loaded during startup
A dummy request/inference is run immediately after loading
Ensures:
- Model weights are in memory
- GPU/CPU kernels are initialized
- Any just-in-time (JIT) compilation is completed

Benefits

No cold start latency
First real user request is fast
Consistent performance
No variation between first and subsequent requests
Improved reliability
Errors in model loading are caught during startup, not during user requests
Better production readiness
Ideal for APIs with real-time requirements (like ASR/diarization)

Notes on Warm-Up Strategy

Dummy input should be:
- Small but valid (e.g., short audio clip for ASR)
- Warm-up should cover:
- All critical pipelines (ASR, diarization, etc.)
- Should not affect logs/metrics as real requests