Skip to content

fix: Shift Model Loading from Lazy Loading to Preloading with Warm-Up

ashritha kunjeti requested to merge model-loading into diarization

Summary

This MR updates the model initialization strategy by replacing lazy loading with eager preloading at application startup, along with a dummy inference (warm-up pass) to ensure all models are fully initialized before handling real requests.

Changes Made

  • Moved model loading from on-demand (lazy) to startup phase

  • Implemented centralized model loader

  • Added warm-up step using dummy input

  • Ensures all pipelines are initialized:

    • ASR (Swecha + Whisper)
    • Language Identification (LID)
    • Diarization
    • Punctuation
    • Alignment
  • Updated logs to reflect loading and warm-up stages

  • Ensured models are cached and reused across requests

Motivation

Previously:

  • First request experienced high latency (cold start)
  • Models loaded during request → poor user experience
  • Risk of race conditions under concurrent requests

Now:

  • Models are ready before serving traffic
  • Eliminates cold-start delays
  • Improves consistency and reliability

Impact

Improvements

  • Faster response time for first request
  • Consistent latency across all requests
  • Early detection of model loading failures
  • Better production readiness

Trade-offs

  • Increased application startup time (~100s observed)
  • Higher initial CPU/GPU usage during boot
  • Slightly heavier memory footprint at idle

Performance Observations

  • Model loading time: ~100 seconds
  • Warm-up time: ~14 seconds
  • Runtime latency improved by eliminating cold start delays

*closes #22 (closed)

Edited by ashritha kunjeti

Merge request reports

Loading