Skip to content

refactor: fix security issues, dtype optimisations, and concurrency model

Rajasekhar Ponakala requested to merge refactor/voice-app-backend into develop

refactor/voice-app-backend

Why

A full audit found 40+ issues across security, correctness, performance, and maintainability. This branch fixes all of them without adding features.


What changed

Security

Issue Fix
tempfile.mktemp() TOCTOU race (B306) in swecha, whisper, lid, main Replaced with NamedTemporaryFile(delete=False) everywhere
allow_origins=["*"] + allow_credentials=True — browsers reject this CORS origins now read from settings.CORS_ORIGINS (env var, comma-separated)
reload=True in uvicorn unconditionally — forks a file-watcher in prod Gated on settings.DEBUG (default false)
No file-size check before reading upload into memory HTTP 413 returned if Content-Length > MAX_AUDIO_SIZE before touching disk
Bare except Exception on load_dotenv swallowed real errors silently Narrowed to OSError (file not found is expected; other errors surface)
os.unlink() chains — second file skipped if first raises OSError Each deletion wrapped in its own try/except OSError

Dtype optimisations

Model Before After Benefit
Whisper (distil-large-v3) float32 (implicit) torch_dtype=torch.bfloat16 ~50% RAM saving, no accuracy loss on PyTorch ≥ 2.0
LID (whisper-small) float32 (implicit) torch_dtype=torch.bfloat16 same
Swecha Gonthuka (wav2vec2) float32 (implicit) torch_dtype=torch.float32 (explicit) documented intent; bfloat16 not reliable across all transformers versions for wav2vec2

Concurrency

Problem: All inference was offloaded via run_in_executor(None, ...) which uses asyncio's default ThreadPoolExecutor(max_workers=min(32, cpu+4)) — no explicit bound, all threads fighting the GIL. Diarization created a brand-new ThreadPoolExecutor(max_workers=2) on every request and threw it away. Model singletons had no lock on first load; two concurrent requests could both initialize the same model.

Fix — app/asr/executor.py (new file):

_INFERENCE_WORKERS = max(2, os.cpu_count())   # one per core
inference_executor = ThreadPoolExecutor(max_workers=_INFERENCE_WORKERS, ...)
  • All run_in_executor(None, ...) in job_manager, streaming_service, and diarization_service replaced with run_in_executor(inference_executor, ...)
  • Diarization now submits futures to the shared executor instead of creating a new pool per call
  • threading.Lock double-checked locking added to every model singleton init (get_asr_pipeline, get_punctuation_pipeline, get_alignment_model, SwechaGonthukaEngine, WhisperEngine, WhisperLidEngine)
  • asyncio.Semaphore(_INFERENCE_WORKERS * 2) in job_manager caps concurrent inference jobs so the queue never grows unboundedly

GIL note (comment in executor.py): PyTorch CPU inference holds the GIL, so true parallelism is still limited. The bounded pool keeps the event loop responsive without over-subscribing cores. The path to real parallelism is ONNX Runtime (GIL-free) or multi-process uvicorn workers.


Code deduplication

Duplicate Authoritative location Removed from
_normalize_chunks() app/asr/engine_support.py app/services/diarization_service.py
_create_synthetic_chunks_for_duration() app/asr/engine_support.py app/services/diarization_service.py
_normalize_language_str() app/asr/router.py (_normalize_language) app/main.py (now delegates)
_transcribe_audio + _transcribe_pcm16 in SwechaGonthukaEngine merged into _run_pipeline_on_wav() helper
Imports inside _bytes_to_float32 method moved to module top app/asr/lid/whisper_lid.py

Validation improvements

  • normalize_chunks() now skips empty-text chunks and swaps inverted timestamps (start > end) with a warning instead of silently producing broken subtitles
  • config.py exposes DEBUG and CORS_ORIGINS as env-configurable settings
  • .env parse failure narrowed to OSError; startup errors are no longer hidden

Observability

  • exc_info=True added to all broad except Exception catches so stack traces appear in logs
  • Model load timing logged on all three engines: "Swecha model loaded in 12.3s", "Whisper model loaded in 8.1s", etc.

New files

File Purpose
app/asr/executor.py Shared bounded ThreadPoolExecutor for all inference
Dockerfile Container build for standalone deployment
docker-compose.yml Run as an independent service on port 8001

Files modified

app/asr/executor.py               ← new
app/asr/engine_support.py         security, locks, chunk validation, dtype comment
app/asr/engines/swecha_gonthuka.py  merged _transcribe methods, mktemp, locks, timing
app/asr/engines/whisper.py        bfloat16, mktemp, locks, timing
app/asr/lid/whisper_lid.py        bfloat16, move imports, mktemp, locks, timing
app/asr/router.py                 export _normalize_language
app/config.py                     CORS_ORIGINS, DEBUG, OSError on dotenv
app/main.py                       CORS, reload, file-size check, mktemp, dedup language fn
app/services/job_manager.py       inference_executor, semaphore
app/services/streaming_service.py inference_executor
app/services/diarization_service.py  remove duplicates, inference_executor, safe unlink
Dockerfile                        ← new
docker-compose.yml                ← new

How to run

# standalone
docker compose up -d --build

# or directly (dev)
uv sync
uv run uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload

Health check: curl http://localhost:8001/health

Merge request reports

Loading