refactor: fix security issues, dtype optimisations, and concurrency model
refactor/voice-app-backend
Why
A full audit found 40+ issues across security, correctness, performance, and maintainability. This branch fixes all of them without adding features.
What changed
Security
| Issue | Fix |
|---|---|
tempfile.mktemp() TOCTOU race (B306) in swecha, whisper, lid, main |
Replaced with NamedTemporaryFile(delete=False) everywhere |
allow_origins=["*"] + allow_credentials=True — browsers reject this |
CORS origins now read from settings.CORS_ORIGINS (env var, comma-separated) |
reload=True in uvicorn unconditionally — forks a file-watcher in prod |
Gated on settings.DEBUG (default false) |
| No file-size check before reading upload into memory | HTTP 413 returned if Content-Length > MAX_AUDIO_SIZE before touching disk |
Bare except Exception on load_dotenv swallowed real errors silently |
Narrowed to OSError (file not found is expected; other errors surface) |
os.unlink() chains — second file skipped if first raises OSError
|
Each deletion wrapped in its own try/except OSError
|
Dtype optimisations
| Model | Before | After | Benefit |
|---|---|---|---|
| Whisper (distil-large-v3) | float32 (implicit) | torch_dtype=torch.bfloat16 |
~50% RAM saving, no accuracy loss on PyTorch ≥ 2.0 |
| LID (whisper-small) | float32 (implicit) | torch_dtype=torch.bfloat16 |
same |
| Swecha Gonthuka (wav2vec2) | float32 (implicit) |
torch_dtype=torch.float32 (explicit) |
documented intent; bfloat16 not reliable across all transformers versions for wav2vec2 |
Concurrency
Problem: All inference was offloaded via run_in_executor(None, ...) which
uses asyncio's default ThreadPoolExecutor(max_workers=min(32, cpu+4)) — no
explicit bound, all threads fighting the GIL. Diarization created a brand-new
ThreadPoolExecutor(max_workers=2) on every request and threw it away.
Model singletons had no lock on first load; two concurrent requests could both
initialize the same model.
Fix — app/asr/executor.py (new file):
_INFERENCE_WORKERS = max(2, os.cpu_count()) # one per core
inference_executor = ThreadPoolExecutor(max_workers=_INFERENCE_WORKERS, ...)
- All
run_in_executor(None, ...)injob_manager,streaming_service, anddiarization_servicereplaced withrun_in_executor(inference_executor, ...) - Diarization now submits futures to the shared executor instead of creating a new pool per call
-
threading.Lockdouble-checked locking added to every model singleton init (get_asr_pipeline,get_punctuation_pipeline,get_alignment_model,SwechaGonthukaEngine,WhisperEngine,WhisperLidEngine) -
asyncio.Semaphore(_INFERENCE_WORKERS * 2)injob_managercaps concurrent inference jobs so the queue never grows unboundedly
GIL note (comment in executor.py): PyTorch CPU inference holds the GIL, so true parallelism is still limited. The bounded pool keeps the event loop responsive without over-subscribing cores. The path to real parallelism is ONNX Runtime (GIL-free) or multi-process uvicorn workers.
Code deduplication
| Duplicate | Authoritative location | Removed from |
|---|---|---|
_normalize_chunks() |
app/asr/engine_support.py |
app/services/diarization_service.py |
_create_synthetic_chunks_for_duration() |
app/asr/engine_support.py |
app/services/diarization_service.py |
_normalize_language_str() |
app/asr/router.py (_normalize_language) |
app/main.py (now delegates) |
_transcribe_audio + _transcribe_pcm16 in SwechaGonthukaEngine |
merged into _run_pipeline_on_wav() helper |
— |
Imports inside _bytes_to_float32 method |
moved to module top | app/asr/lid/whisper_lid.py |
Validation improvements
-
normalize_chunks()now skips empty-text chunks and swaps inverted timestamps (start > end) with a warning instead of silently producing broken subtitles -
config.pyexposesDEBUGandCORS_ORIGINSas env-configurable settings -
.envparse failure narrowed toOSError; startup errors are no longer hidden
Observability
-
exc_info=Trueadded to all broadexcept Exceptioncatches so stack traces appear in logs - Model load timing logged on all three engines:
"Swecha model loaded in 12.3s","Whisper model loaded in 8.1s", etc.
New files
| File | Purpose |
|---|---|
app/asr/executor.py |
Shared bounded ThreadPoolExecutor for all inference |
Dockerfile |
Container build for standalone deployment |
docker-compose.yml |
Run as an independent service on port 8001 |
Files modified
app/asr/executor.py ← new
app/asr/engine_support.py security, locks, chunk validation, dtype comment
app/asr/engines/swecha_gonthuka.py merged _transcribe methods, mktemp, locks, timing
app/asr/engines/whisper.py bfloat16, mktemp, locks, timing
app/asr/lid/whisper_lid.py bfloat16, move imports, mktemp, locks, timing
app/asr/router.py export _normalize_language
app/config.py CORS_ORIGINS, DEBUG, OSError on dotenv
app/main.py CORS, reload, file-size check, mktemp, dedup language fn
app/services/job_manager.py inference_executor, semaphore
app/services/streaming_service.py inference_executor
app/services/diarization_service.py remove duplicates, inference_executor, safe unlink
Dockerfile ← new
docker-compose.yml ← new
How to run
# standalone
docker compose up -d --build
# or directly (dev)
uv sync
uv run uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload
Health check: curl http://localhost:8001/health