Rajasekhar Ponakala requested to merge refactor/voice-app-backend into develop May 13, 2026

refactor/voice-app-backend

Why

A full audit found 40+ issues across security, correctness, performance, and maintainability. This branch fixes all of them without adding features.

What changed

Security

Issue	Fix
`tempfile.mktemp()` TOCTOU race (B306) in swecha, whisper, lid, main	Replaced with `NamedTemporaryFile(delete=False)` everywhere
`allow_origins=["*"]` + `allow_credentials=True` — browsers reject this	CORS origins now read from `settings.CORS_ORIGINS` (env var, comma-separated)
`reload=True` in uvicorn unconditionally — forks a file-watcher in prod	Gated on `settings.DEBUG` (default `false`)
No file-size check before reading upload into memory	HTTP 413 returned if `Content-Length > MAX_AUDIO_SIZE` before touching disk
Bare `except Exception` on `load_dotenv` swallowed real errors silently	Narrowed to `OSError` (file not found is expected; other errors surface)
`os.unlink()` chains — second file skipped if first raises `OSError`	Each deletion wrapped in its own `try/except OSError`

Dtype optimisations

Model	Before	After	Benefit
Whisper (distil-large-v3)	float32 (implicit)	`torch_dtype=torch.bfloat16`	~50% RAM saving, no accuracy loss on PyTorch ≥ 2.0
LID (whisper-small)	float32 (implicit)	`torch_dtype=torch.bfloat16`	same
Swecha Gonthuka (wav2vec2)	float32 (implicit)	`torch_dtype=torch.float32` (explicit)	documented intent; bfloat16 not reliable across all transformers versions for wav2vec2

Concurrency

Problem: All inference was offloaded via run_in_executor(None, ...) which uses asyncio's default ThreadPoolExecutor(max_workers=min(32, cpu+4)) — no explicit bound, all threads fighting the GIL. Diarization created a brand-new ThreadPoolExecutor(max_workers=2) on every request and threw it away. Model singletons had no lock on first load; two concurrent requests could both initialize the same model.

Fix — app/asr/executor.py (new file):

_INFERENCE_WORKERS = max(2, os.cpu_count())   # one per core
inference_executor = ThreadPoolExecutor(max_workers=_INFERENCE_WORKERS, ...)

All run_in_executor(None, ...) in job_manager, streaming_service, and diarization_service replaced with run_in_executor(inference_executor, ...)
Diarization now submits futures to the shared executor instead of creating a new pool per call
threading.Lock double-checked locking added to every model singleton init (get_asr_pipeline, get_punctuation_pipeline, get_alignment_model, SwechaGonthukaEngine, WhisperEngine, WhisperLidEngine)
asyncio.Semaphore(_INFERENCE_WORKERS * 2) in job_manager caps concurrent inference jobs so the queue never grows unboundedly

GIL note (comment in executor.py): PyTorch CPU inference holds the GIL, so true parallelism is still limited. The bounded pool keeps the event loop responsive without over-subscribing cores. The path to real parallelism is ONNX Runtime (GIL-free) or multi-process uvicorn workers.

Code deduplication

Duplicate	Authoritative location	Removed from
`_normalize_chunks()`	`app/asr/engine_support.py`	`app/services/diarization_service.py`
`_create_synthetic_chunks_for_duration()`	`app/asr/engine_support.py`	`app/services/diarization_service.py`
`_normalize_language_str()`	`app/asr/router.py` (`_normalize_language`)	`app/main.py` (now delegates)
`_transcribe_audio` + `_transcribe_pcm16` in SwechaGonthukaEngine	merged into `_run_pipeline_on_wav()` helper	—
Imports inside `_bytes_to_float32` method	moved to module top	`app/asr/lid/whisper_lid.py`

Validation improvements

normalize_chunks() now skips empty-text chunks and swaps inverted timestamps (start > end) with a warning instead of silently producing broken subtitles
config.py exposes DEBUG and CORS_ORIGINS as env-configurable settings
.env parse failure narrowed to OSError; startup errors are no longer hidden

Observability

exc_info=True added to all broad except Exception catches so stack traces appear in logs
Model load timing logged on all three engines: "Swecha model loaded in 12.3s", "Whisper model loaded in 8.1s", etc.

New files

File	Purpose
`app/asr/executor.py`	Shared bounded `ThreadPoolExecutor` for all inference
`Dockerfile`	Container build for standalone deployment
`docker-compose.yml`	Run as an independent service on port 8001

Files modified

app/asr/executor.py               ← new
app/asr/engine_support.py         security, locks, chunk validation, dtype comment
app/asr/engines/swecha_gonthuka.py  merged _transcribe methods, mktemp, locks, timing
app/asr/engines/whisper.py        bfloat16, mktemp, locks, timing
app/asr/lid/whisper_lid.py        bfloat16, move imports, mktemp, locks, timing
app/asr/router.py                 export _normalize_language
app/config.py                     CORS_ORIGINS, DEBUG, OSError on dotenv
app/main.py                       CORS, reload, file-size check, mktemp, dedup language fn
app/services/job_manager.py       inference_executor, semaphore
app/services/streaming_service.py inference_executor
app/services/diarization_service.py  remove duplicates, inference_executor, safe unlink
Dockerfile                        ← new
docker-compose.yml                ← new

How to run

# standalone
docker compose up -d --build

# or directly (dev)
uv sync
uv run uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload

Health check: curl http://localhost:8001/health

refactor: fix security issues, dtype optimisations, and concurrency model