feat: Automatic ASR transcription on audio upload
Summary
Every audio record uploaded to the corpus platform is now automatically
transcribed in the background using voice-app-backend — a standalone
FastAPI service providing Swecha Gonthuka (Telugu) and Whisper (multilingual)
ASR. The transcript is stored as an ExtractedText row (extraction_type=asr)
and immediately visible via the existing GET /api/v1/records/{id} response
under the extracted_text field.
No new API endpoints. No change to upload response time — transcription is fully async.
Architecture
User uploads audio
→ POST /api/v1/records/upload
→ record saved to DB
→ transcribe_audio_record.delay(record_id) ← fire and forget
→ HTTP 200 returned immediately
[Celery worker — background]
→ download audio from Hetzner via presigned URL
→ POST audio to voice-app-backend /transcribe → {"job_id": "..."}
→ poll GET /transcribe/{job_id} until completed (4s interval, 300s max)
→ upsert ExtractedText row in DB
segments = word-level chunks if available, else [{"text": "full transcript"}]
Frontend polls GET /api/v1/records/{id}
→ extracted_text: null (in progress)
→ extracted_text.text: "..." (done)
voice-app-backend is a separate service — it owns its own process, models,
and docker-compose. corpus-server-app only needs its URL (ASR_SERVICE_URL).
The two stacks are fully decoupled.
Changed files
app/tasks/transcription.py (modified)
Celery task transcribe_audio_record(record_id: str):
- Skips non-audio records and when
ASR_ENABLED=false - Downloads audio from Hetzner via a short-lived presigned URL
-
Submits audio to
POST /transcribe(fieldfile) → getsjob_id -
Polls
GET /transcribe/{job_id}every 4 s untilstatus=completed|failed(hard timeout 300 s — Celery retries the whole task if exceeded) - Upserts
ExtractedText: stores word-levelchunksinsegmentswhen the ASR engine returns them, falling back to[{"text": "..."}] - Auto-retries up to 3× (60 s backoff) on HTTP/network/timeout failures
- Routes to the
file_processingCelery queue
app/api/v1/endpoints/records.py (modified)
6-line addition after session.commit() in upload_record. Only fires for
media_type == audio. Upload handler is otherwise untouched.
app/core/config.py (modified)
Two settings:
-
ASR_ENABLED(defaulttrue) — kill switch; disable without redeployment -
ASR_SERVICE_URL(defaulthttp://localhost:8001) — voice-app-backend URL
app/tasks/__init__.py (modified)
Registers transcription module with Celery so the task is discoverable.
Deployment
1 — Start voice-app-backend (once, on the server)
cd /home/rajasekhar/voice-app-backend
docker compose up -d --build
# first start downloads models (~2 GB); allow ~3 min for health check to pass
curl http://localhost:8001/health
2 — Set env vars in corpus-server-app .env
ASR_ENABLED=true
ASR_SERVICE_URL=http://localhost:8001 # same server
# or http://<ip>:8001 if on a different machine
3 — Deploy corpus-server-app as normal
docker compose up -d
The Celery worker picks up ASR_SERVICE_URL and starts transcribing new
uploads immediately. voice-app-backend is not in corpus docker-compose.yml
— it manages itself.
How to test
# 1. verify voice-app-backend is healthy
curl http://localhost:8001/health
# → {"status":"healthy"}
# 2. upload an audio file
curl -X POST https://api.corpus.swecha.org/api/v1/records/upload \
-F "[email protected]" -F "media_type=audio" ...
# 3. poll until extracted_text is populated
curl https://api.corpus.swecha.org/api/v1/records/<record_id> | jq .extracted_text
# 4. monitor Celery task in Flower
open http://localhost:5555
To disable transcription without redeploying:
ASR_ENABLED=false docker compose up -d celery-worker
Considerations for reviewers
-
ExtractedText upsert: re-upload or task retry overwrites the existing row (last-write-wins). Intentional.
-
Word-level chunks: if voice-app-backend returns timestamps, they are stored in
segmentsdirectly. This enables future subtitle/alignment features without a schema change. -
Polling vs webhook: the Celery task blocks its worker slot while polling. On long audio (> 2 min) this ties up one worker for the duration. Acceptable for current upload volumes; revisit if the queue backs up.
-
Model routing: voice-app-backend routes Telugu to Swecha Gonthuka (wav2vec2) and all other languages to Whisper automatically based on the record's
languagefield. No corpus-server code needs to know about models. -
ASR_ENABLED flag: allows disabling transcription for non-Telugu language campaigns or during model maintenance without code changes.
-
Decoupling: voice-app-backend can be updated, restarted, or replaced (e.g. swapped for a faster model) without touching corpus-server-app — only
ASR_SERVICE_URLneeds to point at the new instance.