Skip to content

feat: Automatic ASR transcription on audio upload

Rajasekhar Ponakala requested to merge feat/asr-transcription into develop

Summary

Every audio record uploaded to the corpus platform is now automatically transcribed in the background using voice-app-backend — a standalone FastAPI service providing Swecha Gonthuka (Telugu) and Whisper (multilingual) ASR. The transcript is stored as an ExtractedText row (extraction_type=asr) and immediately visible via the existing GET /api/v1/records/{id} response under the extracted_text field.

No new API endpoints. No change to upload response time — transcription is fully async.


Architecture

User uploads audio
  → POST /api/v1/records/upload
    → record saved to DB
    → transcribe_audio_record.delay(record_id)   ← fire and forget
    → HTTP 200 returned immediately

[Celery worker — background]
  → download audio from Hetzner via presigned URL
  → POST audio to voice-app-backend /transcribe    → {"job_id": "..."}
  → poll GET /transcribe/{job_id} until completed  (4s interval, 300s max)
  → upsert ExtractedText row in DB
    segments = word-level chunks if available, else [{"text": "full transcript"}]

Frontend polls GET /api/v1/records/{id}
  → extracted_text: null          (in progress)
  → extracted_text.text: "..."    (done)

voice-app-backend is a separate service — it owns its own process, models, and docker-compose. corpus-server-app only needs its URL (ASR_SERVICE_URL). The two stacks are fully decoupled.


Changed files

app/tasks/transcription.py (modified)

Celery task transcribe_audio_record(record_id: str):

  • Skips non-audio records and when ASR_ENABLED=false
  • Downloads audio from Hetzner via a short-lived presigned URL
  • Submits audio to POST /transcribe (field file) → gets job_id
  • Polls GET /transcribe/{job_id} every 4 s until status=completed|failed (hard timeout 300 s — Celery retries the whole task if exceeded)
  • Upserts ExtractedText: stores word-level chunks in segments when the ASR engine returns them, falling back to [{"text": "..."}]
  • Auto-retries up to 3× (60 s backoff) on HTTP/network/timeout failures
  • Routes to the file_processing Celery queue

app/api/v1/endpoints/records.py (modified)

6-line addition after session.commit() in upload_record. Only fires for media_type == audio. Upload handler is otherwise untouched.

app/core/config.py (modified)

Two settings:

  • ASR_ENABLED (default true) — kill switch; disable without redeployment
  • ASR_SERVICE_URL (default http://localhost:8001) — voice-app-backend URL

app/tasks/__init__.py (modified)

Registers transcription module with Celery so the task is discoverable.


Deployment

1 — Start voice-app-backend (once, on the server)

cd /home/rajasekhar/voice-app-backend
docker compose up -d --build
# first start downloads models (~2 GB); allow ~3 min for health check to pass
curl http://localhost:8001/health

2 — Set env vars in corpus-server-app .env

ASR_ENABLED=true
ASR_SERVICE_URL=http://localhost:8001   # same server
# or http://<ip>:8001 if on a different machine

3 — Deploy corpus-server-app as normal

docker compose up -d

The Celery worker picks up ASR_SERVICE_URL and starts transcribing new uploads immediately. voice-app-backend is not in corpus docker-compose.yml — it manages itself.


How to test

# 1. verify voice-app-backend is healthy
curl http://localhost:8001/health
# → {"status":"healthy"}

# 2. upload an audio file
curl -X POST https://api.corpus.swecha.org/api/v1/records/upload \
  -F "[email protected]" -F "media_type=audio" ...

# 3. poll until extracted_text is populated
curl https://api.corpus.swecha.org/api/v1/records/<record_id> | jq .extracted_text

# 4. monitor Celery task in Flower
open http://localhost:5555

To disable transcription without redeploying:

ASR_ENABLED=false docker compose up -d celery-worker

Considerations for reviewers

  • ExtractedText upsert: re-upload or task retry overwrites the existing row (last-write-wins). Intentional.

  • Word-level chunks: if voice-app-backend returns timestamps, they are stored in segments directly. This enables future subtitle/alignment features without a schema change.

  • Polling vs webhook: the Celery task blocks its worker slot while polling. On long audio (> 2 min) this ties up one worker for the duration. Acceptable for current upload volumes; revisit if the queue backs up.

  • Model routing: voice-app-backend routes Telugu to Swecha Gonthuka (wav2vec2) and all other languages to Whisper automatically based on the record's language field. No corpus-server code needs to know about models.

  • ASR_ENABLED flag: allows disabling transcription for non-Telugu language campaigns or during model maintenance without code changes.

  • Decoupling: voice-app-backend can be updated, restarted, or replaced (e.g. swapped for a faster model) without touching corpus-server-app — only ASR_SERVICE_URL needs to point at the new instance.

Edited by Rajasekhar Ponakala

Merge request reports

Loading