add diarization + ASR backend pipeline integration
Description
This Merge Request introduces a new integrated API endpoint that combines speaker diarization and automatic speech recognition (ASR) into a single, cohesive pipeline. The implementation enables processing of uploaded audio files to produce speaker-labeled transcriptions in a structured JSON format.
Key Changes
- New API Endpoint
Added POST /diarize-transcribe in app/main.py. Accepts audio file uploads with an optional language selection parameter. Supports automatic language handling when “other” is selected.
- Pipeline Integration
Orchestrates the complete workflow: Performs speaker diarization to segment audio by speaker. Applies ASR transcription on each segmented portion. Merges results into a unified response. Ensures smooth interaction between diarization and transcription modules.
- Data Models
Introduced new response models in app/models/shared_models.py: DiarizedSegment: Represents individual segments with speaker label, timestamps, and transcription. DiarizationResponse: Defines the overall structured response containing language and segment list. Maintains consistency and type safety across the API.
- File Handling & Validation
Validates uploaded audio formats against allowed extensions. Uses the job management system for temporary file storage. Implements proper error handling and logging for reliability.
- Testing
Added integration test test_diarization_integration.py. Uses FastAPI TestClient to validate endpoint behavior. Dynamically generates a test audio file using FFmpeg. Verifies response structure, including speaker labels, timestamps, and transcription fields. Response Format { "language": "auto", "segments": [ { "speaker": "SPEAKER_0", "start": 0.0, "end": 2.0, "text": "..." } ] } Summary
This update establishes a unified diarization + transcription pipeline, improves API structure, and ensures a standardized response format.