feat: Merge Request for Phase 6: Benchmarking
Implementation:
- Implemented benchmarking engine to process audio files and compute evaluation metrics.
- Integrated with model-router to run inference on each audio sample.
- Implemented asynchronous job workflow: POST /benchmark → initiates benchmark job and returns job_id GET /benchmark/{job_id} → retrieves benchmark results
- Added support for benchmarking multiple models independently.
- Generated structured JSON reports containing evaluation metrics.
Metrics Computed
- Accuracy Metrics:
- Average WER
- Average CER
- SER
- Error Breakdowns: Substitutions, Insertions, Deletions
- Includes per-sample evaluation: Ground truth vs prediction File-level WER/CER S/I/D breakdown
- Latency Metrics
- Average Latency
- std latency
- Min/Max Latency
- p50
- Average RTF
- Min/ Max RTF
- Throughput Metrics
- avg inference sec
- throughput files per sec
- Audio processed per second
- Average RTF
- Estimated cost per minute
- Scalability Metrics
- Measured under increasing concurrent users: 10, 50, 100 (Throughput, Latency)
- Concurrency Metrics
- Requests Tested
- Avg latency, Min/Max latency
- throughput
- Inference Time
- Audio Duration
Testing
- Added integration tests using a small fixture dataset.
- Verified correctness of metrics calculation
- Tested system behavior under: sequential execution concurrent load
*closes #18 (closed)
Edited by vyshnavi