Add dots OCR JSON support, centralized language helper, comprehensive tests, and lint compliance
Summary
This MR adds support for the dots.ocr JSON format in the extract command, introduces a centralized language helper module, adds comprehensive test coverage, and ensures compliance with linting tools.
Changes Made
- dots.ocr JSON Support: Added parser in extracted_text_upload.py to handle dots.ocr array format alongside existing ASR-style format, with automatic format detection
- Centralized Language Helper: New languages.py module containing all 22 scheduled languages of India with ISO 639-1/3 codes, used by both upload and extract commands
- Test Coverage: Added 86% test coverage with tests for async logic, JSON parsing, and CLI commands (up from 0%)
- Code Quality: Fixed compliance with ruff, mypy strict mode, and vulture warnings
Key Files Changed
- src/corpus_client_cli/extracted_text_upload.py - dots.ocr parser, format detection
- src/corpus_client_cli/languages.py - new centralized language module
- src/corpus_client_cli/cli.py - type annotations, language helper integration
- src/corpus_client_cli/upload.py - language helper integration, type annotations
- tests/ - comprehensive test suite added
Edited by Ahlad Pataparla