Add dots OCR JSON support, centralized language helper, comprehensive tests, and lint compliance

Summary

This MR adds support for the dots.ocr JSON format in the extract command, introduces a centralized language helper module, adds comprehensive test coverage, and ensures compliance with linting tools.

Changes Made

  • dots.ocr JSON Support: Added parser in extracted_text_upload.py to handle dots.ocr array format alongside existing ASR-style format, with automatic format detection
  • Centralized Language Helper: New languages.py module containing all 22 scheduled languages of India with ISO 639-1/3 codes, used by both upload and extract commands
  • Test Coverage: Added 86% test coverage with tests for async logic, JSON parsing, and CLI commands (up from 0%)
  • Code Quality: Fixed compliance with ruff, mypy strict mode, and vulture warnings

Key Files Changed

  • src/corpus_client_cli/extracted_text_upload.py - dots.ocr parser, format detection
  • src/corpus_client_cli/languages.py - new centralized language module
  • src/corpus_client_cli/cli.py - type annotations, language helper integration
  • src/corpus_client_cli/upload.py - language helper integration, type annotations
  • tests/ - comprehensive test suite added
Edited by Ahlad Pataparla

Merge request reports

Loading