Feat: Word Count metrics for the Machine extracted text
✨ Feature Request
Summary: Add word count functionality (backfill and real-time) to the Corpus Platform using space-split word-level counting, leveraging python .split() fucntion libraries.
Problem Statement: The Corpus Platform currently lacks a word count feature, which is essential for text analysis and evaluation metrics.
Proposed Solution: Implement two modes of word count:
- Backfill: Process existing documents in the corpus to compute and store word counts.
- Real-time: Compute word count for documents as they are uploaded or updated, providing immediate feedback. The word count will be calculated by splitting text on spaces (after normalizing whitespace) and counting the resulting tokens.
Alternatives Considered:
Benefits:
- Provides essential metrics for extracted text analysis (Both OCR extracted and ASR transcribed).
- Helps users monitor and control document lengths.
- Can be integrated into evaluation metrics for ASR and other NLP tasks.
- The backfill feature ensures the past uploads and extraction be counted.
Risks or Concerns:
- Performance: Processing large documents for backfill might be resource-intensive. Hence batch processing and efficient libraries be used.
Additional context or references:
Edited by Dr. Praveen Gorla