Skip to content

feat: Pattern-Based File Upload Filtering in Corpus CLI

DALIBOINA SATISH requested to merge category into develop

Problem Statement

Users uploading files through the Corpus CLI were unable to filter or select specific file types when uploading from a directory. The system either:

  • Uploaded all files without discrimination, or
  • Required users to upload files one at a time

This resulted in inefficiency, especially when directories contained mixed file types (e.g., uploading only .mp3 files from a folder with audio, video, and images).


Solution

Implemented glob pattern support in the upload-files command, enabling users to filter files using standard pattern matching syntax.


Features Added

  • Pattern-Based Filtering

    • Supports patterns like:

      • *.mp3 → Upload only MP3 files
      • *.mp4 → Upload only MP4 files
      • *.jpg, *.jpeg → Upload JPEG images
      • *.csv → Upload CSV files
      • **/*.wav → Recursively upload WAV files
      • subdir/*.mp3 → Upload from specific folder
  • Automatic Pattern Normalization

    • Converts user-friendly inputs:

      • .mp3*.mp3
  • Universal File Support

    • Works with:

      • Audio, Video, Images
      • Documents (PDF, CSV, TXT, DOCX, etc.)
      • Any custom file extensions
  • Recursive Directory Support

    • Supports deep directory traversal using **/
  • Backward Compatibility

    • Default behavior unchanged:

      • Empty or * → Upload all files

🛠 Implementation Details

Modified Files:

  • upload.py → Added glob logic in run_record_upload()
  • cli.py → Added --pattern parameter

Key Enhancements:

  • Pattern normalization utility

  • File matching using pathlib.Path.glob()

  • CLI prompt with usage examples

  • Console feedback:

    • Found X files matching pattern 'Y'

📖 Usage Examples

  • Upload MP3 files:

    upload-files --pattern "*.mp3"
  • Upload JPEG images:

    upload-files --pattern "*.jpg"
  • Recursive upload:

    upload-files --pattern "**/*.mp4"
  • Upload all files:

    upload-files

🧪 Testing

  • 7+ test cases covering:

    • Extension filters (.mp3)
    • Wildcards (*.mp3)
    • Recursive patterns (**/*.wav)
    • Subdirectory patterns (subdir/*.mp3)
    • Default behavior
  • Real-world validation:

    • Multiple file types
    • Files with spaces
    • Concurrent uploads

Benefits

  • Efficient batch uploads
  • Improved user experience
  • Flexible and powerful filtering
  • Works across all file types
  • Fully backward compatible

Acceptance Criteria

  • Users can apply glob patterns
  • System correctly filters files
  • Pattern normalization works
  • Recursive patterns supported
  • Clear CLI feedback provided
  • All tests pass After image image
Edited by DALIBOINA SATISH

Merge request reports

Loading