Skip to content

Feat: add glob pattern filtering to upload_files command

Context: Why this MR? Previously, the upload_files command was a "take it all" tool. If a user pointed the script at a directory, it would attempt to upload every single file found, regardless of type, size, or relevance.

This created several pain points: Lack of Precision: Users couldn't target specific file types (like only .mp3 or only .pdf) without manually moving files into a new folder. Efficiency Waste: Bandwidth and time were wasted uploading system junk (like .DS_Store) or temporary draft files. Recursive Limitations: There was no easy way to tell the script to look deep into sub-folders for specific patterns.

This MR introduces "Glob Pattern" support, transforming the uploader into a precision tool that allows users to filter exactly what they want to contribute to the Corpus.

Key Changes

  1. Interface Updates (cli.py) New Argument: Added a --pattern (or -p) option to the upload_files command. Default Behavior: It defaults to *, meaning it still uploads everything unless the user specifies a rule. UX Improvement: Updated the visual UI panel to display the active filter. When a user runs an upload, the header now explicitly says: Record Upload (Filter: *.wav), providing immediate feedback that the filter is active.

  2. Logic Improvements (upload.py) Dynamic File Selection: Replaced the hardcoded directory listing with path.glob(pattern). This allows the script to handle complex queries like **/*.txt (search all sub-folders). Smart User Feedback: Added a "Pre-check" step. The script now prints a line telling the user exactly how many files were found that match their pattern before starting the upload. Error Handling: Improved the "No files found" message. If a user provides a pattern that doesn't match anything, the script now gives a helpful suggestion: Did the pattern '{pattern}' filter everything out? instead of just failing silently.

How to Test Single Extension: corpus-client upload --path ./data --pattern ".pdf" (Should only upload PDFs). Deep Search: corpus-client upload --path ./data --pattern "**/.wav" (Should find WAV files in all sub-folders).

Impact Cleaner Data: Better quality control over what enters the Corpus. Better UX: Saves users from manual file organization. Performance: Faster uploads by skipping unnecessary files.

Edited by Vaishnavi Prabhala

Merge request reports

Loading