feat: add CSV-based upload support and patch option for existing records
Problem Statement
Currently, the corpus-client upload_files command uploads files from a single source (local directory or S3 bucket) with shared metadata entered via CLI prompts. Users need to manually enter fields like title, description, language, etc., for every batch upload, even when uploading multiple files with different metadata. Additionally, there's no way to update existing records in bulk using CSV data.
Feature 1: CSV-based Upload for New Records
Goal
Enable batch uploads where each file can have different metadata via CSV input, eliminating manual CLI prompts.
Proposed CLI Usage
corpus-client upload_files /path/to/files --csv /path/to/metadata.csv
Expected CSV Format
| filename | title | description | language | category_ids | release_rights | creator | latitude | longitude | source_label | published_date |
|---|---|---|---|---|---|---|---|---|---|---|
| audio1.mp3 | Folk Song from Kerala | A beautiful traditional... | malayalam | cat-uuid-1,cat-uuid-2 | creator | 10.8505 | 76.2711 | SOAI 2025 | 2024-01-15 |
Required columns
-
filename- relative path to the file in the provided directory title-
description(min 32 chars) language-
category_ids(comma-separated UUIDs) release_rights
Optional columns
-
creator(required if release_rights=others) latitudelongitudesource_labelpublished_date
Behavior
- If CSV is provided, fail if any required row is missing its value
- Files are resolved relative to the provided path argument (local dir or S3 prefix)
- Auto-detect media_type from file extension if not provided
Feature 2: --patch Flag for Updating Existing Records
Goal
Update existing records using CSV data, where each row contains a record uid and the fields to modify.
Proposed CLI Usage
corpus-client upload_files /path/to/files --csv /path/to/updates.csv --patch
Expected CSV Format
| uid | title | description | language | release_rights | creator | latitude | longitude | source_label |
|---|---|---|---|---|---|---|---|---|
| record-uid-1 | Updated Title | Updated description | 17.4459 | 78.3504 | ||||
| record-uid-2 | kannada | others | Author Name |
Behavior
- Only non-empty fields in CSV are updated (PATCH semantics)
- Empty cells = no change to that field
-
uidis required for identifying which record to update - Calls backend
PATCH /api/v1/records/{record_id}endpoint
PATCH-able fields
titledescription-
location(latitude/longitude) release_rightslanguagepublished_datecreatorsource_labelsource_url
Backward Compatibility
- If CSV is NOT provided, fall back to current interactive prompt flow
- Work with single file, directory, or S3 URI as before
Additional Considerations
- CSV parsing should handle comma-separated category_ids (e.g., "cat-id-1,cat-id-2")
- Validate CSV headers before processing
- Show preview of records to be uploaded/updated before proceeding
- Support resume for interrupted CSV batch operations
Edited by Ahlad Pataparla