feat: S3 file upload support in csv based uploading
Description
Currently, the CSV file used for uploading data to the corpus only supports local file paths. This limitation requires files to be present on the local system before ingestion, which reduces flexibility and scalability of the data upload process.
To improve the data ingestion workflow, we should extend support to include S3 bucket URLs in the CSV file. This enhancement will allow users to directly reference files stored in S3, eliminating the need for manual downloads or local file management.
Proposed Enhancement
- Enable the CSV parser to accept S3 URLs (e.g.,
s3://bucket-name/path/to/fileor HTTPS S3 links). - Implement logic to fetch and process files directly from S3.
- Ensure proper authentication/authorization mechanisms are handled (e.g., IAM roles, access keys, or pre-signed URLs).
- Maintain backward compatibility with existing local file path support.
Benefits
- Streamlines data ingestion by removing dependency on local storage.
- Improves scalability for large datasets.
- Aligns with cloud-native workflows and storage practices.
Acceptance Criteria
- CSV file accepts both local file paths and S3 URLs.
- Files referenced via S3 URLs are successfully fetched and processed.
- Proper error handling for invalid or inaccessible S3 paths.
- Documentation updated to reflect the new capability.