Feature Request: Extract and Store Metadata from Uploaded Files
Description: As part of enhancing the corpus collection app, we should add the ability to automatically extract and store metadata from uploaded files (documents, images, audio, and video). This will improve searchability, organization, and analysis of the collected corpus.
Motivation:
- Metadata provides valuable context and makes it easier to categorize, search, and analyze files.
- Users can benefit from advanced filtering and sorting based on metadata fields.
- Automated extraction reduces manual effort and ensures consistency.
Proposed Metadata Fields: For each media type, the following metadata should be extracted:
Documents (PDF, DOCX, TXT, etc.):
| Metadata Field | Description |
|---|---|
| Title, Author, Creation Date, Modification Date, File Size, Page Count, Language, Keywords, Subject, File Type, Word Count, Character Count, Paragraph Count, Fonts Used, Encryption Status, Software Used, Last Printed Date, Version, Custom Properties |
Images (JPEG, PNG, GIF, etc.):
| Metadata Field | Description |
|---|---|
| File Name, File Size, Image Dimensions, Resolution, Color Space, Color Depth, Camera Make, Camera Model, Date Taken, GPS Coordinates, Orientation, Exposure Time, Aperture, ISO, Focal Length, Flash, Software Used, Copyright, EXIF Data, IPTC Data, XMP Data |
Audio (MP3, WAV, AAC, etc.):
| Metadata Field | Description |
|---|---|
| Title, Artist, Album, Track Number, Genre, Duration, Bitrate, Sample Rate, Channels, File Size, Recording Date, Encoding Software, Copyright, Lyrics, ID3 Tags, BPM, Mood/Theme |
Video (MP4, AVI, MOV, etc.):
| Metadata Field | Description |
|---|---|
| Title, Duration, Resolution, Frame Rate, Bitrate, Codec, Audio Codec, File Size, Creation Date, Modification Date, Camera Make, Camera Model, GPS Coordinates, Orientation, Software Used, Copyright, Thumbnail, Subtitles, Scene Information, Aspect Ratio |
Scope:
- Extract metadata for all supported file types: documents (PDF, DOCX, TXT), images (JPEG, PNG, GIF), audio (MP3, WAV, AAC), and video (MP4, AVI, MOV).
- Store extracted metadata in a structured format (e.g., JSON or database).
- Display relevant metadata in the file details view.
- Allow users to edit or supplement metadata as needed.
Technical Approach:
- Use existing libraries/tools for extraction:
- Documents: Apache Tika, PyPDF2, python-docx
- Images: Pillow, ExifTool, OpenCV
- Audio: Mutagen, FFmpeg, EyeD3
- Video: FFmpeg, MediaInfo, OpenCV
- Store metadata in the database alongside file references.
- Ensure privacy and security, especially for sensitive metadata (e.g., GPS coordinates).
Acceptance Criteria:
-
Metadata is extracted for all supported file types. -
Extracted metadata is stored and associated with each file. -
Users can view and edit metadata in the app. -
Metadata is used to enhance search and filtering. -
Documentation is updated to reflect the new feature.
Out of Scope:
- Manual metadata entry (unless supplementing extracted data).
- Advanced analytics or AI-based metadata generation (for now).
Open Questions:
- Should we allow users to customize which metadata fields are displayed or editable?
- How should we handle files with missing or corrupted metadata?
- Should we provide an API for programmatic access to metadata?
Additional Context:
- This feature will be especially useful for researchers and analysts who need to organize and query large collections of files.
Labels:
enhancement, metadata, feature request