Feature Request: Add new mediatype "Docs"
Description:
The Corpus Collector App currently supports four media types: Text, Image, Audio, and Video. To better support real-world content formats and expand data collection capabilities, we propose adding a new media type: Docs.
✅ Requested Changes
-
Add “Docs” as a new media type
- Introduce “Docs” alongside existing media types in the UI and backend.
- Include appropriate labels, icons, and tooltips to distinguish “Docs” from “Text”.
-
Support the following file formats under “Docs”:
📄 Document Formats.txt.pdf.docx.doc.odt.rtf
📚 eBook Formats.epub.mobi-
.azw3(optional) -
.fb2(optional)
💡 Rationale
- Many contributors have existing materials—like articles, papers, or books—in document or eBook formats.
- Unlike plain "Text" input, "Docs" can support longer and more structured content.
- Adding this as a separate media type avoids disrupting current "Text" workflows while enabling broader data intake.
🔧 Implementation Notes
- Enable basic metadata extraction from supported file types (e.g., title, author).
- Add robust file validation (MIME type and extension checks).
- Handle encrypted or unsupported formats gracefully (e.g., DRM-protected
.epub).
- Priority: Medium
- Impact: High – Significantly broadens contributor flexibility and data diversity
- Target Release: Next feature update or onboarding sprint
Edited by Ranjith Raj