Feature Request: Add Corpus Collection Guidelines to corpus.swecha.org
Background: As viswam.ai aimed at building Indic LLMs trained on Indian Knowledge Systems, the quality and relevance of our corpus are paramount. Currently, corpus.swecha.org allows users to upload files (docs, images, audios, videos) with mandatory fields such as Title, Categories, Description, and Release Rights. Adding to our efforts, we need to publish a clear, easily accessible guide for contributors on what types of content to prioritize, how to ensure quality, and what best practices to follow during collection.
Feature Request: We request the addition of a dedicated “Corpus Collection Guidelines” section or page on corpus.swecha.org. This resource should be prominently linked on the upload page and the homepage, ensuring contributors are aware of best practices before submitting content.
Proposed Content for Guidelines:
-
Scope and Relevance:
- Emphasize the importance of collecting content that represents Indian languages, culture, history, science, literature, and local knowledge systems.
- Encourage diverse sources: textbooks, research papers, folk literature, government documents, news articles, audio-visual lectures, and more.
-
Quality Standards:
- Clarify the preferred formats and resolutions for each file type (e.g., PDF/DOCX for documents, high-resolution for images, clear audio for recordings).
- Discourage low-quality, duplicate, or irrelevant content.
-
Metadata Best Practices:
- Guide users on writing descriptive, accurate titles and descriptions.
- Explain the importance of correct categorization and tagging for discoverability.
-
Legal and Ethical Considerations:
- Remind users to only upload content they have the rights to share, or that is in the public domain/licensed for reuse.
- Provide examples of acceptable release rights (e.g., Creative Commons licenses).
-
Encouraging Diversity:
- Highlight the need for underrepresented languages, dialects, and regional knowledge.
- Suggest sources for rare or endangered language materials.
-
FAQ/Examples:
- Include a short FAQ or examples of ideal submissions to clarify expectations.
Implementation Suggestion:
- Add a “Guidelines” button/link near the upload form.
Benefits:
- Improves the quality and relevance of submissions.
- Reduces moderation overhead.
- Empowers contributors to make informed decisions.
-
Onboarding Task for New Users:
- Add a mandatory step during the first-time user onboarding process, requiring new users to read the Corpus Collection Guidelines before they can upload content.
- Include a checkbox or confirmation button: “I have read and understood the Corpus Collection Guidelines.”
Implementation Suggestion:
- Place a “Guidelines” button/link near the upload form and in the user dashboard.
- For new users, display a modal or a dedicated onboarding screen with the guidelines, requiring them to acknowledge reading before proceeding to upload.
Benefits:
- Improves the quality and relevance of submissions.
- Reduces moderation overhead.
- Empowers contributors to make informed decisions.
- Ensures all users are aware of best practices from the start.
Request for Feedback: We welcome suggestions from the community on what else to include in these guidelines and the onboarding process to make them most effective.