Implement Duplicate Submission Detection and Marking
Description: We need to detect and mark duplicate submissions to maintain data quality and integrity. This feature should work across all supported media types (text, audio, video, image) and ensure that only unique content is retained in the database.
Requirements:
-
Duplicate Detection:
- For text submissions, use content-based hashing (e.g., SHA-256) or similarity algorithms (e.g., cosine similarity for embeddings) to identify duplicates.
- For audio/video/image submissions, use perceptual hashing (e.g., pHash, dHash, or aHash) to detect near-duplicates.
- Configurable similarity threshold for each media type.
-
Marking as Invalid:
- When a duplicate is detected, mark the submission as
invalidin the database. - Optionally, notify the submitter (via email or in-app notification) that their submission was flagged as a duplicate.
- When a duplicate is detected, mark the submission as
-
Database Schema Updates:
- Add a
is_duplicateboolean field to the submissions table. - Add a
duplicate_offoreign key field to reference the original submission, if applicable.
- Add a
-
API Endpoints:
- Extend the submission API to include duplicate checking before insertion.
- Provide an admin endpoint to manually check and mark duplicates.
-
Performance Considerations:
- Ensure the duplicate detection process is efficient, especially for large datasets.
- Consider asynchronous processing for media files to avoid blocking the API.
-
Geolocation Awareness:
- Optionally, consider geolocation data in duplicate detection (e.g., same media submitted from the same location within a short time window).
-
Review Workflow Integration:
- Allow reviewers to manually override duplicate flags if necessary.
-
Documentation:
- Update OpenAPI docs to reflect new fields and endpoints.
Acceptance Criteria:
-
Duplicate detection is implemented for all media types. -
Detected duplicates are automatically marked as invalid. -
Admin users can review and override duplicate flags. -
API and database changes are documented and tested.
Edited by Ranjith Raj