Move uploading file into hetzner object storage as a celery task
Preface: Current Upload Flow
- Chunk Uploads: Client splits a file into chunks and uploads them to the backend. Chunks are stored temporarily (e.g., on local disk or a staging area) and tracked by an upload_id.
- Finalize Call (/upload): After all chunks are uploaded, the client calls the /upload endpoint to finalize. The server validates chunk completeness and order, assembles the file, creates the record, and (today) performs the object storage upload synchronously within the request.
Problem
- Blocking Request: Synchronous transfer to object storage makes /upload slow and fragile for large files.
- Poor Resilience: Transient storage errors cause user-facing failures; retries are limited by request lifetimes.
- Resource Spikes: Large uploads can tie up workers and memory/IO, leading to timeouts and degraded API responsiveness.
- No Progress Visibility: Users cannot track long-running upload/ingest progress beyond a single request.
Proposal
- Background Task: Move the object storage transfer and finalization to a Celery task. The /upload endpoint will enqueue a job and return immediately.
- Immediate Response: /upload responds 202 Accepted with a job_id (and upload_id) for polling.
- Idempotency: Repeated finalize calls for the same upload_id should be safe (deduplicate by upload_id).
- Streaming/Multipart: The Celery task streams from chunk files and uses object storage multipart upload to avoid loading the entire file in memory.
- Cleanup: On success or terminal failure, clean up temporary chunks; enforce TTL cleanup for abandoned uploads.
API Changes
- Finalize (POST /upload): Now returns 202 Accepted with { job_id, upload_id, status: "queued" }. Optional flag to allow legacy sync path behind a feature flag for rollout.
- Job Status (GET /uploads/{upload_id}/status or /jobs/{job_id}): Returns { status: queued|processing|completed|failed, progress, error }.
- Optional Webhook: Support a registered callback URL for completion/failure notifications.
- Optional Cancel: DELETE /uploads/{upload_id} to cancel if not yet started or to mark abandoned.
Data Model
- Records Linkage: On completion, associate the persisted object key with the record in the existing tables.
- Chunk Manifest: Persist ordered chunk list, sizes, and per-chunk checksums to verify integrity before compose.
Celery Task Design
- Task Name: tasks.upload_to_object_store(upload_id: str).
- Flow: Validate manifest → initiate multipart upload → stream each chunk → complete multipart → verify checksum/etag → create/update record → cleanup chunks.
- Retries: Exponential backoff on transient errors (network, 5xx from object storage). Limit max retries; move to failed with error context.
- Idempotency: Use a per-upload_id lock and check if object already exists/record finalized before proceeding.
- Progress: Update progress after each part; emit structured logs with upload_id.
Configuration
- Celery: CELERY_BROKER_URL, CELERY_RESULT_BACKEND, queues (e.g., uploads), concurrency limits.
- Object Storage: S3_ENDPOINT_URL, S3_BUCKET, creds, region, multipart part size, upload timeouts.
- Feature Flag: UPLOADS_ASYNC_ENABLED to toggle background mode during rollout.
Operational Considerations