Feat: Implement Asynchronous Background Processing (Phase 3)
Description
The current metadata extraction pipeline processes files synchronously within the FastAPI request cycle. This leads to API timeouts for large documents (PDFs) and heavy AI inference tasks (vLLM). To scale the system to 130,000+ documents, we need a distributed task queue that can handle jobs in the background and route them to specific hardware (CPU vs GPU).
Objectives
- Integrate Celery as the task orchestrator.
- Utilize Redis as the message broker and result backend.
- Prevent API timeouts by shifting heavy lifting to background workers.
- Enable hardware-specific task routing (e.g., route vision tasks to GPU workers).
- Provide a status monitoring interface for background jobs.
Proposed Solution
- Create a
celery_config.pyfor queue and worker settings. - Implement background tasks in
tasks.pythat wrap the existingExtractionPipeline. - Add
/extract/asyncand/jobs/{job_id}endpoints to the FastAPI application. - Update
docker-compose.ymlto include Redis and Celery worker services.
Acceptance Criteria
-
Users can submit files asynchronously and receive a job_id. -
Job status and results can be retrieved via the /jobs/{job_id}endpoint. -
Tasks are correctly routed to default_queue(CPU) orvlm_queue(GPU). -
Background workers automatically clean up temporary files after processing. -
Total code coverage remains above 95%.
Edited by Praneeth Ashish