Implement Document Embedding Ingestion Pipeline with Qdrant
Document Embedding Ingestion Pipeline
Overview
This feature implements a document embedding ingestion pipeline in the backend. It enables the system to process documents, convert them into vector embeddings using a local embedding model, and store them in a vector database for semantic search.
The pipeline ensures that users can upload or reference documents, transform them into embeddings, and retrieve relevant information using similarity search.
The implementation follows a minimal and modular architecture, ensuring that existing backend functionality remains unaffected.
Objective
The main goal of this feature is to create an API that:
- Accepts a
record_id - Retrieves the associated document
- Extracts text from the document
- Splits the text into smaller chunks
- Generates embeddings locally using an embedding model
- Stores embeddings in a vector database
- Enables semantic search over stored document embeddings
Embedding Model
The system uses a local embedding model.
Model:
snowflake-arctic-embed-l-v2.0
Requirements
- Embedding dimension: 1024
- Embeddings must be normalized
- Must support batch embedding for multiple text chunks
- The model should be loaded only once at application startup
Vector Database
The embeddings are stored in Qdrant (local instance).
Configuration
-
Collection Name:
documents -
Vector Size:
1024 -
Distance Metric:
cosine
If the collection does not exist, it must be created during initialization.
Each stored vector should include metadata:
record_idchunk_text
System Architecture
The pipeline is implemented using separate modular services to ensure maintainability and avoid modifying unrelated backend components.
Services
| Service | Responsibility |
|---|---|
parser_service |
Extract text from documents |
chunk_service |
Split text into manageable chunks |
embedding_service |
Generate embeddings from text |
vector_service |
Store embeddings in Qdrant |
Only the minimum required modules should be created to implement the feature.
Implementation Workflow
Step 1 — Document Parsing
Create a parsing function: