Skip to content

Implement Document Embedding Ingestion Pipeline with Qdrant

Greeshma Kanukunta requested to merge embedding-pipeline into feat/rag-chatbot-model

Document Embedding Ingestion Pipeline

Overview

This feature implements a document embedding ingestion pipeline in the backend. It enables the system to process documents, convert them into vector embeddings using a local embedding model, and store them in a vector database for semantic search.

The pipeline ensures that users can upload or reference documents, transform them into embeddings, and retrieve relevant information using similarity search.

The implementation follows a minimal and modular architecture, ensuring that existing backend functionality remains unaffected.


Objective

The main goal of this feature is to create an API that:

  • Accepts a record_id
  • Retrieves the associated document
  • Extracts text from the document
  • Splits the text into smaller chunks
  • Generates embeddings locally using an embedding model
  • Stores embeddings in a vector database
  • Enables semantic search over stored document embeddings

Embedding Model

The system uses a local embedding model.

Model:
snowflake-arctic-embed-l-v2.0

Requirements

  • Embedding dimension: 1024
  • Embeddings must be normalized
  • Must support batch embedding for multiple text chunks
  • The model should be loaded only once at application startup

Vector Database

The embeddings are stored in Qdrant (local instance).

Configuration

  • Collection Name: documents
  • Vector Size: 1024
  • Distance Metric: cosine

If the collection does not exist, it must be created during initialization.

Each stored vector should include metadata:

  • record_id
  • chunk_text

System Architecture

The pipeline is implemented using separate modular services to ensure maintainability and avoid modifying unrelated backend components.

Services

Service Responsibility
parser_service Extract text from documents
chunk_service Split text into manageable chunks
embedding_service Generate embeddings from text
vector_service Store embeddings in Qdrant

Only the minimum required modules should be created to implement the feature.


Implementation Workflow

Step 1 — Document Parsing

Create a parsing function:

Edited by Greeshma Kanukunta

Merge request reports

Loading