Completed RAG application

PDF RAG Chatbot

Overview

PDF RAG Chatbot is an AI-powered application that allows users to upload PDF documents and ask questions about their content. The project implements a Retrieval-Augmented Generation (RAG) pipeline using LangChain, ChromaDB, Hugging Face embeddings, and a Large Language Model (LLM) to generate context-aware responses.


What Document Did We Use and Why?

For testing and evaluation, we used educational PDF documents related to Operating Systems concepts.

These documents were chosen because:

  • They contain structured theoretical content.
  • They include multiple topics and subtopics suitable for retrieval.
  • They help evaluate whether the chatbot can accurately locate and answer questions from different sections of a document.
  • They provide a realistic academic use case for students.

The chatbot is designed to work with any PDF uploaded by the user.


How Does Chunking Work?

After a PDF is uploaded, the text is extracted using PyPDFLoader.

The extracted text is then divided into smaller chunks using LangChain's RecursiveCharacterTextSplitter.

Configuration used:

chunk_size = 500
chunk_overlap = 50

Why Chunking?

Large Language Models have context limitations and cannot efficiently process entire documents at once.

Chunking helps by:

  • Breaking large documents into manageable pieces.
  • Preserving context through chunk overlap.
  • Improving retrieval accuracy.
  • Reducing token usage.

Chunking Workflow

PDF → Text Extraction → Chunking → Embedding Generation → Vector Database Storage


Which Embedding Model Did We Use?

Embedding Model:

sentence-transformers/all-MiniLM-L6-v2

Reasons for selecting this model:

  • Lightweight and fast.
  • Produces high-quality semantic embeddings.
  • Well-suited for Retrieval-Augmented Generation applications.
  • Efficient for CPU-based execution.

The embedding model converts text chunks into numerical vector representations, enabling semantic similarity search.


Project Architecture

  1. User uploads a PDF.
  2. Text is extracted using PyPDFLoader.
  3. Text is split into chunks using RecursiveCharacterTextSplitter.
  4. Embeddings are generated using all-MiniLM-L6-v2.
  5. Embeddings are stored in ChromaDB.
  6. User submits a question.
  7. ChromaDB retrieves the top-k most relevant chunks.
  8. Retrieved chunks are added to the prompt context.
  9. The Hugging Face LLM generates an answer based only on the retrieved context.
  10. The answer is displayed to the user.

Technologies Used

Technology Purpose
Python Core programming language
Streamlit User Interface
LangChain RAG Pipeline
Hugging Face Embeddings and LLM
ChromaDB Vector Database
PyPDFLoader PDF Processing
Sentence Transformers Embedding Generation

How to Run Locally

Clone the Repository

git clone <repository-url>
cd rag-chatbot

Create Virtual Environment

python -m venv venv

Activate Virtual Environment

Windows:

venv\Scripts\activate

Linux/macOS:

source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Create Environment Variables

Create a .env file:

HF_TOKEN=your_huggingface_token

Run the Application

streamlit run app.py

The application will be available at:

http://localhost:8501

Screenshot

Add screenshots of:

  1. Home page
  2. PDF upload screen
  3. Question-answer interface
  4. Retrieved source chunks

D1 D2 D3 D4


What Would We Improve With More Time?

If additional development time were available, the following enhancements would be implemented:

  • Multi-PDF support
  • Chat history and conversational memory
  • Source page citations
  • Persistent vector database storage
  • Advanced prompt engineering
  • PDF summarization feature
  • Voice-based interaction
  • OCR support for scanned PDFs
  • User authentication and document management
  • Cloud deployment and scalability improvements

Learning Outcomes

Through this project, we gained hands-on experience with:

  • Retrieval-Augmented Generation (RAG)
  • Vector Databases
  • Semantic Search
  • Embedding Models
  • Prompt Engineering
  • Large Language Models
  • Streamlit Application Development
  • Natural Language Processing Workflows

Team

Developed by Five Pixels as part of an AI internship project.


License

This project is intended for educational and learning purposes.

Merge request reports

Loading