PDF RAG Chatbot

Overview

PDF RAG Chatbot is an AI-powered application that allows users to upload PDF documents and ask questions about their content. The project implements a Retrieval-Augmented Generation (RAG) pipeline using LangChain, ChromaDB, Hugging Face embeddings, and a Large Language Model (LLM) to generate context-aware responses.

What Document Did We Use and Why?

For testing and evaluation, we used educational PDF documents related to Operating Systems concepts.

These documents were chosen because:

They contain structured theoretical content.
They include multiple topics and subtopics suitable for retrieval.
They help evaluate whether the chatbot can accurately locate and answer questions from different sections of a document.
They provide a realistic academic use case for students.

The chatbot is designed to work with any PDF uploaded by the user.

How Does Chunking Work?

After a PDF is uploaded, the text is extracted using PyPDFLoader.

The extracted text is then divided into smaller chunks using LangChain's RecursiveCharacterTextSplitter.

Configuration used:

chunk_size = 500
chunk_overlap = 50

Why Chunking?

Large Language Models have context limitations and cannot efficiently process entire documents at once.

Chunking helps by:

Breaking large documents into manageable pieces.
Preserving context through chunk overlap.
Improving retrieval accuracy.
Reducing token usage.

Chunking Workflow

PDF → Text Extraction → Chunking → Embedding Generation → Vector Database Storage

Which Embedding Model Did We Use?

Embedding Model:

sentence-transformers/all-MiniLM-L6-v2

Reasons for selecting this model:

Lightweight and fast.
Produces high-quality semantic embeddings.
Well-suited for Retrieval-Augmented Generation applications.
Efficient for CPU-based execution.

The embedding model converts text chunks into numerical vector representations, enabling semantic similarity search.

Project Architecture

User uploads a PDF.
Text is extracted using PyPDFLoader.
Text is split into chunks using RecursiveCharacterTextSplitter.
Embeddings are generated using all-MiniLM-L6-v2.
Embeddings are stored in ChromaDB.
User submits a question.
ChromaDB retrieves the top-k most relevant chunks.
Retrieved chunks are added to the prompt context.
The Hugging Face LLM generates an answer based only on the retrieved context.
The answer is displayed to the user.

Technologies Used

Technology	Purpose
Python	Core programming language
Streamlit	User Interface
LangChain	RAG Pipeline
Hugging Face	Embeddings and LLM
ChromaDB	Vector Database
PyPDFLoader	PDF Processing
Sentence Transformers	Embedding Generation

How to Run Locally

Clone the Repository

git clone <repository-url>
cd rag-chatbot

Create Virtual Environment

python -m venv venv

Activate Virtual Environment

Windows:

venv\Scripts\activate

Linux/macOS:

source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Create Environment Variables

Create a .env file:

HF_TOKEN=your_huggingface_token

Run the Application

streamlit run app.py

The application will be available at:

http://localhost:8501

Screenshot

Add screenshots of:

Home page
PDF upload screen
Question-answer interface
Retrieved source chunks

What Would We Improve With More Time?

If additional development time were available, the following enhancements would be implemented:

Multi-PDF support
Chat history and conversational memory
Source page citations
Persistent vector database storage
Advanced prompt engineering
PDF summarization feature
Voice-based interaction
OCR support for scanned PDFs
User authentication and document management
Cloud deployment and scalability improvements

Learning Outcomes

Through this project, we gained hands-on experience with:

Retrieval-Augmented Generation (RAG)
Vector Databases
Semantic Search
Embedding Models
Prompt Engineering
Large Language Models
Streamlit Application Development
Natural Language Processing Workflows

Team

Developed by Five Pixels as part of an AI internship project.

License

This project is intended for educational and learning purposes.

Completed RAG application