Completed RAG application
PDF RAG Chatbot
Overview
PDF RAG Chatbot is an AI-powered application that allows users to upload PDF documents and ask questions about their content. The project implements a Retrieval-Augmented Generation (RAG) pipeline using LangChain, ChromaDB, Hugging Face embeddings, and a Large Language Model (LLM) to generate context-aware responses.
What Document Did We Use and Why?
For testing and evaluation, we used educational PDF documents related to Operating Systems concepts.
These documents were chosen because:
- They contain structured theoretical content.
- They include multiple topics and subtopics suitable for retrieval.
- They help evaluate whether the chatbot can accurately locate and answer questions from different sections of a document.
- They provide a realistic academic use case for students.
The chatbot is designed to work with any PDF uploaded by the user.
How Does Chunking Work?
After a PDF is uploaded, the text is extracted using PyPDFLoader.
The extracted text is then divided into smaller chunks using LangChain's RecursiveCharacterTextSplitter.
Configuration used:
chunk_size = 500
chunk_overlap = 50
Why Chunking?
Large Language Models have context limitations and cannot efficiently process entire documents at once.
Chunking helps by:
- Breaking large documents into manageable pieces.
- Preserving context through chunk overlap.
- Improving retrieval accuracy.
- Reducing token usage.
Chunking Workflow
PDF → Text Extraction → Chunking → Embedding Generation → Vector Database Storage
Which Embedding Model Did We Use?
Embedding Model:
sentence-transformers/all-MiniLM-L6-v2
Reasons for selecting this model:
- Lightweight and fast.
- Produces high-quality semantic embeddings.
- Well-suited for Retrieval-Augmented Generation applications.
- Efficient for CPU-based execution.
The embedding model converts text chunks into numerical vector representations, enabling semantic similarity search.
Project Architecture
- User uploads a PDF.
- Text is extracted using PyPDFLoader.
- Text is split into chunks using RecursiveCharacterTextSplitter.
- Embeddings are generated using all-MiniLM-L6-v2.
- Embeddings are stored in ChromaDB.
- User submits a question.
- ChromaDB retrieves the top-k most relevant chunks.
- Retrieved chunks are added to the prompt context.
- The Hugging Face LLM generates an answer based only on the retrieved context.
- The answer is displayed to the user.
Technologies Used
| Technology | Purpose |
|---|---|
| Python | Core programming language |
| Streamlit | User Interface |
| LangChain | RAG Pipeline |
| Hugging Face | Embeddings and LLM |
| ChromaDB | Vector Database |
| PyPDFLoader | PDF Processing |
| Sentence Transformers | Embedding Generation |
How to Run Locally
Clone the Repository
git clone <repository-url>
cd rag-chatbot
Create Virtual Environment
python -m venv venv
Activate Virtual Environment
Windows:
venv\Scripts\activate
Linux/macOS:
source venv/bin/activate
Install Dependencies
pip install -r requirements.txt
Create Environment Variables
Create a .env file:
HF_TOKEN=your_huggingface_token
Run the Application
streamlit run app.py
The application will be available at:
http://localhost:8501
Screenshot
Add screenshots of:
- Home page
- PDF upload screen
- Question-answer interface
- Retrieved source chunks
What Would We Improve With More Time?
If additional development time were available, the following enhancements would be implemented:
- Multi-PDF support
- Chat history and conversational memory
- Source page citations
- Persistent vector database storage
- Advanced prompt engineering
- PDF summarization feature
- Voice-based interaction
- OCR support for scanned PDFs
- User authentication and document management
- Cloud deployment and scalability improvements
Learning Outcomes
Through this project, we gained hands-on experience with:
- Retrieval-Augmented Generation (RAG)
- Vector Databases
- Semantic Search
- Embedding Models
- Prompt Engineering
- Large Language Models
- Streamlit Application Development
- Natural Language Processing Workflows
Team
Developed by Five Pixels as part of an AI internship project.
License
This project is intended for educational and learning purposes.



