Shravan Veeravalli - RAG app
📄 Project Documentation
## What document did you use and why?
We used a PDF document named college.pdf as the primary knowledge source for the chatbot. The document contains information related to:
- Attendance policies
- Semester examinations
- Clubs and extracurricular activities
- Placement eligibility
- Academic guidelines
The reason for using a PDF document is that most colleges already store notices, handbooks, rules, and academic information in PDF format. Using PDFs makes the chatbot more realistic and practical for real-world applications.
The PDF was loaded using:
PyPDFLoader("college.pdf")
This extracted all textual content from the PDF and converted it into readable document objects for further processing.
## How does your chunking work?
Chunking is the process of dividing large document text into smaller sections called chunks.
Since large language models cannot efficiently process very large documents at once, the extracted PDF text is split into smaller manageable pieces.
We used:
RecursiveCharacterTextSplitter
with the following parameters:
chunk_size=500
chunk_overlap=50
How Chunking Happens
Step 1 — Extract Text from PDF
Example:
Attendance must be above 75 percent.
Semester exams begin in December.
The photography club is called Nexus.
Step 2 — Split into Chunks
The splitter divides the text into smaller sections.
Example:
Chunk 1:
Attendance must be above 75 percent.
Chunk 2:
Semester exams begin in December.
Chunk 3:
The photography club is called Nexus.
Why chunk_overlap=50 is Used
Overlap allows neighboring chunks to share some common text.
This prevents context loss when information is split between chunks.
Example:
Without overlap:
Chunk 1:
Placement eligibility requires
Chunk 2:
minimum 7 CGPA
With overlap:
Chunk 1:
Placement eligibility requires minimum
Chunk 2:
requires minimum 7 CGPA
This improves retrieval accuracy during semantic search.
Advantages of Chunking
- Faster retrieval
- Better semantic search
- Improved response quality
- Reduced memory usage
- Better context handling
## Which embedding model did you use?
We used the Hugging Face embedding model:
sentence-transformers/all-MiniLM-L6-v2
It was loaded using:
HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
Why This Model Was Chosen
This model is:
- lightweight,
- fast,
- beginner-friendly,
- efficient for semantic search tasks.
It converts text into vector embeddings that capture semantic meaning.
Example:
"attendance criteria"
↓
vector representation
These embeddings help the chatbot understand similarity in meaning rather than exact keywords.
For example:
"What attendance is required?"
can match with:
"Attendance must be above 75 percent."
even though the wording is different.
## How to Run Locally
Step 1 — Clone Repository
git clone https://code.swecha.org/venika_2537/faq-chatbot.git
Step 2 — Open Project Folder
cd faq-chatbot
Step 3 — Install Dependencies
pip install langchain
pip install chromadb
pip install sentence-transformers
pip install transformers
pip install streamlit
pip install pypdf
pip install torch
Step 4 — Run the Application
streamlit run app.py
Step 5 — Open Browser
Streamlit automatically opens:
http://localhost:8501
The chatbot interface will appear in the browser.
## Screenshot
## What would you improve with more time?
If given more development time, the following improvements could be added:
1. Multiple PDF Upload Support
Allow users to upload multiple documents dynamically instead of using a single static PDF.
2. Chat Memory
Store previous conversations to enable context-aware interactions.
3. Better UI/UX
Improve frontend design using:
- custom CSS,
- chat bubbles,
- dark mode,
- animations.
4. OCR Support
Enable reading scanned or image-based PDFs using Optical Character Recognition.
5. Source Citation
Display the exact page or source chunk used to generate answers.
Example:
Answer retrieved from Page 3
6. Faster Retrieval
Optimize vector search performance for larger document collections.
7. Cloud Deployment
Deploy the chatbot on:
- Hugging Face Spaces,
- Render,
- Netlify,
- Streamlit Cloud.
8. Voice-Based Interaction
Add speech-to-text and text-to-speech functionality.
9. Improved Models
Use more advanced LLMs for:
- better reasoning,
- more natural responses,
- higher accuracy.
10. Admin Dashboard
Create an admin panel for:
- managing documents,
- monitoring usage,
- updating datasets.