shravan veeravalli requested to merge shravan.vvkl/venika-rag-app:main into main May 29, 2026

📄 Project Documentation

## What document did you use and why?

We used a PDF document named college.pdf as the primary knowledge source for the chatbot. The document contains information related to:

Attendance policies
Semester examinations
Clubs and extracurricular activities
Placement eligibility
Academic guidelines

The reason for using a PDF document is that most colleges already store notices, handbooks, rules, and academic information in PDF format. Using PDFs makes the chatbot more realistic and practical for real-world applications.

The PDF was loaded using:

PyPDFLoader("college.pdf")

This extracted all textual content from the PDF and converted it into readable document objects for further processing.

## How does your chunking work?

Chunking is the process of dividing large document text into smaller sections called chunks.

Since large language models cannot efficiently process very large documents at once, the extracted PDF text is split into smaller manageable pieces.

We used:

RecursiveCharacterTextSplitter

with the following parameters:

chunk_size=500
chunk_overlap=50

How Chunking Happens

Step 1 — Extract Text from PDF

Example:

Attendance must be above 75 percent.
Semester exams begin in December.
The photography club is called Nexus.

Step 2 — Split into Chunks

The splitter divides the text into smaller sections.

Example:

Chunk 1:
Attendance must be above 75 percent.

Chunk 2:
Semester exams begin in December.

Chunk 3:
The photography club is called Nexus.

Why `chunk_overlap=50` is Used

Overlap allows neighboring chunks to share some common text.

This prevents context loss when information is split between chunks.

Example:

Without overlap:

Chunk 1:
Placement eligibility requires

Chunk 2:
minimum 7 CGPA

With overlap:

Chunk 1:
Placement eligibility requires minimum

Chunk 2:
requires minimum 7 CGPA

This improves retrieval accuracy during semantic search.

Advantages of Chunking

Faster retrieval
Better semantic search
Improved response quality
Reduced memory usage
Better context handling

## Which embedding model did you use?

We used the Hugging Face embedding model:

sentence-transformers/all-MiniLM-L6-v2

It was loaded using:

HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

Why This Model Was Chosen

This model is:

lightweight,
fast,
beginner-friendly,
efficient for semantic search tasks.

It converts text into vector embeddings that capture semantic meaning.

Example:

"attendance criteria"
        ↓
vector representation

These embeddings help the chatbot understand similarity in meaning rather than exact keywords.

For example:

"What attendance is required?"

can match with:

"Attendance must be above 75 percent."

even though the wording is different.

## How to Run Locally

Step 1 — Clone Repository

git clone https://code.swecha.org/venika_2537/faq-chatbot.git

Step 2 — Open Project Folder

cd faq-chatbot

Step 3 — Install Dependencies

pip install langchain
pip install chromadb
pip install sentence-transformers
pip install transformers
pip install streamlit
pip install pypdf
pip install torch

Step 4 — Run the Application

streamlit run app.py

Step 5 — Open Browser

Streamlit automatically opens:

http://localhost:8501

The chatbot interface will appear in the browser.

## Screenshot

## What would you improve with more time?

If given more development time, the following improvements could be added:

1. Multiple PDF Upload Support

Allow users to upload multiple documents dynamically instead of using a single static PDF.

2. Chat Memory

Store previous conversations to enable context-aware interactions.

3. Better UI/UX

Improve frontend design using:

custom CSS,
chat bubbles,
dark mode,
animations.

4. OCR Support

Enable reading scanned or image-based PDFs using Optical Character Recognition.

5. Source Citation

Display the exact page or source chunk used to generate answers.

Example:

Answer retrieved from Page 3

6. Faster Retrieval

Optimize vector search performance for larger document collections.

7. Cloud Deployment

Deploy the chatbot on:

Hugging Face Spaces,
Render,
Netlify,
Streamlit Cloud.

8. Voice-Based Interaction

Add speech-to-text and text-to-speech functionality.

9. Improved Models

Use more advanced LLMs for:

better reasoning,
more natural responses,
higher accuracy.

10. Admin Dashboard

Create an admin panel for:

managing documents,
monitoring usage,
updating datasets.

Shravan Veeravalli - RAG app

📄 Project Documentation

## What document did you use and why?

## How does your chunking work?

How Chunking Happens

Step 1 — Extract Text from PDF

Step 2 — Split into Chunks

Why chunk_overlap=50 is Used

Advantages of Chunking

## Which embedding model did you use?

Why This Model Was Chosen

## How to Run Locally

Step 1 — Clone Repository

Step 2 — Open Project Folder

Step 3 — Install Dependencies

Step 4 — Run the Application

Step 5 — Open Browser

## Screenshot

## What would you improve with more time?

1. Multiple PDF Upload Support

2. Chat Memory

3. Better UI/UX

4. OCR Support

5. Source Citation

6. Faster Retrieval

7. Cloud Deployment

8. Voice-Based Interaction

9. Improved Models

10. Admin Dashboard

Merge request reports

Why `chunk_overlap=50` is Used