Shreyas Mogalapalli requested to merge ShreyasMogalapalli/ip1-icfai:main into main May 29, 2026

🧠 RAG Chatbot — Local AI Document Q&A

A fully local, privacy-first Retrieval-Augmented Generation (RAG) chatbot built with Streamlit and Ollama. Three modes in one app — chat freely, upload documents, or paste a URL — all processing happens on your machine, nothing sent to the cloud.

📸 Screenshot

Run the app and save your screenshot as screenshots/demo.png

📄 What Document Did I Use and Why?

This app is general-purpose — it works with any PDF, TXT, or Markdown file you provide, or any webpage URL. There is no hardcoded document.

Good starter documents to test with:

A research paper (e.g. Attention Is All You Need) — tests dense technical Q&A
A company policy or HR handbook — tests factual retrieval
A Wikipedia article via URL — tests live web ingestion

Why no fixed document? The app is designed to be reusable across any domain — academic, legal, personal notes, or web content — so you bring your own data.

💬 Three Modes

The app has 3 tabs, each with its own chat history:

Tab	Mode	How it works
💬 Normal Chat	Plain LLM	Just type and chat — no documents needed
📄 Document	RAG on file	Upload PDF/TXT/MD → index → answers from your file
🌐 URL	RAG on webpage	Paste a URL → fetch & index → answers from that page

✂️ How Does Your Chunking Work?

Chunking is handled in utils/loader.py using a character-based sliding window with sentence-boundary awareness.

Steps:

Document loaded as raw text (PyMuPDF for PDFs, plain read for TXT/MD, HTTP scrape for URLs)
Text split into 500-character chunks (configurable in sidebar)
Each chunk has 50-character overlap with the next — prevents answers being cut at boundaries
Before cutting, the chunker looks back up to 100 chars for a natural break: paragraph (\n\n), newline (\n), or sentence end (. , ! , ? )
Each chunk stored with metadata: source, chunk_id, start_char

Example of smart breaking:

"...attention mechanisms are used here.\n\nThe encoder then maps..."
                                         ↑ breaks here (paragraph boundary)

Parameter	Default	Configurable range
Chunk size	500 chars	200 – 1000
Chunk overlap	50 chars	0 – 200

🔢 Which Embedding Model Did I Use?

Model: nomic-embed-text:latest via Ollama

Why nomic-embed-text?

Built specifically for retrieval tasks — outperforms general-purpose models on semantic search
Produces 768-dimensional vectors — precise enough for similarity matching, efficient for local hardware
Runs fully locally via Ollama — no API key, no internet required
Lightweight at only 274 MB, already in your Ollama install

How it's used:

Index time: each text chunk → nomic-embed-text → 768-dim vector → stored in ChromaDB
Query time: user question → nomic-embed-text → query vector → cosine similarity → top-K chunks retrieved → passed to LLM as context

🚀 How to Run Locally

Prerequisites

Python 3.9+
Ollama installed

Step 1 — Start Ollama

ollama serve

Confirm you have the required models:

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.1:8b        # chat LLM (default)

Step 2 — Install Dependencies

cd shreyas-rag-app
pip install -r requirements.txt

Step 3 — Run the App

streamlit run app.py

Open http://localhost:8501 in your browser.

Step 4 — Pick Your Mode

💬 Normal Chat tab — just start typing, no setup needed

📄 Document tab:

Expand Upload & Index a Document
Upload your PDF / TXT / MD
Click ⚡ Index Document
Ask questions in the chat box below

🌐 URL tab:

Expand Fetch & Index a URL
Paste any webpage URL
Click 🌐 Fetch & Index
Ask questions about the page

🗂️ Folder Structure

shreyas-rag-app/
├── app.py                  # Main Streamlit app (3-tab UI)
├── requirements.txt        # All dependencies
├── README.md               # This file
├── data/
│   └── your_document.pdf   # Place documents here
├── utils/
│   ├── __init__.py
│   ├── loader.py           # Document loading & chunking logic
│   ├── embedder.py         # Embedding generation via Ollama
│   └── retriever.py        # ChromaDB vector store & retrieval
└── screenshots/
    └── demo.png            # App screenshot

🧩 Supported Models (from your Ollama install)

Model	Best for
`llama3.1:8b`	Balanced quality & speed — default
`qwen2.5:7b`	Strong reasoning & instruction following
`deepseek-r1:8b`	Complex multi-step reasoning
`gemma2:2b`	Fast responses, low RAM usage
`llama3:latest`	General purpose
`nomic-embed-text`	Embeddings only (not for chat)

🔮 What Would I Improve With More Time?

1. 🔁 Semantic Chunking

Replace fixed character chunking with topic-boundary chunking — splits where the content actually changes, not just at a character count. Libraries like semantic-text-splitter would give much more coherent chunks.

2. 🔀 Hybrid Search (BM25 + Vector)

Combine dense vector search (semantic similarity) with sparse BM25 keyword search. Hybrid retrieval handles both vague conceptual questions and exact term lookups (names, dates, codes) better than either alone.

3. 🗃️ Multi-Document Management

A dedicated UI to manage multiple indexed documents — view, delete, or filter retrieval to a specific file. Currently all documents in a tab are searched together.

4. 💬 Smarter Conversation Memory

Replace the current 3-turn sliding window with a summarisation-based memory — older turns get compressed into a short summary so the model retains context over long conversations without hitting the context limit.

5. 📊 Retrieval Evaluation

Add a 👍 / 👎 feedback button per answer and log Hit Rate and MRR metrics to measure and improve retrieval quality over time.

6. 🖥️ Streaming Responses

Switch from stream: false to streaming Ollama responses so the answer appears word-by-word instead of all at once — much better UX for longer answers.

🔒 Privacy

Everything runs locally — no data ever leaves your machine:

Ollama runs LLMs and embeddings entirely on your hardware
ChromaDB stores all vectors in ./chroma_db/ on disk
No telemetry, no cloud calls, no API keys required EOF echo "Done"

Edited May 29, 2026 by Shreyas Mogalapalli

[Submission] Shreyas Mogalapalli — RAG App