[Submission] Shreyas Mogalapalli — RAG App
🧠 RAG Chatbot — Local AI Document Q&A
A fully local, privacy-first Retrieval-Augmented Generation (RAG) chatbot built with Streamlit and Ollama. Three modes in one app — chat freely, upload documents, or paste a URL — all processing happens on your machine, nothing sent to the cloud.
📸 Screenshot
Run the app and save your screenshot as
screenshots/demo.png
📄 What Document Did I Use and Why?
This app is general-purpose — it works with any PDF, TXT, or Markdown file you provide, or any webpage URL. There is no hardcoded document.
Good starter documents to test with:
- A research paper (e.g. Attention Is All You Need) — tests dense technical Q&A
- A company policy or HR handbook — tests factual retrieval
- A Wikipedia article via URL — tests live web ingestion
Why no fixed document? The app is designed to be reusable across any domain — academic, legal, personal notes, or web content — so you bring your own data.
💬 Three Modes
The app has 3 tabs, each with its own chat history:
| Tab | Mode | How it works |
|---|---|---|
|
|
Plain LLM | Just type and chat — no documents needed |
|
|
RAG on file | Upload PDF/TXT/MD → index → answers from your file |
|
|
RAG on webpage | Paste a URL → fetch & index → answers from that page |
✂ ️ How Does Your Chunking Work?
Chunking is handled in utils/loader.py using a character-based sliding window with sentence-boundary awareness.
Steps:
- Document loaded as raw text (PyMuPDF for PDFs, plain read for TXT/MD, HTTP scrape for URLs)
- Text split into 500-character chunks (configurable in sidebar)
- Each chunk has 50-character overlap with the next — prevents answers being cut at boundaries
- Before cutting, the chunker looks back up to 100 chars for a natural break: paragraph (
\n\n), newline (\n), or sentence end (.,!,?) - Each chunk stored with metadata:
source,chunk_id,start_char
Example of smart breaking:
"...attention mechanisms are used here.\n\nThe encoder then maps..."
↑ breaks here (paragraph boundary)
| Parameter | Default | Configurable range |
|---|---|---|
| Chunk size | 500 chars | 200 – 1000 |
| Chunk overlap | 50 chars | 0 – 200 |
🔢 Which Embedding Model Did I Use?
Model: nomic-embed-text:latest via Ollama
Why nomic-embed-text?
- Built specifically for retrieval tasks — outperforms general-purpose models on semantic search
- Produces 768-dimensional vectors — precise enough for similarity matching, efficient for local hardware
- Runs fully locally via Ollama — no API key, no internet required
- Lightweight at only 274 MB, already in your Ollama install
How it's used:
-
Index time: each text chunk →
nomic-embed-text→ 768-dim vector → stored in ChromaDB -
Query time: user question →
nomic-embed-text→ query vector → cosine similarity → top-K chunks retrieved → passed to LLM as context
🚀 How to Run Locally
Prerequisites
- Python 3.9+
- Ollama installed
Step 1 — Start Ollama
ollama serve
Confirm you have the required models:
ollama pull nomic-embed-text # embeddings
ollama pull llama3.1:8b # chat LLM (default)
Step 2 — Install Dependencies
cd shreyas-rag-app
pip install -r requirements.txt
Step 3 — Run the App
streamlit run app.py
Open http://localhost:8501 in your browser.
Step 4 — Pick Your Mode
- Expand Upload & Index a Document
- Upload your PDF / TXT / MD
- Click
⚡ Index Document - Ask questions in the chat box below
- Expand Fetch & Index a URL
- Paste any webpage URL
- Click
🌐 Fetch & Index - Ask questions about the page
🗂 ️ Folder Structure
shreyas-rag-app/
├── app.py # Main Streamlit app (3-tab UI)
├── requirements.txt # All dependencies
├── README.md # This file
├── data/
│ └── your_document.pdf # Place documents here
├── utils/
│ ├── __init__.py
│ ├── loader.py # Document loading & chunking logic
│ ├── embedder.py # Embedding generation via Ollama
│ └── retriever.py # ChromaDB vector store & retrieval
└── screenshots/
└── demo.png # App screenshot
🧩 Supported Models (from your Ollama install)
| Model | Best for |
|---|---|
llama3.1:8b |
Balanced quality & speed — default |
qwen2.5:7b |
Strong reasoning & instruction following |
deepseek-r1:8b |
Complex multi-step reasoning |
gemma2:2b |
Fast responses, low RAM usage |
llama3:latest |
General purpose |
nomic-embed-text |
Embeddings only (not for chat) |
🔮 What Would I Improve With More Time?
1. 🔁 Semantic Chunking
Replace fixed character chunking with topic-boundary chunking — splits where the content actually changes, not just at a character count. Libraries like semantic-text-splitter would give much more coherent chunks.
2. 🔀 Hybrid Search (BM25 + Vector)
Combine dense vector search (semantic similarity) with sparse BM25 keyword search. Hybrid retrieval handles both vague conceptual questions and exact term lookups (names, dates, codes) better than either alone.
3. 🗃 ️ Multi-Document Management
A dedicated UI to manage multiple indexed documents — view, delete, or filter retrieval to a specific file. Currently all documents in a tab are searched together.
4. 💬 Smarter Conversation Memory
Replace the current 3-turn sliding window with a summarisation-based memory — older turns get compressed into a short summary so the model retains context over long conversations without hitting the context limit.
5. 📊 Retrieval Evaluation
Add a
6. 🖥 ️ Streaming Responses
Switch from stream: false to streaming Ollama responses so the answer appears word-by-word instead of all at once — much better UX for longer answers.
🔒 Privacy
Everything runs locally — no data ever leaves your machine:
- Ollama runs LLMs and embeddings entirely on your hardware
- ChromaDB stores all vectors in
./chroma_db/on disk - No telemetry, no cloud calls, no API keys required EOF echo "Done"