[Submission] Joshika -RAG App
Cinerag - Tollywood and Document RAG Chatbot
Cinerag is a Retrieval-Augmented Generation project for cinema-related question answering. The existing app answers questions about the Tollywood, or Telugu-language, film industry from a local knowledge base. This submission also adds a Gradio document-RAG app that can search uploaded PDF, TXT, Markdown, or CSV files using paragraph-aware chunks.
Submission Highlights
- Preserved the existing Streamlit, backend API, frontend, and Tollywood data workflow.
- Added
app.py, a Gradio document-RAG app for uploaded files. - Added paragraph and sentence-based chunking instead of sending an entire document into the prompt.
- Added lightweight TF-IDF style retrieval with source filenames, chunk numbers, and retrieval scores.
- Included sample assignment PDFs and a helper script to regenerate them.
- Avoided hardcoded API keys, model paths, paid services, and persistent vector database setup.
Tollywood RAG App
The Tollywood app uses a local knowledge base in data/tollywood_kb.json, retrieves the most relevant passages with a pure-Python TF-IDF style retriever, and returns grounded answers with source snippets. The knowledge base includes Telugu movie titles, summaries, actors, directors, genres, release years, reception notes, and retrieval metadata.
The local dataset includes:
- Telugu cinema industry overview and history
- Hyderabad film hub and studio information
- Important Telugu directors and actors
- Music, dance, genres, awards, audience culture, and pan-Indian cinema topics
- Classic and modern Telugu films such as
Mayabazar,Baahubali,RRR,Pushpa,Hanu-Man, andKalki 2898 AD - Actor and director profile chunks for better person-based retrieval
Document RAG App
The Gradio app in app.py lets users upload documents and ask questions from retrieved evidence.
How it works:
- Documents are loaded from uploads. If no file is uploaded, bundled sample PDFs are used when present.
- Text is normalized and split using paragraph boundaries, with sentence fallback for single-block text.
- Paragraphs are grouped into chunks of about 900 characters with overlap.
- The user question is tokenized and compared with chunks using term frequency and inverse document frequency.
- The app returns the strongest retrieved chunks and uses the highest-ranked evidence as the answer source.
Project Structure
.
|-- app.py
|-- backend/
| |-- app.py
| `-- rag_engine.py
|-- data/
| `-- tollywood_kb.json
|-- frontend/
| |-- index.html
| |-- script.js
| `-- styles.css
|-- streamlit_app.py
|-- requirements.txt
|-- Prompt_Engineering_Assignment_3.pdf
|-- Prompt_Engineering_Assignment_3_Colorful.pdf
`-- build_prompt_engineering_pdf.py
Run With Streamlit
pip install -r requirements.txt
streamlit run streamlit_app.py
Then open the local URL shown in the terminal, usually:
http://localhost:8501
Run the Gradio Document RAG App
pip install -r requirements.txt
python app.py
Then open the local Gradio URL shown in the terminal, usually:
http://127.0.0.1:7860
Run Backend and HTML Frontend
Start the backend:
python backend/app.py
Then open:
frontend/index.html
The frontend expects the backend API at:
http://127.0.0.1:8000
If you deploy the backend somewhere else, update API_URL in frontend/script.js.
Example Questions
What is Tollywood known for?Tell me about Baahubali and RRRWho are important directors in Telugu cinema?How did Telugu cinema become pan-Indian?Who acted in Mahanati and what is it about?Which Telugu movies did Rajamouli direct?Who directed Kalki 2898 AD?Tell me about Allu Arjun and Pushpa
Test Locally
python -m json.tool data/tollywood_kb.json
python -m py_compile backend/rag_engine.py backend/app.py streamlit_app.py app.py
Run a quick terminal smoke test:
python - <<'PY'
from backend.rag_engine import RAGEngine
engine = RAGEngine("data/tollywood_kb.json")
for question in [
"Who acted in Mahanati and what is it about?",
"Which Telugu movies did Rajamouli direct?",
"Who directed Kalki 2898 AD?",
"Tell me about Allu Arjun and Pushpa.",
]:
result = engine.answer(question)
print("\nQ:", question)
print("A:", result["answer"])
print("Sources:", [source["id"] for source in result["sources"]])
PY
Add More Knowledge
Edit data/tollywood_kb.json and add more entries with:
{
"id": "unique-id",
"title": "Short title",
"category": "film",
"metadata": {
"type": "movie",
"year": 2024,
"directors": ["Director Name"],
"actors": ["Actor One", "Actor Two"],
"genres": ["action", "drama"],
"rating": "Short reception note",
"aliases": ["Optional alternate title"]
},
"text": "Knowledge passage..."
}
The required fields are id, title, category, and text. metadata is optional, but adding it improves retrieval for actor, director, year, genre, and rating questions.
Restart the backend or Streamlit app after changing the knowledge base so the in-memory index is rebuilt.
Deployment Note
GitHub Pages can host only the static frontend files. This chatbot also needs the Python backend in backend/app.py, so the full app should be hosted on a Python-friendly service such as Render, Railway, Fly.io, Hugging Face Spaces, or a VPS.