Joshika gangavaram requested to merge Joshika_12/cinerag:main into main May 29, 2026

Cinerag - Tollywood and Document RAG Chatbot

Cinerag is a Retrieval-Augmented Generation project for cinema-related question answering. The existing app answers questions about the Tollywood, or Telugu-language, film industry from a local knowledge base. This submission also adds a Gradio document-RAG app that can search uploaded PDF, TXT, Markdown, or CSV files using paragraph-aware chunks.

Submission Highlights

Preserved the existing Streamlit, backend API, frontend, and Tollywood data workflow.
Added app.py, a Gradio document-RAG app for uploaded files.
Added paragraph and sentence-based chunking instead of sending an entire document into the prompt.
Added lightweight TF-IDF style retrieval with source filenames, chunk numbers, and retrieval scores.
Included sample assignment PDFs and a helper script to regenerate them.
Avoided hardcoded API keys, model paths, paid services, and persistent vector database setup.

Tollywood RAG App

The Tollywood app uses a local knowledge base in data/tollywood_kb.json, retrieves the most relevant passages with a pure-Python TF-IDF style retriever, and returns grounded answers with source snippets. The knowledge base includes Telugu movie titles, summaries, actors, directors, genres, release years, reception notes, and retrieval metadata.

The local dataset includes:

Telugu cinema industry overview and history
Hyderabad film hub and studio information
Important Telugu directors and actors
Music, dance, genres, awards, audience culture, and pan-Indian cinema topics
Classic and modern Telugu films such as Mayabazar, Baahubali, RRR, Pushpa, Hanu-Man, and Kalki 2898 AD
Actor and director profile chunks for better person-based retrieval

Document RAG App

The Gradio app in app.py lets users upload documents and ask questions from retrieved evidence.

How it works:

Documents are loaded from uploads. If no file is uploaded, bundled sample PDFs are used when present.
Text is normalized and split using paragraph boundaries, with sentence fallback for single-block text.
Paragraphs are grouped into chunks of about 900 characters with overlap.
The user question is tokenized and compared with chunks using term frequency and inverse document frequency.
The app returns the strongest retrieved chunks and uses the highest-ranked evidence as the answer source.

Project Structure

.
|-- app.py
|-- backend/
|   |-- app.py
|   `-- rag_engine.py
|-- data/
|   `-- tollywood_kb.json
|-- frontend/
|   |-- index.html
|   |-- script.js
|   `-- styles.css
|-- streamlit_app.py
|-- requirements.txt
|-- Prompt_Engineering_Assignment_3.pdf
|-- Prompt_Engineering_Assignment_3_Colorful.pdf
`-- build_prompt_engineering_pdf.py

Run With Streamlit

pip install -r requirements.txt
streamlit run streamlit_app.py

Then open the local URL shown in the terminal, usually:

http://localhost:8501

Run the Gradio Document RAG App

pip install -r requirements.txt
python app.py

Then open the local Gradio URL shown in the terminal, usually:

http://127.0.0.1:7860

Run Backend and HTML Frontend

Start the backend:

python backend/app.py

Then open:

frontend/index.html

The frontend expects the backend API at:

http://127.0.0.1:8000

If you deploy the backend somewhere else, update API_URL in frontend/script.js.

Example Questions

What is Tollywood known for?
Tell me about Baahubali and RRR
Who are important directors in Telugu cinema?
How did Telugu cinema become pan-Indian?
Who acted in Mahanati and what is it about?
Which Telugu movies did Rajamouli direct?
Who directed Kalki 2898 AD?
Tell me about Allu Arjun and Pushpa

Test Locally

python -m json.tool data/tollywood_kb.json
python -m py_compile backend/rag_engine.py backend/app.py streamlit_app.py app.py

Run a quick terminal smoke test:

python - <<'PY'
from backend.rag_engine import RAGEngine

engine = RAGEngine("data/tollywood_kb.json")

for question in [
    "Who acted in Mahanati and what is it about?",
    "Which Telugu movies did Rajamouli direct?",
    "Who directed Kalki 2898 AD?",
    "Tell me about Allu Arjun and Pushpa.",
]:
    result = engine.answer(question)
    print("\nQ:", question)
    print("A:", result["answer"])
    print("Sources:", [source["id"] for source in result["sources"]])
PY

Add More Knowledge

Edit data/tollywood_kb.json and add more entries with:

{
  "id": "unique-id",
  "title": "Short title",
  "category": "film",
  "metadata": {
    "type": "movie",
    "year": 2024,
    "directors": ["Director Name"],
    "actors": ["Actor One", "Actor Two"],
    "genres": ["action", "drama"],
    "rating": "Short reception note",
    "aliases": ["Optional alternate title"]
  },
  "text": "Knowledge passage..."
}

The required fields are id, title, category, and text. metadata is optional, but adding it improves retrieval for actor, director, year, genre, and rating questions.

Restart the backend or Streamlit app after changing the knowledge base so the in-memory index is rebuilt.

Deployment Note

GitHub Pages can host only the static frontend files. This chatbot also needs the Python backend in backend/app.py, so the full app should be hosted on a Python-friendly service such as Render, Railway, Fly.io, Hugging Face Spaces, or a VPS.

[Submission] Joshika -RAG App