[Submission] Bhavana-RAG App
Project Title: IPL 2026 Match Insight Assistant
Team Name: [PRISM] | Project Type: RAG-based IPL 2026 Assistant
URL: https://huggingface.co/spaces/suryavamshibhavana/IPL_2026_Match_Insight_Assistant/tree/main
1. RAG System Design
1.1 Problem Statement and Motivation
- Problem Definition: Our assistant addresses the challenge of navigating fragmented and massive match data from the IPL 2026 season. Currently, cricket fans and analysts struggle to extract specific match statistics or venue-specific performance metrics from raw CSV datasets without advanced data analysis tools.
- Need for Retrieval: We chose a RAG architecture because a standalone language model lacks access to our specific, private IPL 2026 dataset. Relying on model memory alone would lead to hallucinations regarding match scores or team records. Grounding responses in our external knowledge base ensures factual reliability and accuracy.
- Target Users and Usage Context: The primary users are sports enthusiasts and data analysts. They will use the system to query specific match outcomes or trends while viewing historical records.
- Domain Scope and Boundaries: * In-Scope: Questions about IPL 2026 match results, venues, winners, losers, and victory margins.
- Out-of-Scope: General cricket history, player contract negotiations, betting advice, or non-IPL cricket tournaments.
1.2 Knowledge Base Design and Description
-
Knowledge Source: Our knowledge base consists of a structured CSV file (
matches.csv) containing match-by-match data for the 2026 season. - Data Selection Rationale: We selected this source because it represents the official, comprehensive record of the season, ensuring trustworthiness and high domain relevance for our users.
-
Knowledge Base Characteristics: The dataset is structured, consisting of rows representing each IPL 2026 game, including attributes like
match_id,venue,winner,loser, andwin_by_runs. - Data Preparation: We standardized column names and used a Python function to convert each raw row into a semantic, natural-language sentence (e.g., "In IPL 2026, match X at Y saw A defeat B by Z runs") before vectorization. This improves retrieval quality by providing context-rich documents.
1.3 Retrieval Strategy and Methodology
-
Retrieval Method: We implemented Semantic Retrieval using Dense Embeddings via the
sentence-transformers/all-MiniLM-L6-v2model and FAISS (Facebook AI Similarity Search) for vector storage. - Retrieval Design Justification: Semantic search was chosen over keyword search to allow users to ask natural language questions (e.g., "Which team won the most at Wankhede?") instead of needing exact match IDs.
- Retrieval Challenges: We anticipate challenges with query ambiguity. We handle this by ensuring our data-to-sentence conversion process includes full team names and distinct venue identifiers to create high-quality semantic matches.
1.4 Data Augmentation and Context Enhancement
- Augmentation Methods Used: We utilized Contextual Document Formatting. Instead of raw CSV rows, we transform each row into a natural language sentence.
- Purpose: This transformation is essential because it bridges the gap between structured table data and the natural language queries that users type into the search interface, improving contextual grounding and system performance.
1.5 Generation Model Selection and Reasoning
-
Model Identification: We utilize
sentence-transformersfor embeddings and standard local processing for retrieval, compatible with LLM interfaces. - Model Selection Justification: We focused on efficiency and high instruction-following capabilities. The selected model family handles context-limited RAG prompts without excessive latency, which is crucial for our local deployment.
- Role in Workflow:
- Retrieval: Responsible for fetching the most relevant 2-3 match documents from the FAISS vector store.
- Generation: Responsible for synthesizing these facts into a clear, conversational answer.
1.6 System Prompt Design and Transparency
- Full System Prompt:
"You are an expert IPL 2026 Analyst. Use the provided match data context to answer the user's question. If the information is not present in the context, state that you do not have that specific data. Be concise and maintain a professional, analytical tone."
-
Prompt Objectives: This prompt enforces strict factual grounding, ensuring the model only uses the retrieved context. This reduces the risk of hallucinations and ensures the answer is derived directly from the
matches.csvsource.
2. Technical Setup Guide
Running the Application Locally
-
Activate Environment: Ensure your terminal shows
(venv)by running.\venv\Scripts\activate. - Run Application: Execute the following command in your project folder: streamlit run app.py
-
Access: Open the provided local URL (usually
http://localhost:8501) in your browser to interact with the assistant.