Project M.A.R.S. is an intelligent chatbot application that leverages Retrieval-Augmented Generation (RAG) to provide answers based on dynamically retrieved web content. It goes a step further by comparing the RAG-generated answer with a response from a standard Large Language Model (LLM) and provides a comprehensive evaluation of both.
General-purpose Large Language Models (LLMs) have a vast but static knowledge base. Their information is limited to what they were trained on, and they can sometimes provide outdated or incorrect information (a phenomenon known as "hallucination").
M.A.R.S. overcomes these limitations by implementing a sophisticated RAG (Retrieval-Augmented Generation) pipeline, a comprehensive evaluation system, and Reinforcement Learning for continuous improvement:
-
Dynamic Information Retrieval (RAG Core):
- When a user submits a query, M.A.R.S. first uses a search tool (Serper API) to find relevant web pages.
- These web pages are then scraped for their content, and the extracted text is converted into numerical representations (embeddings).
- These embeddings are stored in a vector database (Pinecone), creating a dynamic, up-to-date knowledge base specific to the query.
- A powerful LLM (Groq's Llama 3.3) then generates an answer, but critically, it is "augmented" with the retrieved and embedded context from the web. This ensures the answer is grounded in current, external information.
-
Direct LLM Response:
- Simultaneously, the same LLM generates an answer to the user's query without any external context (i.e., a pure LLM response).
-
Comprehensive Evaluation:
- Both the RAG-augmented answer and the pure LLM answer are then subjected to a rigorous, multi-metric evaluation process:
- Semantic Similarity: Measures how semantically similar the two answers are to each other using Sentence-Transformers.
- BERTScore: Evaluates the quality of the RAG answer against the LLM answer (or vice-versa) by comparing their contextual embeddings, providing precision, recall, and F1 scores.
- Factual Accuracy (QA-based): A Question-Answering (QA) model (e.g., RoBERTa-base-squad2) is used to assess if key facts from the original query are present and correctly addressed in each answer.
- Judge Model Comparison: A powerful LLM acts as an impartial judge, comparing both answers qualitatively, providing a justification, and declaring a "winner" based on overall quality, relevance, and completeness.
- Both the RAG-augmented answer and the pure LLM answer are then subjected to a rigorous, multi-metric evaluation process:
-
Reinforcement Learning (RL) for Optimization:
- The evaluation metrics are combined into a scalar reward signal.
- A basic RL agent learns from these rewards, dynamically adjusting RAG parameters (e.g., the
top_kvalue for document retrieval) to continuously improve the accuracy and quality of the RAG-generated answers over time.
This multi-faceted approach ensures that M.A.R.S. not only provides context-specific, up-to-date, and less hallucinatory answers but also offers transparency into the performance differences between RAG and pure LLM approaches, and actively learns to optimize its own performance.
Here is the detailed flow of how the application works:
graph TD
subgraph Frontend
UI[User Query]
Display[Display Comparison and Metrics]
end
subgraph Backend
API[POST /api/rag/compare]
end
subgraph Python_Pipeline
Start[Start]
RL_Act[RL Agent Choose top_k]
subgraph Ingestion
Serper[Serper API Search]
Scrape[Web Scraping]
Embed[Embed and Upsert to Pinecone]
end
subgraph Generation
RAG_Gen[RAG Answer Groq plus Context]
LLM_Gen[Pure LLM Answer Groq]
end
Eval[Comprehensive Evaluation]
RL_Learn[RL Agent Learn from Reward]
Result[Return JSON]
end
UI --> API
API -->|Spawn Process| Start
Start --> RL_Act
RL_Act --> Ingestion
Ingestion -->|Context| RAG_Gen
Start -->|Query| LLM_Gen
RAG_Gen --> Eval
LLM_Gen --> Eval
Eval --> RL_Learn
RL_Learn --> Result
Result -->|JSON Output| API
API -->|Response| Display
- User Authentication: Secure user login and registration using JWT.
- RAG vs. LLM Comparison: The core feature that provides a side-by-side comparison of answers from a RAG model and a standard LLM.
- Comprehensive Evaluation: A sophisticated evaluation pipeline that uses multiple metrics to assess the quality of the generated answers.
- Reinforcement Learning Integration: A basic RL agent learns from evaluation rewards to dynamically optimize RAG parameters (e.g.,
top_kfor retrieval), aiming to improve accuracy over time. - Visual Comparison: A graph to visually compare the performance of the two models.
- Professional UI/UX: Redesigned login and signup pages for a modern and professional user experience.
- Responsive UI: A clean and simple chat interface.
- Frontend: React, Vite, Axios, Recharts
- Backend: Node.js, Express, Mongoose, JWT
- Python: Sentence-transformers, Pinecone, Groq, BERT-Score, Transformers
.
├── backend
│ ├── db.js
│ ├── docker.yaml
│ ├── index.js
│ ├── middleware
│ │ └── fetchuser.js
│ ├── models
│ │ ├── Score.js
│ │ └── User.js
│ ├── node
│ │ ├── comprehensive_evaluate.py
│ │ ├── embed_and_upload.py
│ │ ├── rag_query.py
│ │ ├── rag_query_compare.py
│ │ ├── reward_memory.json
│ │ ├── scraped_data.json
│ │ ├── searchurl.py
│ │ ├── webscrap.py
│ │ └── rl_agent.py
│ ├── package.json
│ └── routes
│ ├── auth.js
│ └── rag.js
├── public
│ └── vite.svg
├── src
│ ├── assets
│ │ └── react.svg
│ ├── components
│ │ ├── ChatArea.jsx
│ │ ├── ComparisonGraph.jsx
│ │ ├── ComparisonMessage.jsx
│ │ ├── EvaluationDetails.jsx
│ │ ├── Header.jsx
│ │ ├── Message.jsx
│ │ ├── MessageInput.jsx
│ │ ├── Shimmer.jsx
│ │ └── Sidebar.jsx
│ ├── pages
│ │ ├── LoginPage.jsx
│ │ └── SignupPage.jsx
│ ├── App.css
│ ├── App.jsx
│ ├── index.css
│ └── main.jsx
├── .gitignore
├── eslint.config.js
├── index.html
├── package.json
├── README.md
└── vite.config.js
POST /api/auth/signup: Create a new user.POST /api/auth/login: Login a user.GET /api/auth/getuser: Get the logged-in user's data.POST /api/rag/query: Get a response from the RAG model (not used in the current UI).POST /api/rag/compare: Get a comparison between the RAG and LLM models.
rag_query_compare.py: The main script that orchestrates the comparison pipeline. It takes a user query, generates RAG and LLM answers, and calls the evaluation script. It also interacts with therl_agent.pyto choose optimal RAG parameters and learn from the results.comprehensive_evaluate.py: Performs a comprehensive evaluation of the two answers using various metrics and calculates a scalar reward for the RAG answer.rl_agent.py: Implements a basic Reinforcement Learning agent that learns from past evaluation rewards to dynamically select optimal RAG parameters (e.g.,top_kfor document retrieval).rag_query.py: A script for querying the RAG model (not used in the current comparison pipeline).embed_and_upload.py: A utility script to embed and upload data to the Pinecone vector database.searchurl.py: A utility script to search for URLs based on a query using the Serper API.webscrap.py: A utility script to scrape the content of a webpage.
This project has undergone several significant updates and improvements:
- Reinforcement Learning (RL) Integration: A foundational RL system has been integrated. This includes:
- A comprehensive reward function in
comprehensive_evaluate.pythat calculates a scalar reward for RAG answers based on faithfulness, factual accuracy, completeness, clarity, and semantic similarity to the query. - An action space defined by the
top_kparameter for Pinecone retrieval. - A basic
rl_agent.pythat learns from past rewards to dynamically choosetop_kvalues, aiming to optimize RAG performance over time.
- A comprehensive reward function in
- UI/UX Enhancements:
- The Login and Signup pages (
LoginPage.jsx,SignupPage.jsx) have been professionally redesigned with a modern, dark-themed, and responsive interface.
- The Login and Signup pages (
- Feature Removal:
- The entire chat history feature (frontend pages, backend routes, and database models) has been completely removed to streamline the application.
- Bug Fixes and Stability:
- Resolved multiple
NameErrorissues in Python scripts (comprehensive_evaluate.py) to ensure stable execution. - Implemented user-agent headers in
webscrap.pyto improve web scraping success rates and mitigate 403 Forbidden errors (though some sites may still block access). - Fixed malformed JSON output from
rag_query_compare.pyby ensuring proper flushing ofstdout, resolving parsing errors in the Node.js backend. - Addressed chat saving validation errors by updating the backend
MessageSchemaand frontendChatArea.jsxto correctly handle comparison messages (though chat saving is now removed with the history feature).
- Resolved multiple
- Node.js
- Python 3.10+
- MongoDB
-
Navigate to the
backenddirectory:cd backend -
Install Node.js dependencies:
npm install
-
Install Python dependencies:
pip install -r requirements.txt
(Note: You may need to create a
requirements.txtfile first. See the "Python Dependencies" section below.) -
Create a
.envfile in thebackenddirectory and add the following environment variables:MONGO_URI=<your_mongodb_uri> JWT_SECRET=<your_jwt_secret> PINECONE_API_KEY=<your_pinecone_api_key> GROQ_API_KEY=<your_groq_api_key> SERPER_API_KEY=<your_serper_api_key> -
Start the backend server:
node index.js
-
Navigate to the root directory.
-
Install Node.js dependencies:
npm install
-
Start the frontend development server:
npm run dev
The Python scripts require the following libraries. You can install them using pip:
pip install sentence-transformers pinecone-client groq bert-score torch transformers beautifulsoup4 requests python-dotenvYou can also create a requirements.txt file with the following content and run pip install -r requirements.txt:
sentence-transformers
pinecone-client
groq
bert-score
torch
transformers
beautifulsoup4
requests
python-dotenv