RAG-Based Multilingual News Retrieval System

Project Overview
Features
Workflow
- 1. Data Collection and Loading
- 2. Data Preprocessing
- 3. Translation
- 4. Summarization
- 5. Model Inference
- 6. Retrieval-Augmented Generation (RAG)
- 7. Similarity and Relevance Analysis
- 8. Streamlit Interactive Application
Installation
Usage Instructions
File Structure
Technologies Used
Contributors

1. Project Overview

The RAG-Based Multilingual News Retrieval System enables users to search for multilingual news articles and receive summaries of the top 5 most relevant news articles in English. This system incorporates retrieval, translation, and summarization into a unified pipeline. It leverages MBART50 for translation, SBERT embeddings for similarity, FAISS indexing for fast retrieval, and a user-friendly Streamlit-based UI for interactivity. The system ensures context-aware summaries of news articles in German, Spanish, French, Russian, Turkish, and Arabic.

2. Features

Cross-Lingual Retrieval: Users submit a query in English and receive the top 5 most relevant news articles from multilingual sources.
Multilingual Support: Supports articles in German, Spanish, French, Russian, Turkish, and Arabic.
Translation: Non-English articles are translated into English using MBART50.
Summarization: Summarized versions of retrieved articles are displayed for users.
Interactive UI: Streamlit UI allows users to submit queries, view results, and analyze similarity scores.
Context-Aware Summaries: Retrieves and summarizes articles using RAG, ensuring relevance and contextual accuracy.
Relevance Analysis: Measures similarity between the user query and retrieved articles using BERTScore, Cosine Similarity, and SBERT Score.

3. Workflow

1. Data Collection and Loading

Objective: Collect, load, and prepare multilingual data from the MLSUM dataset.
Steps:

Dataset Selection: Use the MLSUM dataset for German, Spanish, French, Russian, and Turkish.
Data Download: 700 records per language are downloaded from the training set.
Dataframe Creation: Data is stored in a Pandas DataFrame with columns:
- Text: The main content of the article.
- Summary: The reference summary of the article.
- Language: The language of the article (e.g., "de" for German).
Data Concatenation: Combine all language DataFrames into one unified DataFrame.
Dataset Conversion: Convert the DataFrame to a HuggingFace dataset format for compatibility with NLP models.

2. Data Preprocessing

Objective: Clean and prepare the text data for translation and summarization.
Steps:

Language Column Addition: Add a column specifying the language of each article.
Data Cleaning: Remove unnecessary columns (e.g., URLs) and irrelevant content.
Language Detection (Optional): Verify that the language tag of each article is correct using langdetect.

3. Translation

Objective: Translate non-English articles into English using MBART50.
Steps:

Model Selection: Use the MBART50 model for translation.
Tokenizer Initialization: Use MBart50Tokenizer with source language (src_lang) and target language (tgt_lang) set to English (en_XX).
Translation Function:
- Tokenize the input text.
- Pass the tokenized input to the MBART50 model.
- Decode the translated output.
Translation Output: Store the translated content for summarization.

4. Summarization

Objective: Summarize the translated articles using T5.
Steps:

Model Selection: Use T5 for summarization.
Summarization Function:
- Tokenize the translated text.
- Generate the summary using T5.
- Decode the summary.
Summarization Output: Store the summary for later retrieval and user interaction.

5. Model Inference

Objective: Apply translation and summarization to all articles in the dataset.
Steps:

Batch Processing: Translate and summarize each article.
Result Storage: Store the translated articles and their summaries for use in the RAG system.

6. Retrieval-Augmented Generation (RAG)

Objective: Use a RAG-based retrieval system to retrieve and summarize news articles based on a user query.
Steps:

Knowledge Base Construction:
- Index all translated articles and summaries using FAISS.
Query Embedding: Convert user query into an embedding using SBERT.
Information Retrieval:
- Retrieve the top 5 most relevant articles from the FAISS index.
RAG Model: Generate a context-aware summary using the 5 retrieved articles.
Storage of Results: Store the retrieved articles, summaries, and RAG-generated output for display in the UI.

7. Similarity and Relevance Analysis

Objective: Measure the alignment between the input query and retrieved summaries.
Steps:

BERTScore: Measures the semantic similarity between the query and summaries.
Cosine Similarity: Measures similarity using SBERT embeddings.
SBERT Score: Measures contextual alignment between the user query and system-generated summaries.

8. Streamlit Interactive Application

Objective: Allow users to interact with the system via Streamlit.
Features:

Query Submission: Users submit queries in English.
Result Display: Shows titles, summaries, and similarity scores for retrieved articles.
Configurable Settings: Users can configure dataset path, number of results, and similarity score thresholds.

4. Installation

Install Required Libraries:
```
pip install -r requirements.txt
```
Run Streamlit Application:
```
streamlit run app.py
```

5. Usage Instructions

Launch the App: Run streamlit run app.py.
Input Query: Enter an English query in the search box.
View Results: View the top 5 most relevant articles, their summaries, and similarity scores.

6. File Structure

📦 RAG-Based-Multilingual-News-Retrieval
 ┣ 📂data
 ┣ 📂models
 ┣ 📂notebooks
 ┣ 📜app.py
 ┣ 📜requirements.txt
 ┣ 📜README.md

7. Technologies Used

Python
HuggingFace Transformers (MBART50, BERT, SBERT, T5)
FAISS (Dense Vector Indexing)
Streamlit (Interactive UI)
SBERT (Sentence Embeddings)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Notebooks		Notebooks
WebApp		WebApp
datasets_translated_dataset_v3		datasets_translated_dataset_v3
lora_summarization_model		lora_summarization_model
translation_model		translation_model
.DS_Store		.DS_Store
.gitignore		.gitignore
Final Project Report Group 7.pdf		Final Project Report Group 7.pdf
Project_Presentation.pptx		Project_Presentation.pptx
README.md		README.md
dataset.csv		dataset.csv
datasets_translated_dataset_v3.tar		datasets_translated_dataset_v3.tar
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-Based Multilingual News Retrieval System

Table of Contents

1. Project Overview

2. Features

3. Workflow

1. Data Collection and Loading

2. Data Preprocessing

3. Translation

4. Summarization

5. Model Inference

6. Retrieval-Augmented Generation (RAG)

7. Similarity and Relevance Analysis

8. Streamlit Interactive Application

4. Installation

5. Usage Instructions

6. File Structure

7. Technologies Used

About

Releases

Packages

Contributors 3

Languages

rahulodedra30/RAG-Based-Multilingual-News-Retrieval

Folders and files

Latest commit

History

Repository files navigation

RAG-Based Multilingual News Retrieval System

Table of Contents

1. Project Overview

2. Features

3. Workflow

1. Data Collection and Loading

2. Data Preprocessing

3. Translation

4. Summarization

5. Model Inference

6. Retrieval-Augmented Generation (RAG)

7. Similarity and Relevance Analysis

8. Streamlit Interactive Application

4. Installation

5. Usage Instructions

6. File Structure

7. Technologies Used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages