- Project Overview
- Features
- Workflow
- 1. Data Collection and Loading
- 2. Data Preprocessing
- 3. Translation
- 4. Summarization
- 5. Model Inference
- 6. Retrieval-Augmented Generation (RAG)
- 7. Similarity and Relevance Analysis
- 8. Streamlit Interactive Application
- Installation
- Usage Instructions
- File Structure
- Technologies Used
- Contributors
The RAG-Based Multilingual News Retrieval System enables users to search for multilingual news articles and receive summaries of the top 5 most relevant news articles in English. This system incorporates retrieval, translation, and summarization into a unified pipeline. It leverages MBART50 for translation, SBERT embeddings for similarity, FAISS indexing for fast retrieval, and a user-friendly Streamlit-based UI for interactivity. The system ensures context-aware summaries of news articles in German, Spanish, French, Russian, Turkish, and Arabic.
- Cross-Lingual Retrieval: Users submit a query in English and receive the top 5 most relevant news articles from multilingual sources.
- Multilingual Support: Supports articles in German, Spanish, French, Russian, Turkish, and Arabic.
- Translation: Non-English articles are translated into English using MBART50.
- Summarization: Summarized versions of retrieved articles are displayed for users.
- Interactive UI: Streamlit UI allows users to submit queries, view results, and analyze similarity scores.
- Context-Aware Summaries: Retrieves and summarizes articles using RAG, ensuring relevance and contextual accuracy.
- Relevance Analysis: Measures similarity between the user query and retrieved articles using BERTScore, Cosine Similarity, and SBERT Score.
Objective: Collect, load, and prepare multilingual data from the MLSUM dataset.
Steps:
- Dataset Selection: Use the MLSUM dataset for German, Spanish, French, Russian, and Turkish.
- Data Download: 700 records per language are downloaded from the training set.
- Dataframe Creation: Data is stored in a Pandas DataFrame with columns:
- Text: The main content of the article.
- Summary: The reference summary of the article.
- Language: The language of the article (e.g., "de" for German).
- Data Concatenation: Combine all language DataFrames into one unified DataFrame.
- Dataset Conversion: Convert the DataFrame to a HuggingFace dataset format for compatibility with NLP models.
Objective: Clean and prepare the text data for translation and summarization.
Steps:
- Language Column Addition: Add a column specifying the language of each article.
- Data Cleaning: Remove unnecessary columns (e.g., URLs) and irrelevant content.
- Language Detection (Optional): Verify that the language tag of each article is correct using langdetect.
Objective: Translate non-English articles into English using MBART50.
Steps:
- Model Selection: Use the MBART50 model for translation.
- Tokenizer Initialization: Use MBart50Tokenizer with source language (
src_lang
) and target language (tgt_lang
) set to English (en_XX
). - Translation Function:
- Tokenize the input text.
- Pass the tokenized input to the MBART50 model.
- Decode the translated output.
- Translation Output: Store the translated content for summarization.
Objective: Summarize the translated articles using T5.
Steps:
- Model Selection: Use T5 for summarization.
- Summarization Function:
- Tokenize the translated text.
- Generate the summary using T5.
- Decode the summary.
- Summarization Output: Store the summary for later retrieval and user interaction.
Objective: Apply translation and summarization to all articles in the dataset.
Steps:
- Batch Processing: Translate and summarize each article.
- Result Storage: Store the translated articles and their summaries for use in the RAG system.
Objective: Use a RAG-based retrieval system to retrieve and summarize news articles based on a user query.
Steps:
- Knowledge Base Construction:
- Index all translated articles and summaries using FAISS.
- Query Embedding: Convert user query into an embedding using SBERT.
- Information Retrieval:
- Retrieve the top 5 most relevant articles from the FAISS index.
- RAG Model: Generate a context-aware summary using the 5 retrieved articles.
- Storage of Results: Store the retrieved articles, summaries, and RAG-generated output for display in the UI.
Objective: Measure the alignment between the input query and retrieved summaries.
Steps:
- BERTScore: Measures the semantic similarity between the query and summaries.
- Cosine Similarity: Measures similarity using SBERT embeddings.
- SBERT Score: Measures contextual alignment between the user query and system-generated summaries.
Objective: Allow users to interact with the system via Streamlit.
Features:
- Query Submission: Users submit queries in English.
- Result Display: Shows titles, summaries, and similarity scores for retrieved articles.
- Configurable Settings: Users can configure dataset path, number of results, and similarity score thresholds.
-
Install Required Libraries:
pip install -r requirements.txt
-
Run Streamlit Application:
streamlit run app.py
- Launch the App: Run
streamlit run app.py
. - Input Query: Enter an English query in the search box.
- View Results: View the top 5 most relevant articles, their summaries, and similarity scores.
📦 RAG-Based-Multilingual-News-Retrieval
┣ 📂data
┣ 📂models
┣ 📂notebooks
┣ 📜app.py
┣ 📜requirements.txt
┣ 📜README.md
- Python
- HuggingFace Transformers (MBART50, BERT, SBERT, T5)
- FAISS (Dense Vector Indexing)
- Streamlit (Interactive UI)
- SBERT (Sentence Embeddings)