LearningLion-WOO

The project is a study on the use of generative AI to improve the services of SSC-ICT by supporting employees and optimizing internal processes. Originally, the focus is on generative large language models (LLM), in the form of Retrieval Augmented Generation (RAG), because they can have the most significant impact on the daily work of SSC-ICT employees. This version dipes deeper into the Retrieval part in RAG. The original version can be found here.

This version serves as part of the Master Thesis of Nicky Ju.

The paper corresponding to this repository can be found in the TU Delft Repository.

Flow Chart

Files

Filenames starting with

create --> create evaluation files with specific preprocessing
evaluate --> running queries on vector database/corpus
ingest --> creating vector database/corpus
preprocess --> preprocess the data in different ways before creating the database
relevance --> (re-)evaluating the results

Complete Example Pipeline

This guide assumes that you are familiar with the basics of Python (such as setting up environment, and installing packages).

First steps
- Have your data dump downloaded from Woogle: Dump of 19/04/2024 or Daily updated dump (password protected).
- Merge all the data using merge_woo.ipynb.
- Create evaluation files create_evaluation_file.py or with create_evaluation_file_keywords_paraphrase.ipynb.
Preprocess Data
- Run preprocess preprocess_real_words.py or preprocess_stem_stopwords.py to preprocess the data in different ways.
Database creation
- Create Vector Store with ingest_embeddings.py.
- Create BM25 Corpus with ingest_bm25.py.
Evaluation
- Run the evaluation files with the vector store/bm25 corpus evaluate_bm25.py or evaluate_embeddings.py.
Evaluation metrics
- relevance_evaluation.ipynb to calculate basic metrics like precision and recall.
- relevance_dossier_average.ipynb for frequency based, relevance_dossier_MAP.ipynb for weighted frequency based.

Name		Name	Last commit message	Last commit date
Latest commit History 534 Commits
! project_docs		! project_docs
__pycache__		__pycache__
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LearningLion-WOO

Flow Chart

Files

Complete Example Pipeline

About

Releases

Packages

Languages

License

SSC-ICT-Innovatie/LearningLion-WOO

Folders and files

Latest commit

History

Repository files navigation

LearningLion-WOO

Flow Chart

Files

Complete Example Pipeline

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages