From 73947cba4279fc13fe5135cf2375ba5cc99ad915 Mon Sep 17 00:00:00 2001 From: Sanchit Vijay Date: Tue, 23 Apr 2024 19:11:50 -0400 Subject: [PATCH 1/8] updated main readme --- README.md | 234 ++---------------------------------------------------- 1 file changed, 7 insertions(+), 227 deletions(-) diff --git a/README.md b/README.md index 310eceb..e9eec40 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ -# GRAG (note: specify the abbreviation) +# GRAG - Good RAG ![GitHub License](https://img.shields.io/github/license/arjbingly/Capstone_5) -![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/sphinx-gitpg.yml?label=Docs&labelColor=yellow) +![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/sphinx-gitpg.yml?label=Docs) ![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/build_linting.yml?label=Linting) ![Static Badge](https://img.shields.io/badge/Tests-failing-red) ![Static Badge](https://img.shields.io/badge/docstring%20style-google-yellow) @@ -10,7 +10,10 @@ ![Static Badge](https://img.shields.io/badge/codestyle-pyflake-purple?labelColor=white) ![GitHub Issues or Pull Requests](https://img.shields.io/github/issues-pr/arjbingly/Capstone_5) -(note: add overview on what the purpose of this project is here. Talk briefly about RAG. Maybe copy from the proposal) + +[GRAG](https://arjbingly.github.io/Capstone_5/) is a simple python package that provides an easy end-to-end solution for implementing Retrieval Augmented Generation (RAG). + +The package offers an easy way for running various LLMs locally, Thanks to LlamaCpp and also supports vector stores like Chroma and DeepLake. It also makes it easy to integrage support to any vector stores easy.
Diagram of a basic RAG pipeline @@ -24,19 +27,7 @@ - [Getting Started](#getting-started) - [Requirements](#requirements) - [LLM Models](#llm-models) - - [Data](#data) - [Supported Vector Databases](#supported-vector-databases) - - [Embeddings](#embeddings) - - [Data Ingestion](#data-ingestion) -- [Main Features](#main-features) - - [1. PDF Parser](#1-pdf-parser) - - [2. Multi-Vector Retriever](#2-multi-vector-retriever) - - [3. BasicRAG](#3-basicrag) -- [GUI](#gui) - - [1. Retriever GUI](#1-retriever-gui) - - [2. BasicRAG GUI](#2-basicrag-gui) -- [Demo](#demo) -- [Repo Structure](#repo-structure) ## Project Overview @@ -61,8 +52,6 @@ Further customization can be made on the config file, `src/config.ini`. Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_): -Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_): - - PyTorch - LangChain - Chroma @@ -86,18 +75,6 @@ For more details, go to [.\llm_quantize\readme.md](.\llm_quantize\readme.md) Refer to [llama.cpp](https://github.com/ggerganov/llama.cpp) Supported Models (under Description) for list of compatible models. -### Data - -Any PDF can be used for this project. We personally tested the project using ArXiv papers. Refer [ArXiv Bulk Data](https://info.arxiv.org/help/bulk_data/index.html) for -details on how to download. - -``` -├── data -│ ├── pdf -``` - -**Make sure to specify `data_path` under `data` in `src/config.ini`** - ### Supported Vector Databases **1. [Chroma](https://www.trychroma.com)** @@ -109,202 +86,5 @@ Since Chroma is a server-client based vector database, make sure to run the serv **2. [Deeplake](https://www.deeplake.ai/)** -#### Embeddings - -- By default, the embedding model is `instructor-xl`. Can be changed by changing `embedding_type` and `embedding_model` - in `src/config.ini'. Any huggingface embeddings can be used. - -### Data Ingestion - -For ingesting data to the vector db: - -``` -client = DeepLakeClient() # Any vectordb client -retriever = Retriever(vectordb=client) - - -dir_path = Path(__file__).parents[2] # path to folder containing pdf files - - -retriever.ingest(dir_path) -``` - -Refer to ['cookbook/basicRAG/BasicRAG_ingest'](./cookbook/basicRAG/BasicRAG_ingest) - ---- - -## Main Features - -### 1. PDF Parser - -(note: need to rewrite this. Under contruction: test suites and documentation for every iteration) -- The pdf parser is implemented using [Unstructured.io](https://unstructured.io). -- It effectively parses any pdf including OCR documents and categorises all elements including tables and images. -- Enables contextual text parsing: it ensures that the chunking process does not separate items like list items, and keeps titles together with text. -- Tables are not chunked. - -### 2. Multi-Vector Retriever - -- It easily retrieves not only the most similar chunks (to a query) but also the source document of the chunks. - -### 3. BasicRAG - -Refer to [BasicRAG/README.md](./cookbook/Basic-RAG/README.md) -(note: fix the RAGPipeline.md link) - ---- - -## GUI - -### 1. Retriever GUI - -A simple GUI for retrieving documents and viewing config of the vector database. - -To run: `streamlit run projects/retriver_app.py -server.port=8888` - -### 2. BasicRAG GUI - -Under development. - ---- - -## Demo - -(to be added) -![Watch the video](../Sample_Capstone/demo/fig/demo.gif) - -## Repo Structure - ---- - -``` -. -├── LICENSE -├── README.md -├── ci -│   ├── Jenkinsfile -│   ├── env_test.py -│   ├── modify_config.py -│   └── unlock_deeplake.py -├── cookbook -│   ├── Basic-RAG -│   │   ├── BasicRAG_CustomPrompt.py -│   │   ├── BasicRAG_FewShotPrompt.py -│   │   ├── BasicRAG_ingest.py -│   │   ├── BasicRAG_refine.py -│   │   ├── BasicRAG_stuff.py -│   │   ├── RAG-PIPELINES.md -│   │   └── README.md -│   └── Retriver-GUI -│   └── retriever_app.py -├── demo -│   ├── Readme.md -│   └── fig -│   ├── demo.gif -│   └── video.mp4 -├── documentation -│   ├── AWS_Setup_Nvidia_Driver_Install.md -│   ├── AWS_Setup_Python_Env.md -│   ├── Building an effective RAG app.md -│   ├── Data Sources.md -│   ├── basic_RAG_pipeline.drawio.svg -│   └── challenges.md -├── full_report -│   ├── Latex_report -│   │   ├── File_Setup.tex -│   │   ├── Sample_Report.pdf -│   │   ├── Sample_Report.tex -│   │   ├── fig -│   │   │   ├── GW_logo-eps-converted-to.pdf -│   │   │   ├── GW_logo.eps -│   │   │   ├── ascent-archi.pdf -│   │   │   ├── certificates-log-archi.pdf -│   │   │   ├── nyush-logo.jpeg -│   │   │   └── perf-plot-1.pdf -│   │   └── references.bib -│   ├── Markdown_Report -│   ├── Readme.md -│   └── Word_Report -│   ├── Sample_Report.docx -│   └── Sample_Report.pdf -├── llm_quantize -│   └── README.md -├── presentation -│   └── Readme.md -├── proposal -│   └── proposal.md -├── pyproject.toml -├── requirements.yml -├── research_paper -│   ├── Latex -│   │   ├── Fig -│   │   │   ├── narxnet1-eps-converted-to.pdf -│   │   │   └── narxnet1.eps -│   │   ├── Paper_Temp.pdf -│   │   ├── Paper_Temp.tex -│   │   └── mybib.bib -│   ├── Readme.md -│   └── Word -│   └── Conference-template-A4.doc -└── src - ├── __init__.py - ├── config.ini - ├── grag - │   ├── __about__.py - │   ├── __init__.py - │   ├── components - │   │   ├── __init__.py - │   │   ├── embedding.py - │   │   ├── llm.py - │   │   ├── multivec_retriever.py - │   │   ├── parse_pdf.py - │   │   ├── prompt.py - │   │   ├── text_splitter.py - │   │   ├── utils.py - │   │   └── vectordb - │   │   ├── __init__.py - │   │   ├── base.py - │   │   ├── chroma_client.py - │   │   └── deeplake_client.py - │   ├── prompts - │   │   ├── Llama-2_QA-refine_1.json - │   │   ├── Llama-2_QA_1.json - │   │   ├── Mixtral_QA_1.json - │   │   ├── __init__.py - │   │   └── matcher.json - │   ├── quantize - │   │   ├── __init__.py - │   │   ├── quantize.py - │   │   └── utils.py - │   └── rag - │   ├── __init__.py - │   └── basic_rag.py - ├── scripts - │   ├── reset_chroma.sh - │   ├── reset_store.sh - │   └── run_chroma.sh - └── tests - ├── README.md - ├── __init__.py - ├── components - │   ├── __init__.py - │   ├── embedding_test.py - │   ├── llm_test.py - │   ├── multivec_retriever_test.py - │   ├── parse_pdf_test.py - │   ├── prompt_test.py - │   ├── utils_test.py - │   └── vectordb - │   ├── __init__.py - │   ├── chroma_client_test.py - │   └── deeplake_client_test.py - ├── quantize - │   ├── __init__.py - │   └── quantize_test.py - └── rag - ├── __init__.py - └── basic_rag_test.py -``` - ---- +For more information refer to [Documentation](https://arjbingly.github.io/Capstone_5/). From 9b64a7d0b65fa4b30d4323e2fb1ea0c6876c8c64 Mon Sep 17 00:00:00 2001 From: Arjun Bingly Date: Wed, 24 Apr 2024 14:25:40 -0400 Subject: [PATCH 2/8] Update pyproject.toml Add sphinx dependencies --- pyproject.toml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index 9667ac5..f2b8bb1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -56,7 +56,8 @@ dev = [ "mypy", "ruff", "sphinx", - "sphinx-rtd-theme" + "sphinx-rtd-theme", + "sphinx-gallery" ] [project.urls] From bf3815212862c6c152eb81330e79ba36d442b90a Mon Sep 17 00:00:00 2001 From: Arjun Bingly Date: Wed, 24 Apr 2024 14:32:28 -0400 Subject: [PATCH 3/8] Update sphinx-gitpg.yml Install grag package. --- .github/workflows/sphinx-gitpg.yml | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/.github/workflows/sphinx-gitpg.yml b/.github/workflows/sphinx-gitpg.yml index 238a907..ae5581b 100644 --- a/.github/workflows/sphinx-gitpg.yml +++ b/.github/workflows/sphinx-gitpg.yml @@ -27,9 +27,7 @@ jobs: uses: actions/checkout@v4 - name: Install Deps - run: | - pip install sphinx - pip install -r src/docs/requirements.txt + run: pip install .[dev] - name: Sphinx Build run: | @@ -39,12 +37,11 @@ jobs: - name: Setup Pages uses: actions/configure-pages@v5 - - name: Deploy to GitHub Pages Artifact - # uses: actions/deploy-pages@v4 + - name: Upload GitHub Pages Artifact uses: actions/upload-pages-artifact@v3 with: path: "src/docs/_build/html" - - name: Deplot GitHub Pages + - name: Deploy GitHub Pages id: deployment uses: actions/deploy-pages@v4 From ea08ca6e64308469cd5594911b2ea350106207a4 Mon Sep 17 00:00:00 2001 From: Arjun Bingly Date: Wed, 24 Apr 2024 14:43:42 -0400 Subject: [PATCH 4/8] Update sphinx-gitpg.yml Install grag as dev package. --- .github/workflows/sphinx-gitpg.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/sphinx-gitpg.yml b/.github/workflows/sphinx-gitpg.yml index ae5581b..5312d70 100644 --- a/.github/workflows/sphinx-gitpg.yml +++ b/.github/workflows/sphinx-gitpg.yml @@ -27,7 +27,7 @@ jobs: uses: actions/checkout@v4 - name: Install Deps - run: pip install .[dev] + run: pip install -e .[dev] - name: Sphinx Build run: | From 75278e181c76d3185d1a7cccf356d7913d09f1ee Mon Sep 17 00:00:00 2001 From: Arjun Bingly Date: Wed, 24 Apr 2024 14:59:21 -0400 Subject: [PATCH 5/8] Update sphinx-gitpg.yml Setup Python --- .github/workflows/sphinx-gitpg.yml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.github/workflows/sphinx-gitpg.yml b/.github/workflows/sphinx-gitpg.yml index 5312d70..b5cf572 100644 --- a/.github/workflows/sphinx-gitpg.yml +++ b/.github/workflows/sphinx-gitpg.yml @@ -26,6 +26,11 @@ jobs: - name: Checkout uses: actions/checkout@v4 + - name: Setup Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + - name: Install Deps run: pip install -e .[dev] From 398671938d76546e4ccccb46ea083f63f81b3445 Mon Sep 17 00:00:00 2001 From: Sanchit Vijay Date: Wed, 24 Apr 2024 16:08:16 -0400 Subject: [PATCH 6/8] solved CUDA issue with jenkins build --- ci/Jenkinsfile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ci/Jenkinsfile b/ci/Jenkinsfile index 486cd12..24aad00 100644 --- a/ci/Jenkinsfile +++ b/ci/Jenkinsfile @@ -8,8 +8,8 @@ pipeline { PYTHONPATH = "${env.WORKSPACE}/.venv/bin" CUDACXX = '/usr/local/cuda-12/bin/nvcc' CMAKE_ARGS = "-DLLAMA_CUBLAS=on" - PATH="/usr/local/cuda-12.3/bin:$PATH" - LD_LIBRARY_PATH="/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH" + PATH="/usr/local/cuda-12/bin:$PATH" + LD_LIBRARY_PATH="/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH" GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=accept-new" } From 8b3a5988eeea6300a83804ed7cbec3ea50678b5b Mon Sep 17 00:00:00 2001 From: Sanchit Vijay Date: Wed, 24 Apr 2024 16:28:26 -0400 Subject: [PATCH 7/8] added docs and cookbooks url --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index e9eec40..fcc9189 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,8 @@ ![Static Badge](https://img.shields.io/badge/codestyle-pyflake-purple?labelColor=white) ![GitHub Issues or Pull Requests](https://img.shields.io/github/issues-pr/arjbingly/Capstone_5) +[![Static Badge][Documentation-badge]][Docuementation-url] +[![Static Badge][Cookbooks-badge]][Cookbooks-url] [GRAG](https://arjbingly.github.io/Capstone_5/) is a simple python package that provides an easy end-to-end solution for implementing Retrieval Augmented Generation (RAG). @@ -88,3 +90,9 @@ Since Chroma is a server-client based vector database, make sure to run the serv For more information refer to [Documentation](https://arjbingly.github.io/Capstone_5/). + + +[Documentation-badge]: https://img.shields.io/badge/Documentation-red.svg?style=for-the-badge +[Docuementation-url]: https://arjbingly.github.io/Capstone_5/ +[Cookbooks-badge]: https://img.shields.io/badge/Cookbooks-blue?style=for-the-badge +[Cookbooks-url]: https://arjbingly.github.io/Capstone_5/auto_examples_index.html From 329539e67419606ee8f6287fc6c88e007e32c40d Mon Sep 17 00:00:00 2001 From: Sanchit Vijay Date: Wed, 24 Apr 2024 17:41:01 -0400 Subject: [PATCH 8/8] adding git pull To avoid failure when Jenkins and merging occur same time --- ci/Jenkinsfile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ci/Jenkinsfile b/ci/Jenkinsfile index 24aad00..bcce629 100644 --- a/ci/Jenkinsfile +++ b/ci/Jenkinsfile @@ -109,6 +109,7 @@ pipeline { fixed{ sshagent(credentials: ['GithubKey']){ sh 'git checkout main' + sh 'git pull' sh 'python3 ci/modify_test_status.py' sh 'git add README.md' sh 'git commit -m "test status updated"' @@ -119,6 +120,7 @@ pipeline { regression{ sshagent(credentials: ['GithubKey']){ sh 'git checkout main' + sh 'git pull' sh 'python3 ci/modify_test_status.py --fail' sh 'git add README.md' sh 'git commit -m "test status updated"'