Skip to content

Commit

Permalink
updated main readme
Browse files Browse the repository at this point in the history
  • Loading branch information
sanchitvj authored Apr 23, 2024
1 parent f44c0ba commit 73947cb
Showing 1 changed file with 7 additions and 227 deletions.
234 changes: 7 additions & 227 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GRAG (note: specify the abbreviation)
# GRAG - Good RAG

![GitHub License](https://img.shields.io/github/license/arjbingly/Capstone_5)
![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/sphinx-gitpg.yml?label=Docs&labelColor=yellow)
![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/sphinx-gitpg.yml?label=Docs)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/build_linting.yml?label=Linting)
![Static Badge](https://img.shields.io/badge/Tests-failing-red)
![Static Badge](https://img.shields.io/badge/docstring%20style-google-yellow)
Expand All @@ -10,7 +10,10 @@
![Static Badge](https://img.shields.io/badge/codestyle-pyflake-purple?labelColor=white)
![GitHub Issues or Pull Requests](https://img.shields.io/github/issues-pr/arjbingly/Capstone_5)

(note: add overview on what the purpose of this project is here. Talk briefly about RAG. Maybe copy from the proposal)

[GRAG](https://arjbingly.github.io/Capstone_5/) is a simple python package that provides an easy end-to-end solution for implementing Retrieval Augmented Generation (RAG).

The package offers an easy way for running various LLMs locally, Thanks to LlamaCpp and also supports vector stores like Chroma and DeepLake. It also makes it easy to integrage support to any vector stores easy.

<figure>
<img src="documentation/basic_RAG_pipeline.png" alt="Diagram of a basic RAG pipeline">
Expand All @@ -24,19 +27,7 @@
- [Getting Started](#getting-started)
- [Requirements](#requirements)
- [LLM Models](#llm-models)
- [Data](#data)
- [Supported Vector Databases](#supported-vector-databases)
- [Embeddings](#embeddings)
- [Data Ingestion](#data-ingestion)
- [Main Features](#main-features)
- [1. PDF Parser](#1-pdf-parser)
- [2. Multi-Vector Retriever](#2-multi-vector-retriever)
- [3. BasicRAG](#3-basicrag)
- [GUI](#gui)
- [1. Retriever GUI](#1-retriever-gui)
- [2. BasicRAG GUI](#2-basicrag-gui)
- [Demo](#demo)
- [Repo Structure](#repo-structure)

## Project Overview

Expand All @@ -61,8 +52,6 @@ Further customization can be made on the config file, `src/config.ini`.

Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_):

Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_):

- PyTorch
- LangChain
- Chroma
Expand All @@ -86,18 +75,6 @@ For more details, go to [.\llm_quantize\readme.md](.\llm_quantize\readme.md)

Refer to [llama.cpp](https://github.com/ggerganov/llama.cpp) Supported Models (under Description) for list of compatible models.

### Data

Any PDF can be used for this project. We personally tested the project using ArXiv papers. Refer [ArXiv Bulk Data](https://info.arxiv.org/help/bulk_data/index.html) for
details on how to download.

```
├── data
│ ├── pdf
```

**Make sure to specify `data_path` under `data` in `src/config.ini`**

### Supported Vector Databases

**1. [Chroma](https://www.trychroma.com)**
Expand All @@ -109,202 +86,5 @@ Since Chroma is a server-client based vector database, make sure to run the serv

**2. [Deeplake](https://www.deeplake.ai/)**

#### Embeddings

- By default, the embedding model is `instructor-xl`. Can be changed by changing `embedding_type` and `embedding_model`
in `src/config.ini'. Any huggingface embeddings can be used.

### Data Ingestion

For ingesting data to the vector db:

```
client = DeepLakeClient() # Any vectordb client
retriever = Retriever(vectordb=client)
dir_path = Path(__file__).parents[2] # path to folder containing pdf files
retriever.ingest(dir_path)
```

Refer to ['cookbook/basicRAG/BasicRAG_ingest'](./cookbook/basicRAG/BasicRAG_ingest)

---

## Main Features

### 1. PDF Parser

(note: need to rewrite this. Under contruction: test suites and documentation for every iteration)

- The pdf parser is implemented using [Unstructured.io](https://unstructured.io).
- It effectively parses any pdf including OCR documents and categorises all elements including tables and images.
- Enables contextual text parsing: it ensures that the chunking process does not separate items like list items, and keeps titles together with text.
- Tables are not chunked.

### 2. Multi-Vector Retriever

- It easily retrieves not only the most similar chunks (to a query) but also the source document of the chunks.

### 3. BasicRAG

Refer to [BasicRAG/README.md](./cookbook/Basic-RAG/README.md)
(note: fix the RAGPipeline.md link)

---

## GUI

### 1. Retriever GUI

A simple GUI for retrieving documents and viewing config of the vector database.

To run: `streamlit run projects/retriver_app.py -server.port=8888`

### 2. BasicRAG GUI

Under development.

---

## Demo

(to be added)
![Watch the video](../Sample_Capstone/demo/fig/demo.gif)

## Repo Structure

---

```
.
├── LICENSE
├── README.md
├── ci
│   ├── Jenkinsfile
│   ├── env_test.py
│   ├── modify_config.py
│   └── unlock_deeplake.py
├── cookbook
│   ├── Basic-RAG
│   │   ├── BasicRAG_CustomPrompt.py
│   │   ├── BasicRAG_FewShotPrompt.py
│   │   ├── BasicRAG_ingest.py
│   │   ├── BasicRAG_refine.py
│   │   ├── BasicRAG_stuff.py
│   │   ├── RAG-PIPELINES.md
│   │   └── README.md
│   └── Retriver-GUI
│   └── retriever_app.py
├── demo
│   ├── Readme.md
│   └── fig
│   ├── demo.gif
│   └── video.mp4
├── documentation
│   ├── AWS_Setup_Nvidia_Driver_Install.md
│   ├── AWS_Setup_Python_Env.md
│   ├── Building an effective RAG app.md
│   ├── Data Sources.md
│   ├── basic_RAG_pipeline.drawio.svg
│   └── challenges.md
├── full_report
│   ├── Latex_report
│   │   ├── File_Setup.tex
│   │   ├── Sample_Report.pdf
│   │   ├── Sample_Report.tex
│   │   ├── fig
│   │   │   ├── GW_logo-eps-converted-to.pdf
│   │   │   ├── GW_logo.eps
│   │   │   ├── ascent-archi.pdf
│   │   │   ├── certificates-log-archi.pdf
│   │   │   ├── nyush-logo.jpeg
│   │   │   └── perf-plot-1.pdf
│   │   └── references.bib
│   ├── Markdown_Report
│   ├── Readme.md
│   └── Word_Report
│   ├── Sample_Report.docx
│   └── Sample_Report.pdf
├── llm_quantize
│   └── README.md
├── presentation
│   └── Readme.md
├── proposal
│   └── proposal.md
├── pyproject.toml
├── requirements.yml
├── research_paper
│   ├── Latex
│   │   ├── Fig
│   │   │   ├── narxnet1-eps-converted-to.pdf
│   │   │   └── narxnet1.eps
│   │   ├── Paper_Temp.pdf
│   │   ├── Paper_Temp.tex
│   │   └── mybib.bib
│   ├── Readme.md
│   └── Word
│   └── Conference-template-A4.doc
└── src
├── __init__.py
├── config.ini
├── grag
│   ├── __about__.py
│   ├── __init__.py
│   ├── components
│   │   ├── __init__.py
│   │   ├── embedding.py
│   │   ├── llm.py
│   │   ├── multivec_retriever.py
│   │   ├── parse_pdf.py
│   │   ├── prompt.py
│   │   ├── text_splitter.py
│   │   ├── utils.py
│   │   └── vectordb
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── chroma_client.py
│   │   └── deeplake_client.py
│   ├── prompts
│   │   ├── Llama-2_QA-refine_1.json
│   │   ├── Llama-2_QA_1.json
│   │   ├── Mixtral_QA_1.json
│   │   ├── __init__.py
│   │   └── matcher.json
│   ├── quantize
│   │   ├── __init__.py
│   │   ├── quantize.py
│   │   └── utils.py
│   └── rag
│   ├── __init__.py
│   └── basic_rag.py
├── scripts
│   ├── reset_chroma.sh
│   ├── reset_store.sh
│   └── run_chroma.sh
└── tests
├── README.md
├── __init__.py
├── components
│   ├── __init__.py
│   ├── embedding_test.py
│   ├── llm_test.py
│   ├── multivec_retriever_test.py
│   ├── parse_pdf_test.py
│   ├── prompt_test.py
│   ├── utils_test.py
│   └── vectordb
│   ├── __init__.py
│   ├── chroma_client_test.py
│   └── deeplake_client_test.py
├── quantize
│   ├── __init__.py
│   └── quantize_test.py
└── rag
├── __init__.py
└── basic_rag_test.py
```

---
For more information refer to [Documentation](https://arjbingly.github.io/Capstone_5/).

0 comments on commit 73947cb

Please sign in to comment.