Merge branch 'update-docs' of https://github.com/arjbingly/Capstone_5 …

…into update-docs
arjbingly · Apr 9, 2024 · 9ee62ec · 9ee62ec
2 parents 4650d27 + 9bc174b
commit 9ee62ec
Show file tree

Hide file tree

Showing 2 changed files with 170 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,173 @@
+<h1 align="center">Graph Retrieval-Augmented Generation - GRAG</h1>
+
+[![License: AGPL v3](https://img.shields.io/badge/License-AGPL_v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
+![Static Badge](https://img.shields.io/badge/docstring%20style-google-yellow)
+![Static Badge](https://img.shields.io/badge/linter%20-ruff-yellow)
+![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/ruff_linting.yml?label=Docs&labelColor=yellow)
+![Static Badge](https://img.shields.io/badge/buildstyle-hatchling-purple?labelColor=white)
+![Static Badge](https://img.shields.io/badge/codestyle-pyflake-purple?labelColor=white)
+![GitHub Issues or Pull Requests](https://img.shields.io/github/issues-pr/arjbingly/Capstone_5)
+
+This GitRepo provides an open-sourced implementation of a Retrival-Augmented Generation pipeline, using a graph data structure in place of a vector database.
+
+<figure>
+    <img src="documentation/basic_RAG_pipeline.png" alt="Diagram of a basic RAG pipeline">
+    <figcaption style="text-align: center;"
+    >Diagram of a basic RAG pipeline</figcaption>
+</figure>
+
+## Table of Content
+
+- [Project Overview](#project-overview--change-this-to-features--)
+- [Getting Started](#getting-started)
+  - [Requirements](#requirements)
+  - [LLM Models](#llm-models)
+  - [Data](#data)
+  - [Supported Vector Databases](#supported-vector-databases)
+    - [Embeddings](#embeddings)
+  - [Data Ingestion](#data-ingestion)
+- [Main Features](#main-features)
+  - [1. PDF Parser](#1-pdf-parser)
+  - [2. Multi-Vector Retriever](#2-multi-vector-retriever)
+  - [3. BasicRAG](#3-basicrag)
+- [GUI](#gui)
+  - [1. Retriever GUI](#1-retriever-gui)
+  - [2. BasicRAG GUI](#2-basicrag-gui)
+- [Demo](#demo)
+- [Repo Structure](#repo-structure)
+
+## Project Overview
+
+- A ready to deploy RAG pipeline for document retrival.
+- Basic GUI _(Under Development)_
+- Evaluation Suite _(Under Development)_
+- RAG enhancement using Graphs _(Under Development)_
+
+---
+
+## Getting Started
+
+To run the projects, make sure the instructions below are followed.
+
+Further customization can be made on the config file, `src/config.ini`.
+
+- `git clone` the repository
+- `pip install .` from the repository (note: add - then change directory to the cloned repo)
+- _For Dev:_ `pip install -e .`
+
+### Requirements
+
+Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_):
+
+- PyTorch
+- LangChain
+- Chroma
+- Unstructured.io
+- sentence-embedding
+- instructor-embedding
+
+### LLM Models
+
+To quantize model, run:
+`python -m grag.quantize.quantize`
+
+For more details, go to [.\llm_quantize\readme.md](.\llm_quantize\readme.md)
+**Tested models:**
+
+1. Llama-2 7B, 13B
+2. Mixtral 8x7B
+3. Gemma 7B
+
+**Model Compatibility**
+
+Refer to [llama.cpp](https://github.com/ggerganov/llama.cpp) Supported Models (under Description) for list of compatible models.
+
+### Data
+
+Any PDF can be used for this project. We personally tested the project using ArXiv papers. Refer [ArXiv Bulk Data](https://info.arxiv.org/help/bulk_data/index.html) for
+details on how to download.
+
+```
+├── data
+│   ├── pdf
+```
+
+**Make sure to specify `data_path` under `data` in `src/config.ini`**
+
+### Supported Vector Databases
+
+**1. [Chroma](https://www.trychroma.com)**
+
+Since Chroma is a server-client based vector database, make sure to run the server.
+
+- To run Chroma locally, move to `src/scripts` then run `source run_chroma.sh`. This by default runs on port 8000.
+- If Chroma is not run locally, change `host` and `port` under `chroma` in `src/config.ini`.
+
+**2. [Deeplake](https://www.deeplake.ai/)**
+
+#### Embeddings
+
+- By default, the embedding model is `instructor-xl`. Can be changed by changing `embedding_type` and `embedding_model`
+  in `src/config.ini'. Any huggingface embeddings can be used.
+
+### Data Ingestion
+
+For ingesting data to the vector db:
+
+```
+client = DeepLakeClient() # Any vectordb client
+retriever = Retriever(vectordb=client)
+
+dir_path = Path(__file__).parents[2] # path to folder containing pdf files
+
+retriever.ingest(dir_path)
+```
+
+Refer to ['cookbook/basicRAG/BasicRAG_ingest'](./cookbook/basicRAG/BasicRAG_ingest)
+
+---
+
+## Main Features
+
+### 1. PDF Parser
+
+(note: need to rewrite this. Under contruction: test suites and documentation for every iteration)
+
+- The pdf parser is implemented using [Unstructured.io](https://unstructured.io).
+- It effectively parses any pdf including OCR documents and categorises all elements including tables and images.
+- Enables contextual text parsing: it ensures that the chunking process does not separate items like list items, and keeps titles together with text.
+- Tables are not chunked.
+
+### 2. Multi-Vector Retriever
+
+- It easily retrieves not only the most similar chunks (to a query) but also the source document of the chunks.
+
+### 3. BasicRAG
+
+Refer to [BasicRAG/README.md](./cookbook/Basic-RAG/README.md)
+(note: fix the RAGPipeline.md link)
+
+---
+
+## GUI
+
+### 1. Retriever GUI
+
+A simple GUI for retrieving documents and viewing config of the vector database.
+
+To run: `streamlit run projects/retriver_app.py -server.port=8888`
+
+### 2. BasicRAG GUI
+
+Under development.
+
+---
+
+## Demo
+
+(to be added)
+![Watch the video](../Sample_Capstone/demo/fig/demo.gif)
+
 ## Repo Structure
 
 ___

diff --git a/documentation/basic_RAG_pipeline.png b/documentation/basic_RAG_pipeline.png