Skip to content

Commit

Permalink
Merge branch 'update-docs' of https://github.com/arjbingly/Capstone_5
Browse files Browse the repository at this point in the history
…into update-docs
  • Loading branch information
sanchitvj committed Apr 9, 2024
2 parents 4650d27 + 9bc174b commit 9ee62ec
Show file tree
Hide file tree
Showing 2 changed files with 170 additions and 0 deletions.
170 changes: 170 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,173 @@
<h1 align="center">Graph Retrieval-Augmented Generation - GRAG</h1>

[![License: AGPL v3](https://img.shields.io/badge/License-AGPL_v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
![Static Badge](https://img.shields.io/badge/docstring%20style-google-yellow)
![Static Badge](https://img.shields.io/badge/linter%20-ruff-yellow)
![Linting](https://img.shields.io/github/actions/workflow/status/arjbingly/Capstone_5/ruff_linting.yml?label=Docs&labelColor=yellow)
![Static Badge](https://img.shields.io/badge/buildstyle-hatchling-purple?labelColor=white)
![Static Badge](https://img.shields.io/badge/codestyle-pyflake-purple?labelColor=white)
![GitHub Issues or Pull Requests](https://img.shields.io/github/issues-pr/arjbingly/Capstone_5)

This GitRepo provides an open-sourced implementation of a Retrival-Augmented Generation pipeline, using a graph data structure in place of a vector database.

<figure>
<img src="documentation/basic_RAG_pipeline.png" alt="Diagram of a basic RAG pipeline">
<figcaption style="text-align: center;"
>Diagram of a basic RAG pipeline</figcaption>
</figure>

## Table of Content

- [Project Overview](#project-overview--change-this-to-features--)
- [Getting Started](#getting-started)
- [Requirements](#requirements)
- [LLM Models](#llm-models)
- [Data](#data)
- [Supported Vector Databases](#supported-vector-databases)
- [Embeddings](#embeddings)
- [Data Ingestion](#data-ingestion)
- [Main Features](#main-features)
- [1. PDF Parser](#1-pdf-parser)
- [2. Multi-Vector Retriever](#2-multi-vector-retriever)
- [3. BasicRAG](#3-basicrag)
- [GUI](#gui)
- [1. Retriever GUI](#1-retriever-gui)
- [2. BasicRAG GUI](#2-basicrag-gui)
- [Demo](#demo)
- [Repo Structure](#repo-structure)

## Project Overview

- A ready to deploy RAG pipeline for document retrival.
- Basic GUI _(Under Development)_
- Evaluation Suite _(Under Development)_
- RAG enhancement using Graphs _(Under Development)_

---

## Getting Started

To run the projects, make sure the instructions below are followed.

Further customization can be made on the config file, `src/config.ini`.

- `git clone` the repository
- `pip install .` from the repository (note: add - then change directory to the cloned repo)
- _For Dev:_ `pip install -e .`

### Requirements

Required packages to install includes (_refer to [pyproject.toml](pyproject.toml)_):

- PyTorch
- LangChain
- Chroma
- Unstructured.io
- sentence-embedding
- instructor-embedding

### LLM Models

To quantize model, run:
`python -m grag.quantize.quantize`

For more details, go to [.\llm_quantize\readme.md](.\llm_quantize\readme.md)
**Tested models:**

1. Llama-2 7B, 13B
2. Mixtral 8x7B
3. Gemma 7B

**Model Compatibility**

Refer to [llama.cpp](https://github.com/ggerganov/llama.cpp) Supported Models (under Description) for list of compatible models.

### Data

Any PDF can be used for this project. We personally tested the project using ArXiv papers. Refer [ArXiv Bulk Data](https://info.arxiv.org/help/bulk_data/index.html) for
details on how to download.

```
├── data
│   ├── pdf
```

**Make sure to specify `data_path` under `data` in `src/config.ini`**

### Supported Vector Databases

**1. [Chroma](https://www.trychroma.com)**

Since Chroma is a server-client based vector database, make sure to run the server.

- To run Chroma locally, move to `src/scripts` then run `source run_chroma.sh`. This by default runs on port 8000.
- If Chroma is not run locally, change `host` and `port` under `chroma` in `src/config.ini`.

**2. [Deeplake](https://www.deeplake.ai/)**

#### Embeddings

- By default, the embedding model is `instructor-xl`. Can be changed by changing `embedding_type` and `embedding_model`
in `src/config.ini'. Any huggingface embeddings can be used.

### Data Ingestion

For ingesting data to the vector db:

```
client = DeepLakeClient() # Any vectordb client
retriever = Retriever(vectordb=client)
dir_path = Path(__file__).parents[2] # path to folder containing pdf files
retriever.ingest(dir_path)
```

Refer to ['cookbook/basicRAG/BasicRAG_ingest'](./cookbook/basicRAG/BasicRAG_ingest)

---

## Main Features

### 1. PDF Parser

(note: need to rewrite this. Under contruction: test suites and documentation for every iteration)

- The pdf parser is implemented using [Unstructured.io](https://unstructured.io).
- It effectively parses any pdf including OCR documents and categorises all elements including tables and images.
- Enables contextual text parsing: it ensures that the chunking process does not separate items like list items, and keeps titles together with text.
- Tables are not chunked.

### 2. Multi-Vector Retriever

- It easily retrieves not only the most similar chunks (to a query) but also the source document of the chunks.

### 3. BasicRAG

Refer to [BasicRAG/README.md](./cookbook/Basic-RAG/README.md)
(note: fix the RAGPipeline.md link)

---

## GUI

### 1. Retriever GUI

A simple GUI for retrieving documents and viewing config of the vector database.

To run: `streamlit run projects/retriver_app.py -server.port=8888`

### 2. BasicRAG GUI

Under development.

---

## Demo

(to be added)
![Watch the video](../Sample_Capstone/demo/fig/demo.gif)

## Repo Structure

___
Expand Down
Binary file added documentation/basic_RAG_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9ee62ec

Please sign in to comment.