Merge branch 'sphinx' of https://github.com/arjbingly/Capstone_5 into…

… sphinx
arjbingly · Apr 23, 2024 · c94ad28 · c94ad28
2 parents 7bec7c1 + 6546f52
commit c94ad28
Show file tree

Hide file tree

Showing 49 changed files with 497 additions and 116 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,27 @@
+name: documentation
+
+on: [ push, pull_request, workflow_dispatch ]
+
+permissions:
+  contents: write
+
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v3
+      - name: Install dependencies
+        run: |
+          pip install sphinx sphinx_rtd_theme myst_parser
+      - name: Sphinx build
+        run: |
+          sphinx-build doc _build
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
+        with:
+          publish_branch: gh-pages
+          #          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: _build/
+          force_orphan: true
diff --git a/src/docs/_build/doctrees/environment.pickle b/src/docs/_build/doctrees/environment.pickle
diff --git a/src/docs/_build/doctrees/get_started.doctree b/src/docs/_build/doctrees/get_started.doctree
diff --git a/src/docs/_build/doctrees/get_started.introduction.doctree b/src/docs/_build/doctrees/get_started.introduction.doctree
diff --git a/src/docs/_build/doctrees/get_started.llms.doctree b/src/docs/_build/doctrees/get_started.llms.doctree
diff --git a/src/docs/_build/doctrees/get_started.parse_pdf.doctree b/src/docs/_build/doctrees/get_started.parse_pdf.doctree
diff --git a/src/docs/_build/doctrees/get_started.vectordb.doctree b/src/docs/_build/doctrees/get_started.vectordb.doctree
diff --git a/src/docs/_build/doctrees/grag.components.doctree b/src/docs/_build/doctrees/grag.components.doctree
diff --git a/src/docs/_build/doctrees/grag.components.vectordb.doctree b/src/docs/_build/doctrees/grag.components.vectordb.doctree
diff --git a/src/docs/_build/doctrees/grag.rag.doctree b/src/docs/_build/doctrees/grag.rag.doctree
diff --git a/src/docs/_build/html/.buildinfo b/src/docs/_build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 33176d1a0fbc2e489b6d5201070d328e
+config: 1ced34aae86d195057701cf655c56180
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/src/docs/_build/html/_sources/get_started.introduction.rst.txt b/src/docs/_build/html/_sources/get_started.introduction.rst.txt
@@ -3,9 +3,22 @@ GRAG Overview
 
 GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced.
 Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns.
-For more information, refer to :ref:`Test <Vector Stores>`.
+For more information, refer to `our readme <https://github.com/arjbingly/Capstone_5/blob/main/README.md>`_.
 
-Retrieval-Augmented Generation
-##############################
+Retrieval-Augmented Generation (RAG)
+####################################
 
-Re
+Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data.
+
+In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses.
+
+By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy.
+
+.. figure:: ../../_static/basic_RAG_pipeline.png
+  :width: 800
+  :alt: Basic-RAG Pipeline
+  :align: center
+
+  Illustration of a basic RAG pipeline
+
+Traditionally, it uses a vector database/vector store for both retrieval and generation processes.
diff --git a/src/docs/_build/html/_sources/get_started.llms.rst.txt b/src/docs/_build/html/_sources/get_started.llms.rst.txt
@@ -1,4 +1,4 @@
-        `LLMs
+LLMs
 =====
 
 GRAG offers two ways to run LLMs locally:
@@ -17,10 +17,10 @@ provide an auth token*
 To run LLMs using LlamaCPP
 #############################
 LlamaCPP requires models in the form of `.gguf` file. You can either download these model files online,
-or
+or **quantize** the model yourself following the instructions below.
 
-How to quantize models.
-************************
+How to quantize models
+***********************
 To quantize the model, run:
   ``python -m grag.quantize.quantize``
 
@@ -34,4 +34,4 @@ After running the above command, user will be prompted with the following:
 
 * If the user has the model downloaded locally, then user will be instructed to copy the model and input the name of the model directory.
 
-3.Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp <https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19>`_.
+3. Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp <https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19>`_.
diff --git a/src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt b/src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt
@@ -0,0 +1,61 @@
+Parse PDF
+=========
+
+The parsing and partitioning were primarily done using the unstructured.io library, which is designed for this purpose. However, for PDFs with complex layouts, such as nested tables or tax forms, the pdfplumber and pytesseract libraries were employed to improve the parsing accuracy.
+
+The class has several attributes that control the behavior of the parsing and partitioning process.
+
+Attributes
+##########
+
+- single_text_out (bool): If True, all text elements are combined into a single output document. The default value is True.
+
+- strategy (str): The strategy for PDF partitioning. The default is "hi_res" for better accuracy
+
+- extract_image_block_types (list): A list of elements to be extracted as image blocks. By default, it includes "Image" and "Table".The default value is True.
+
+- infer_table_structure (bool): Whether to extract tables during partitioning. The default value is True.
+
+- extract_images (bool): Whether to extract images. The default value is True.
+
+- image_output_dir (str): The directory to save extracted images, if any.
+
+- add_captions_to_text (bool): Whether to include figure captions in the text output. The default value is True.
+
+- add_captions_to_blocks (bool): Whether to add captions to table and image blocks. The default value is True.
+
+- add_caption_first (bool): Whether to place captions before their corresponding image or table in the output. The default value is True.
+
+- table_as_html (bool): Whether to represent tables as HTML.
+
+Parsing Complex PDF Layouts
+###########################
+
+While unstructured.io performed well in parsing PDFs with straightforward layouts, PDFs with complex layouts, such as nested tables or tax forms, were not parsed accurately. To address this issue, the pdfplumber and pytesseract libraries were employed.
+
+Table Parsing Methodology
+=========================
+
+For each page in the PDF file, the find_tables method is called with specific table settings to find the tables on that page. The table settings used are:
+
+- ``"vertical_strategy": "text"``: This setting tells the function to detect tables based on the text content.
+
+- ``"horizontal_strategy": "lines"``: This setting tells the function to detect tables based on the horizontal lines.
+
+- ``"min_words_vertical": 3``: This setting specifies the minimum number of words required to consider a row as part of a table.
+
+**For each table found on the page, the following steps are performed:**
+
+1. The table area is cropped from the page using the crop method and the bbox (bounding box) of the table.
+
+2. The text content of the cropped table area is extracted using the `extract_text` method with `layout=True`.
+
+3. A dictionary is created with the `table_number` and `extracted_text` of the table, and it is appended to the `extracted_tables_in_page` list.
+After processing all the tables on the page, a dictionary is created with the `page_number` and the list of `extracted_tables_in_page`, and it is appended to the `extracted_tables` list.
+Finally, the extracted_tables list is returned, which contains all the extracted tables from the PDF file, organized by page and table number.
+
+Limitations
+===========
+
+While the table parsing methodology using `pdfplumber` could process most tables, it could not parse every table layout accurately. The table settings need to be adjusted for different types of table layouts. Additionally, pdfplumber could not extract figure captions, whereas `unstructured.io` could.
+Future work may involve developing a more robust and flexible table parsing algorithm that can handle a wider range of table layouts and integrate seamlessly with the ParsePDF class to leverage the strengths of both unstructured.io and pdfplumber libraries.
diff --git a/src/docs/_build/html/_sources/get_started.rst.txt b/src/docs/_build/html/_sources/get_started.rst.txt
@@ -5,6 +5,7 @@ Get Started
 
    get_started.introduction
    get_started.installation
+   get_started.parse_pdf
    get_started.llms
    get_started.vectordb
 
diff --git a/src/docs/_build/html/_sources/get_started.vectordb.rst.txt b/src/docs/_build/html/_sources/get_started.vectordb.rst.txt
@@ -1,5 +1,3 @@
-.. _Vector Stores:
-
 Vector Stores
 ===============
 
@@ -28,7 +26,14 @@ Since Chroma is a server-client based vector database, make sure to run the serv
 * If Chroma is not run locally, change ``host`` and ``port`` under ``chroma`` in `src/config.ini`, or provide the arguments
   explicitly.
 
-For non-supported vectorstores, (...)
+Once you have chroma running, just use the Chroma Client class.
+
+DeepLake
+*********
+Since DeepLake is not a server based vector store, it is much easier to get started.
+
+Just make sure you have DeepLake installed and use the DeepLake Client class.
+
 
 Embeddings
 ###########
@@ -52,4 +57,3 @@ For more details on data ingestion, refer to our `cookbook <https://github.com/a
 
 
     retriever.ingest(dir_path)
-
diff --git a/src/docs/_build/html/get_started.html b/src/docs/_build/html/get_started.html
@@ -53,8 +53,10 @@
 <li class="toctree-l1 current"><a class="current reference internal" href="#">Get Started</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a></li>
 <li class="toctree-l2"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
 <li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
 </ul>
 </li>
@@ -91,13 +93,20 @@ <h1>Get Started<a class="headerlink" href="#get-started" title="Link to this hea
 <div class="toctree-wrapper compound">
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a><ul>
-<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html#retrieval-augmented-generation">Retrieval-Augmented Generation</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html#retrieval-augmented-generation-rag">Retrieval-Augmented Generation (RAG)</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
-<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a><ul>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#how-to-quantize-models">How to quantize models.</a></li>
+<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#attributes">Attributes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#parsing-complex-pdf-layouts">Parsing Complex PDF Layouts</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
+<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
+<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html">LLMs</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-huggingface">To run LLMs using HuggingFace</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a><ul>

diff --git a/src/docs/_build/html/get_started.installation.html b/src/docs/_build/html/get_started.installation.html
@@ -25,7 +25,7 @@
     <script src="_static/js/theme.js"></script>
     <link rel="index" title="Index" href="genindex.html" />
     <link rel="search" title="Search" href="search.html" />
-    <link rel="next" title="To run LLMs using HuggingFace" href="get_started.llms.html" />
+    <link rel="next" title="Parse PDF" href="get_started.parse_pdf.html" />
     <link rel="prev" title="GRAG Overview" href="get_started.introduction.html" /> 
 </head>
 
@@ -53,8 +53,10 @@
 <li class="toctree-l1 current"><a class="reference internal" href="get_started.html">Get Started</a><ul class="current">
 <li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a></li>
 <li class="toctree-l2 current"><a class="current reference internal" href="#">Installation</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
 <li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
 </ul>
 </li>
@@ -103,7 +105,7 @@ <h1>Installation<a class="headerlink" href="#installation" title="Link to this h
           </div>
           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
         <a href="get_started.introduction.html" class="btn btn-neutral float-left" title="GRAG Overview" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
-        <a href="get_started.llms.html" class="btn btn-neutral float-right" title="To run LLMs using HuggingFace" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
+        <a href="get_started.parse_pdf.html" class="btn btn-neutral float-right" title="Parse PDF" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
     </div>
 
   <hr/>

diff --git a/src/docs/_build/html/get_started.introduction.html b/src/docs/_build/html/get_started.introduction.html
@@ -52,12 +52,14 @@
 <ul class="current">
 <li class="toctree-l1 current"><a class="reference internal" href="get_started.html">Get Started</a><ul class="current">
 <li class="toctree-l2 current"><a class="current reference internal" href="#">GRAG Overview</a><ul>
-<li class="toctree-l3"><a class="reference internal" href="#retrieval-augmented-generation">Retrieval-Augmented Generation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#retrieval-augmented-generation-rag">Retrieval-Augmented Generation (RAG)</a></li>
 </ul>
 </li>
 <li class="toctree-l2"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
-<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
+<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
 <li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
 </ul>
 </li>
@@ -94,10 +96,19 @@
 <h1>GRAG Overview<a class="headerlink" href="#grag-overview" title="Link to this heading"></a></h1>
 <p>GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced.
 Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns.
-For more information, refer to <a class="reference internal" href="get_started.vectordb.html#vector-stores"><span class="std std-ref">Test</span></a>.</p>
-<section id="retrieval-augmented-generation">
-<h2>Retrieval-Augmented Generation<a class="headerlink" href="#retrieval-augmented-generation" title="Link to this heading"></a></h2>
-<p>Re</p>
+For more information, refer to <a class="reference external" href="https://github.com/arjbingly/Capstone_5/blob/main/README.md">our readme</a>.</p>
+<section id="retrieval-augmented-generation-rag">
+<h2>Retrieval-Augmented Generation (RAG)<a class="headerlink" href="#retrieval-augmented-generation-rag" title="Link to this heading"></a></h2>
+<p>Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data.</p>
+<p>In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses.</p>
+<p>By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy.</p>
+<figure class="align-center" id="id1">
+<a class="reference internal image-reference" href="../../_static/basic_RAG_pipeline.png"><img alt="Basic-RAG Pipeline" src="../../_static/basic_RAG_pipeline.png" style="width: 800px;" /></a>
+<figcaption>
+<p><span class="caption-text">Illustration of a basic RAG pipeline</span><a class="headerlink" href="#id1" title="Link to this image"></a></p>
+</figcaption>
+</figure>
+<p>Traditionally, it uses a vector database/vector store for both retrieval and generation processes.</p>
 </section>
 </section>