This repository contains a Retrieval Augmented Generation (RAG) system that leverages Vision Language Models (VLMs) to query visually rich documents. Built using FastAPI and PyTorch, the system processes PDFs, images, and text, and utilizes Google’s Generative AI for context-aware responses.
The core focus of Retrieval Augmented Generation (RAG) is connecting your data of interest to a Large Language Model (LLM). This process bridges the power of generative AI to your data, enabling complex question answering and insights generated by LLMs based on your specific datasets (In our Case , it would be the NCERT PDF, or any PDF of our Choice , that we can upload).
Here's why RAG is exciting and how it works:
Concept: RAG allows AI models to access and utilize vast amounts of up-to-date information without constant retraining.
Process:
When a query is received, RAG first retrieves relevant information from a knowledge base. This retrieved information is then used to "augment" the input to the language model. The LLM generates a response based on both its training and the retrieved information.
Improved accuracy: By accessing current and specific information, RAG reduces hallucinations and outdated responses. Expandable knowledge: You can easily update the knowledge base without retraining the entire model. Transparency: RAG can provide sources for its information, increasing trust and verifiability.
Vectorstore: Convert your knowledge base into vector embeddings for efficient similarity search. Retriever: Implement algorithms to find the most relevant information for a given query. Generator: Use an LLM to produce human-like responses incorporating the retrieved information.
Text-centric approach: Traditional RAG systems primarily focus on text-based information, often neglecting the visual aspects of documents. This limitation is significant when dealing with: Tables Figures Page layouts Fonts and text styling
Limited visual understanding: Conventional RAG models struggle to efficiently exploit visual cues in documents, which can carry crucial information or context.
Multimodal integration challenges: Traditional systems often have difficulty integrating textual and visual information seamlessly, leading to potential misinterpretations or loss of context.
Language and domain limitations: Many existing RAG systems may not perform well across multiple languages or specialized domains, especially when visual elements are key to understanding the content.
Inefficient processing of complex documents: Documents with rich visual structures may require multiple processing steps in traditional RAG systems, leading to increased computational overhead and potential information loss.
Lack of end-to-end training: Many current document retrieval pipelines involve separate components for text extraction, embedding generation, and matching, making end-to-end optimization challenging.
Performance bottlenecks: When dealing with large collections of visually rich documents, traditional RAG systems may struggle with retrieval speed and accuracy.
Context preservation: Traditional systems might lose important contextual information provided by the visual layout and structure of documents during the retrieval process.
To address these Shortcomings , We would be using a Vision Language Model in our project.
ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. More recently, vision-language models (VLMs) excel at capturing both visual and textual context within images, and that includes document images.
The code imports necessary libraries, including FastAPI , PyTorch and the 'colpali_engine'.
The ColPali model is loaded from a pre-trained checkpoint ("vidore/colpaligemma-3b-mix-448-base"). An adapter for the model is loaded, A processor is initialized to handle input preprocessing.
The code checks for existing embeddings at "/tmp/embeddings.pt". If found, it loads them; otherwise, it initializes an empty list.
Accepts a PDF file upload. Saves the PDF temporarily. Converts the PDF pages to images. Processes these images through the ColPali model to generate embeddings. Saves the embeddings to "/tmp/embeddings.pt" for future use.
Accepts a text query. Loads the previously saved document embeddings. Processes the query through the ColPali model to generate a query embedding. Uses a CustomEvaluator to compare the query embedding with document embeddings. Identifies the most relevant document (image) based on the highest similarity score. Saves the relevant image temporarily.
Uses Google's Generative AI (Gemini model) for further processing. Feeds the original query and the most relevant image to the Gemini model. Generates a response based on both the query and the image content.
Returns the generated response from the Gemini model to the user.
Sets up the FastAPI server to run on host "0.0.0.0" and port 8000. Steps to run the Project on macOS: