Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TensorRT-LLM connector for Langchain #1

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
149 changes: 149 additions & 0 deletions libs/trt/docs/trtllmapi.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cf9d3415-fe08-4cc0-bbb8-b582cb01e754",
"metadata": {},
"source": [
"# NVIDIA TensorRT-LLM\n",
"TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.\n",
"<br>\n",
"[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)\n",
"\n",
"## TensorRT-LLM environment setup\n",
"Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.\n",
"<br>\n",
"1. Install `tensorrt_llm` following the instruction on [TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM). Llama2 and Mistral models are supported with this connector. The following steps are shown for Llama2\n",
"2. Ensure you have access to the Llama 2 [repository on huggingface](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)"
]
},
{
"cell_type": "markdown",
"id": "585c521c-d2ab-407c-b3fb-768c1afd0922",
"metadata": {},
"source": [
"## Langchain-nvidia-trt setup\n",
"To install from source:\n",
"1. `cd langchain-nvidia/libs/trt`\n",
"2. `!pip install -e .`"
]
},
{
"cell_type": "markdown",
"id": "01e68c59-1245-4ab3-a7c8-0ba54619f873",
"metadata": {},
"source": [
"## Use TensorRT-LLM to create engine files for the model\n",
"Llama2 and Mistral models are supported. The following steps are shown for Llama2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aeeb384f-effa-4d1a-9bf0-3bf748daeb73",
"metadata": {},
"outputs": [],
"source": [
"from tensorrt_llm import LLM, ModelConfig\n",
"from huggingface_hub import snapshot_download\n",
"\n",
"#Download the Llama2 model\n",
"model_dir = snapshot_download(repo_id=\"meta-llama/Llama-2-7b-chat-hf\",token=\"<hf_token>\")\n",
"\n",
"# Load the model via LLM and save the .engine file\n",
"# Please restart the kernel after saving the .engine file\n",
"# to prevent OOM errors with the torch and engine loaded\n",
"config = ModelConfig(model_dir=model_dir)\n",
"llm = LLM(config)\n",
"llm.save(\"./model\")\n",
"#Plug this path to the TrtLlmAPI"
]
},
{
"cell_type": "markdown",
"id": "88f77f91-fced-408b-95ff-c9b604b9c188",
"metadata": {},
"source": [
"### Building engine files for Windows users\n",
"Instead of using the steps above, build the engine files using the following steps:\n",
"1. Clone the [TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM) repository\n",
"2. Change directory to [examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) for LLama models\n",
"3. Convert model to checkpoint format and build the engine using the following commands"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0de70883-215a-40aa-a12b-e684536dd495",
"metadata": {},
"outputs": [],
"source": [
"# Build the LLaMA 7B model using a single GPU and FP16.\n",
"python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \\\n",
" --output_dir ./tllm_checkpoint_1gpu_fp16 \\\n",
" --dtype float16\n",
"\n",
"trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \\\n",
" --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu \\\n",
" --gemm_plugin float16 \\\n",
" --context_fmha disable \\\n",
" --context_fmha_fp32_acc enable"
]
},
{
"cell_type": "markdown",
"id": "49e836a3-be37-416a-8af6-5e53562789e4",
"metadata": {},
"source": [
"## Create the TrtLlmAPI instance\n",
"When setting up an LLM object, provide the model directory where the engine built is placed, tokenizer path to the cloned huggingface repository and temperature to specify the desired deterministic nature of the responses. Call `invoke` with a prompt. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "908ac6ac-b728-47d2-ae41-11a5cf4c1057",
"metadata": {},
"outputs": [],
"source": [
"from langchain_nvidia_trt.llms import TrtLlmAPI\n",
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"\n",
"prompt = PromptTemplate.from_template(template)\n",
"\n",
"llm = TrtLlmAPI(\n",
" model_path=\"./model\",\n",
" tokenizer_dir=\"./model\",\n",
" temperature=1.0\n",
")\n",
"chain = prompt | llm\n",
"print(chain.invoke({\"question\": \"Who is Paul Graham?\"}))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
4 changes: 2 additions & 2 deletions libs/trt/langchain_nvidia_trt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from langchain_nvidia_trt.llms import TritonTensorRTLLM
from langchain_nvidia_trt.llms import TritonTensorRTLLM, TrtLlmAPI

__all__ = ["TritonTensorRTLLM"]
__all__ = ["TritonTensorRTLLM", "TrtLlmAPI"]
Loading
Loading