From 5f5a199a5a97f17d8aa888765019f2344443a6a0 Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:36:00 -0700 Subject: [PATCH] [GenAI] Add readme to Microsoft.ML.GenAI.Phi (#7206) * add readme * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> * Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --------- Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 119 +++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 src/Microsoft.ML.GenAI.Phi/README.md diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md new file mode 100644 index 0000000000..758a78ad47 --- /dev/null +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -0,0 +1,119 @@ +# Microsoft.ML.GenAI.Phi +Torchsharp implementation of Microsoft phi-series models for GenAI + +## Supported list +The following phi-models are supported and tested: +- [x] [Phi-2](https://huggingface.co/microsoft/phi-2) +- [x] [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) +- [x] [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) +- [ ] [Phi-3-small-8k-instruct](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) +- [ ] [Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) +- [ ] [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) +- [ ] [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) +- [ ] [Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-large-4k-instruct) + +## Getting Started with Semantic Kernel + +### Download model weight (e.g. phi-3-mini-4k-instruct) from Huggingface +```bash +## make sure you have lfs installed +git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct +``` + +### Load model +```csharp +var weightFolder = "/path/to/Phi-3-mini-4k-instruct"; +var configName = "config.json"; +var config = JsonSerializier.Deserialize(File.ReadAllText(Path.Combine(weightFolder, configName))); +var model = new Phi3ForCasualLM(config); + +// load tokenizer +var tokenizerModelName = "tokenizer.model"; +var tokenizer = Phi3TokenizerHelper.FromPretrained(Path.Combine(weightFolder, tokenizerModelName)); + +// load weight +model.LoadSafeTensors(weightFolder); + +// initialize device +var device = "cuda"; +if (device == "cuda") +{ + torch.InitializeDeviceType(DeviceType.CUDA); +} + + +// create causal language model pipeline +var pipeline = new CausalLMPipeline(tokenizer, model, device); +``` + +### Add pipeline as `IChatCompletionService` to sematic kernel +```csharp +var kernel = Kernel.CreateBuilder() + .AddGenAIChatCompletion(pipeline) + .Build(); +``` + +### Chat with the model +```csharp +var chatService = kernel.GetRequiredService(); +var chatHistory = new ChatHistory(); +chatHistory.AddSystemMessage("you are a helpful assistant"); +chatHistory.AddUserMessage("write a C# program to calculate the factorial of a number"); +await foreach (var response in chatService.GetStreamingChatMessageContentsAsync(chatHistory)) +{ + Console.Write(response); +} +``` + +## Getting started with AutoGen.Net +### Follow the same steps download model weight and load model +### Create `Phi3Agent` from pipeline +```csharp +var agent = new Phi3Agent(pipeline, name: "assistant") + .RegisterPrintMessage(); +``` + +### Chat with the model +```csharp +var task = """ +write a C# program to calculate the factorial of a number +"""; + +await agent.SendAsync(task); +``` + +### More examples +Please refer to [Microsoft.ML.GenAI.Samples](./../../docs/samples/Microsoft.ML.GenAI.Samples/) for more examples. + +## Dynamic loading +It's recommended to run model inference on GPU, which requires at least 8GB of GPU memory for phi-3-mini-4k-instruct model if fully loaded. + +If your GPU memory is not enough, you can choose to dynamically load the model weight to GPU memory. Here is how it works behind the scene: +- when initializing the model, the size of each layer is calculated and stored in a dictionary +- when loading the model weight, each layer is assigned to a device (CPU or GPU) based on the size of the layer and the remaining memory of the device. If there is no enough memory on the device, the layer is loaded to CPU memory. +- when inference, the layer which is loaded to CPU memory is moved to GPU memory before the inference and moved back to CPU memory after the inference. + +Here is how to enable dynamic loading of model: +### Step 1: Infer the size of each layer +You can infer the size of each layer using `InferDeviceMapForEachLayer` API. The `deviceMap` will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu"). + +```csharp +// manually set up the available memory on each device +var deviceSizeMap = new Dictionary + { + ["cuda"] = modelSizeOnCudaInGB * 1L * 1024 * 1024 * 1024, + ["cpu"] = modelSizeOnMemoryInGB * 1L * 1024 * 1024 * 1024, + ["disk"] = modelSizeOnDiskInGB * 1L * 1024 * 1024 * 1024, + }; + +var deviceMap = model.InferDeviceMapForEachLayer( + devices: ["cuda", "cpu", "disk"], + deviceSizeMapInByte: deviceSizeMap); +``` + +### Step 2: Load model weights using `ToDynamicLoadingModel` API +Once the `deviceMap` is calculated, you can pass it to `ToDynamicLoadingModel` api to load the model weight. + +```csharp +model = model.ToDynamicLoadingModel(deviceMap, "cuda"); +```