Skip to content

Latest commit

 

History

History
362 lines (266 loc) · 9.73 KB

README.md

File metadata and controls

362 lines (266 loc) · 9.73 KB

Sapien, a LLaMA 3.1 70B 🦙 Fine-Tuned LoRA model using Alpaca Dataset

husky nextdotjs

Open In Colab  

  1. Introduction
  2. Key Features
  3. 📊 Fine tuned models
  4. 💬 Inferencing
  5. 📋 Requirements
  6. 🚀 Getting Started
  7. 🐥 Frontend
  8. 🗄️ Backend
  9. 🛠️ Code Formatting

Sapien is the LLaMA 3.1 70B model fined tuned using Low-Rank Adaptation (LoRA) on the Alpaca dataset. The training is optimized for 4-bit and 16-bit precision.

Project Preview

Watch more detailed project walkthrough


  • LoRA (Low-Rank Adaptation) for optimizing large language models.
  • 4-bit & 16-bit precision fine-tuning using advanced quantization techniques.
  • Alpaca Dataset: Instruction-based fine-tuning dataset.
  • Model Hosting: Push the trained model to Hugging Face for deployment.

Model config:

{
  "_name_or_path": "unsloth/meta-llama-3.1-8b-bnb-4bit",
  "architectures": ["LlamaForCausalLM"],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-5,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.45.1",
  "unsloth_version": "2024.9.post3",
  "use_cache": true,
  "vocab_size": 128256
}

Trainer stats:

[
  60,
  0.8564618517955144,
  {
    "train_runtime": 441.2579,
    "train_samples_per_second": 1.088,
    "train_steps_per_second": 0.136,
    "total_flos": 5726714157219840.0,
    "train_loss": 0.8564618517955144,
    "epoch": 0.00927357032457496
  }
]

(This will work only when you have all model files locally saved after running trainer)

from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Give me the first 10 digits of Pi",
            "3.14159",
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(outputs))

To run Sapien, you need the following requirements:

  • Node.js (version 14 or higher)
  • npm (version 6 or higher)
  • Python (version 3.7 or higher)
  • Ollama (version 1.7 or higher for method 2)
  • llama.cpp (for method 3)

Make sure you have these installed before proceeding.


To get started with Sapien, follow these steps:

  1. Clone the repository:

    git clone https://github.com/annalhq/sapien.git
    cd sapien
  2. Install dependencies

    npm install
  3. Run the dev server

    npm run dev
  4. For the backend part refer to Backend section accordingly


Deployed using NextJS and Shadcn UI library alongside Vercel's AI SDK UI.

eslint prettier husky

These integrations will make sure while deploying that there is no server side issues (also maintains code consistency)

If you are making changes in the code, make sure to run npm run format otherwise Husky will prevent you from committing the code to repository.

Use --no-verify flag alongside with your git command to skip the invocation of Husky.

ESLint

ESLint is used to identify and fix problems in JavaScript and TypeScript code. To run ESLint, use:
npm run check-lint

Prettier

For consitent code formatting, use:
npm run check-format

Husky

Husky is used to manage Git hooks. The pre-commit hook checks for formatting, linting, and type errors, and also builds the project.

This Serverless Inference API allows you to easily do inference on my fine tuned models or you can use any other models with TextToText generation models.

Getting tokens from HuggingFace

Login to HuggingFace and get tokens from here. As reccommended, it is preferable to create fine-grained tokens with the scope to Make calls to the serverless Inference API

Official Tokens guide by HuggingFace

In v1.0.0 of this project, HFInference client is used for handling inference from model.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference("HUGGINGFACE_API_KEY");

const result = await inference.textClassification({
  model: "https://huggingface.co/annalhq/llama-3.1-8B-lora-alpaca",
  inputs: "Hi! How are you?",
});

console.log(result);

Store HF token variables .env.local as

HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Now this will use this serverless API from my model for streaming text.


(Recommended for locally running)

Here you can use Ollama to serve my model locally. As it does not has stream handling capabilities as for chat frontend I've used Vercel AI SDK with ModelFusion.

Vercel AI SDK will handle stream forwarding and rendering, and ModelFusion to integrate Ollama with the Vercel AI SDK.

  1. Install Ollama from official site
  2. Pulling model on Ollama

If you want to use my model in Ollama follow these instructions:

  1. Download HFDownloader
  2. Download my model in SafeTensor format from HF
hf -m annalhq/llama-3.1-8B-lora-alpaca
  1. Importing a fine tuned adapter from Safetensors weights

First, create a Modelfile with a FROM command pointing at the base model you used for fine tuning, and an ADAPTER command which points to the directory with your Safetensors adapter:

FROM <base annalhq/llama-3.1-8B-lora-alpaca>
ADAPTER /path/to/safetensors/adapter/directory
ollama create annalhq/llama-3.1-8B-lora-alpaca

Lastly, test the model:

ollama run annalhq/llama-3.1-8B-lora-alpaca
  1. Cloning llama.cpp from here
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
  1. Compiling llama.cpp
  • Using make:

    • On Linux or MacOS:

      make
  • On Windows (x86/x64 only, arm64 requires cmake):

    1. Download the latest fortran version of w64devkit.
    2. Extract w64devkit on your pc.
    3. Run w64devkit.exe.
    4. Use the cd command to reach the llama.cpp folder.
    5. From here you can run:
      make
  1. Convert SafeTensore modelfile of my model to GGUF using these instructions

  2. Start the llama.cpp server

./server -m models/llama-3.1-8B-lora-alpaca.gguf