- ⚡ Introduction
- ✨ Key Features
- 📊 Fine tuned models
- 💬 Inferencing
- 📋 Requirements
- 🚀 Getting Started
- 🐥 Frontend
- 🗄️ Backend
- 🛠️ Code Formatting
Sapien is the LLaMA 3.1 70B model fined tuned using Low-Rank Adaptation (LoRA) on the Alpaca dataset. The training is optimized for 4-bit and 16-bit precision.
Watch more detailed project walkthrough
- LoRA (Low-Rank Adaptation) for optimizing large language models.
- 4-bit & 16-bit precision fine-tuning using advanced quantization techniques.
- Alpaca Dataset: Instruction-based fine-tuning dataset.
- Model Hosting: Push the trained model to Hugging Face for deployment.
- My fine tuned Llama model
- Official Meta Llama 3.2 for Ollama (released on 25th Sept 2024)
Model config:
{
"_name_or_path": "unsloth/meta-llama-3.1-8b-bnb-4bit",
"architectures": ["LlamaForCausalLM"],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pad_token_id": 128004,
"pretraining_tp": 1,
"rms_norm_eps": 1e-5,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.45.1",
"unsloth_version": "2024.9.post3",
"use_cache": true,
"vocab_size": 128256
}
Trainer stats:
[
60,
0.8564618517955144,
{
"train_runtime": 441.2579,
"train_samples_per_second": 1.088,
"train_steps_per_second": 0.136,
"total_flos": 5726714157219840.0,
"train_loss": 0.8564618517955144,
"epoch": 0.00927357032457496
}
]
(This will work only when you have all model files locally saved after running trainer)
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
alpaca_prompt.format(
"Give me the first 10 digits of Pi",
"3.14159",
"",
)
],
return_tensors="pt",
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(outputs))
To run Sapien, you need the following requirements:
- Node.js (version 14 or higher)
- npm (version 6 or higher)
- Python (version 3.7 or higher)
- Ollama (version 1.7 or higher for method 2)
- llama.cpp (for method 3)
Make sure you have these installed before proceeding.
To get started with Sapien, follow these steps:
-
Clone the repository:
git clone https://github.com/annalhq/sapien.git cd sapien
-
Install dependencies
npm install
-
Run the dev server
npm run dev
-
For the backend part refer to Backend section accordingly
Deployed using NextJS and Shadcn UI library alongside Vercel's AI SDK UI.
These integrations will make sure while deploying that there is no server side issues (also maintains code consistency)
If you are making changes in the code, make sure to run npm run format
otherwise Husky will prevent you from committing the code to repository.
Use --no-verify
flag alongside with your git command to skip the invocation of
Husky.
npm run check-lint
npm run check-format
This Serverless Inference API allows you to easily do inference on my fine tuned models or you can use any other models with TextToText generation models.
Getting tokens from HuggingFace
Login to HuggingFace and get tokens from
here.
As reccommended, it is preferable to create fine-grained
tokens with the scope
to Make calls to the serverless Inference API
Official Tokens guide by HuggingFace
In v1.0.0 of this project, HFInference client is used for handling inference from model.
import { HfInference } from "@huggingface/inference";
const inference = new HfInference("HUGGINGFACE_API_KEY");
const result = await inference.textClassification({
model: "https://huggingface.co/annalhq/llama-3.1-8B-lora-alpaca",
inputs: "Hi! How are you?",
});
console.log(result);
Store HF token variables .env.local
as
HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Now this will use this serverless API from my model for streaming text.
(Recommended for locally running)
Here you can use Ollama to serve my model locally. As it does not has stream handling capabilities as for chat frontend I've used Vercel AI SDK with ModelFusion.
Vercel AI SDK will handle stream forwarding and rendering, and ModelFusion to integrate Ollama with the Vercel AI SDK.
- Install Ollama from official site
- Pulling model on Ollama
If you want to use my model in Ollama follow these instructions:
- Download HFDownloader
- Download my model in SafeTensor format from HF
hf -m annalhq/llama-3.1-8B-lora-alpaca
- Importing a fine tuned adapter from Safetensors weights
First, create a Modelfile
with a FROM
command pointing at the base model you
used for fine tuning, and an ADAPTER
command which points to the directory
with your Safetensors adapter:
FROM <base annalhq/llama-3.1-8B-lora-alpaca>
ADAPTER /path/to/safetensors/adapter/directory
ollama create annalhq/llama-3.1-8B-lora-alpaca
Lastly, test the model:
ollama run annalhq/llama-3.1-8B-lora-alpaca
- Cloning llama.cpp from here
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
- Compiling llama.cpp
-
Using
make
:-
On Linux or MacOS:
make
-
-
On Windows (x86/x64 only, arm64 requires cmake):
- Download the latest fortran version of w64devkit.
- Extract
w64devkit
on your pc. - Run
w64devkit.exe
. - Use the
cd
command to reach thellama.cpp
folder. - From here you can run:
make
-
Convert SafeTensore modelfile of my model to GGUF using these instructions
-
Start the llama.cpp server
./server -m models/llama-3.1-8B-lora-alpaca.gguf