Simply place the model in the models
folder, making sure that its name contains ggml
somewhere and ends in .bin
.
Follow the instructions in the llama.cpp README to generate the ggml-model-q4_0.bin
file: https://github.com/ggerganov/llama.cpp#usage
This was the performance of llama-7b int4 on my i5-12400F:
Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17)
You can change the number of threads with --threads N
.