Replies: 4 comments 1 reply
-
Great work! I'm not sure what these perplexity values are, but my guess is that they are for a fraction of WikiText2 (around 10 chunks perhaps?). In any case, you are running on the CPU, and there was a bug in the AVX2 implementation that was fixed in #5834, which I think you don't have. This might explain the higher PPL values you are observing on the master branch. In any case, I have done a complete PPL run for Mistral-7B for a context of 512 and imatrix from Mistral-7B PPL this PR, Final PPL = 5.9530
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709531437
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = tmp
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
Mistral-7B PPL master, Final PPL = 5.8807
main: build = 2329 (67be2ce1)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709537499
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
I have also implemented a multiplier based codebook, see PR #5867. For Mistral-7B I get
|
Beta Was this translation helpful? Give feedback.
-
@PeterReid I was intrigued by the fact that your codebook results in a slightly better perplexity for Mistral-7B compared to #5867, so went ahead and tried it on LLaMA-v2-7B. It does not do very well there: LLaMA-v2-7B PPL this PR = 5.2466
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709545886
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
LLaMA-v2-7B PPL master = 5.1340
main: build = 2307 (7b629c3b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709314323
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
LLaMA-v2-7B PPL #5867 = 5.2016
main: build = 2295 (8b713a98)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709458694
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 26
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
|
Beta Was this translation helpful? Give feedback.
-
My perplexity runs were on the full 600ish chunks of wiki.test, and on a GPU, so the AVX2 bug wouldn't have affected them. I wonder if the difference is me using the instruct-tuned mistral? I also requantized down from Q8 because I couldn't find a fp16 ggml, so maybe that is it. I will do some more testing. I did not realize you had already done all this work and more in this direction! I bet if you used a shuffle at the end the performance gap would close up. |
Beta Was this translation helpful? Give feedback.
-
Ah, you used an instruct tuned Mistral-7B, this explains the large PPL values. Are you using the official one from Mistral AI or some other random tuning? With the official Mistral-Instruct-7B-v0.2 I get Master, Final PPL = 6.7768, PPL after 100 chunks: 6.9156
main: build = 2282 (cb49e0f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709560713
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = hf
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
This PR, PPL after 100 chunks: 7.0149
main: build = 2278 (7ad9511)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1709561204
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = hf
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 26
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
|
Beta Was this translation helpful? Give feedback.
-
I have been exploring ways to improve perplexity for IQ3_S quantization while speeding it up on AVX/NEON, and I think I have found one. This uses the multiply-instead-of-codebook-lookup that I was asking about in #5676 for its speed boost, and the insight from the IQ4_NL quantization for its perplexity improvement. This is not backwards compatible, unfortunately, because it is changing the code book. It is also not ready for merging (I've broken IQ3_XXS, I think, for example), and represents me mucking around rather than making something presentable.
I started by noticing that some values appeared much more often than others in the codebook. Specifically, the number of occurrences of the 8 values are: 436, 344, 327, 271, 223, 185, 112, 150. This is what you would expect to see from the distribution of weights being somewhat quadratic-ish, and is the fact that IQ4_NL uses to be better than similarly-sized methods. I decided to choose values in the codebook following the same polynomial as IQ4_NL. I fit a polynomial (0.08095843xxx + .0671659xx + 11.43774359x + 0.99047392) and ended up using the values [0, 3, 6, 9, 12, 16, 19, 23, 26, 31, 35, 40, 45, 50, 56, 62]. (I used a maximum of 62 because that's what the highest value was in the old codebook.)
For assembling these values into a codebook, I compute the values indices by computing (codebook_index * 0xd137151) & 0x0f0f0f0f, and then map those four indices in those four bytes to the values above. That magic number is the result of me trying out a few numbers until I found one that used each value an about-equal number of times, not some search computation, so it is very possible that it would be possible to find a better one that ends up with better-spaced points. It's also possible that there's a better list of values to use, but I didn't want to overfit anything to the one model I'm working with.
I have only done this for AVX so far. It does all of those operations vectorized, working on 32 weights at a time.
I've tested three versions
I have not figured out why the current version of iq3_s is performing worse than the baseline on these metrics. But in any case, the speed improvement for my version is pretty big! 158% - 210% of the original speed. Plus the perplexity is better.
So, to summarize, this breaks backwards compatibility with existing IQ3_S-quantified files, but seems like it may be worthwhile to pursue for performance and perplexity reasons. @ikawrakow ?
The code is in https://github.com/PeterReid/llama.cpp/commits/iq3_s_quant_change_cleaned/
Beta Was this translation helpful? Give feedback.
All reactions