关于新版llama.cpp重要提示及最新速度对比 #291

ymcui · 2023-05-10T05:59:24Z

ymcui
May 10, 2023
Maintainer

llama.cpp即将迎来（现在还没有）新一次大更新，以下是几点需要关注的地方，供参考。

这个PR将很快合并到main branch，主要是进一步提升了模型推理速度

粗略看了一下，提速并不是很明显，聊胜于无。以下是8线程的结果（摘自PR里的表格）。

7B @ q4_0: 47ms -> 44ms
7B @ q8_0: 75ms -> 70ms
13B @ q4_0: 85ms -> 81ms
13B @ q8_0: 147ms -> 134ms

上述PR合并到main branch之后，所有的旧ggml文件（Q4, Q5）将不能被新版llama.cpp加载

尚不清楚是否会给出新旧版本转换的脚本
建议保留consolidate.*.pth文件，以便llama.cpp更新后可以将pth文件转换为新版ggml
如果不想再转换，就不要更新新版llama.cpp了，或者等官方出转换脚本（可能性较低）

P.S.（2023/5/10）现在的最新版中解决了部分环境下，交互时只能删除半个中文的问题，建议更新（仍然可以加载现有的ggml文件）。

ymcui · 2023-05-11T23:53:05Z

ymcui
May 11, 2023
Maintainer Author

update: 上述PR已合并到main

f16, q8_0 还是可以正常加载
q4_0, q5_0 无法加载，需要重新生成模型文件
q4_2量化版本被删除

使用建议（结合下表分析）：

7B：推荐Q5_1
13B：推荐Q5_0

当然，效果最好的还是Q8_0，与F16几乎没有显著差别。对速度要求较高的可以按以上建议选择量化模式。

0 replies

ymcui · 2023-05-12T04:39:22Z

ymcui
May 12, 2023
Maintainer Author

新版Benchmark（只有速度变化）

7B（旧版本）

	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
PPL	10.793	12.416	12.002	11.863	11.155	10.905	10.790
Size	13.77G	4.31G	5.17G	4.31G	4.74G	5.17G	7.75G
ms/tok @ `-t 2`	144	102	109	157	161	182	103
ms/tok @ `-t 4`	123	55	60	83	87	96	72
ms/tok @ `-t 8`	126	44	55	52	56	63	76

7B（新版本）

	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
PPL	10.793	12.416	12.002	11.155	10.905	10.790
Size	13.77G	4.31G	5.17G	4.74G	5.17G	7.75G
ms/tok @ `-t 2`	144	87	88	143	157	103
ms/tok @ `-t 4`	123	50	52	75	82	72
ms/tok @ `-t 8`	126	41	49	46	49	69

0 replies

ymcui · 2023-05-12T04:51:14Z

ymcui
May 12, 2023
Maintainer Author

13B（旧版本）

	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
PPL	9.147	9.917	9.689	9.845	9.325	9.344	9.147
Size	26.4G	8.25G	9.9G	8.25G	9.08G	9.9G	14.85G
ms/tok @ `-t 2`	-	196	207	298	305	348	192
ms/tok @ `-t 4`	-	103	111	155	179	181	132
ms/tok @ `-t 8`	-	81	93	94	104	113	132

13B（新版本）

	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
PPL	9.147	9.917	9.689	9.325	9.344	9.147
Size	26.4G	8.25G	9.9G	9.08G	9.9G	14.85G
ms/tok @ `-t 2`	-	166	166	273	304	192
ms/tok @ `-t 4`	-	89	94	142	155	132
ms/tok @ `-t 8`	-	77	89	86	93	132

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于新版llama.cpp重要提示及最新速度对比 #291

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

关于新版llama.cpp重要提示及最新速度对比 #291

ymcui May 10, 2023 Maintainer

这个PR将很快合并到main branch，主要是进一步提升了模型推理速度

上述PR合并到main branch之后，所有的旧ggml文件（Q4, Q5）将不能被新版llama.cpp加载

Replies: 3 comments

ymcui May 11, 2023 Maintainer Author

ymcui May 12, 2023 Maintainer Author

新版Benchmark（只有速度变化）

7B（旧版本）

7B（新版本）

ymcui May 12, 2023 Maintainer Author

13B（旧版本）

13B（新版本）

ymcui
May 10, 2023
Maintainer

ymcui
May 11, 2023
Maintainer Author

ymcui
May 12, 2023
Maintainer Author

ymcui
May 12, 2023
Maintainer Author