Skip to content

Releases: ggerganov/llama.cpp

b4081

15 Nov 01:40
1607a5e
Compare
Choose a tag to compare
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)

* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <[email protected]>

b4080

14 Nov 18:08
ae8de6d
Compare
Choose a tag to compare
ggml : build backends as libraries (#10256)

* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: R0CKSTAR <[email protected]>

b4079

14 Nov 12:54
4a8ccb3
Compare
Choose a tag to compare
CUDA: no -sm row for very small matrices (#10185)

b4078

14 Nov 10:57
2a82891
Compare
Choose a tag to compare
speculative : fix out-of-bounds access (#10289)

b4077

14 Nov 06:19
af148c9
Compare
Choose a tag to compare
vulkan: Optimize binary ops (#10270)

Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.

b4076

13 Nov 23:38
66798e4
Compare
Choose a tag to compare
vulkan: Use macros to make the mat mul pipeline creation more concise…

b4075

13 Nov 19:15
fb4a0ec
Compare
Choose a tag to compare
llama : propagate the results of `graph_compute` (#9525)

* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b4071

13 Nov 12:31
0e712a5
Compare
Choose a tag to compare
server : fix incorrect res in validate_model_chat_template (#10272)

* server : fix validate_model_chat_template

* server : fix chat res

b4069

13 Nov 10:55
2e82ffa
Compare
Choose a tag to compare
sycl : Fixes to broken builds and test-backend-ops (#10257)

* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.

b4068

13 Nov 08:22
80dd7ff
Compare
Choose a tag to compare
vulkan: Optimize contiguous copies (#10254)

* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.