Releases · ggerganov/llama.cpp

15 Nov 01:40

1607a5e

b4081

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)

* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <[email protected]>

Assets 22

14 Nov 18:08

github-actions

b4080

ae8de6d

b4080

ggml : build backends as libraries (#10256)

* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: R0CKSTAR <[email protected]>

Assets 22

14 Nov 12:54

github-actions

b4079

4a8ccb3

b4079

CUDA: no -sm row for very small matrices (#10185)

Assets 22

14 Nov 10:57

github-actions

b4078

2a82891

b4078

speculative : fix out-of-bounds access (#10289)

Assets 22

14 Nov 06:19

github-actions

b4077

af148c9

b4077

vulkan: Optimize binary ops (#10270)

Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.

Assets 22

13 Nov 23:38

github-actions

b4076

66798e4

b4076

vulkan: Use macros to make the mat mul pipeline creation more concise…

Assets 22

13 Nov 19:15

github-actions

b4075

fb4a0ec

b4075

llama : propagate the results of `graph_compute` (#9525)

* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 22

13 Nov 12:31

github-actions

b4071

0e712a5

b4071

server : fix incorrect res in validate_model_chat_template (#10272)

* server : fix validate_model_chat_template

* server : fix chat res

Assets 22

13 Nov 10:55

github-actions

b4069

2e82ffa

b4069

sycl : Fixes to broken builds and test-backend-ops (#10257)

* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.

Assets 22

13 Nov 08:22

github-actions

b4068

80dd7ff

b4068

vulkan: Optimize contiguous copies (#10254)

* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.

Assets 22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b4081

b4080

b4079

b4078

b4077

b4076

b4075

b4071

b4069

b4068