Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b4081
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <[email protected]>
b4080
ggml : build backends as libraries (#10256) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: R0CKSTAR <[email protected]>
b4079
CUDA: no -sm row for very small matrices (#10185)
b4078
speculative : fix out-of-bounds access (#10289)
b4077
vulkan: Optimize binary ops (#10270) Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
b4076
vulkan: Use macros to make the mat mul pipeline creation more concise…
b4075
llama : propagate the results of `graph_compute` (#9525) * llama: propagating the results of `graph_compute` to the user interface * llama: reverting kv_cache in case of failed compute * llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned * llama: restore a kv_cache in case of failed computation * llama: correct reverting of the entire batch. also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models * llama: updated comments * llama : add comments about KV cache state after error --------- Co-authored-by: Georgi Gerganov <[email protected]>
b4071
server : fix incorrect res in validate_model_chat_template (#10272) * server : fix validate_model_chat_template * server : fix chat res
b4069
sycl : Fixes to broken builds and test-backend-ops (#10257) * Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133) * Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops * Fixes asserts in norm to fix debug builds.
b4068
vulkan: Optimize contiguous copies (#10254) * tests: Fix memory bandwidth calculation for perf tests Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test. * vulkan: Optimize contiguous copies Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.