Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b2661
server : coherent log output for KV cache full (#6637)
b2660
llama : add gguf_remove_key + remove split meta during quantize (#6591) * Remove split metadata when quantize model shards * Find metadata key by enum * Correct loop range for gguf_remove_key and code format * Free kv memory --------- Co-authored-by: z5269887 <[email protected]>
b2658
imatrix : remove invalid assert (#6632)
b2657
Correct free memory and total memory. (#6630) Co-authored-by: MasterYi <[email protected]>
b2656
eval-callback: use ggml_op_desc to pretty print unary operator name (…
b2655
ci : disable Metal for macOS-latest-cmake-x64 (#6628)
b2646
minor layout improvements (#6572) * minor layout improvements * added missing file, run deps.sh locally
b2645
llama : add model types for mixtral (#6589)
b2636
llama : add Command R Plus support (#6491) * Add Command R Plus GGUF * Add Command R Plus GGUF * Loading works up to LayerNorm2D * Export new tensors in 1D so they are not quantized. * Fix embedding layer based on Noeda's example * Whitespace * Add line * Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda) * dranger003: Fix block index overflow in CUDA dequantizing. * Reverted blocked multiplication code as it still has issues and could affect other Llama arches * export norms as f32 * fix overflow issues during quant and other cleanup * Type convention Co-authored-by: Georgi Gerganov <[email protected]> * dranger003: Fix more int overflow during quant. --------- Co-authored-by: S <[email protected]> Co-authored-by: S <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b2632
quantize : fix precedence of cli args (#6541)