Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(llama.cpp): embed metal file into result binary for darwin #4279

Merged
merged 2 commits into from
Nov 28, 2024

Conversation

mudler
Copy link
Owner

@mudler mudler commented Nov 27, 2024

Description

This PR fixes #4274

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Copy link

netlify bot commented Nov 27, 2024

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 725eecd
🔍 Latest deploy log https://app.netlify.com/sites/localai/deploys/6747be3d528683000898b449
😎 Deploy Preview https://deploy-preview-4279--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mintyleaf
Copy link
Contributor

@mudler

...
12:24AM DBG [llama-cpp-fallback] llama-cpp variant available
12:24AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-fallback
12:24AM DBG GRPC Service for Phi2 will be running at: '127.0.0.1:64983'
12:24AM DBG GRPC Service state dir: /var/folders/d7/46zkm5yj39nbb6dp9dtrs_d00000gn/T/go-processmanager3344125854
12:24AM DBG GRPC Service Started
12:24AM DBG Wait for the service to start up
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stdout Server listening on 127.0.0.1:64983
12:24AM DBG GRPC Service Ready
12:24AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:phi-2.Q2_K ContextSize:512 Seed:2133683766 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false ModelPath:/Users/mintyleaf/Projects/work/LocalAI/models LoraAdapters:[] LoraScales:[]}
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_load_model_from_file: using device Metal (Apple M3) - 5461 MiB free
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K (version GGUF V3 (latest))
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   0:                       general.architecture str              = phi2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   1:                               general.name str              = Phi2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  10:                          general.file_type u32              = 10
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - kv  19:               general.quantization_version u32              = 2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - type  f32:  195 tensors
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - type q2_K:   33 tensors
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - type q3_K:   96 tensors
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_model_loader: - type q6_K:    1 tensors
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: missing pre-tokenizer type, using: 'default'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab:                                             
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: ************************************        
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: CONSIDER REGENERATING THE MODEL             
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: ************************************        
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab:                                             
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: special tokens cache size = 944
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_vocab: token to piece cache size = 0.3151 MB
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: format           = GGUF V3 (latest)
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: arch             = phi2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: vocab type       = BPE
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_vocab          = 51200
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_merges         = 50000
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: vocab_only       = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_ctx_train      = 2048
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_embd           = 2560
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_layer          = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_head           = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_head_kv        = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_rot            = 32
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_swa            = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_embd_head_k    = 80
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_embd_head_v    = 80
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_gqa            = 1
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_embd_k_gqa     = 2560
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_embd_v_gqa     = 2560
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: f_norm_eps       = 1.0e-05
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_ff             = 10240
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_expert         = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_expert_used    = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: causal attn      = 1
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: pooling type     = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: rope type        = 2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: rope scaling     = linear
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: freq_base_train  = 10000.0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: freq_scale_train = 1
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: n_ctx_orig_yarn  = 2048
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: rope_finetuned   = unknown
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: ssm_d_conv       = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: ssm_d_inner      = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: ssm_d_state      = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: ssm_dt_rank      = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: ssm_dt_b_c_rms   = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: model type       = 3B
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: model ftype      = Q2_K - Medium
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: model params     = 2.78 B
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: model size       = 1.09 GiB (3.37 BPW) 
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: general.name     = Phi2
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: LF token         = 128 'Ä'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: EOG token        = 50256 '<|endoftext|>'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_print_meta: max token length = 256
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors: tensor 'token_embd.weight' (q2_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_backend_metal_log_allocated_size: allocated buffer, size =  1076.52 MiB, ( 1076.59 /  5461.34)
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors: offloading 32 repeating layers to GPU
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors: offloading output layer to GPU
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors: offloaded 33/33 layers to GPU
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors: Metal_Mapped model buffer size =  1076.51 MiB
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llm_load_tensors:   CPU_Mapped model buffer size =    41.02 MiB
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr .........................................................................................
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_seq_max     = 1
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_ctx         = 512
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_ctx_per_seq = 512
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_batch       = 512
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_ubatch      = 512
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: flash_attn    = 0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: freq_base     = 10000.0
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: freq_scale    = 1
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: allocating
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: found device: Apple M3
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: picking default device: Apple M3
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: default.metallib not found, loading from source
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: loading 'ggml-metal.metal'
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=260 "The file “ggml-metal.metal” couldn’t be opened because there is no such file." UserInfo={NSFilePath=ggml-metal.metal, NSURL=ggml-metal.metal -- file:///private/tmp/localai/backend_data/backend-assets/grpc/, NSUnderlyingError=0x600001becde0 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr ggml_backend_metal_device_init: error: failed to allocate context
12:24AM DBG GRPC(Phi2-127.0.0.1:64983): stderr llama_new_context_with_model: failed to initialize Metal backend
...

@mintyleaf
Copy link
Contributor

...and if ggml-metal.metal is put back - we have the same error:

...
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: allocating
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: found device: Apple M3
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: picking default device: Apple M3
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: default.metallib not found, loading from source
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: loading '/private/tmp/localai/backend_data/backend-assets/grpc/ggml-metal.metal'
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:7:10: fatal error: '../ggml-common.h' file not found
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr #include "../ggml-common.h"
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr          ^~~~~~~~~~~~~~~~~~
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr " UserInfo={NSLocalizedDescription=program_source:7:10: fatal error: '../ggml-common.h' file not found
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr #include "../ggml-common.h"
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr          ^~~~~~~~~~~~~~~~~~
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr }
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr ggml_backend_metal_device_init: error: failed to allocate context
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr llama_new_context_with_model: failed to initialize Metal backend
12:27AM DBG GRPC(Phi2-127.0.0.1:65077): stderr common_init_from_params: failed to create context with model '/Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K'
...

@mintyleaf
Copy link
Contributor

@mudler

CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=OFF

Removing this line fixes the whole thing

@mudler
Copy link
Owner Author

mudler commented Nov 27, 2024

@mudler

CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=OFF

Removing this line fixes the whole thing

argh right, good catch! I forgot we explicitly disabled it in the backend Makefile. Thanks for testing

@mudler mudler added the bug Something isn't working label Nov 27, 2024
@dave-gray101 dave-gray101 self-requested a review November 27, 2024 22:04
Copy link
Collaborator

@dave-gray101 dave-gray101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm that this works and we're using the embedded metal lib

@dave-gray101 dave-gray101 enabled auto-merge (squash) November 27, 2024 22:15
@dave-gray101 dave-gray101 merged commit cbedf2f into master Nov 28, 2024
31 checks passed
@dave-gray101 dave-gray101 deleted the fix/metal_embed branch November 28, 2024 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

darwin arm64 regression trying "Example: Build on mac", llama-cpp ''../ggml-common.h' file not found'
3 participants