Looking for help understanding llama-server /metrics #10325

Allan-Luu · 2024-11-15T22:11:12Z

Allan-Luu
Nov 15, 2024

Hello,

I'm trying to better understand the /metrics output from llama-server.

Specifically how the llamacpp:prompt_tokens_seconds and llamacpp:predicted_tokens_seconds are calculated.

is llamacpp:prompt_tokens_seconds calculated based on the initial input? From my understanding, the input prompt is converted to tokens for a prediction, then the prediction output is used as another input, tokenized to generate the token of the next character/word? Which parts of this process is used in the calculation for llamacpp:prompt_tokens_seconds.

Another question I had is which model parameters can we change to control the length of the output prompt?

Thank you!

Answered by Allan-Luu

Nov 17, 2024

Thanks for the answer @dspasyuk . It pointed me in the right direction!

I found that the function update_slots

These lines will start processing the prompts from the slots within the server, which are considered initial prompt tokens:

llama.cpp/examples/server/server.cpp

Lines 1874 to 1880 in 0fff7fd

     if (slot.state == SLOT_STATE_STARTED) {  
   slot.t_start_process_prompt = ggml_time_us();  
   slot.t_start_generation = 0;  
    
   slot.n_past = 0;  
   slot.n_prompt_tokens = prompt_tokens.size();  
   slot.state = SLOT_STATE_PROCESSING_PROMPT;  

 

Then these lines will checks whether the prompt exceeds the context size (slot.n_ctx). If so, it truncates the input to fit wit…

View full answer

dspasyuk · 2024-11-17T14:41:36Z

dspasyuk
Nov 17, 2024

Try using grep -r prompt_tokens_seconds in llama.cpp folder you will find:
{"name", "predicted_tokens_seconds"}, {"help", "Average generation throughput in tokens/s."}, {"value", n_tokens_predicted ? 1.e3 / t_tokens_generation * n_tokens_predicted : 0.}
in server.cpp

1 reply

Allan-Luu Nov 17, 2024
Author

Thanks for the answer @dspasyuk . It pointed me in the right direction!

I found that the function update_slots

These lines will start processing the prompts from the slots within the server, which are considered initial prompt tokens:

llama.cpp/examples/server/server.cpp

Lines 1874 to 1880 in 0fff7fd

    
           if (slot.state == SLOT_STATE_STARTED) { 
        
               slot.t_start_process_prompt = ggml_time_us(); 
        
               slot.t_start_generation = 0; 
        
               slot.n_past = 0; 
        
               slot.n_prompt_tokens = prompt_tokens.size(); 
        
               slot.state = SLOT_STATE_PROCESSING_PROMPT;

Then these lines will checks whether the prompt exceeds the context size (slot.n_ctx). If so, it truncates the input to fit within n_ctx

llama.cpp/examples/server/server.cpp

Lines 1936 to 1958 in 0fff7fd

    
           if (slot.n_prompt_tokens >= slot.n_ctx) { 
        
               const int n_left = slot.n_ctx - slot.params.n_keep; 
        
               const int n_block_size = n_left / 2; 
        
               const int erased_blocks = (slot.n_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size; 
        
               llama_tokens new_tokens( 
        
                       prompt_tokens.begin(), 
        
                       prompt_tokens.begin() + slot.params.n_keep); 
        
               new_tokens.insert( 
        
                       new_tokens.end(), 
        
                       prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size, 
        
                       prompt_tokens.end()); 
        
               prompt_tokens = std::move(new_tokens); 
        
               slot.truncated = true; 
        
               slot.n_prompt_tokens = prompt_tokens.size(); 
        
               SLT_WRN(slot, "input truncated, n_ctx = %d, n_keep = %d, n_left = %d, n_prompt_tokens = %d\n", slot.n_ctx, slot.params.n_keep, n_left, slot.n_prompt_tokens); 
        
               GGML_ASSERT(slot.n_prompt_tokens < slot.n_ctx);

Here, the function is managing the cache where it reuses for efficiency, like my initial assumptions where the output of one portion of the processed prompt is reused to predict the new tokens.

llama.cpp/examples/server/server.cpp

Lines 1961 to 2019 in 0fff7fd

    
                   if (slot.params.cache_prompt) { 
        
                       // reuse any previously computed tokens that are common with the new prompt 
        
                       slot.n_past = longest_common_prefix(slot.cache_tokens, prompt_tokens); 
        
                       // reuse chunks from the cached prompt by shifting their KV cache in the new position 
        
                       if (params.n_cache_reuse > 0) { 
        
                           size_t head_c = slot.n_past; // cache 
        
                           size_t head_p = slot.n_past; // current prompt 
        
                           SLT_DBG(slot, "trying to reuse chunks with size > %d, slot.n_past = %d\n", params.n_cache_reuse, slot.n_past); 
        
                           while (head_c < slot.cache_tokens.size() && 
        
                                  head_p < prompt_tokens.size()) { 
        
                               size_t n_match = 0; 
        
                               while (head_c + n_match < slot.cache_tokens.size() && 
        
                                      head_p + n_match < prompt_tokens.size()     && 
        
                                      slot.cache_tokens[head_c + n_match] == prompt_tokens[head_p + n_match]) { 
        
                                   n_match++; 
        
                               } 
        
                               if (n_match >= (size_t) params.n_cache_reuse) { 
        
                                   SLT_INF(slot, "reusing chunk with size %zu, shifting KV cache [%zu, %zu) -> [%zu, %zu)\n", n_match, head_c, head_c + n_match, head_p, head_p + n_match); 
        
                                   //for (size_t i = head_p; i < head_p + n_match; i++) { 
        
                                   //    SLT_DBG(slot, "cache token %3zu: %6d '%s'\n", i, prompt_tokens[i], common_token_to_piece(ctx, prompt_tokens[i]).c_str()); 
        
                                   //} 
        
                                   const int64_t kv_shift = (int64_t) head_p - (int64_t) head_c; 
        
                                   llama_kv_cache_seq_rm (ctx, slot.id, head_p, head_c); 
        
                                   llama_kv_cache_seq_add(ctx, slot.id, head_c, -1,     kv_shift); 
        
                                   for (size_t i = 0; i < n_match; i++) { 
        
                                       slot.cache_tokens[head_p + i] = slot.cache_tokens[head_c + i]; 
        
                                       slot.n_past++; 
        
                                   } 
        
                                   head_c += n_match; 
        
                                   head_p += n_match; 
        
                               } else { 
        
                                   head_c += 1; 
        
                               } 
        
                           } 
        
                           SLT_DBG(slot, "after context reuse, new slot.n_past = %d\n", slot.n_past); 
        
                       } 
        
                   } 
        
               } 
        
               if (slot.n_past == slot.n_prompt_tokens && slot.n_past > 0) { 
        
                   // we have to evaluate at least 1 token to generate logits. 
        
                   SLT_WRN(slot, "need to evaluate at least 1 token to generate logits, n_past = %d, n_prompt_tokens = %d\n", slot.n_past, slot.n_prompt_tokens); 
        
                   slot.n_past--; 
        
               } 
        
               slot.n_prompt_tokens_processed = 0; 
        
           }

Then we can find the tokens being added up to n_batch

llama.cpp/examples/server/server.cpp

Lines 2054 to 2064 in 0fff7fd

    
           // add prompt tokens for processing in the current batch 
        
           while (slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch) { 
        
               common_batch_add(batch, prompt_tokens[slot.n_past], slot.n_past, { slot.id }, false); 
        
               if (slot.params.cache_prompt) { 
        
                   slot.cache_tokens.push_back(prompt_tokens[slot.n_past]); 
        
               } 
        
               slot.n_prompt_tokens_processed++; 
        
               slot.n_past++; 
        
           }

And finally once the prompt is fully processed (slot.n_past == slot.n_prompt_tokens), the code marks the prompt as done (SLOT_STATE_DONE_PROMPT) and processes it through the sampler system:

llama.cpp/examples/server/server.cpp

Lines 2068 to 2088 in 0fff7fd

    
           // entire prompt has been processed 
        
           if (slot.n_past == slot.n_prompt_tokens) { 
        
               slot.state = SLOT_STATE_DONE_PROMPT; 
        
               GGML_ASSERT(batch.n_tokens > 0); 
        
               common_sampler_reset(slot.smpl); 
        
               // Process all prompt tokens through sampler system 
        
               for (int i = 0; i < slot.n_prompt_tokens; ++i) { 
        
                   common_sampler_accept(slot.smpl, prompt_tokens[i], false); 
        
               } 
        
               // extract the logits only for the last token 
        
               batch.logits[batch.n_tokens - 1] = true; 
        
               slot.n_decoded = 0; 
        
               slot.i_batch   = batch.n_tokens - 1; 
        
               SLT_INF(slot, "prompt done, n_past = %d, n_tokens = %d\n", slot.n_past, batch.n_tokens); 
        
           }

The variable slot.n_prompt_tokens_processed keeps track of how many tokens have been processed. The number of prompt tokens is reduced in steps as they are added to the batch, with each iteration increasing slot.n_past and updating slot.n_prompt_tokens_processed. The process involves:

Initializing slot.n_prompt_tokens with the total count of tokens.
Iterating over tokens and adding them to the batch, counting processed tokens.

Answer selected by Allan-Luu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for help understanding llama-server /metrics #10325

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

	if (slot.state == SLOT_STATE_STARTED) {
	slot.t_start_process_prompt = ggml_time_us();
	slot.t_start_generation = 0;

	slot.n_past = 0;
	slot.n_prompt_tokens = prompt_tokens.size();
	slot.state = SLOT_STATE_PROCESSING_PROMPT;

Looking for help understanding llama-server /metrics #10325

Allan-Luu Nov 15, 2024

Replies: 1 comment · 1 reply

dspasyuk Nov 17, 2024

Allan-Luu Nov 17, 2024 Author

Allan-Luu
Nov 15, 2024

Replies: 1 comment 1 reply

dspasyuk
Nov 17, 2024

Allan-Luu Nov 17, 2024
Author