Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Why is the feed_prompt process so slow? #439

Open
zackshen opened this issue Nov 4, 2023 · 5 comments
Open

Why is the feed_prompt process so slow? #439

zackshen opened this issue Nov 4, 2023 · 5 comments

Comments

@zackshen
Copy link

zackshen commented Nov 4, 2023

LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of feed_prompt is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation that feed_prompt currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.

Using the same model and prompt, I tested with llama.cpp, and its first token response time is very fast. I'm not sure what the difference is in the feed_prompt process between llm and llama.cpp. By observing CPU history and GPU history,It seems like llama.cpp is fully utilizing the GPU for inference.

Can you please help me identify what's wrong?

Model:

  1. TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin

System:

  1. Apple 2020 M1 16GB
  2. MacOS 13.6.1 (22G313)

llama.cpp command:

./main -m {{MODEL_PATH}}  -p "[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"

llama.cpp Result:

llama_print_timings:        load time =     473.17 ms
llama_print_timings:      sample time =      49.00 ms /   144 runs   (    0.34 ms per token,  2938.90 tokens per second)
llama_print_timings: prompt eval time =    1460.21 ms /   155 tokens (    9.42 ms per token,   106.15 tokens per second)
llama_print_timings:        eval time =   11099.90 ms /   143 runs   (   77.62 ms per token,    12.88 tokens per second)
llama_print_timings:       total time =   12666.70 ms

llm sample code:

const DEFAULT_PROMPT: &'static str = r#"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"#;

    let model_path = PathBuf::from(MODEL_FILE);
    let model = llm::load_dynamic(
        Some(llm::ModelArchitecture::Llama),
        &model_path,
        llm::TokenizerSource::Embedded,
        llm::ModelParameters {
            prefer_mmap: true,
            use_gpu: true,
            ..Default::default()
        },
        llm::load_progress_callback_stdout,
    )
    .unwrap();

    let session_config = InferenceSessionConfig {
        n_batch: 512,
        ..Default::default()
    };
    let mut session = model.start_session(session_config);
    let mut rng = rand::thread_rng();
    let mut output_request = llm::OutputRequest::default();
    let sampler = Arc::new(Mutex::new(
        SamplerChain::<u32, f32>::new()
            + SampleTemperature::new(0.2)
            + SampleTopK::new(40, 40)
            + SampleTopP::new(0.95, 40)
            + SampleRandDistrib::new(),
    ));
    let params = llm::InferenceParameters { sampler };
    let ts = Instant::now();
    let mut first_token_time: Option<f32> = None;
    let ret = session
        .infer::<Infallible>(
            model.as_ref(),
            &mut rng,
            &llm::InferenceRequest {
                prompt: llm::Prompt::Text(DEFAULT_PROMPT),
                parameters: &params,
                play_back_previous_tokens: false,
                maximum_token_count: Some(1500),
            },
            &mut output_request,
            llm::conversation_inference_callback("[INST]", |t| {
                if first_token_time.is_none() {
                    first_token_time = Some(ts.elapsed().as_secs_f32());
                }
                print_token(t)
            }),
        )
        .unwrap();
    println!("{stats:#?}", stats = ret,);
    println!("first time to token: {first_token_time:?}");
    println!("token count {:?}", ret.prompt_tokens + ret.predict_tokens);
    println!(
        "prompt token speed {:?}/s",
        ret.prompt_tokens as f32 / ret.feed_prompt_duration.as_secs_f32()
    );
    println!(
        "predict token speed {:?}/s",
        ret.predict_tokens as f32 / ret.predict_duration.as_secs_f32()
    );
    println!(
        "summary speed {:?}/s",
        (ret.predict_tokens + ret.prompt_tokens) as f32
            / (ret.predict_duration.as_secs_f32() + ret.feed_prompt_duration.as_secs_f32())
    );

llm sample code result:

InferenceStats {
    feed_prompt_duration: 10.74704s,
    prompt_tokens: 155,
    predict_duration: 28.863045s,
    predict_tokens: 397,
}
first time to token: Some(11.22408)
token count 552
prompt token speed 14.422576/s
predict token speed 13.754613/s
summary speed 13.935845/s
@philpax
Copy link
Collaborator

philpax commented Nov 4, 2023

Hey there! Thanks for reporting this and providing lots of detail :)

The issue here is that the version of GGML we use doesn’t support a specific operation required for feeding more than one token at a time with Metal (i.e. this works fine with CUDA, not Metal). See also #403.

This has been fixed in upstream GGML/llama.cpp, but we haven’t integrated that fix yet. The work has started in #428 and that should hopefully be finished within the next week (I’m out of town but I hope to get back to it soon).

Hope that helps clarify the state of affairs!

@zackshen
Copy link
Author

zackshen commented Nov 4, 2023

I'm very happy to hear this news and looking forward to the merged version. Thank you for your work.

Can I wait until after the release to close this issue?

@zackshen
Copy link
Author

hello @philpax has there been any recent movement on this?

@philpax philpax mentioned this issue Nov 12, 2023
17 tasks
@philpax
Copy link
Collaborator

philpax commented Nov 12, 2023

I started working on it, but realised that it would end up being quite a large task. Still working on it, but it'll take some time.

@zackshen
Copy link
Author

thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants