Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use whole history in case of undetermined tokenization of sequence #1254

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sbalandi
Copy link
Contributor

@sbalandi sbalandi commented Nov 26, 2024

Task: CVS-157295

  • refused switching between EncodedInput and StringInput mode
  • if it's found that the history contains ambiguous characters (i.e. the model output tokens and decoded+encoded tokens are different), start using the entire history for chat
  • collect prompt and output of the model for EncodedInput, collect output for StringInput for checking history, but rewrite history after tokenizing before infer, to have templated answer

@github-actions github-actions bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) no-match-files labels Nov 26, 2024
@sbalandi sbalandi force-pushed the tok_hist branch 2 times, most recently from 49a4c9b to 05b3302 Compare November 26, 2024 13:52
@sbalandi sbalandi marked this pull request as ready for review November 26, 2024 13:52
@ilya-lavrenov ilya-lavrenov added this to the 2025.0 milestone Nov 26, 2024
@ilya-lavrenov ilya-lavrenov added the port to LTS PR needs to be ported to LTS label Nov 26, 2024
@ilya-lavrenov ilya-lavrenov self-assigned this Nov 26, 2024

m_tokenized_chat_history.clear();
std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
std::back_inserter(m_tokenized_chat_history));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you override m_tokenized_chat_history with newly tokenized history, you can trust it again marking m_trust_encoded_history with true in generate(EncodedInputs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like generate with encoded inputs can always reset m_tokenized_chat_history to true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added reset of m_tokenized_chat_history to true

new_atten_mask.data<int64_t>() + kv_cache_len);
concatenated_attention_mask = new_atten_mask;
if (is_chat_conversation && m_history_available) {
if (!m_trust_encoded_history) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flip if to check not negated m_trust_encoded_history. You have else branch anyway, but removing extra ! simplifies reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +279 to +283
break;
i++;
}

return i == tokenized_history.size() - 1 || i == tokenized_history.size() - 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is i == tokenized_history.size() - 2 condition treated as same history? The last tokens are different in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoded_history = prev templated encoded prompts&answers + not templated, but decoded/encoded last answer
tokenized_history = prev templated encoded prompts&answers + pure model output
In decode part it will be removed eos token(or some stop token) or nothing
So, we will have the same results or encoded_history will be less on one symbol

comment added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urgency is about reaching 24.6. But this PR targets master. If it's possible to retarget this PR, do that. If it's not possible, let's finish the review here and then port to 24.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved #1268


m_history.clear();
m_templated_chat_history.clear();
m_tokenized_chat_history.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the order of vars modification aligned between start and finish chat. Currently start_chat() clears m_tokenized_chat_history after m_trust_encoded_history, but m_tokenized_chat_history is cleared in the end in finish_chat().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines +132 to +133
m_templated_chat_history
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_templated_chat_history
);
m_templated_chat_history
);

size_t history_size = m_language.get_tensor("attention_mask").get_shape().at(1);
size_t inputs_embeds_size = inputs_embeds.get_shape().at(1);
// inputs_embeds contains whole history
if (inputs_embeds.get_shape().at(1) == tokenized_chat_history.size()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (inputs_embeds.get_shape().at(1) == tokenized_chat_history.size()) {
if (inputs_embeds_size == tokenized_chat_history.size()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we need tests for that change. I remember you said random miniCPM diverges. This PR probably fixes that, so your case can be used as a test for VLM. LLM needs its test as well, maybe you can provoke a random weights model to generate such tokens.

@@ -191,6 +224,9 @@ class StatefulLLMPipeline final : public LLMPipelineImplBase {
attention_mask = data->attention_mask;
}

if (is_chat_conversation && m_chat_input_type == ov::genai::utils::GenerationChatInputsType::ENCODED_INPUTS)
std::copy(input_ids.data<int64_t>(), input_ids.data<int64_t>() + input_ids.get_size(), std::back_inserter(m_tokenized_chat_history));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be added unconditionally on m_chat_input_type ? because in current method we have only EncodedInputs& inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function will called also in STRING mode (if it is ENCODED_INPUTS it will be called directly by user, if it is STRING mode it will be called in generation with StringInputs here https://github.com/openvinotoolkit/openvino.genai/blob/releases/2024/5/src/cpp/src/llm_pipeline.cpp#L136), so, no, we should check what type of generation is it


m_tokenized_chat_history.clear();
std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
std::back_inserter(m_tokenized_chat_history));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like generate with encoded inputs can always reset m_tokenized_chat_history to true

// if we met sequence with such combination of symbols, we cannot correctly subtract the new history from the old history
// and find the difference as a prompt
// so let's check it out and use the whole history in this case
if (m_history_available && m_trust_encoded_history)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like m_history_available is redundant and we can use !m_tokenized_chat_history.empty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
m_templated_chat_history = new_templated_chat_history;
m_tokenized_chat_history = new_chat_tokens;

m_tokenized_chat_history.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_tokenized_chat_history.clear();
m_tokenized_chat_history.reserve(new_chat_tokens.input_ids.get_size());

Copy link
Contributor Author

@sbalandi sbalandi Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with reserve and without clear copy will add elements , not rewrite it , reserve have added , but not sure that i'm understand correct comment

m_tokenized_chat_history = new_chat_tokens;

m_tokenized_chat_history.clear();
std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
std::copy_n(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.get_size(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

}

if(m_adapter_controller) {
m_adapter_controller->apply(m_model_runner, config.adapters);
}

auto input_tokens = input_ids;
if (is_chat_conversation && !m_trust_encoded_history) {
input_tokens = tokenized_chat_history;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like here we can also set m_trust_encoded_history = true;

@sbalandi
Copy link
Contributor Author

sbalandi commented Nov 27, 2024

comments have been addressed - #1268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: visual language Visual language pipeline no-match-files port to LTS PR needs to be ported to LTS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants