Use whole history in case of undetermined tokenization of sequence #1254

sbalandi · 2024-11-26T12:20:37Z

refused switching between EncodedInput and StringInput mode
if it's found that the history contains ambiguous characters (i.e. the model output tokens and decoded+encoded tokens are different), start using the entire history for chat
collect prompt and output of the model for EncodedInput, collect output for StringInput for checking history, but rewrite history after tokenizing before infer, to have templated answer

Wovchena · 2024-11-27T08:54:29Z

src/cpp/src/llm_pipeline.cpp

+
+                m_tokenized_chat_history.clear();
+                std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
+                          std::back_inserter(m_tokenized_chat_history));


Once you override m_tokenized_chat_history with newly tokenized history, you can trust it again marking m_trust_encoded_history with true in generate(EncodedInputs)

looks like generate with encoded inputs can always reset m_tokenized_chat_history to true

added reset of m_tokenized_chat_history to true

Wovchena · 2024-11-27T08:58:34Z

src/cpp/src/llm_pipeline.cpp

-                    new_atten_mask.data<int64_t>() + kv_cache_len);
-            concatenated_attention_mask = new_atten_mask;
+        if (is_chat_conversation && m_history_available) {
+            if (!m_trust_encoded_history) {


Flip if to check not negated m_trust_encoded_history. You have else branch anyway, but removing extra ! simplifies reading.

Wovchena · 2024-11-27T09:02:36Z

src/cpp/src/utils.cpp

+            break;
+        i++;
+    }
+
+    return i == tokenized_history.size() - 1 || i == tokenized_history.size() - 2;


Why is i == tokenized_history.size() - 2 condition treated as same history? The last tokens are different in that case

encoded_history = prev templated encoded prompts&answers + not templated, but decoded/encoded last answer
tokenized_history = prev templated encoded prompts&answers + pure model output
In decode part it will be removed eos token(or some stop token) or nothing
So, we will have the same results or encoded_history will be less on one symbol

comment added

Wovchena · 2024-11-27T09:05:18Z

src/cpp/src/llm_pipeline.cpp

The urgency is about reaching 24.6. But this PR targets master. If it's possible to retarget this PR, do that. If it's not possible, let's finish the review here and then port to 24.6.

moved #1268

Wovchena · 2024-11-27T09:11:37Z

src/cpp/src/visual_language/inputs_embedder.cpp


        m_history.clear();
        m_templated_chat_history.clear();
+        m_tokenized_chat_history.clear();


Keep the order of vars modification aligned between start and finish chat. Currently start_chat() clears m_tokenized_chat_history after m_trust_encoded_history, but m_tokenized_chat_history is cleared in the end in finish_chat().

Wovchena · 2024-11-27T09:12:32Z

src/cpp/src/visual_language/inputs_embedder.cpp

+                    m_templated_chat_history
+                );


Suggested change

m_templated_chat_history

);

m_templated_chat_history

);

Wovchena · 2024-11-27T09:16:04Z

src/cpp/src/visual_language/pipeline.cpp

        size_t history_size = m_language.get_tensor("attention_mask").get_shape().at(1);
        size_t inputs_embeds_size = inputs_embeds.get_shape().at(1);
+        // inputs_embeds contains whole history
+        if (inputs_embeds.get_shape().at(1) == tokenized_chat_history.size()) {


Suggested change

if (inputs_embeds.get_shape().at(1) == tokenized_chat_history.size()) {

if (inputs_embeds_size == tokenized_chat_history.size()) {

Wovchena · 2024-11-27T09:21:02Z

src/cpp/src/llm_pipeline.cpp

Not for this PR, but we need tests for that change. I remember you said random miniCPM diverges. This PR probably fixes that, so your case can be used as a test for VLM. LLM needs its test as well, maybe you can provoke a random weights model to generate such tokens.

ilya-lavrenov · 2024-11-27T10:26:57Z

src/cpp/src/llm_pipeline.cpp

@@ -191,6 +224,9 @@ class StatefulLLMPipeline final : public LLMPipelineImplBase {
            attention_mask = data->attention_mask;
        }

+        if (is_chat_conversation && m_chat_input_type == ov::genai::utils::GenerationChatInputsType::ENCODED_INPUTS)
+            std::copy(input_ids.data<int64_t>(), input_ids.data<int64_t>() + input_ids.get_size(), std::back_inserter(m_tokenized_chat_history));


should it be added unconditionally on m_chat_input_type ? because in current method we have only EncodedInputs& inputs

this function will called also in STRING mode (if it is ENCODED_INPUTS it will be called directly by user, if it is STRING mode it will be called in generation with StringInputs here https://github.com/openvinotoolkit/openvino.genai/blob/releases/2024/5/src/cpp/src/llm_pipeline.cpp#L136), so, no, we should check what type of generation is it

ilya-lavrenov · 2024-11-27T11:01:57Z

src/cpp/src/llm_pipeline.cpp

+
+                m_tokenized_chat_history.clear();
+                std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
+                          std::back_inserter(m_tokenized_chat_history));


looks like generate with encoded inputs can always reset m_tokenized_chat_history to true

ilya-lavrenov · 2024-11-27T11:06:13Z

src/cpp/src/llm_pipeline.cpp

+                // if we met sequence with such combination of symbols, we cannot correctly subtract the new history from the old history
+                // and find the difference as a prompt
+                // so let's check it out and use the whole history in this case
+                if (m_history_available && m_trust_encoded_history)


looks like m_history_available is redundant and we can use !m_tokenized_chat_history.empty()

ilya-lavrenov · 2024-11-27T11:06:40Z

src/cpp/src/llm_pipeline.cpp

                }
                m_templated_chat_history = new_templated_chat_history;
-                m_tokenized_chat_history = new_chat_tokens;
+
+                m_tokenized_chat_history.clear();


Suggested change

m_tokenized_chat_history.clear();

m_tokenized_chat_history.reserve(new_chat_tokens.input_ids.get_size());

with reserve and without clear copy will add elements , not rewrite it , reserve have added , but not sure that i'm understand correct comment

ilya-lavrenov · 2024-11-27T11:07:01Z

src/cpp/src/llm_pipeline.cpp

-                m_tokenized_chat_history = new_chat_tokens;
+
+                m_tokenized_chat_history.clear();
+                std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),


Suggested change

std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),

std::copy_n(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.get_size(),

ilya-lavrenov · 2024-11-27T11:11:07Z

src/cpp/src/llm_pipeline.cpp

        }

        if(m_adapter_controller) {
            m_adapter_controller->apply(m_model_runner, config.adapters);
        }

+        auto input_tokens = input_ids;
+        if (is_chat_conversation && !m_trust_encoded_history) {
+            input_tokens = tokenized_chat_history;


looks like here we can also set m_trust_encoded_history = true;

sbalandi · 2024-11-27T18:35:11Z

comments have been addressed - #1268

github-actions bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) no-match-files labels Nov 26, 2024

sbalandi force-pushed the tok_hist branch 2 times, most recently from 49a4c9b to 05b3302 Compare November 26, 2024 13:52

sbalandi marked this pull request as ready for review November 26, 2024 13:52

sbalandi requested review from Wovchena and ilya-lavrenov November 26, 2024 13:52

Use whole history in case of undetermined tokenization of sequence

05b3302

ilya-lavrenov added this to the 2025.0 milestone Nov 26, 2024

ilya-lavrenov added the port to LTS PR needs to be ported to LTS label Nov 26, 2024

ilya-lavrenov self-assigned this Nov 26, 2024

Wovchena requested changes Nov 27, 2024

View reviewed changes

ilya-lavrenov reviewed Nov 27, 2024

View reviewed changes

ilya-lavrenov assigned Wovchena Nov 27, 2024

ilya-lavrenov modified the milestones: 2025.0, 2024.5.1 Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use whole history in case of undetermined tokenization of sequence #1254

Use whole history in case of undetermined tokenization of sequence #1254

sbalandi commented Nov 26, 2024 •

edited

Loading

Wovchena Nov 27, 2024

ilya-lavrenov Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

Wovchena Nov 27, 2024

sbalandi Nov 27, 2024

Wovchena Nov 27, 2024

ilya-lavrenov Nov 27, 2024

sbalandi Nov 27, 2024

ilya-lavrenov Nov 27, 2024

ilya-lavrenov Nov 27, 2024

sbalandi Nov 27, 2024

ilya-lavrenov Nov 27, 2024

sbalandi Nov 27, 2024 •

edited

Loading

ilya-lavrenov Nov 27, 2024

sbalandi Nov 27, 2024

ilya-lavrenov Nov 27, 2024

sbalandi commented Nov 27, 2024 •

edited

Loading

	if (inputs_embeds.get_shape().at(1) == tokenized_chat_history.size()) {
	if (inputs_embeds_size == tokenized_chat_history.size()) {

	m_tokenized_chat_history.clear();
	m_tokenized_chat_history.reserve(new_chat_tokens.input_ids.get_size());

	std::copy(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.data<int64_t>() + new_chat_tokens.input_ids.get_size(),
	std::copy_n(new_chat_tokens.input_ids.data<int64_t>(), new_chat_tokens.input_ids.get_size(),

Use whole history in case of undetermined tokenization of sequence #1254

Are you sure you want to change the base?

Use whole history in case of undetermined tokenization of sequence #1254

Conversation

sbalandi commented Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbalandi Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbalandi commented Nov 27, 2024 • edited Loading

sbalandi commented Nov 26, 2024 •

edited

Loading

sbalandi Nov 27, 2024 •

edited

Loading

sbalandi commented Nov 27, 2024 •

edited

Loading