advancing to scalars gen2 #374

YoelShoshan · 2024-10-07T12:28:53Z

No description provided.

mosheraboh

Looks great!
Few comments inline/

mosheraboh · 2024-10-08T06:01:14Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

+            []
+        )  # one scalar for every element, `scalar_default_unfound_value` is used for elements that aren't scalars
+        all_scalars_valid_mask = []  # for each element, whether it's a scalar or not
+        scalar_default_unfound_value = -1000.0


What does it mean?

since we now keep the scalars values and mask at the size of the entire sequence, this is the default value for places that don't actually have a scalar value.
I chose -1000 and not something like 0 to make sure it pops up easily if there are mistakes down the road

mosheraboh · 2024-10-08T06:06:23Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

@@ -47,7 +45,7 @@ class InjectorToModularTokenizerLib:
    @staticmethod
    def build_placeholder_meta_tokenization(
        *,
-        sequence: Union[str, list, tuple],
+        sequence: str,


Why did you change it if you still support those types?

I just copied over the code, so probably you already fixed this and it got overriden.
fixed it now to contain the union.

mosheraboh · 2024-10-08T06:08:32Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

@@ -91,19 +88,18 @@ def build_placeholder_meta_tokenization(
            if tokenizer_type.startswith("SCALARS_"):
                with_placeholders.append(
                    "<@TOKENIZER-TYPE=AA>"


The general solution can't assume there is AA subtokenizer.
Maybe we need a default empty sub-tokenizer? Maybe SCALARS can be an empty sub-tokenizer?

interesting point.
SCALARS is currently fully programmatic and does not rely on any dictionary, so I would rather not mix it.
Probably better to have "base" that gets automatically generated and supported , as the modular tokenizer already knows how to handle special tokens

maybe "Base" or "SpecialTokensBase" or something

mosheraboh · 2024-10-08T06:11:02Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

-                if (
-                    tokenizer_type == "SCALARS_LITERALS"
-                ):  # note: masking is only supported in literals (not in "from dict")
+                if tokenizer_type == "SCALARS_LITERALS":


Remind me, can we put mask in scalar literals?

I stopped supporting this option, intentionally.
Now scalar only supports scalars

mosheraboh · 2024-10-08T06:13:38Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

-        scalars_masked_indices = []
-        prev_index_end = -1
+        ## both `all_scalars_values` and `all_scalars_valid_mask` will contain torch tensors, which will be concatanated in the end of this function
+        all_scalars_values = (


all_scalar_values = [] ?

I think that static code auto "fixed" this, due to the comment length
(I didn't write it like this)

I moved the comments to one line above the code, so it won't do that.

mosheraboh · 2024-10-08T06:16:02Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

-                            for x in values
-                        ]
-                    )
+                    # validate that all values can be converted to fload


typo: float

thanks!
fixed

mosheraboh · 2024-10-08T06:21:05Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

+
+        # pad if needed
+        full_query_len = len(token_ids)
+        if full_query_len > all_scalars_values.shape[0]:


Can't it happend?

Do you mean "can it happen" ?
If so, then yes, almost always.

The main code logic (before it) iterates over each sub part (with specific sub tokenizer) so it does not contain the padding.

I can explain more if it isn't clear

mosheraboh · 2024-10-08T06:24:16Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

@@ -130,6 +126,7 @@ def prepare_info_for_model_step(
        *,


maybe build_scalars be a better name here?

np, renamed

YoelShoshan · 2024-10-09T14:09:06Z

fuse/data/tokenizers/modular_tokenizer/inject_utils.py

            )
-        else:
-            scalars_masked_indices = None
+        elif full_query_len > all_scalars_values.shape[0]:


@mosheraboh see here, related to what we talked about.

I'll try to add more unit tests with interesting cases by the end of this week

yoel shoshan added 2 commits October 7, 2024 08:28

advancing to scalars gen2

ec8d187

fix static code checks

6010b34

mosheraboh previously approved these changes Oct 8, 2024

View reviewed changes

PR comments

e19cb2c

YoelShoshan dismissed mosheraboh’s stale review via e19cb2c October 9, 2024 13:41

yoel shoshan added 2 commits October 9, 2024 09:42

PR comments

d2a7b83

handling shorter seq in scalars due to crop

dff2a35

YoelShoshan commented Oct 9, 2024

View reviewed changes

yoel shoshan added 2 commits October 9, 2024 11:33

...

0ac3c6d

...

2ae2510

YoelShoshan requested a review from mosheraboh October 9, 2024 17:39

mosheraboh approved these changes Oct 10, 2024

View reviewed changes

mosheraboh merged commit 957f37f into master Oct 10, 2024
5 checks passed

mosheraboh deleted the scalars_gen2_for_os branch October 10, 2024 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advancing to scalars gen2 #374

advancing to scalars gen2 #374

YoelShoshan commented Oct 7, 2024

mosheraboh left a comment

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

mosheraboh Oct 8, 2024

YoelShoshan Oct 9, 2024

YoelShoshan Oct 9, 2024

advancing to scalars gen2 #374

advancing to scalars gen2 #374

Conversation

YoelShoshan commented Oct 7, 2024

mosheraboh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment