-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
advancing to scalars gen2 #374
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Few comments inline/
[] | ||
) # one scalar for every element, `scalar_default_unfound_value` is used for elements that aren't scalars | ||
all_scalars_valid_mask = [] # for each element, whether it's a scalar or not | ||
scalar_default_unfound_value = -1000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we now keep the scalars values and mask at the size of the entire sequence, this is the default value for places that don't actually have a scalar value.
I chose -1000 and not something like 0 to make sure it pops up easily if there are mistakes down the road
@@ -47,7 +45,7 @@ class InjectorToModularTokenizerLib: | |||
@staticmethod | |||
def build_placeholder_meta_tokenization( | |||
*, | |||
sequence: Union[str, list, tuple], | |||
sequence: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change it if you still support those types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied over the code, so probably you already fixed this and it got overriden.
fixed it now to contain the union.
@@ -91,19 +88,18 @@ def build_placeholder_meta_tokenization( | |||
if tokenizer_type.startswith("SCALARS_"): | |||
with_placeholders.append( | |||
"<@TOKENIZER-TYPE=AA>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general solution can't assume there is AA subtokenizer.
Maybe we need a default empty sub-tokenizer? Maybe SCALARS can be an empty sub-tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting point.
SCALARS is currently fully programmatic and does not rely on any dictionary, so I would rather not mix it.
Probably better to have "base" that gets automatically generated and supported , as the modular tokenizer already knows how to handle special tokens
maybe "Base" or "SpecialTokensBase" or something
if ( | ||
tokenizer_type == "SCALARS_LITERALS" | ||
): # note: masking is only supported in literals (not in "from dict") | ||
if tokenizer_type == "SCALARS_LITERALS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me, can we put mask in scalar literals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped supporting this option, intentionally.
Now scalar only supports scalars
scalars_masked_indices = [] | ||
prev_index_end = -1 | ||
## both `all_scalars_values` and `all_scalars_valid_mask` will contain torch tensors, which will be concatanated in the end of this function | ||
all_scalars_values = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_scalar_values = [] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that static code auto "fixed" this, due to the comment length
(I didn't write it like this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the comments to one line above the code, so it won't do that.
for x in values | ||
] | ||
) | ||
# validate that all values can be converted to fload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
fixed
|
||
# pad if needed | ||
full_query_len = len(token_ids) | ||
if full_query_len > all_scalars_values.shape[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't it happend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "can it happen" ?
If so, then yes, almost always.
The main code logic (before it) iterates over each sub part (with specific sub tokenizer) so it does not contain the padding.
I can explain more if it isn't clear
@@ -130,6 +126,7 @@ def prepare_info_for_model_step( | |||
*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe build_scalars be a better name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, renamed
) | ||
else: | ||
scalars_masked_indices = None | ||
elif full_query_len > all_scalars_values.shape[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosheraboh see here, related to what we talked about.
I'll try to add more unit tests with interesting cases by the end of this week
No description provided.