Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advancing to scalars gen2 #374

Merged
merged 7 commits into from
Oct 10, 2024
Merged

advancing to scalars gen2 #374

merged 7 commits into from
Oct 10, 2024

Conversation

YoelShoshan
Copy link
Collaborator

No description provided.

mosheraboh
mosheraboh previously approved these changes Oct 8, 2024
Copy link
Collaborator

@mosheraboh mosheraboh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Few comments inline/

[]
) # one scalar for every element, `scalar_default_unfound_value` is used for elements that aren't scalars
all_scalars_valid_mask = [] # for each element, whether it's a scalar or not
scalar_default_unfound_value = -1000.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we now keep the scalars values and mask at the size of the entire sequence, this is the default value for places that don't actually have a scalar value.
I chose -1000 and not something like 0 to make sure it pops up easily if there are mistakes down the road

@@ -47,7 +45,7 @@ class InjectorToModularTokenizerLib:
@staticmethod
def build_placeholder_meta_tokenization(
*,
sequence: Union[str, list, tuple],
sequence: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change it if you still support those types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied over the code, so probably you already fixed this and it got overriden.
fixed it now to contain the union.

@@ -91,19 +88,18 @@ def build_placeholder_meta_tokenization(
if tokenizer_type.startswith("SCALARS_"):
with_placeholders.append(
"<@TOKENIZER-TYPE=AA>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general solution can't assume there is AA subtokenizer.
Maybe we need a default empty sub-tokenizer? Maybe SCALARS can be an empty sub-tokenizer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting point.
SCALARS is currently fully programmatic and does not rely on any dictionary, so I would rather not mix it.
Probably better to have "base" that gets automatically generated and supported , as the modular tokenizer already knows how to handle special tokens

maybe "Base" or "SpecialTokensBase" or something

if (
tokenizer_type == "SCALARS_LITERALS"
): # note: masking is only supported in literals (not in "from dict")
if tokenizer_type == "SCALARS_LITERALS":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me, can we put mask in scalar literals?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped supporting this option, intentionally.
Now scalar only supports scalars

scalars_masked_indices = []
prev_index_end = -1
## both `all_scalars_values` and `all_scalars_valid_mask` will contain torch tensors, which will be concatanated in the end of this function
all_scalars_values = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_scalar_values = [] ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that static code auto "fixed" this, due to the comment length
(I didn't write it like this)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the comments to one line above the code, so it won't do that.

for x in values
]
)
# validate that all values can be converted to fload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: float

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
fixed


# pad if needed
full_query_len = len(token_ids)
if full_query_len > all_scalars_values.shape[0]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't it happend?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "can it happen" ?
If so, then yes, almost always.

The main code logic (before it) iterates over each sub part (with specific sub tokenizer) so it does not contain the padding.

I can explain more if it isn't clear

@@ -130,6 +126,7 @@ def prepare_info_for_model_step(
*,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe build_scalars be a better name here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, renamed

)
else:
scalars_masked_indices = None
elif full_query_len > all_scalars_values.shape[0]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosheraboh see here, related to what we talked about.

I'll try to add more unit tests with interesting cases by the end of this week

yoel shoshan added 2 commits October 9, 2024 11:33
@YoelShoshan YoelShoshan requested a review from mosheraboh October 9, 2024 17:39
@mosheraboh mosheraboh merged commit 957f37f into master Oct 10, 2024
5 checks passed
@mosheraboh mosheraboh deleted the scalars_gen2_for_os branch October 10, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants