Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do you handle the dataset in this way? #619

Open
lzcchl opened this issue Sep 20, 2024 · 1 comment
Open

Why do you handle the dataset in this way? #619

lzcchl opened this issue Sep 20, 2024 · 1 comment

Comments

@lzcchl
Copy link

lzcchl commented Sep 20, 2024

# now concatenate all samples and split according to max sequence length

why do you concatenate all samples and split according to max sequence length rather than without concatenate?

I think it is very weak in logic,

  1. if there is no correlation between two paragraphs, will the understanding ability of the quantified model deteriorate? for example, will there be a situation where it doesn't end in the correct position?
  2. after concatenate and split, only the first sample with right begin token, other sample with wrong begin token, will this have an impact on the activation value?

Please re-examine whether concatenate and split dataset is reasonable.
Or is there a better way to process the dataset?

@EricLiclair
Copy link

quantisation calibrations are generally and ideally performed on corpora datasets; examples c4, wikitext, pileval, etc. In which case,
[1] concatenate and split by max seq. length is essentially just making sure all samples are of a common sequence length.
[2] since its a text corpus, there is no intermediary special token (eol, etc.)

If in case you wish to calibrate with an instruction dataset; you would rather want to pass in the tokenized examples from your dataset directly to the quantizer with each sample being one complete instruction.

something like

# tokenizer := some tokenizer
apply_func = lambda x: tokenizer(x)

tokenized_data = [apply_func(sample) for sample in dataset["message"])]

awq_model.quantize(
    ..., # other arguments
    data_samples=tokenized_data,
)

or refer to this notebook (for a real use-case) that i used sometime back to quantise on instruction datasets;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants