You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# now concatenate all samples and split according to max sequence length
why do you concatenate all samples and split according to max sequence length rather than without concatenate?
I think it is very weak in logic,
if there is no correlation between two paragraphs, will the understanding ability of the quantified model deteriorate? for example, will there be a situation where it doesn't end in the correct position?
after concatenate and split, only the first sample with right begin token, other sample with wrong begin token, will this have an impact on the activation value?
Please re-examine whether concatenate and split dataset is reasonable.
Or is there a better way to process the dataset?
The text was updated successfully, but these errors were encountered:
quantisation calibrations are generally and ideally performed on corpora datasets; examples c4, wikitext, pileval, etc. In which case,
[1] concatenate and split by max seq. length is essentially just making sure all samples are of a common sequence length.
[2] since its a text corpus, there is no intermediary special token (eol, etc.)
If in case you wish to calibrate with an instruction dataset; you would rather want to pass in the tokenized examples from your dataset directly to the quantizer with each sample being one complete instruction.
something like
# tokenizer := some tokenizerapply_func=lambdax: tokenizer(x)
tokenized_data= [apply_func(sample) forsampleindataset["message"])]
awq_model.quantize(
..., # other argumentsdata_samples=tokenized_data,
)
or refer to this notebook (for a real use-case) that i used sometime back to quantise on instruction datasets;
AutoAWQ/awq/utils/calib_data.py
Line 59 in 7954766
why do you concatenate all samples and split according to max sequence length rather than without concatenate?
I think it is very weak in logic,
Please re-examine whether concatenate and split dataset is reasonable.
Or is there a better way to process the dataset?
The text was updated successfully, but these errors were encountered: