Why do you handle the dataset in this way？ #619

lzcchl · 2024-09-20T02:40:34Z

AutoAWQ/awq/utils/calib_data.py

Line 59 in 7954766

# now concatenate all samples and split according to max sequence length

why do you concatenate all samples and split according to max sequence length rather than without concatenate?

I think it is very weak in logic,

if there is no correlation between two paragraphs, will the understanding ability of the quantified model deteriorate? for example, will there be a situation where it doesn't end in the correct position？
after concatenate and split, only the first sample with right begin token, other sample with wrong begin token, will this have an impact on the activation value?

Please re-examine whether concatenate and split dataset is reasonable.
Or is there a better way to process the dataset?

EricLiclair · 2024-09-25T21:23:59Z

quantisation calibrations are generally and ideally performed on corpora datasets; examples c4, wikitext, pileval, etc. In which case,
[1] concatenate and split by max seq. length is essentially just making sure all samples are of a common sequence length.
[2] since its a text corpus, there is no intermediary special token (eol, etc.)

If in case you wish to calibrate with an instruction dataset; you would rather want to pass in the tokenized examples from your dataset directly to the quantizer with each sample being one complete instruction.

something like

# tokenizer := some tokenizer
apply_func = lambda x: tokenizer(x)

tokenized_data = [apply_func(sample) for sample in dataset["message"])]

awq_model.quantize(
    ..., # other arguments
    data_samples=tokenized_data,
)

or refer to this notebook (for a real use-case) that i used sometime back to quantise on instruction datasets;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do you handle the dataset in this way？ #619

Why do you handle the dataset in this way？ #619

lzcchl commented Sep 20, 2024

EricLiclair commented Sep 25, 2024

Why do you handle the dataset in this way？ #619

Why do you handle the dataset in this way？ #619

Comments

lzcchl commented Sep 20, 2024

EricLiclair commented Sep 25, 2024