llama_hf on pope benchmark gives zero results. #356

giobin · 2024-10-26T10:45:49Z

the command:
python3 -m accelerate.commands.launch --config_file=accelerate_multi_GPU.yaml --num_processes=4 -m lmms_eval --model llava_hf --model_args pretrained="llava-hf/llava-1.5-7b-hf" --tasks pope --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_pope --output_path ./logs/ --show_config --verbosity=DEBUG

gives poor results like this:

"results": {
"pope": {
"alias": "pope",
"pope_accuracy,none": 0.0,
"pope_accuracy_stderr,none": "N/A",
"pope_precision,none": 0.0
"pope_precision_stderr,none": "N/A",
"pope_recall,none": 0.0,
"pope_recall_stderr,none": "N/A",
"pope_f1_score,none": 0.0,
"pope_f1_score_stderr,none": "N/A",
"pope_yes_ratio,none": 0.5,
"pope_yes_ratio_stderr,none": "N/A"
}

The problem is due to the fact that pope.yaml contains:

generation_kwargs:
  max_new_tokens: 128

Which let the model generate 128 tokens and then, when the exact match metric is computed, the result is 0, since a 128 token string can not match a target answer in ['yes', 'no'].

I think the problem is in the way the generation is (not) stopped, which should be controlled by the "until" parameter in generation_kwargs. But even if you change the pope.yaml to contain the "until" param, in llama_hf.py the only place where it is used is in this part of the code:

`# Set default values for until and max_new_tokens
until = [self.tok_decode(self.eot_token_id)]

        # Update values from gen_kwargs if present
        if "until" in gen_kwargs:
            until = gen_kwargs.pop("until")
            if isinstance(until, str):
                until = [until]
            elif not isinstance(until, list):
                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str,list] but got {type(until)}")
        assert self.batch_size_per_gpu == 1, "Do not support batch_size_per_gpu > 1 for now"
        context = contexts[0]`

Which basically set the until variable, but then never uses it on the rest of the code to actually crop the output or stop generation_until() as soon as a specific string present in until is generated.

It seems that this problem with "until" is present also in other varian of llava and idefics. So what am i missing?

The text was updated successfully, but these errors were encountered:

kcz358 · 2024-10-28T04:32:17Z

Noticed that you are using llava1.5 to perform evaluation. Does bad instruction following ability for the model also be a reason for this and whether you observe the same result from llava-next-llama3 or even llava-onevision series?

giobin · 2024-10-31T12:43:27Z

Hi @kcz358 , I tried "llava-hf/llama3-llava-next-8b-hf" and i am getting the same 0 results. Note that the models (i.e. llava 1.5, llama3-llava-next-8b-hf) work if you use for example:
generation_kwargs: max_new_tokens: 3
this because the response is first decoded with skip_special_tokens=True, so already the ", <|end_of_text|>, ..." and all those special tokens are discarded, and then the string is stripped, so eventual '\n' or spaces are again erased from the generation output. If the generated tokens are just a few, this leaves just the 'yes'/'no' response, provided that the model is actually able to follow the instruction, and the exact match is performed as expected.

Anyway i don't think that we should rely on lowering the max_new_tokens on order to get results. There should be a better way.

kcz358 · 2024-11-01T04:46:01Z

Hi @giobin , I find out the issue is due to the specified_eot_token_id is set to None by default. This let the model to generate with no eos_token if you don't pass the eos_token in model_args. I have remove this param and set the eos_token to the eos token of tokenizer in #386 . I have done a quick dry run and the result should be good right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama_hf on pope benchmark gives zero results. #356

llama_hf on pope benchmark gives zero results. #356

giobin commented Oct 26, 2024

kcz358 commented Oct 28, 2024

giobin commented Oct 31, 2024

kcz358 commented Nov 1, 2024 •

edited

Loading

llama_hf on pope benchmark gives zero results. #356

llama_hf on pope benchmark gives zero results. #356

Comments

giobin commented Oct 26, 2024

kcz358 commented Oct 28, 2024

giobin commented Oct 31, 2024

kcz358 commented Nov 1, 2024 • edited Loading

kcz358 commented Nov 1, 2024 •

edited

Loading