Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_hf on pope benchmark gives zero results. #356

Open
giobin opened this issue Oct 26, 2024 · 3 comments
Open

llama_hf on pope benchmark gives zero results. #356

giobin opened this issue Oct 26, 2024 · 3 comments

Comments

@giobin
Copy link

giobin commented Oct 26, 2024

the command:
python3 -m accelerate.commands.launch --config_file=accelerate_multi_GPU.yaml --num_processes=4 -m lmms_eval --model llava_hf --model_args pretrained="llava-hf/llava-1.5-7b-hf" --tasks pope --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_pope --output_path ./logs/ --show_config --verbosity=DEBUG

gives poor results like this:

"results": {
"pope": {
"alias": "pope",
"pope_accuracy,none": 0.0,
"pope_accuracy_stderr,none": "N/A",
"pope_precision,none": 0.0
"pope_precision_stderr,none": "N/A",
"pope_recall,none": 0.0,
"pope_recall_stderr,none": "N/A",
"pope_f1_score,none": 0.0,
"pope_f1_score_stderr,none": "N/A",
"pope_yes_ratio,none": 0.5,
"pope_yes_ratio_stderr,none": "N/A"
}

The problem is due to the fact that pope.yaml contains:

generation_kwargs:
  max_new_tokens: 128

Which let the model generate 128 tokens and then, when the exact match metric is computed, the result is 0, since a 128 token string can not match a target answer in ['yes', 'no'].

I think the problem is in the way the generation is (not) stopped, which should be controlled by the "until" parameter in generation_kwargs. But even if you change the pope.yaml to contain the "until" param, in llama_hf.py the only place where it is used is in this part of the code:

`# Set default values for until and max_new_tokens
until = [self.tok_decode(self.eot_token_id)]

        # Update values from gen_kwargs if present
        if "until" in gen_kwargs:
            until = gen_kwargs.pop("until")
            if isinstance(until, str):
                until = [until]
            elif not isinstance(until, list):
                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str,list] but got {type(until)}")
        assert self.batch_size_per_gpu == 1, "Do not support batch_size_per_gpu > 1 for now"
        context = contexts[0]`

Which basically set the until variable, but then never uses it on the rest of the code to actually crop the output or stop generation_until() as soon as a specific string present in until is generated.

It seems that this problem with "until" is present also in other varian of llava and idefics. So what am i missing?

@kcz358
Copy link
Collaborator

kcz358 commented Oct 28, 2024

Noticed that you are using llava1.5 to perform evaluation. Does bad instruction following ability for the model also be a reason for this and whether you observe the same result from llava-next-llama3 or even llava-onevision series?

@giobin
Copy link
Author

giobin commented Oct 31, 2024

Hi @kcz358 , I tried "llava-hf/llama3-llava-next-8b-hf" and i am getting the same 0 results. Note that the models (i.e. llava 1.5, llama3-llava-next-8b-hf) work if you use for example:
generation_kwargs: max_new_tokens: 3
this because the response is first decoded with skip_special_tokens=True, so already the ", <|end_of_text|>, ..." and all those special tokens are discarded, and then the string is stripped, so eventual '\n' or spaces are again erased from the generation output. If the generated tokens are just a few, this leaves just the 'yes'/'no' response, provided that the model is actually able to follow the instruction, and the exact match is performed as expected.

Anyway i don't think that we should rely on lowering the max_new_tokens on order to get results. There should be a better way.

@kcz358
Copy link
Collaborator

kcz358 commented Nov 1, 2024

Hi @giobin , I find out the issue is due to the specified_eot_token_id is set to None by default. This let the model to generate with no eos_token if you don't pass the eos_token in model_args. I have remove this param and set the eos_token to the eos token of tokenizer in #386 . I have done a quick dry run and the result should be good right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants