-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama_hf on pope benchmark gives zero results. #356
Comments
Noticed that you are using llava1.5 to perform evaluation. Does bad instruction following ability for the model also be a reason for this and whether you observe the same result from |
Hi @kcz358 , I tried "llava-hf/llama3-llava-next-8b-hf" and i am getting the same 0 results. Note that the models (i.e. llava 1.5, llama3-llava-next-8b-hf) work if you use for example: Anyway i don't think that we should rely on lowering the max_new_tokens on order to get results. There should be a better way. |
Hi @giobin , I find out the issue is due to the |
the command:
python3 -m accelerate.commands.launch --config_file=accelerate_multi_GPU.yaml --num_processes=4 -m lmms_eval --model llava_hf --model_args pretrained="llava-hf/llava-1.5-7b-hf" --tasks pope --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_pope --output_path ./logs/ --show_config --verbosity=DEBUG
gives poor results like this:
"results": {
"pope": {
"alias": "pope",
"pope_accuracy,none": 0.0,
"pope_accuracy_stderr,none": "N/A",
"pope_precision,none": 0.0
"pope_precision_stderr,none": "N/A",
"pope_recall,none": 0.0,
"pope_recall_stderr,none": "N/A",
"pope_f1_score,none": 0.0,
"pope_f1_score_stderr,none": "N/A",
"pope_yes_ratio,none": 0.5,
"pope_yes_ratio_stderr,none": "N/A"
}
The problem is due to the fact that pope.yaml contains:
Which let the model generate 128 tokens and then, when the exact match metric is computed, the result is 0, since a 128 token string can not match a target answer in ['yes', 'no'].
I think the problem is in the way the generation is (not) stopped, which should be controlled by the "until" parameter in generation_kwargs. But even if you change the pope.yaml to contain the "until" param, in llama_hf.py the only place where it is used is in this part of the code:
`# Set default values for until and max_new_tokens
until = [self.tok_decode(self.eot_token_id)]
Which basically set the until variable, but then never uses it on the rest of the code to actually crop the output or stop generation_until() as soon as a specific string present in until is generated.
It seems that this problem with "until" is present also in other varian of llava and idefics. So what am i missing?
The text was updated successfully, but these errors were encountered: