MERA: Multimodal Evaluation for Russian-language Architectures
The LM-harness support for the MERA benchmark datasets.
This project provides a unified framework to test generative language models on MERA benchmark and its evaluation tasks.
To install lm-eval
from the repository main branch, run:
pip install -e .
To support loading GPTQ quantized models, install the package with the auto-gptq
extra:
pip install -e ".[auto-gptq]"
Sample command to run benchmark with ai-forever/rugpt3large_based_on_gpt2
model from Huggingface Hub:
CUDA_VISIBLE_DEVICES=0 MERA_FOLDER="$PWD/mera_results/rugpt3large_760m_defaults" MERA_MODEL_STRING="pretrained=ai-forever/rugpt3large_based_on_gpt2,dtype=auto" bash run_mera.sh
Use CUDA_VISIBLE_DEVICES
to set cuda device visibility, MERA_FOLDER
for path to store outputs,
MERA_MODEL_STRING
to setup model_args
parameter of lm-evaluation-harness
's main.py
.
Use MERA_COMMON_SETUP
to change default parameters for model inferencing with main.py
(defaults are
--model hf-causal --device cuda --max_batch_size=64 --batch_size=auto --inference
).
See more on parameters in next section.
Running specific benchmark available with main.py
script.
Example:
CUDA_VISIBLE_DEVICES=3 python main.py --model hf-causal --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto,max_length=11000 \
--device cuda --output_base_path="$PWD/mera_results/Mistral-7B-v0.1_defaults" --max_batch_size=16 --batch_size=auto \
--inference --write_out --tasks rummlu --num_fewshot=5 \
--output_path="$PWD/mera_results/Mistral-7B-v0.1_defaults/rummlu_result.json"
Use --tasks
to provide comma separated list of tasks to run (available options are: bps
, chegeka
, lcs
,
mathlogicqa
, multiq
, parus
, rcb
, rudetox
, ruethics
, ruhatespeech
, ruhhh
, ruhumaneval
, rummlu
,
rumodar
, rumultiar
, ruopenbookqa
, rutie
, ruworldtree
, rwsd
, simplear
, use
).
Avoiding this argument will run all tasks with same provided settings.
--num_fewshot
sets fewshot count. MERA supposes to run tasks with the following fewshot count:
--num_fewshot=0
(zeroshot) withmultiq
,parus
,rcb
,rumodar
,rwsd
,use
,rudetox
,ruethics
,ruhatespeech
,ruhhh
,rutie
, andruhumaneval
;--num_fewshot=2
withbps
andlcs
;--num_fewshot=4
withchegeka
;--num_fewshot=5
withmathlogicqa
,ruworldtree
,ruopenbookqa
,simplear
,rumultiar
, andrummlu
.
Use CUDA_VISIBLE_DEVICES
to set cuda device visibility (setting --device cuda:3
works inconsisitently).
--model hf-causal
is used for models compatible with transformers' AutoModelForCausalLM
class and is most
stable with MERA benchmark.
You can try to use unstable hf-causal-experimental
(AutoModelForCausalLM
compatible) or
hf-seq2seq
(AutoModelForSeq2SeqLM
) for your model.
--model_args
is for comma separated parameters of from_pretrained
method of autoclass. One should be aware of
hardware requirements to run big models and limit maximum input length of models with parameter max_length
to avoid out-of-memory errors during run.
--batch_size=auto
is set to determine batch size for run automatically based on tasks and inputs maximum value
to start search down is set with --max_batch_size
. Bigger batches may speed up running whole MERA benchmark.
--output_base_path
is path to dir (will be created) to store data for submission preparation and logs.
--inference
important to use this key always, it allows to run on datasets without proper replies provided
(score result 0 will be reported).
--write_out
turn on extra logging, should be always on if the submission may be made public.
--no_cache
is used to turn off caching of model files (datasets are not cached).
--output_path
path to extra log file with parameters of run and results of task. It is preferred to be inside
output_base_path
directory.
Bash script above runs submission zip packing routine. Here is the way to run packing manually.
For converting run
python scripts/log_to_submission.py
Cmd arguments:
--outputs_dir
— path to directory with outputs (MERA_FOLDER
from bash script above)--dst_dir
— directory for store submission zip--dataset_dir
— path tolm_eval/datasets/
--logs_public_submit
(--no-logs_public_submit
) — pack logs for public submission in separate file (true by default)