Add multilingual validation #3

TJ-Solergibert · 2024-07-24T19:29:26Z

In this PR I include the validation stage per languages! There are some comments in the code itself, but the most relevant things to take care of are:

The validation stage DataLoaders are crafted with the same logic nanotron implemented the training ones with the different training stages(_update_dataloader_based_on_training_stages for training and _prepare_dataloader_for_validation_stage for validation). After carefully analysing the mechanism, I have to contact them because it's not deleting the previous DataLoaders from the previous stages properly. Is it something we should be aware of? NO, everything works, but I copied this logic for the validation stage and it's kind of useless & messy.
When aggregating the different language losses there is one case where it fails: If we have a DP (Data Parallel) group with NO (0) SAMPLES from a language, it will try to all reduce this metric which is a empty list with all the other DP groups and will fail. I hope we don't run into this kind of issues, but I propose 1 (messy) solution to this problem we could try.

To use this validation stage feature you just need to set dataset.validation_folder for each data stage and tokens. val_check_interval. Be aware that as we are logging the training & validation metrics together, we must set the validation interval a multiple of the logging interval (aka perform the validation stage during a training step in which we are logging the metrics).

You can check some wandb logs here. Keep in mind that wand logs each metric separately, so in order to merge in a single plot the different language losses + global loss you need to "edit panel" (✏️) and set in the * option "validation_loss". I recommend you trying this feature with a single wandb run instead of the whole project runs.

TJ-Solergibert · 2024-08-07T19:57:15Z

I've pushed the following changes:

Now we don't prepend the language token BUT bring the lang_code to LlamaMode.forward(). This lang_code tensor has shape (micro_batch_size, 1).
Now we don't introduce in the config file a mapping from languages to language tokens BUT just a language list which is necessary for logging and aggregating the different metrics.
For the sparse result aggregation we will first compute the local language result in each DP group (Shape (1, NºLANGS)), then gather all local language results in all DP groups in a tensor with shape (DP SIZE, NºLANGS) and finally aggregate the final result. For the DP groups that don't have any validation sample of a language we will set the loss to -1 and then ignore this values in the average loss computation.

Check the WANDB logs for examples/config_multilingual_nanoset.yaml of Llama3-8B (without pretrained weights) with TP = 4 & DP = 2

I think thats all, let me know if there is any issue/question and typos!!

Add multilingual validation step.

TJ-Solergibert added 8 commits July 22, 2024 16:39

Just in case

27133e1

just in case

5c09e11

This looks good

94d6c2a

This looks better

5cccf16

last fixes

d75038d

Fixed tokenizer config

ab1dd83

deleted comments

2d91154

Last fixes

ce068fd

negar-foroutan merged this pull request into swiss-ai:main Aug 15, 2024
1 of 3 checks passed

negar-foroutan pushed a commit that referenced this pull request Aug 15, 2024

Add multilingual validation (#3)

8b68126

Add multilingual validation step.

negar-foroutan pushed a commit that referenced this pull request Sep 8, 2024

Add multilingual validation (#3)

c41730d

Add multilingual validation step.

negar-foroutan pushed a commit that referenced this pull request Oct 4, 2024

Add multilingual validation (#3)

c65b349

Add multilingual validation step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilingual validation #3

Add multilingual validation #3

TJ-Solergibert commented Jul 24, 2024 •

edited

Loading

TJ-Solergibert commented Aug 7, 2024

Add multilingual validation #3

Add multilingual validation #3

Conversation

TJ-Solergibert commented Jul 24, 2024 • edited Loading

TJ-Solergibert commented Aug 7, 2024

TJ-Solergibert commented Jul 24, 2024 •

edited

Loading