Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual validation #3

Merged
merged 8 commits into from
Aug 15, 2024

Conversation

TJ-Solergibert
Copy link

@TJ-Solergibert TJ-Solergibert commented Jul 24, 2024

In this PR I include the validation stage per languages! There are some comments in the code itself, but the most relevant things to take care of are:

  • The validation stage DataLoaders are crafted with the same logic nanotron implemented the training ones with the different training stages(_update_dataloader_based_on_training_stages for training and _prepare_dataloader_for_validation_stage for validation). After carefully analysing the mechanism, I have to contact them because it's not deleting the previous DataLoaders from the previous stages properly. Is it something we should be aware of? NO, everything works, but I copied this logic for the validation stage and it's kind of useless & messy.
  • When aggregating the different language losses there is one case where it fails: If we have a DP (Data Parallel) group with NO (0) SAMPLES from a language, it will try to all reduce this metric which is a empty list with all the other DP groups and will fail. I hope we don't run into this kind of issues, but I propose 1 (messy) solution to this problem we could try.

To use this validation stage feature you just need to set dataset.validation_folder for each data stage and tokens. val_check_interval. Be aware that as we are logging the training & validation metrics together, we must set the validation interval a multiple of the logging interval (aka perform the validation stage during a training step in which we are logging the metrics).

You can check some wandb logs here. Keep in mind that wand logs each metric separately, so in order to merge in a single plot the different language losses + global loss you need to "edit panel" (✏️) and set in the * option "validation_loss". I recommend you trying this feature with a single wandb run instead of the whole project runs.

@TJ-Solergibert
Copy link
Author

I've pushed the following changes:

  1. Now we don't prepend the language token BUT bring the lang_code to LlamaMode.forward(). This lang_code tensor has shape (micro_batch_size, 1).
  2. Now we don't introduce in the config file a mapping from languages to language tokens BUT just a language list which is necessary for logging and aggregating the different metrics.
  3. For the sparse result aggregation we will first compute the local language result in each DP group (Shape (1, NºLANGS)), then gather all local language results in all DP groups in a tensor with shape (DP SIZE, NºLANGS) and finally aggregate the final result. For the DP groups that don't have any validation sample of a language we will set the loss to -1 and then ignore this values in the average loss computation.

Check the WANDB logs for examples/config_multilingual_nanoset.yaml of Llama3-8B (without pretrained weights) with TP = 4 & DP = 2

I think thats all, let me know if there is any issue/question and typos!!

@negar-foroutan negar-foroutan merged this pull request into swiss-ai:main Aug 15, 2024
1 of 3 checks passed
negar-foroutan pushed a commit that referenced this pull request Aug 15, 2024
Add multilingual validation step.
negar-foroutan pushed a commit that referenced this pull request Sep 8, 2024
Add multilingual validation step.
negar-foroutan pushed a commit that referenced this pull request Oct 4, 2024
Add multilingual validation step.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants