Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test readme commands #1311

Merged
merged 31 commits into from
May 3, 2024
Merged

Test readme commands #1311

merged 31 commits into from
May 3, 2024

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented Apr 17, 2024

Fixes #1296

@rasbt rasbt changed the title [WIP] test readme commands Test readme commands Apr 22, 2024
@rasbt
Copy link
Collaborator Author

rasbt commented Apr 24, 2024

I was asked to put together a CI script that runs the CLI commands as they appear in the readme, which is what this is. I tried to make it feasible by using the smallest models. Works fine locally on GPU machines.

But in the CI, which is running on a CPU, I noticed the following:

RuntimeError: Command 'litgpt pretrain --model_name pythia-14m --initial_checkpoint checkpoints/EleutherAI/pythia-14m --tokenizer_dir checkpoints/EleutherAI/pythia-14m --data TextFiles --data.train_data_path custom_texts --train.max_tokens 100 --out_dir out/custom_continue_pretrained' failed with exit status 1
E           Output:
E           {'data': {'batch_size': 1,
E                     'max_seq_length': -1,
E                     'num_workers': 4,
E                     'seed': 42,
E                     'tokenizer': None,
E                     'train_data_path': PosixPath('custom_texts'),
E                     'val_data_path': None},
E            'devices': 'auto',
E            'eval': {'interval': 1000, 'max_iters': 100, 'max_new_tokens': None},
E            'initial_checkpoint_dir': PosixPath('checkpoints/EleutherAI/pythia-14m'),
E            'logger_name': 'tensorboard',
E            'model_config': None,
E            'model_name': 'pythia-14m',
E            'out_dir': PosixPath('out/custom_continue_pretrained'),
E            'resume': False,
E            'seed': 42,
E            'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-14m'),
E            'train': {'beta1': 0.9,
E                      'beta2': 0.95,
E                      'epochs': None,
E                      'global_batch_size': 512,
E                      'learning_rate': 0.0004,
E                      'log_interval': 1,
E                      'lr_warmup_fraction': None,
E                      'lr_warmup_steps': 2000,
E                      'max_norm': 1.0,
E                      'max_seq_length': None,
E                      'max_steps': None,
E                      'max_tokens': 100,
E                      'micro_batch_size': 4,
E                      'min_lr': 4e-05,
E                      'save_interval': 1000,
E                      'tie_embeddings': False,
E                      'weight_decay': 0.1}}
E           Time to instantiate model: 0.07 seconds.
E           Total parameters: 14,067,712
E           
E           Error:
E           Using bfloat16 Automatic Mixed Precision (AMP)
E           Missing logger folder: out/custom_continue_pretrained/logs/tensorboard
E           Seed set to 42
E           Traceback (most recent call last):
E             File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/litgpt", line 8, in <module>
E               sys.exit(main())
E             File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
E               fn(**kwargs)
E             File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/litgpt/pretrain.py", line 119, in setup
E               main(
E             File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/litgpt/pretrain.py", line 172, in main
E               optimizer = torch.optim.AdamW(
E             File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/torch/optim/adamw.py", line 69, in __init__
E               raise RuntimeError("`fused=True` requires all the params to be floating point Tensors of "
E           RuntimeError: `fused=True` requires all the params to be floating point Tensors of supported devices: ['cuda', 'xpu', 'privateuseone'].

I can reproduce it in a CPU machine.

Is that intended? I thought we are only running bfloat16 if it's supported by the device. And I thought we were supporting CPUs. Is this a bug and should we fix this?

On a second note, how do we make running this on CI more feasible? I think we can skip this test on all machines but one if that's okay with you. What are your thoughts about this in general? @awaelchli @carmocca

@rasbt
Copy link
Collaborator Author

rasbt commented Apr 24, 2024

I saw we had the precision hardcoded in the pretraining script. I am making it consistent with the finetuning settings in #1353 which should address this issue.

@carmocca
Copy link
Contributor

carmocca commented Apr 25, 2024

My personal opinion is to not do this in its current form.

As I understand, the original proposal is about making sure that the README's commands run.

This PR accomplishes that right now, however, it doesn't address the (very common) problem of somebody modifying the README and breaking it accidentally. This is also a problem in our tutorials of course.

I suggest that the README is parsed to extract the commands to run (by looking at code blocks). These commands could be run with mocks for the underlying functionality as we do for the config_hub tests.

You might then wonder, "what if the underlying functionality is broken" but then I would argue that unit-tests specific to that functionality should be created as they will be easier to write and verify. We already have a lot of those of course.

@rasbt
Copy link
Collaborator Author

rasbt commented Apr 25, 2024

I agree with the Readme parsing, and I think Andrei found a tool that might be doing that, but we were not supposed to do that since it would be over-engineered. So, I think we should keep the relatively simple approach I have here, but yeah, a weakness is that we have to remember updating it.

@rasbt
Copy link
Collaborator Author

rasbt commented Apr 25, 2024

Oh, I just remember another reason why this is a bit harder to automate by parsing the README is that the README uses more expensive models that wouldn't run on CPU. I am cutting a few corners here with Pythia and setting the number of max_steps and max_tokens to small numbers.

@rasbt
Copy link
Collaborator Author

rasbt commented Apr 25, 2024

If you don't mind, could we merge this @carmocca to end this Readme testing saga? It was something that was requested and has been on my list for quite a while.

@rasbt
Copy link
Collaborator Author

rasbt commented Apr 25, 2024

Or maybe let me exclude the windows tests because they seem to time out.

@carmocca
Copy link
Contributor

Oh, I just remember another reason why this is a bit harder to automate by parsing the README is that the README uses more expensive models that wouldn't run on CPU. I am cutting a few corners here with Pythia and setting the number of max_steps and max_tokens to small numbers.

My point in my last message was that README testing should not run the actual chat / pretrain / evaluate ... functionality. But instead just make sure that the commands are correct and stop after parsing. For everything else we have tests in other files. Checking parsing is also what we do for the config hub.

I won't oppose this but I don't believe it's a good idea. You may merge if you think it's worth having.

@rasbt rasbt merged commit f334378 into main May 3, 2024
9 checks passed
@rasbt rasbt deleted the readme-tests branch May 3, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test readme commands as part of the CI
2 participants