-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing runtime cuda libs for deepspeed #61
base: main
Are you sure you want to change the base?
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
@conda-forge-admin, please rerender |
…nda-forge-pinning 2024.06.07.18.45.09
Thanks @shaowei-su! Would it be possible for you to provide a small script to test this out? I just want to make sure we've got the correct runtime dependencies listed. |
Thanks folks! this is the minimal conda env for me to run Deepspeed + Torch training:
and minimal Torch training code using Ray Train:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to hold this up, I was hoping for a more minimal example... Ideally one that doesn't have a dozen other dependencies, and is short enough to add under the test: commands:
section in recipe/meta.yaml 🙂
I'll try and get this test from upstream (https://github.com/microsoft/DeepSpeed/blob/v0.14.2/tests/accelerator/test_ds_init.py) to run locally on CUDA 12, this might take a while, as it'll require a lot of trial and error.
- cuda-compiler | ||
- cuda-cudart-dev | ||
- libcusparse-dev | ||
- libcublas-dev | ||
- libcusolver-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cuda libraries are slightly different to the ones listed under host
above. Just want to confirm that this is the correct list, i.e. there are no extra ones which are not needed for JiT compilation?
Slightly shorter self-contained example, adapted from microsoft/DeepSpeed#2478 (comment) # deepspeed_example.py
from deepspeed.ops.transformer import (
DeepSpeedInferenceConfig,
DeepSpeedTransformerInference,
)
import torch
assert torch.cuda.is_available()
torch.cuda.set_device(device=0)
deepspeed_config = DeepSpeedInferenceConfig(
hidden_size=32,
intermediate_size=32 * 4,
heads=8,
num_hidden_layers=3,
layer_norm_eps=1e-5,
dtype=torch.float32,
)
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.cuda()
batch_size = 1
seq_len = 10
inputs = torch.ones((batch_size, seq_len, 32), dtype=torch.float32, device="cuda")
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device="cuda")
output, _ = transformer.forward(input=inputs, input_mask=input_mask)
print(f"output: \n {output}") Run using: mamba create -n deepspeed-test python=3.12 deepspeed=0.14.0=cuda120*
mamba activate deepspeed-test
python deepspeed_example.py Output: [2024-06-12 21:06:33,316] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 21:06:33,448] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 32, 'intermediate_size': 128, 'heads': 8, 'num_hidden_layers': 3, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000}
------------------------------------------------------
Free memory : 7.434021 (GigaBytes)
Total memory: 7.693115 (GigaBytes)
Requested memory: 0.042969 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7f86f2000000
------------------------------------------------------
output:
tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],
device='cuda:0') @shaowei-su, I'm not seeing the |
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)Deepspeed relies on JIT to compile CUDA operators and missing the key headers will lead to failures like