Add missing runtime cuda libs for deepspeed #61

shaowei-su · 2024-06-07T18:50:01Z

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

Deepspeed relies on JIT to compile CUDA operators and missing the key headers will lead to failures like

python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:17:10: fatal error: cusolverDn.h: No such file or directory

conda-forge-webservices · 2024-06-07T18:50:08Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

shaowei-su · 2024-06-07T18:50:46Z

@conda-forge-admin, please rerender

…nda-forge-pinning 2024.06.07.18.45.09

weiji14 · 2024-06-07T23:05:44Z

Thanks @shaowei-su! Would it be possible for you to provide a small script to test this out? I just want to make sure we've got the correct runtime dependencies listed.

recipe/meta.yaml

shaowei-su · 2024-06-08T05:14:35Z

Thanks folks! this is the minimal conda env for me to run Deepspeed + Torch training:

channels:
  - conda-forge
dependencies:
  - boto3
  - python~=3.10.0
  - pip
  - httpx
  - mlflow-skinny==2.11.3
  - transformers
  - accelerate
  - pytorch
  - datasets
  - sentencepiece
  - peft
  - trl
  - bitsandbytes
  - deepspeed
  - flash-attn
  - torchmetrics
  - dm-tree
  - optimum
  - cuda-compiler
  - cuda-cudart-dev
  - libcusparse-dev
  - libcublas-dev
  - libcusolver-dev
  - pip:
      - ray[default]

and minimal Torch training code using Ray Train:

    """
    Minimal Ray Train + DeepSpeed example adapted from
    https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py

    Fine-tune a BERT model with DeepSpeed ZeRO-3 and Ray Train and Ray Data
    """

    import json
    from tempfile import TemporaryDirectory

    import deepspeed
    import ray
    import ray.train
    import torch
    from datasets import load_dataset
    from deepspeed.accelerator import get_accelerator
    from ray.train import Checkpoint, DataConfig, RunConfig, ScalingConfig
    from ray.train.torch import TorchTrainer
    from torchmetrics.classification import BinaryAccuracy, BinaryF1Score
    from transformers import (
        AutoModelForSequenceClassification,
        AutoTokenizer,
        set_seed,
    )

    def train_func(config):
        """Your training function that will be launched on each worker."""

        # Unpack training configs
        set_seed(config["seed"])
        num_epochs = config["num_epochs"]
        train_batch_size = config["train_batch_size"]
        eval_batch_size = config["eval_batch_size"]

        # Instantiate the Model
        model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

        # Prepare Ray Data Loaders
        # ====================================================
        train_ds = ray.train.get_dataset_shard("train")
        eval_ds = ray.train.get_dataset_shard("validation")

        tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

        def collate_fn(batch):
            outputs = tokenizer(
                list(batch["sentence1"]),
                list(batch["sentence2"]),
                truncation=True,
                padding="longest",
                return_tensors="pt",
            )
            outputs["labels"] = torch.LongTensor(batch["label"])
            return outputs

        train_dataloader = train_ds.iter_torch_batches(batch_size=train_batch_size, collate_fn=collate_fn)
        eval_dataloader = eval_ds.iter_torch_batches(batch_size=eval_batch_size, collate_fn=collate_fn)
        # ====================================================

        # Initialize DeepSpeed Engine
        model, optimizer, _, lr_scheduler = deepspeed.initialize(
            model=model,
            model_parameters=model.parameters(),
            config=deepspeed_config,
        )
        device = get_accelerator().device_name(model.local_rank)

        # Initialize Evaluation Metrics
        f1 = BinaryF1Score().to(device)
        accuracy = BinaryAccuracy().to(device)

        for epoch in range(num_epochs):
            # Training
            model.train()
            for batch in train_dataloader:
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)
                loss = outputs.loss
                model.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            # Evaluation
            model.eval()
            for batch in eval_dataloader:
                batch = {k: v.to(device) for k, v in batch.items()}
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)

                f1.update(predictions, batch["labels"])
                accuracy.update(predictions, batch["labels"])

            # torchmetrics will aggregate the metrics across all workers
            eval_metric = {
                "f1": f1.compute().item(),
                "accuracy": accuracy.compute().item(),
            }
            f1.reset()
            accuracy.reset()

            if model.global_rank == 0:
                print(f"epoch {epoch}:", eval_metric)

            # Report checkpoint and metrics to Ray Train
            # ==============================================================
            with TemporaryDirectory() as tmpdir:
                # Each worker saves its own checkpoint shard
                model.save_checkpoint(tmpdir)

                # Ensure all workers finished saving their checkpoint shard
                torch.distributed.barrier()

                # Report checkpoint shards from each worker in parallel
                ray.train.report(metrics=eval_metric, checkpoint=Checkpoint.from_directory(tmpdir))
            # ==============================================================

    deepspeed_config = {
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": 2e-5,
            },
        },
        "scheduler": {"type": "WarmupLR", "params": {"warmup_num_steps": 100}},
        "fp16": {"enabled": True},
        "bf16": {"enabled": False},  # Turn this on if using AMPERE GPUs.
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "none",
            },
            "offload_param": {
                "device": "none",
            },
        },
        "gradient_accumulation_steps": 1,
        "gradient_clipping": True,
        "steps_per_print": 10,
        "train_micro_batch_size_per_gpu": 16,
        "wall_clock_breakdown": False,
    }

    training_config = {
        "seed": 42,
        "num_epochs": 3,
        "train_batch_size": 16,
        "eval_batch_size": 32,
        "deepspeed_config": deepspeed_config,
    }

    # Prepare Ray Datasets
    hf_datasets = load_dataset("glue", "mrpc")
    ray_datasets = {
        "train": ray.data.from_huggingface(hf_datasets["train"]),
        "validation": ray.data.from_huggingface(hf_datasets["validation"]),
    }

    trainer = TorchTrainer(
        train_func,
        train_loop_config=training_config,
        scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
        datasets=ray_datasets,
        dataset_config=DataConfig(datasets_to_split=["train", "validation"]),
        run_config=RunConfig(storage_path="s3://xxx"),
    )

    result = trainer.fit()

    # Retrieve the best checkponints from results
    best_checkpoints = list(map(lambda tup: (str(tup[0]), tup[1]), result.best_checkpoints))
    return json.dumps(best_checkpoints)

weiji14

Sorry to hold this up, I was hoping for a more minimal example... Ideally one that doesn't have a dozen other dependencies, and is short enough to add under the test: commands: section in recipe/meta.yaml 🙂

I'll try and get this test from upstream (https://github.com/microsoft/DeepSpeed/blob/v0.14.2/tests/accelerator/test_ds_init.py) to run locally on CUDA 12, this might take a while, as it'll require a lot of trial and error.

weiji14 · 2024-06-12T06:37:48Z

recipe/meta.yaml

+    - cuda-compiler
+    - cuda-cudart-dev
+    - libcusparse-dev
+    - libcublas-dev
+    - libcusolver-dev


These cuda libraries are slightly different to the ones listed under host above. Just want to confirm that this is the correct list, i.e. there are no extra ones which are not needed for JiT compilation?

weiji14 · 2024-06-12T09:10:39Z

Slightly shorter self-contained example, adapted from microsoft/DeepSpeed#2478 (comment)

# deepspeed_example.py
from deepspeed.ops.transformer import (
    DeepSpeedInferenceConfig,
    DeepSpeedTransformerInference,
)
import torch

assert torch.cuda.is_available()
torch.cuda.set_device(device=0)

deepspeed_config = DeepSpeedInferenceConfig(
    hidden_size=32,
    intermediate_size=32 * 4,
    heads=8,
    num_hidden_layers=3,
    layer_norm_eps=1e-5,
    dtype=torch.float32,
)
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.cuda()

batch_size = 1
seq_len = 10
inputs = torch.ones((batch_size, seq_len, 32), dtype=torch.float32, device="cuda")
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device="cuda")

output, _ = transformer.forward(input=inputs, input_mask=input_mask)
print(f"output: \n {output}")

Run using:

mamba create -n deepspeed-test python=3.12 deepspeed=0.14.0=cuda120*
mamba activate deepspeed-test
python deepspeed_example.py

Output:

[2024-06-12 21:06:33,316] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 21:06:33,448] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 32, 'intermediate_size': 128, 'heads': 8, 'num_hidden_layers': 3, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000}
------------------------------------------------------
Free memory : 7.434021 (GigaBytes)  
Total memory: 7.693115 (GigaBytes)  
Requested memory: 0.042969 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f86f2000000 
------------------------------------------------------
output: 
 tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],
       device='cuda:0')

@shaowei-su, I'm not seeing the fatal error: cusolverDn.h: No such file or directory message you mentioned above. Could you try to isolate which layer or configuration is causing that error? Or maybe @loadams has some insight into where that error might be stemming from?

add missing cuda libs

c8cc00f

shaowei-su requested review from loadams and weiji14 as code owners June 7, 2024 18:50

MNT: Re-rendered with conda-build 24.5.1, conda-smithy 3.36.1, and co…

6e1adf6

…nda-forge-pinning 2024.06.07.18.45.09

loadams approved these changes Jun 7, 2024

View reviewed changes

weiji14 reviewed Jun 7, 2024

View reviewed changes

recipe/meta.yaml Show resolved Hide resolved

build number

2963413

weiji14 reviewed Jun 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing runtime cuda libs for deepspeed #61

Add missing runtime cuda libs for deepspeed #61

shaowei-su commented Jun 7, 2024 •

edited by weiji14

Loading

conda-forge-webservices bot commented Jun 7, 2024

shaowei-su commented Jun 7, 2024

weiji14 commented Jun 7, 2024

shaowei-su commented Jun 8, 2024 •

edited

Loading

weiji14 left a comment

weiji14 Jun 12, 2024 •

edited

Loading

weiji14 commented Jun 12, 2024

Add missing runtime cuda libs for deepspeed #61

Are you sure you want to change the base?

Add missing runtime cuda libs for deepspeed #61

Conversation

shaowei-su commented Jun 7, 2024 • edited by weiji14 Loading

conda-forge-webservices bot commented Jun 7, 2024

shaowei-su commented Jun 7, 2024

weiji14 commented Jun 7, 2024

shaowei-su commented Jun 8, 2024 • edited Loading

weiji14 left a comment

Choose a reason for hiding this comment

weiji14 Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

weiji14 commented Jun 12, 2024

shaowei-su commented Jun 7, 2024 •

edited by weiji14

Loading

shaowei-su commented Jun 8, 2024 •

edited

Loading

weiji14 Jun 12, 2024 •

edited

Loading