Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing runtime cuda libs for deepspeed #61

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

shaowei-su
Copy link

@shaowei-su shaowei-su commented Jun 7, 2024

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Deepspeed relies on JIT to compile CUDA operators and missing the key headers will lead to failures like

python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:17:10: fatal error: cusolverDn.h: No such file or directory

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@shaowei-su
Copy link
Author

@conda-forge-admin, please rerender

@weiji14
Copy link
Member

weiji14 commented Jun 7, 2024

Thanks @shaowei-su! Would it be possible for you to provide a small script to test this out? I just want to make sure we've got the correct runtime dependencies listed.

@shaowei-su
Copy link
Author

shaowei-su commented Jun 8, 2024

Thanks folks! this is the minimal conda env for me to run Deepspeed + Torch training:

channels:
  - conda-forge
dependencies:
  - boto3
  - python~=3.10.0
  - pip
  - httpx
  - mlflow-skinny==2.11.3
  - transformers
  - accelerate
  - pytorch
  - datasets
  - sentencepiece
  - peft
  - trl
  - bitsandbytes
  - deepspeed
  - flash-attn
  - torchmetrics
  - dm-tree
  - optimum
  - cuda-compiler
  - cuda-cudart-dev
  - libcusparse-dev
  - libcublas-dev
  - libcusolver-dev
  - pip:
      - ray[default]

and minimal Torch training code using Ray Train:

    """
    Minimal Ray Train + DeepSpeed example adapted from
    https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py

    Fine-tune a BERT model with DeepSpeed ZeRO-3 and Ray Train and Ray Data
    """

    import json
    from tempfile import TemporaryDirectory

    import deepspeed
    import ray
    import ray.train
    import torch
    from datasets import load_dataset
    from deepspeed.accelerator import get_accelerator
    from ray.train import Checkpoint, DataConfig, RunConfig, ScalingConfig
    from ray.train.torch import TorchTrainer
    from torchmetrics.classification import BinaryAccuracy, BinaryF1Score
    from transformers import (
        AutoModelForSequenceClassification,
        AutoTokenizer,
        set_seed,
    )

    def train_func(config):
        """Your training function that will be launched on each worker."""

        # Unpack training configs
        set_seed(config["seed"])
        num_epochs = config["num_epochs"]
        train_batch_size = config["train_batch_size"]
        eval_batch_size = config["eval_batch_size"]

        # Instantiate the Model
        model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

        # Prepare Ray Data Loaders
        # ====================================================
        train_ds = ray.train.get_dataset_shard("train")
        eval_ds = ray.train.get_dataset_shard("validation")

        tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

        def collate_fn(batch):
            outputs = tokenizer(
                list(batch["sentence1"]),
                list(batch["sentence2"]),
                truncation=True,
                padding="longest",
                return_tensors="pt",
            )
            outputs["labels"] = torch.LongTensor(batch["label"])
            return outputs

        train_dataloader = train_ds.iter_torch_batches(batch_size=train_batch_size, collate_fn=collate_fn)
        eval_dataloader = eval_ds.iter_torch_batches(batch_size=eval_batch_size, collate_fn=collate_fn)
        # ====================================================

        # Initialize DeepSpeed Engine
        model, optimizer, _, lr_scheduler = deepspeed.initialize(
            model=model,
            model_parameters=model.parameters(),
            config=deepspeed_config,
        )
        device = get_accelerator().device_name(model.local_rank)

        # Initialize Evaluation Metrics
        f1 = BinaryF1Score().to(device)
        accuracy = BinaryAccuracy().to(device)

        for epoch in range(num_epochs):
            # Training
            model.train()
            for batch in train_dataloader:
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)
                loss = outputs.loss
                model.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            # Evaluation
            model.eval()
            for batch in eval_dataloader:
                batch = {k: v.to(device) for k, v in batch.items()}
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)

                f1.update(predictions, batch["labels"])
                accuracy.update(predictions, batch["labels"])

            # torchmetrics will aggregate the metrics across all workers
            eval_metric = {
                "f1": f1.compute().item(),
                "accuracy": accuracy.compute().item(),
            }
            f1.reset()
            accuracy.reset()

            if model.global_rank == 0:
                print(f"epoch {epoch}:", eval_metric)

            # Report checkpoint and metrics to Ray Train
            # ==============================================================
            with TemporaryDirectory() as tmpdir:
                # Each worker saves its own checkpoint shard
                model.save_checkpoint(tmpdir)

                # Ensure all workers finished saving their checkpoint shard
                torch.distributed.barrier()

                # Report checkpoint shards from each worker in parallel
                ray.train.report(metrics=eval_metric, checkpoint=Checkpoint.from_directory(tmpdir))
            # ==============================================================

    deepspeed_config = {
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": 2e-5,
            },
        },
        "scheduler": {"type": "WarmupLR", "params": {"warmup_num_steps": 100}},
        "fp16": {"enabled": True},
        "bf16": {"enabled": False},  # Turn this on if using AMPERE GPUs.
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "none",
            },
            "offload_param": {
                "device": "none",
            },
        },
        "gradient_accumulation_steps": 1,
        "gradient_clipping": True,
        "steps_per_print": 10,
        "train_micro_batch_size_per_gpu": 16,
        "wall_clock_breakdown": False,
    }

    training_config = {
        "seed": 42,
        "num_epochs": 3,
        "train_batch_size": 16,
        "eval_batch_size": 32,
        "deepspeed_config": deepspeed_config,
    }

    # Prepare Ray Datasets
    hf_datasets = load_dataset("glue", "mrpc")
    ray_datasets = {
        "train": ray.data.from_huggingface(hf_datasets["train"]),
        "validation": ray.data.from_huggingface(hf_datasets["validation"]),
    }

    trainer = TorchTrainer(
        train_func,
        train_loop_config=training_config,
        scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
        datasets=ray_datasets,
        dataset_config=DataConfig(datasets_to_split=["train", "validation"]),
        run_config=RunConfig(storage_path="s3://xxx"),
    )

    result = trainer.fit()

    # Retrieve the best checkponints from results
    best_checkpoints = list(map(lambda tup: (str(tup[0]), tup[1]), result.best_checkpoints))
    return json.dumps(best_checkpoints)

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to hold this up, I was hoping for a more minimal example... Ideally one that doesn't have a dozen other dependencies, and is short enough to add under the test: commands: section in recipe/meta.yaml 🙂

I'll try and get this test from upstream (https://github.com/microsoft/DeepSpeed/blob/v0.14.2/tests/accelerator/test_ds_init.py) to run locally on CUDA 12, this might take a while, as it'll require a lot of trial and error.

Comment on lines +95 to +99
- cuda-compiler
- cuda-cudart-dev
- libcusparse-dev
- libcublas-dev
- libcusolver-dev
Copy link
Member

@weiji14 weiji14 Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cuda libraries are slightly different to the ones listed under host above. Just want to confirm that this is the correct list, i.e. there are no extra ones which are not needed for JiT compilation?

@weiji14
Copy link
Member

weiji14 commented Jun 12, 2024

Slightly shorter self-contained example, adapted from microsoft/DeepSpeed#2478 (comment)

# deepspeed_example.py
from deepspeed.ops.transformer import (
    DeepSpeedInferenceConfig,
    DeepSpeedTransformerInference,
)
import torch

assert torch.cuda.is_available()
torch.cuda.set_device(device=0)

deepspeed_config = DeepSpeedInferenceConfig(
    hidden_size=32,
    intermediate_size=32 * 4,
    heads=8,
    num_hidden_layers=3,
    layer_norm_eps=1e-5,
    dtype=torch.float32,
)
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.cuda()

batch_size = 1
seq_len = 10
inputs = torch.ones((batch_size, seq_len, 32), dtype=torch.float32, device="cuda")
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device="cuda")

output, _ = transformer.forward(input=inputs, input_mask=input_mask)
print(f"output: \n {output}")

Run using:

mamba create -n deepspeed-test python=3.12 deepspeed=0.14.0=cuda120*
mamba activate deepspeed-test
python deepspeed_example.py

Output:

[2024-06-12 21:06:33,316] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 21:06:33,448] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 32, 'intermediate_size': 128, 'heads': 8, 'num_hidden_layers': 3, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000}
------------------------------------------------------
Free memory : 7.434021 (GigaBytes)  
Total memory: 7.693115 (GigaBytes)  
Requested memory: 0.042969 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f86f2000000 
------------------------------------------------------
output: 
 tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],
       device='cuda:0')

@shaowei-su, I'm not seeing the fatal error: cusolverDn.h: No such file or directory message you mentioned above. Could you try to isolate which layer or configuration is causing that error? Or maybe @loadams has some insight into where that error might be stemming from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants