Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: unspecified launch failure #3

Open
maffinnn opened this issue Oct 8, 2023 · 0 comments
Open

RuntimeError: CUDA error: unspecified launch failure #3

maffinnn opened this issue Oct 8, 2023 · 0 comments

Comments

@maffinnn
Copy link

maffinnn commented Oct 8, 2023

Hi, I'm trying to train the zero cost model, and encountered the following issue. Wanna ask here if anyone could help?
For your info, I'm runing on google colab with T4 GPU.
!TORCH_USE_CUDA_DSA=1 CUDA_LAUNCH_BLOCKING=1 python3 train.py --train_model --workload_runs ../zero-shot-data/runs/deepdb_augmented/airline/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/airline/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/workload_100k_s1_c8220.json --test_workload_runs ../zero-shot-data/runs/deepdb_augmented/imdb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/synthetic_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/scale_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/job-light_c8220.json --statistics_file ../zero-shot-data/runs/deepdb_augmented/statistics_workload_combined.json --target ../zero-shot-data/evaluation/db_generalization_tune_est/ --hyperparameter_path setup/tuned_hyperparameters/tune_est_best_config.json --max_epoch_tuples 100000 --loss_class_name QLoss --device cuda:0 --filename_model imdb_0 --num_workers 16 --database postgres --seed 0

Reading hyperparameters from setup/tuned_hyperparameters/tune_est_best_config.json
No of Plans: 190000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 4565
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 382
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 50
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
PostgresZeroShotModel(
  (loss_fxn): QLoss()
  (fcout): Sequential(
    (0): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=128, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (1): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (2): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (3): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (4): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=1, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
  )
  (tree_models): ModuleDict(
    (column_output_column): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (to_plan): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (intra_plan): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (intra_pred): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
  )
  (node_type_encoders): ModuleDict(
    (column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=14, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (data_type): EmbeddingInitializer(
          (embed): Embedding(10, 10)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (table): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=2, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict()
    )
    (output_column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=5, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (aggregation): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (filter_column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=20, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (operator): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
        (data_type): EmbeddingInitializer(
          (embed): Embedding(10, 10)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (plan): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=24, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (op_name): EmbeddingInitializer(
          (embed): Embedding(20, 20)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (logical_pred): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=6, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (operator): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
  )
)
No valid checkpoint found [Errno 2] No such file or directory: '../zero-shot-data/evaluation/db_generalization_tune_est/imdb_0.pt'
Epoch 0
  0% 0/631 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/train.py", line 54, in <module>
    train_readout_hyperparams(args.workload_runs, args.test_workload_runs, args.statistics_file, args.target,
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 424, in train_readout_hyperparams
    train_model(workload_runs, test_workload_runs, statistics_file, target_dir, filename_model,
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 213, in train_model
    train_epoch(epoch_stats, train_loader, model, optimizer, max_epoch_tuples)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 29, in train_epoch
    input_model, label, sample_idxs = custom_batch_to(batch, model.device, model.label_norm)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 39, in batch_to
    recursive_to(features, device)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 24, in recursive_to
    recursive_to(v, device)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 21, in recursive_to
    iterable.to(device)
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant