You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am getting an error when running the tune_gemm.py script.
I am inside a docker container with access to 8 AMD MI300X gpus, displayed when calling rocm-smi and I have no problem running triton.
The command I am running is: ./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose
The content of input.yaml is - {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'}
Here is the produced stack trace (with --job 1 to reduce its size, but it is similar with --jobs 32):
SIZE: 16 13312 16384 TT nConfigs: 880 compile time: 0:00:17.674805
profiling /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py on GPU 0
RPL: on '241021_230215' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_241021_230215_108023'
RPL: result dir '/tmp/rpl_data_241021_230215_108023/input_results_241021_230215'
ROCProfiler: input from "/tmp/rpl_data_241021_230215_108023/input.xml"
0 metrics
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init
queued_call()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb
default_generator = torch.cuda.default_generators[i]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input
raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type
temp = torch.randn(size, dtype=dtype, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range
CUDA call was originally invoked at:
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
torch.manual_seed(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())
ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230215_108023/input_results_241021_230215
running rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py one more time
RPL: on '241021_230220' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_241021_230220_108139'
RPL: result dir '/tmp/rpl_data_241021_230220_108139/input_results_241021_230220'
ROCProfiler: input from "/tmp/rpl_data_241021_230220_108139/input.xml"
0 metrics
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init
queued_call()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb
default_generator = torch.cuda.default_generators[i]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input
raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type
temp = torch.randn(size, dtype=dtype, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range
CUDA call was originally invoked at:
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
torch.manual_seed(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())
ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230220_108139/input_results_241021_230220
Process Process-1:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 36, in run_bash_command_wrapper
run_bash_command(commandstring, capture)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 194, in profile_batch_kernels
run_bash_command_wrapper(
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 40, in run_bash_command_wrapper
run_bash_command(commandstring, capture)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
profile time: 0:00:11.341448
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 714, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 644, in main
minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in tune_gemm_config
df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in <listcomp>
df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
self.handles = get_handle(
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
handle = open(
^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results_0.csv'
Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.
Operating System
Ubuntu 22.04.4 LTS
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.2, ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
Problem Description
Hi, I am getting an error when running the
tune_gemm.py
script.I am inside a docker container with access to 8 AMD MI300X gpus, displayed when calling
rocm-smi
and I have no problem running triton.The command I am running is:
./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose
The content of
input.yaml
is- {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'}
Here is the produced stack trace (with
--job 1
to reduce its size, but it is similar with--jobs 32
):Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.
Operating System
Ubuntu 22.04.4 LTS
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.2, ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: