Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Error when running tune_gemm.py #650

Open
remi-or opened this issue Oct 21, 2024 · 1 comment
Open

[Issue]: Error when running tune_gemm.py #650

remi-or opened this issue Oct 21, 2024 · 1 comment

Comments

@remi-or
Copy link

remi-or commented Oct 21, 2024

Problem Description

Hi, I am getting an error when running the tune_gemm.py script.
I am inside a docker container with access to 8 AMD MI300X gpus, displayed when calling rocm-smi and I have no problem running triton.
The command I am running is: ./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose
The content of input.yaml is - {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'}
Here is the produced stack trace (with --job 1 to reduce its size, but it is similar with --jobs 32):

SIZE: 16 13312 16384 TT nConfigs: 880 compile time: 0:00:17.674805                                                                                                                                                                     
profiling /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py on GPU 0                                                                                                                        
RPL: on '241021_230215' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'                                                                                                                                   
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'                                                                                                               
RPL: input file ''                                                                                                                                                                                                                     
RPL: output dir '/tmp/rpl_data_241021_230215_108023'                                                                                                                                                                                   
RPL: result dir '/tmp/rpl_data_241021_230215_108023/input_results_241021_230215'                                                                                                                                                       
ROCProfiler: input from "/tmp/rpl_data_241021_230215_108023/input.xml"                                                                                                                                                                 
  0 metrics                                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init                                                                                                                                       
    queued_call()                                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb                                                                                                                                                 
    default_generator = torch.cuda.default_generators[i]                                                                                                                                                                               
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^                                                                                                                                                                               
IndexError: tuple index out of range                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
The above exception was the direct cause of the following exception:                                                                                                                                                                   
                                                                                                                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
             ^^^^^^                                                                                                                                                                                                                    
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input                                                                                                                                         
    raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)                                                                                                                                        
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type                                                                                                                             
    temp = torch.randn(size, dtype=dtype, device='cuda')                                                                                                                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                               
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init                                                                                                                                       
    raise DeferredCudaCallError(msg) from e                                                                                                                                                                                            
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range                                                                                                                       
                                                                                                                                                                                                                                       
CUDA call was originally invoked at:                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input                                                                                                                                         
    torch.manual_seed(seed)                                                                                                                                                                                                            
  File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed                                                                                                                                              
    torch.cuda.manual_seed_all(seed)                                                                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all                                                                                                                                    
    _lazy_call(cb, seed_all=True)                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call                                                                                                                                       
    _lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())   

ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230215_108023/input_results_241021_230215                                                                                                                     
running rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py one more time                                                                             
RPL: on '241021_230220' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'                                                                                                                                   
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'                                                                                                               
RPL: input file ''                                                                                                                                                                                                                     
RPL: output dir '/tmp/rpl_data_241021_230220_108139'                                                                                                                                                                                   
RPL: result dir '/tmp/rpl_data_241021_230220_108139/input_results_241021_230220'                                                                                                                                                       
ROCProfiler: input from "/tmp/rpl_data_241021_230220_108139/input.xml"                                                                                                                                                                 
  0 metrics                                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init                                                                                                                                       
    queued_call()                                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb                                                                                                                                                 
    default_generator = torch.cuda.default_generators[i]                                                                                                                                                                               
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^                                                                                                                                                                               
IndexError: tuple index out of range                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
The above exception was the direct cause of the following exception:                                                                                                                                                                   
                                                                                                                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
             ^^^^^^                                                                                                                                                                                                                    
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input                                                                                                                                         
    raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)                                                                                                                                        
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type                                                                                                                             
    temp = torch.randn(size, dtype=dtype, device='cuda')                                                                                                                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                               
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init                                                                                                                                       
    raise DeferredCudaCallError(msg) from e                                                                                                                                                                                            
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range                                                                                                                       
                                                                                                                                                                                                                                       
CUDA call was originally invoked at:                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
    torch.manual_seed(seed)
  File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
    _lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())



ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230220_108139/input_results_241021_230220
Process Process-1:
Traceback (most recent call last):
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 36, in run_bash_command_wrapper
    run_bash_command(commandstring, capture)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
    proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 194, in profile_batch_kernels
    run_bash_command_wrapper(
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 40, in run_bash_command_wrapper
    run_bash_command(commandstring, capture)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
    proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
profile time: 0:00:11.341448
Traceback (most recent call last):
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 714, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 644, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
                                                                 ^^^^^^^^^^^^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in tune_gemm_config
    df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in <listcomp>
    df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results_0.csv'

Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.

Operating System

Ubuntu 22.04.4 LTS

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.2, ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@zhanglx13
Copy link

I just tried the tune_gemm with your input yaml file and everything works.
Can you provide more information about your docker env. Like

  • triton compiler commit
  • pytorch version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants