Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SASS driven mode failing for PyTorch traces #306

Closed
sinharudraneel opened this issue Jun 10, 2024 · 4 comments
Closed

SASS driven mode failing for PyTorch traces #306

sinharudraneel opened this issue Jun 10, 2024 · 4 comments

Comments

@sinharudraneel
Copy link

I have been trying to run a simple pytorch one linear layer neural network which trains on random data on the SASS driven version of Accel Sim as a test for a larger project. I am able to generate the traces properly from this program but when I try to run the traces through the SASS driven mode of Accel-Sim, I get this error. Essentially, it is an assert failure for `active.any() == false' in shader.cc, I do not completely understand the root of the problem. Is this an error because of unsupported operations running on accel-sim, have I written my program wrong (although it runs just fine as it is on the Tesla V100 that I have been using), or is it a bug in the code? Just out of curiosity I commented out the assert statement and ran the traces through the simulator again, it passed the tests. I understand that I probably am messing with an important assertion check which should not be commented out, but could someone let me know what the error might be instead?

Sleeping for 30s                                                                                                                                                                                                      
Calling job_status.py                                                                                                                                                                                                 
Using logfiles ['/root/accel-sim-framework/util/job_launching/../job_launching/logfiles/sim_log.simpletorch-test.24.06.07-Friday.txt']                                                                                
procman.id      Node                            App                     AppArgs                 Version                 Config          RunningTime     Mem     JobStatus                       Basic GPGPU-Sim Stats 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------                                                                                                                                                                                                               
11              156211a02f7e                    simpletorch             NO_ARGS                 simpletorch.accelsim    QV100-SASS                    0 0       ABORTED                         SIMRATE_IPS=62 K     S
IM_TIME=21 sec (21 sec) TOT_IPC=38      TOT_INSN=1 M    TOT_CYCLE=34 K                                                                                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------                                                                                                                                                                                                               
failed job log written to /root/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.simpletorch-test.24.06.07-Friday.txt                                                          
Passed:0/1, No error:0/1, Failed/Error:1/1, Running:0/1, Waiting:0/1                                                                                                                                                  
Contents /root/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.simpletorch-test.24.06.07-Friday.txt:                                                                          
11              156211a02f7e                    simpletorch             NO_ARGS                 simpletorch.accelsim    QV100-SASS                    0 0       ABORTED                         SIMRATE_IPS=62 K     S
IM_TIME=21 sec (21 sec) TOT_IPC=38      TOT_INSN=1 M    TOT_CYCLE=34 K                                                                                                                                                
                                                                                                                                                                                                                      
**********************************************************                                                                                                                                                            
simpletorch-NO_ARGS--QV100-SASS. Status=ABORTED                                                                                                                                                                       
Last 10 line of /root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/simpletorch-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.o11          
------------------                                                                                                                                                                                                    
thread block = 0,0,0                                                                                                                                                                                                  
GPGPU-Sim: Reconfigure L1 cache to 120KB                                                                                                                                                                              
GPGPU-Sim uArch: Shader 32 bind to kernel 7 '_ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_7MeanOpsIffffEEjfLi4EEEEEvT1_'                                                                                 
launching kernel name: _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_7MeanOpsIffffEEjfLi4EEEEEvT1_ uid: 7                                                                                                 
Header info loaded for kernel command : ./traces/kernel-7.traceg                                                                                                                                                      
-accelsim tracer version = 3                                                                                                                                                                                          
-nvbit version = 1.5.3                                                                                                                                                                                                
-local mem base_addr = 0x00007f8a46000000                                                                                                                                                                             
-shmem base_addr = 0x00007f8a48000000                                                                                                                                                                                 
-cuda stream id = 0                                                                                                                                                                                                   
------------------                                                                                                                                                                                                    
                                                                                                                                                                                                                      
Contents of /root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/simpletorch-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.e11              
------------------                                                                                                                                                                                                    
accel-sim.out: shader.cc:3782: void barrier_set_t::deallocate_barrier(unsigned int): Assertion `active.any() == false' failed.                                                                                        
/root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/slurm.sim: line 52: 16360 Aborted                 (core dumped) /root/accel-sim-framework/util/job_launching/../../sim_
run_11.7/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g                                                 
                                                                                                                                                                                                                      
------------------                                                                                                                                                                                                    
**********************************************************                                                                                                                                                            
                                                                                                                                                                                                                      
All 1 Tests Done.                                                                                                                                                                                                     
Something did not pass.
@JRPan
Copy link
Collaborator

JRPan commented Jun 10, 2024

Can you share the trace file in some way?

I saw this before in some reduce kernels (actually, yours is a reduce kernel as well. Maybe this is the same issue).

In the last several lines of traces, there is probably a exit with mask FFFFFFFF which means all therads within the warp is exit. However, there will be some traces after that with a mask 00000001, which means that 1 thread is still active.

This is NVBit issue we posted here as well NVlabs/NVBit#122. For now, you can either remove the assert or manually delete the lines after the exit.

Thanks

@sinharudraneel
Copy link
Author

Sure! The traces were generated in this folder. There are 131 trace files for the program, and I have not been able to go through all of them yet – the ones I did go through did not seem to have an instance of the trace-post-exit with the same mask that you referred to, but I did find two EXIT statements with a mask of 00000000 (lines 146 and 152) in this trace file. I will keep looking though, what you mentioned is probably the case.

Thank you!

@JRPan
Copy link
Collaborator

JRPan commented Jun 10, 2024

It is kernel-7. https://github.com/sinharudraneel/dp-performance-accel-sim/blob/week3-traces/simpletorch3-traces/traces/kernel-7.traceg#L149C1-L149C25

The block size is 8, so only eight threads are active. The mask is 000000ff.

at:
ba20 000000ff 0 EXIT 0 0

All threads are exited. But one thread is still active after ba20.

Currently we are unable to fix this. This is an NVBit problem. Ignoring it for now won't harm too much.

@sinharudraneel
Copy link
Author

Ah I see, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants