Execution is stuck until termination when running on CUDA in WSL2 #538

wtfil · 2024-06-02T15:41:21Z

Reproducing the behavior

The issue

Hi,
I am having issues running code with bend run-cu on CUDA inside the WSL2. There are not errors, but code is not executing either. Execution is frozen, similary to while (true) {}.
Code executes without any issues when using bend run or bend run-c
Compiling with bend gen-cu and nvcc has the same result.
I've tried both 12.4 and 12.5 and result is the same.

What I attempted

I tried different code examples from the repo, but result is always the same. Since bend allowed to compile code to cuda with gen-cu, I tried to find what is broken inside generated file (assuming bend run-cu will use the same code). This issue happened inside gnet_normalize function, where code could never exit the for loop. This break is never callen (rlen always has the same value)

CUDA verification

Just to rule CUDA out, I have successfully installed CUDA and can confirm it is recognised by compiling and running this code.

cuda-test.cu:

#include <cuda.h>
#include <stdio.h>

int main(int argc, char** argv) {
  int driver_version = 0, runtime_version = 0;

  cudaDriverGetVersion(&driver_version);
  cudaRuntimeGetVersion(&runtime_version);

  printf(
    "Driver Version: %d\nRuntime Version: %d\n",
    driver_version,
    runtime_version
  );

  return 0;
}

output

~> nvcc cuda-test.cu -o cuda-test && ./cuda-test
Driver Version: 10010
Runtime Version: 12040

nvidia-smi

Calling from wsl

>  nvidia-smi.exe
Sun Jun  2 16:32:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44                 Driver Version: 552.44         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:08:00.0  On |                  N/A |
| 49%   56C    P3             41W /  220W |    6496MiB /   8192MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3832    C+G   ...oogle\Chrome\Application\chrome.exe      N/A      |
|    0   N/A  N/A      8292    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A     21004    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A     21416    C+G   ...wekyb3d8bbwe\XboxGameBarWidgets.exe      N/A      |
|    0   N/A  N/A     27344    C+G   ...ience\NVIDIA GeForce Experience.exe      N/A      |
....
+-----------------------------------------------------------------------------------------+

System Settings

Example:

HVM: 2.0.18
Bend: 0.2.27
OS: Ubuntu 20.04.6 LTS
WSL: 2.1.5.0
CPU: AMD Ryzen 9 5900X
GPU: RTX 3070
Cuda Version: 12.5, V12.5.40

Additional context

No response

The text was updated successfully, but these errors were encountered:

developedby · 2024-06-04T12:36:42Z

Is this for any program you try to run?
Also, what compiler version is nvcc using? By default it should be g++ and you can check its version with g++ --version

wtfil · 2024-06-04T21:25:26Z

Yes, all exmples from examples folder have the same beheviour.
For example compiled fib and it fail to break the same loop, because rlen is aways has the same value (which is different from run to run, but never zero)

`nvcc` and `g++`

~ > nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

~ > g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

keaneflynn · 2024-06-05T19:28:09Z

I would like to also add that I have this same issue on my laptop running ubuntu linux. I have tried the sorter.bend script from the README on three machines now and I have gotten it to run on my other two (one has a laptop 3080 gpu and the other running dual RTX A4000s) however it won't run on my laptop with a Quadro P600 (cuda version 12.2). It does the same thing @wtfil is describing where it simply hangs.

Looking at the processor usage, it seems as though it might be having an issue allocating GPU memory (?). I can see it running 100% on a single cpu core but never load either the RAM or VRAM and there is no processing being run on the GPU either. Here are some specs from the machine in question if this might be a bug to be fixed in the future. (Already love this programming language BTW, really hoping I can start switching over to it at work when more support is added in).

Info:
HVM: 2.0.19
Bend: 0.2.27
OS: Ubuntu 22.04.4 LTS
Kernel: 6.5.0-35
CPU: Intel i5-9300H
GPU: Nvidia Quadro P600 Mobile
g++: 11.4.0
nvcc: 12.2
nvidia driver: 535.171.04

NemoInfo · 2024-06-06T13:04:00Z

I'm getting the same issue with:
HVM: 2.0.19
Bend: 0.2.33
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800H
GPU: NVIDIA GeForce RTX 3070 Mobile
g++: 11.4.0
nvcc: 12.5
nvidia driver: 550.67

nmay231 · 2024-06-06T19:03:40Z

Same issue here on Ubuntu (not WSL). I tried debugging this issue in the official discord server here.

As @keaneflynn observed, it doesn't allocate vram correctly, and hangs for a long time until it crashes with the following message:

Failed to launch kernels. Error code: an illegal memory access was encountered.
Errors:
Failed to parse result from HVM.

I waited 45 minutes while another discord server member only waited 30 minutes with his example in a virtual machine. I don't think the time is as important since the program crashed shortly after I launched another application (steam in my case).

A quick summary of the debugging we did in discord: Downgrading from cuda 12.5 to 12.4 doesn't help. Examples unrelated to bend compiled by nvcc work just fine, so it's not a (simple) issue with cuda. Using run and run-c work for the examples provided by bend while using run-cu directly or gen-cu then compiling with nvcc hangs. A different member of the server mentioned it could be an issue with the smaller L1 cache size of my GXT 1660, but that doesn't seem to be the case due to multiple 3070's listed here also not working.

My Specs

HVM: 2.0.18
Bend: 0.2.27
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800X 8-Core Processor
GPU: NVIDIA GeForce GTX 1660 (ti I think?)
g++: 11.4.0
nvcc: V12.4.131
nvidia driver: 555.42.02

keaneflynn · 2024-06-06T20:13:13Z

I'm getting the same issue with: HVM: 2.0.19 Bend: 0.2.33 Ubuntu 22.04.4 LTS x86_64 Kernel: 6.5.0-35 CPU: AMD Ryzen 7 5800H GPU: NVIDIA GeForce RTX 3070 Mobile g++: 11.4.0 nvcc: 12.5 nvidia driver: 550.67

This is fascinating as I have a laptop with nearly identical specs that does manage to use the run-cu properly. I have the 5800H on ubuntu 22.04 except it has a 3080 mobile. I am pretty sure the cache on these two chips are identical per @nmay231 inquiry. Hoping to see some bug fixes here soon!

hopperelec · 2024-06-07T11:37:33Z

Same issue here

GTX 1050 Ti (Nvidia Studio driver 551.23)
Intel i5-6400
Windows 10 22H2 (OS Build 19045.4474) --> WSL 2.0.14.0 (Kernel version 5.15.133.1-1) --> Ubuntu 24.04 LTS
g++ 13.2.0
nvcc cuda_12.5.r12.5/compiler.34177558_0
hvm 2.0.18
bend 0.2.27

Imran-S-heikh · 2024-06-07T14:31:25Z

Hey, I am also facing the same issue. All the dedicated GPU memory gets full. And the process is stuck.

bend 0.2.33
hvm 2.0.19
nvcc V12.5.40
wsl-ubuntu 22.04.3 LTS
gpu NVIDIA GeForce RTX 2060

nmay231 · 2024-06-07T17:17:14Z

@Imran-S-heikh I'm not certain if we have the same issue. My Video RAM for the bend/hvm process never went above 100 MiB.

Perhaps we should all make sure we are experiencing the same thing. Here's a very simple program, that shouldn't need much memory. It hangs with bend run-cu. I checked my GPU memory usage with nvtop (98 MiB VRAM, 0% GPU, 102 MiB RAM, 100% on a CPU core).

def main:
  return (1 + 1)

Also, I forgot to mention I did try running run-cu --verbose and this is the output of the program above.

bend run-cu --verbose simple.bend

% bend run-cu -v simple.bend
(Map/empty) = Map/Leaf

(Map/get map key) = match map = map { Map/Leaf: (*, map); Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: let (got, rest) = (Map/get map.left (/ key 2)); (got, (Map/Node map.value rest map.right)); _ _-1: let (got, rest) = (Map/get map.right (/ key 2)); (got, (Map/Node map.value map.left rest)); }; _ _-1: (map.value, map); }; }

(Map/set map key value) = match map = map { Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node map.value (Map/set map.left (/ key 2) value) map.right); _ _-1: (Map/Node map.value map.left (Map/set map.right (/ key 2) value)); }; _ _-1: (Map/Node value map.left map.right); }; Map/Leaf: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node * (Map/set Map/Leaf (/ key 2) value) Map/Leaf); _ _-1: (Map/Node * Map/Leaf (Map/set Map/Leaf (/ key 2) value)); }; _ _-1: (Map/Node value Map/Leaf Map/Leaf); }; }

(IO/MAGIC) = (13683217, 16719857)

(IO/wrap x) = (IO/Done IO/MAGIC x)

(IO/bind a b) = match a = a { IO/Done: (b a.expr); IO/Call: (IO/Call IO/MAGIC a.func a.argm λx (IO/bind (a.cont x) b)); }

(call func argm) = (IO/Call IO/MAGIC func argm λx (IO/Done IO/MAGIC x))

(print text) = (IO/Call IO/MAGIC "PUT_TEXT" text λx (IO/Done IO/MAGIC x))

(get_time) = (IO/Call IO/MAGIC "GET_TIME" * λx (IO/Done IO/MAGIC x))

(sleep hi_lo) = (IO/Call IO/MAGIC "PUT_TIME" hi_lo λx (IO/Done IO/MAGIC x))

(main) = (+ 1 1)

hopperelec · 2024-06-07T19:11:47Z

@nmay231 I get the exact same output. I also have the same results- it uses all my CPU but no GPU

TimotejFasiang · 2024-06-08T10:59:44Z

Can someone with the issue try running cuda version 11.x? (sudo apt install nvidia-cuda-toolkit will get version 11)
With my 980Ti, on cuda 11, I would instantly get the Failed to launch kernels error, whereas on cuda 12.5, the program just hangs.
This issue is most likely same as #364, where GPU memory architecture is the cause because bend was only tested on an rtx 4090.

hopperelec · 2024-06-08T16:02:07Z

@TimotejFasiang For me, sudo apt install nvidia-cuda-toolkit installed version 12.0.140~12.0.1-4build4, and I couldn't find any full 11.x version numbers to specify (e.g: E: Unable to locate package [email protected]). I don't use Linux very often so I might be doing something silly, though

wtfil · 2024-06-08T17:53:11Z

I have a little update on the issue, hope this will help to understand it better.
Initially I faced this issue when used [email protected] from WSL.
Today I tried my second image - [email protected] on the same machine and it worked! This image was almost fresh so I run installations steps from the README and bend run-cu worked right away

Here are the version of relevant tools for both images

name	[email protected]	[email protected]
os	Ubuntu 20.04.6 LTS	Ubuntu 22.04.3 LTS
gcc	gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0	gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
nvcc	release 12.5, V12.5.40	release 12.5, V12.5.40
hvm	2.0.19	2.0.19
bend	0.2.33	0.2.33
cargo	1.80.0-nightly	1.78.0

The only major different is gcc between two

I also noticed that nvidia-smi (not the nvidia-smi.exe) is failing on [email protected] with following error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

nvidia-smi on [email protected] works fine

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:08:00.0  On |                  N/A |
|  0%   46C    P8             22W /  220W |    2652MiB /   8192MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

[email protected]

~/www/bend-examples > time bend run bitonic_sort.bend
Result: 16646144

real    0m35.908s
user    0m33.658s
sys     0m2.250s
~/www/bend-examples > time bend run-c bitonic_sort.bend
Result: 16646144

real    0m10.611s
user    1m3.683s
sys     1m44.188s
~/www/bend-examples > time bend run-cu bitonic_sort.bend
Result: 16646144

real    0m2.410s
user    0m1.996s
sys     0m0.080s

evbxll · 2024-06-14T23:53:22Z

I am noticing similar. When running on WSL2, I noticed that the default parallel_hello_world would not finish (before I got bored of waiting and figured something was wrong). I have a standard RTX 3070, btw. I rewrote things to play around, and found that it ran plenty fast when running gen(13), just not gen(16). My guess is some issue with using too much GPU memory and getting stuck, as some have mentioned here.

Moreover, when running gen(16), my GPU continued to be fully running after CTRL+C the command line and attempting to terminate the process. Is this related to having no IO?

Info:
HVM: 2.0.19
Bend: 0.2.27
OS: wsl-ubuntu 22.04.3 LTS
Kernel: 6.5.0-35
CPU: RYZEN 3600X
GPU: RTX 3070
g++: 11.4.0

hopperelec · 2024-06-15T00:32:02Z

@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below

def main:
  return (1 + 1)

Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me

evbxll · 2024-06-15T01:26:10Z

@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below
def main:
  return (1 + 1)
Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me

Eh, I feel like the original issue these comments are under is similar to me. Execution stuck when running CUDA WSL2

Wabinab · 2024-06-17T11:47:29Z

Yeah, one also get this issue, so one followed a few steps:

Bitonic Sort example fails with GPU kernel error. #364 solution (since one had an 1080 Ti) but modify on the current latest version hvm v2.0.19 (instead of the v2.0.13 @hubble14567 originally mentioned).
Follow @wtfil suggestion to install ubuntu 22.04 (and 24.04 LTS just in case, but both are installed on separate disk because you can't install both on the same folder same disk, otherwise, it'll share a single virtual disk and that's a huge issue); in the end, both works.
Update driver from 536.xx (forgot the exact version) to 555.99 (the current latest version). If one tries to run nvidia-smi now should get segmentation fault. Restart your computer, then the error should be gone. Now, run bend run-cu simple.bend -s and it'll have no problem.

The fix is quite stupid, because it seems to take so long to move data from cpu to gpu that running the simple.bend suggested above took 5-6 seconds.

wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 5.83s
- MIPS: 0.00

Edit: Anyway, one tries to run a second time and it seems to decrease in time, although the simple isn't worth it.

wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 0.29s
- MIPS: 0.00

Similarly, if we try run the parallel_sum.bend as a hello world, the results aren't enticing with 1080 Ti:

bend run-c parallel_sum.bend -s
Result: 5908768
- ITRS: 45999971
- TIME: 0.69s
- MIPS: 66.89

bend run-cu parallel_sum.bend -s
Result: 5908768
- ITRS: 45983587
- LEAK: 37606783
- TIME: 0.83s
- MIPS: 55.62

There's a lot of LEAK, and calculations are slower compared to 4-core CPU (i5-7400).

tczee36 · 2024-06-23T01:54:56Z

is there a fix for this?
running into same issue

wtfil changed the title ~~Code is frozen when running on CUDA in WSL2~~ Execution is stuck until termination when running on CUDA in WSL2 Jun 2, 2024

developedby added HVM About the HVM bug Something isn't working help wanted Extra attention is needed labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution is stuck until termination when running on CUDA in WSL2 #538

Execution is stuck until termination when running on CUDA in WSL2 #538

wtfil commented Jun 2, 2024

developedby commented Jun 4, 2024

wtfil commented Jun 4, 2024

keaneflynn commented Jun 5, 2024

NemoInfo commented Jun 6, 2024

nmay231 commented Jun 6, 2024 •

edited

Loading

My Specs

keaneflynn commented Jun 6, 2024

hopperelec commented Jun 7, 2024 •

edited

Loading

Imran-S-heikh commented Jun 7, 2024 •

edited

Loading

nmay231 commented Jun 7, 2024 •

edited

Loading

hopperelec commented Jun 7, 2024

TimotejFasiang commented Jun 8, 2024

hopperelec commented Jun 8, 2024

wtfil commented Jun 8, 2024

evbxll commented Jun 14, 2024 •

edited

Loading

hopperelec commented Jun 15, 2024 •

edited

Loading

evbxll commented Jun 15, 2024

Wabinab commented Jun 17, 2024 •

edited

Loading

tczee36 commented Jun 23, 2024

Execution is stuck until termination when running on CUDA in WSL2 #538

Execution is stuck until termination when running on CUDA in WSL2 #538

Comments

wtfil commented Jun 2, 2024

Reproducing the behavior

The issue

What I attempted

CUDA verification

nvidia-smi

System Settings

Additional context

developedby commented Jun 4, 2024

wtfil commented Jun 4, 2024

nvcc and g++

keaneflynn commented Jun 5, 2024

NemoInfo commented Jun 6, 2024

nmay231 commented Jun 6, 2024 • edited Loading

My Specs

keaneflynn commented Jun 6, 2024

hopperelec commented Jun 7, 2024 • edited Loading

Imran-S-heikh commented Jun 7, 2024 • edited Loading

nmay231 commented Jun 7, 2024 • edited Loading

hopperelec commented Jun 7, 2024

TimotejFasiang commented Jun 8, 2024

hopperelec commented Jun 8, 2024

wtfil commented Jun 8, 2024

[email protected]

evbxll commented Jun 14, 2024 • edited Loading

hopperelec commented Jun 15, 2024 • edited Loading

evbxll commented Jun 15, 2024

Wabinab commented Jun 17, 2024 • edited Loading

tczee36 commented Jun 23, 2024

`nvcc` and `g++`

nmay231 commented Jun 6, 2024 •

edited

Loading

hopperelec commented Jun 7, 2024 •

edited

Loading

Imran-S-heikh commented Jun 7, 2024 •

edited

Loading

nmay231 commented Jun 7, 2024 •

edited

Loading

evbxll commented Jun 14, 2024 •

edited

Loading

hopperelec commented Jun 15, 2024 •

edited

Loading

Wabinab commented Jun 17, 2024 •

edited

Loading