[VL] coredump if compiled by g++-11 #7950

shuai-xu · 2024-11-14T08:54:06Z

Backend

VL (Velox)

Bug description

I compile gluten with velox and then run it with spark. There are three machines in the test cluster. I find it always coredump on machine 2 and 3 while running normally on machine 1. The stack is:。Then I compile gluten on another machine, and it runs normally on all the machines. After check the two machines, I find the main diff is the version of g++, I change g++ from g++-11 to g++-10, it works. The machines info is listed in System Info part.

Spark version

Spark-3.3.x

Spark configurations

No response

System information

Compile machine 1:

Compile machine 2:

Run machine 1 is the same as Compile 2.
Run machine 2:

Run machine 3:

Relevant logs

No response

PHILO-HE · 2024-11-15T06:58:14Z

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

shuai-xu · 2024-11-15T09:42:43Z

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

@PHILO-HE After removing this line, it does not coredump.

PHILO-HE · 2024-11-15T15:00:13Z

@shuai-xu, thanks so much for your feedback!

As @zhouyuan told me, the newer gcc (e.g., gcc-11) makes full use of native cpu's instruction and optimization when -march=native is specified. But this can make the binary (which is compiled for your relatively new cpu architecture) not runnable on some old cpu architectures (in your case, it's avx2 cpu).
So essentially, this is not gcc-11's issue. It's caused by our -march setting. It may be not rare that diverse cpu architectures coexist in users' cluster. So maybe, generic setting for compiler should be used by default in our code.

cc @zhouyuan, @FelixYBW

FelixYBW · 2024-11-15T18:33:26Z

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

shuai-xu · 2024-11-18T02:15:48Z

@FelixYBW Thank you for explaining, learn a lot.

surnaik · 2024-11-22T01:43:05Z

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

But adding -mno-avx512f doesn't help, removing -march=native or setting march to build for just avx2 could be the solution

shuai-xu added bug Something isn't working triage labels Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] coredump if compiled by g++-11 #7950

[VL] coredump if compiled by g++-11 #7950

shuai-xu commented Nov 14, 2024 •

edited

Loading

PHILO-HE commented Nov 15, 2024

shuai-xu commented Nov 15, 2024

PHILO-HE commented Nov 15, 2024 •

edited

Loading

FelixYBW commented Nov 15, 2024

shuai-xu commented Nov 18, 2024

surnaik commented Nov 22, 2024

[VL] coredump if compiled by g++-11 #7950

[VL] coredump if compiled by g++-11 #7950

Comments

shuai-xu commented Nov 14, 2024 • edited Loading

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

PHILO-HE commented Nov 15, 2024

shuai-xu commented Nov 15, 2024

PHILO-HE commented Nov 15, 2024 • edited Loading

FelixYBW commented Nov 15, 2024

shuai-xu commented Nov 18, 2024

surnaik commented Nov 22, 2024

shuai-xu commented Nov 14, 2024 •

edited

Loading

PHILO-HE commented Nov 15, 2024 •

edited

Loading