Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] coredump if compiled by g++-11 #7950

Open
shuai-xu opened this issue Nov 14, 2024 · 6 comments
Open

[VL] coredump if compiled by g++-11 #7950

shuai-xu opened this issue Nov 14, 2024 · 6 comments
Labels
bug Something isn't working triage

Comments

@shuai-xu
Copy link
Contributor

shuai-xu commented Nov 14, 2024

Backend

VL (Velox)

Bug description

I compile gluten with velox and then run it with spark. There are three machines in the test cluster. I find it always coredump on machine 2 and 3 while running normally on machine 1. The stack is:image。Then I compile gluten on another machine, and it runs normally on all the machines. After check the two machines, I find the main diff is the version of g++, I change g++ from g++-11 to g++-10, it works. The machines info is listed in System Info part.

Spark version

Spark-3.3.x

Spark configurations

No response

System information

Compile machine 1:
image

Compile machine 2:
image

Run machine 1 is the same as Compile 2.
Run machine 2:
image

Run machine 3:
image

Relevant logs

No response

@shuai-xu shuai-xu added bug Something isn't working triage labels Nov 14, 2024
@PHILO-HE
Copy link
Contributor

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

@shuai-xu
Copy link
Contributor Author

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

@PHILO-HE After removing this line, it does not coredump.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 15, 2024

@shuai-xu, thanks so much for your feedback!

As @zhouyuan told me, the newer gcc (e.g., gcc-11) makes full use of native cpu's instruction and optimization when -march=native is specified. But this can make the binary (which is compiled for your relatively new cpu architecture) not runnable on some old cpu architectures (in your case, it's avx2 cpu).
So essentially, this is not gcc-11's issue. It's caused by our -march setting. It may be not rare that diverse cpu architectures coexist in users' cluster. So maybe, generic setting for compiler should be used by default in our code.

cc @zhouyuan, @FelixYBW

@FelixYBW
Copy link
Contributor

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

@shuai-xu
Copy link
Contributor Author

@FelixYBW Thank you for explaining, learn a lot.

@surnaik
Copy link
Contributor

surnaik commented Nov 22, 2024

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

But adding -mno-avx512f doesn't help, removing -march=native or setting march to build for just avx2 could be the solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

4 participants