Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: compilation error with Cuda 12.2 and GCC 12.3 #12237

Open
maartenarnst opened this issue Sep 11, 2023 · 10 comments
Open

Tpetra: compilation error with Cuda 12.2 and GCC 12.3 #12237

maartenarnst opened this issue Sep 11, 2023 · 10 comments
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@maartenarnst
Copy link
Contributor

@brian-kelley @csiefer2

We are compiling Trilinos with Cuda 12.2 and with GCC 12.3 as the host compiler.

We're seeing a compilation error for the file Tpetra_CrsMatrix_def.hpp:

INFO:root:#21 1477.5 ...build-with-GNU-Cuda-amd64/packages/tpetra/core/src/Tpetra_CrsMatrix_LONG_LONG_INT_LONG_LONG_SERIAL.cpp:74:16:   required from here
INFO:root:#21 1477.5 .../packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:1689:139: error: no type named 'const_type' in 'cuda::std::__4::__is_primary_template<cuda::std::__4::iterator_traits<long long int, void> >'
INFO:root:#21 1477.5  1689 |         typename row_entries_type::const_type numRowEnt_h =
INFO:root:#21 1477.5       | 

It's quite mysterious because row_entries_type seems to be a Kokkos::View, so the line should be fine.

I think I tracked it down to this commit:

which ultimately results in an include of Kokkos_Sort.hpp in Tpetra_CrsMatrix_def.hpp. It seems to be the include of thrust/device_ptr.h and thrust/sort.h from Kokkos_Sort.hpp that ultimately causes the issue. I.e., if I compile using an older version of Trilinos and add those two thrust includes to Tpetra_CrsMatrix_def.hpp, I get the same error.

This is just a bug report. I have no explanation for this compilation error. And no fix to propose.

@maartenarnst maartenarnst added the type: bug The primary issue is a bug in Trilinos code or tests label Sep 11, 2023
@jhux2
Copy link
Member

jhux2 commented Sep 11, 2023

@trilinos/tpetra

@csiefer2
Copy link
Member

Lovely. We currently do not test CUDA 12 or GCC 12. Any suggestion as to where we can find a machine to reproduce this?

@brian-kelley
Copy link
Contributor

@csiefer2 Weaver has both of those

@csiefer2
Copy link
Member

@maartenarnst I finally got around to trying a build on weaver (IBM Power) w/ GCC 12.2, Cuda 12.0 and OpenMPI 4.1.4 and Tpetra compiles just fine. Can you post whatever configure you used here so I can see if it is some magic option thing or if it really is some compiler issue in 12.3 that isn't in 12.2

@maartenarnst
Copy link
Contributor Author

Hi @csiefer2. Thanks for following up.

We're building and testing in a docker container based on the cuda:12.2.0-devel-ubuntu22.04 image. It's x86, GCC 12.3, Cuda 12.2 and openmpi 4.1.2.

I'll try to run checks again tomorrow with our configuration, as well as with gcc 12.2 and cuda 12.0 that your are using. I'll keep you updated, and I'll also send more details.

Also tagging @romintomasetti.

@csiefer2
Copy link
Member

@maartenarnst

Configure I used:

cmake  \
-D CMAKE_CXX_COMPILER=`which mpicxx` \
-D CMAKE_C_COMPILER=`which mpicc` \
-D TPL_ENABLE_MPI=ON \
-D TPL_ENABLE_CUDA=ON \
   -D Kokkos_ARCH_VOLTA70=ON \
-D BUILD_SHARED_LIBS=ON \
-D Trilinos_ENABLE_Epetra=OFF \
-D Trilinos_ENABLE_Tpetra=ON \
  -D Tpetra_ENABLE_TESTS=ON \
  -D Tpetra_ENABLE_EXAMPLES=ON \
-D TPL_BLAS_LIBRARIES=$OPENBLAS_LIB/libopenblas.so \
-D TPL_LAPACK_LIBRARIES=$OPENBLAS_LIB/libopenblas.so \
../Trilinos

@bathmatt
Copy link
Contributor

bathmatt commented Oct 20, 2023

I know what is causing this and it is a bug in the compiler.

There is a bug filed with the nvcc team. It does not (or should not) happen with nvc++ compiler.

I can provide a work around

typedef decltype (myGraph_->k_numRowEntries_) row_entries_type;
WITH
typedef typename Kokkos::View<size_t*, Kokkos::LayoutLeft, device_type>::HostMirror row_entries_type;

and it should go away. Like I said it is in our compiler and with decltype. (you need to update the using statement too)

I will link the internal bug with this ticket.

with these changes...
[74/74] Linking CXX executable packages/panzer/mini-em/example/BlockPrec/PanzerMiniEM_BlockPrec.exe

BTW, i can't provide a patch without a lot of approval, or I would

@bathmatt
Copy link
Contributor

Reproducer

template <class...> class c {
public:
  using ab = c;
};
class ac;
class ad;
typedef ac ai;
enum f { g };
template <class al> class h {
public:
  h(f = g);
  al *operator->();
};
template <class = ai> class j;
template <class, class, class> class k {
public:
  typedef c<> am;
  am l;
};
template <class, class an, class m, class ao> class n {
  using ap = j<>;
  using o = k<an, m, ao>;
  void aq(const h<const ap> &, const h<const ap> &, const h<ad> & = g);
  template <class al> h<n<al, an, m, ao>> ar() const;
  h<const ap> q() const;
  h<const ap> p() const;
  void at(const h<ad> &);
  h<o> t;
};
template <class b, b> struct av {};
typedef av<bool, false> e;
template <bool aw> using u = av<bool, aw>;
template <bool, class b = void> using r = b;
template <class, class> struct ae;
template <class b, class i> using s = u<ae<b, i>::a>;
template <template <class> class, class> e v;
template <template <class> class w, class... x> using ax = decltype(v<w, x...>);
template <class b> using ay = r<s<b, typename b::d>::a>;
template <class b> using __is_primary_template = ax<ay, b>;
template <class az, class an, class m, class ao>
void n<az, an, m, ao>::at(const h<ad> &) {
  typedef decltype(t->l) ba;
  typename ba::ab bb;
  bb;
}
template <class az, class an, class m, class ao>
void n<az, an, m, ao>::aq(const h<const ap> &, const h<const ap> &,
                          const h<ad> &bc) {
  at(bc);
}
template <class az, class an, class m, class ao>
template <class al>
h<n<al, an, m, ao>> n<az, an, m, ao>::ar() const {
  h<n> be;
  be->aq(q(), p());
}
template h<n<double, int, long, ai>> n<double, int, long, ai>::ar() const;

@csiefer2
Copy link
Member

@maartenarnst I have a PR up which blindly tries @bathmatt's fix. Can you check?

csiefer2 added a commit that referenced this issue Oct 27, 2023
* Tpetra: Working around GCC 12.3 + CUDA Compiler Bug

As reporte din #12237

* Tpetra: More Matt's fix

* Tpetra: Maybe fixing bug?

* Update Tpetra_CrsGraph_def.hpp

* Update Tpetra_CrsGraph_def.hpp

* Update Tpetra_CrsGraph_def.hpp

* Gihub editor introduces errors

* Update Tpetra_CrsMatrix_def.hpp

* Why does the github editor keep doing this?!?!?1
@bathmatt
Copy link
Contributor

@csiefer2 et al, I just approved the fix in the compiler that should hit in 12.5 Sorry we couldn't fix it sooner.

@jhux2 jhux2 added this to Tpetra Aug 12, 2024
@jhux2 jhux2 moved this to Awaiting User Feedback in Tpetra Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Awaiting User Feedback
Development

No branches or pull requests

5 participants