Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fail to run dp train with 2.2.8 (but success with 2.2.7) #3218

Closed
link89 opened this issue Feb 2, 2024 · 7 comments · Fixed by deepmodeling-activity/deepmd-kit-installer#36

Comments

@link89
Copy link
Contributor

link89 commented Feb 2, 2024

Bug summary

Get error the following error when run dp train on a GPU node

error: libdevice not found at ./libdevice.10.bc
2024-02-01 22:45:42.137246: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
2024-02-01 22:45:42.137338: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17091566501094239012
Traceback (most recent call last):
  File "/public/groups/ai4ec/libs/conda/deepmd/2.2.8/gpu/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
    return fn(*args)
  File "/public/groups/ai4ec/libs/conda/deepmd/2.2.8/gpu/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/public/groups/ai4ec/libs/conda/deepmd/2.2.8/gpu/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) UNKNOWN: JIT compilation failed.
         [[{{node ExponentialDecay/Floor}}]]
         [[ExponentialDecay/_163]]
  (1) UNKNOWN: JIT compilation failed.
         [[{{node ExponentialDecay/Floor}}]]
0 successful operations.
0 derived errors ignored.

It works well with 2.2.7. Currently our work around is to use 2.2.7 to train model and run LAMMPS with deepmd 2.2.8.

DeePMD-kit Version

2.2.8

TensorFlow Version

2.14.0

How did you download the software?

Conda

Input Files, Running Commands, Error Log, etc.

deepmd-2.2.8.log.txt

Steps to Reproduce

Install deepmd with the following command

CONDA_OVERRIDE_CUDA="11.8" conda create -p deepmd/2.2.8/gpu deepmd-kit=2.2.8=cuda118py310*  lammps horovod -c conda-forge 

And then run dp train

Further Information, Files, and Links

No response

@link89 link89 added the bug label Feb 2, 2024
@njzjz njzjz added the upstream label Feb 2, 2024
@link89
Copy link
Contributor Author

link89 commented Feb 2, 2024

What's the suggested version of tensorflow to use with deepmd 2.2.8?

@biglinn
Copy link

biglinn commented Feb 2, 2024

When I perform dogen 'run' with deepmd-kit 2.2.8 and tensorflow 2.15.0, there is a similar problem about tensorflow.

@njzjz
Copy link
Member

njzjz commented Feb 2, 2024

When I perform dogen 'run' with deepmd-kit 2.2.8 and tensorflow 2.15.0, there is a similar problem about tensorflow.

The problem on TF 2.15 should have been fixed in conda-forge/tensorflow-feedstock#370. It is valuable to know whether the new build still has the problem.

@njzjz njzjz pinned this issue Feb 2, 2024
@biglinn
Copy link

biglinn commented Feb 4, 2024

When I perform dogen 'run' with deepmd-kit 2.2.8 and tensorflow 2.15.0, there is a similar problem about tensorflow.

The problem on TF 2.15 should have been fixed in conda-forge/tensorflow-feedstock#370. It is valuable to know whether the new build still has the problem.

Ok, I will take a try. Thanks!
In addition, I find that deepmd-kit 2.2.7 and tensorflow 2.9.0 can perform very good.

@link89
Copy link
Contributor Author

link89 commented Feb 22, 2024

@njzjz deepmd/2.2.9 still have this issue. One has to run pip install nvidia-cuda-nvcc-cu11 to workaround. I guess this issue is not fixed yet.

@njzjz
Copy link
Member

njzjz commented Feb 22, 2024

deepmd/2.2.9 still have this issue.

@link89 the issue is not related to deepmd-kit.

@njzjz njzjz unpinned this issue Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants