Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the time for train a model #97

Open
Rooders opened this issue Aug 26, 2020 · 5 comments
Open

about the time for train a model #97

Rooders opened this issue Aug 26, 2020 · 5 comments

Comments

@Rooders
Copy link

Rooders commented Aug 26, 2020

how much time for train a model in single 32G GPU and use the defult parameters? I feel it is very slow, in my GPU , it is spend 3.7 seconds for one step. is it normal?

@GrittyChen
Copy link
Member

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

@Rooders
Copy link
Author

Rooders commented Aug 26, 2020

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

@GrittyChen
Copy link
Member

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

@Rooders
Copy link
Author

Rooders commented Aug 26, 2020

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

thank u very mach, the issue have be solved, it is because CUDA version dosen't match Tensorflow version.
By the way, if I set update_cycle=1,batch_size=4096, how many BLEU score I can get in valid corpus? and training model in zh-en 200 millions sentence-pair?

@GrittyChen
Copy link
Member

@Rooders Sorry, we did not record the BLEU scores under this setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants