about the time for train a model #97

Rooders · 2020-08-26T05:35:55Z

how much time for train a model in single 32G GPU and use the defult parameters? I feel it is very slow, in my GPU , it is spend 3.7 seconds for one step. is it normal?

GrittyChen · 2020-08-26T07:01:18Z

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Rooders · 2020-08-26T07:43:52Z

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

GrittyChen · 2020-08-26T08:05:14Z

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

Rooders · 2020-08-26T09:14:21Z

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250.
But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

thank u very mach, the issue have be solved, it is because CUDA version dosen't match Tensorflow version.
By the way, if I set update_cycle=1,batch_size=4096, how many BLEU score I can get in valid corpus? and training model in zh-en 200 millions sentence-pair?

GrittyChen · 2020-08-28T02:45:25Z

@Rooders Sorry, we did not record the BLEU scores under this setting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the time for train a model #97

about the time for train a model #97

Rooders commented Aug 26, 2020

GrittyChen commented Aug 26, 2020

Rooders commented Aug 26, 2020

GrittyChen commented Aug 26, 2020

Rooders commented Aug 26, 2020

GrittyChen commented Aug 28, 2020

about the time for train a model #97

about the time for train a model #97

Comments

Rooders commented Aug 26, 2020

GrittyChen commented Aug 26, 2020

Rooders commented Aug 26, 2020

GrittyChen commented Aug 26, 2020

Rooders commented Aug 26, 2020

GrittyChen commented Aug 28, 2020