Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPUs in tensorflow #3

Open
hixiaye opened this issue May 14, 2019 · 7 comments
Open

Multi-GPUs in tensorflow #3

hixiaye opened this issue May 14, 2019 · 7 comments

Comments

@hixiaye
Copy link

hixiaye commented May 14, 2019

HI @dstamoulis Thanks for your code!
I have modified TPU setting into GPU, like tf.estimator.Estimator, tf.estimator.RunConfig, and single GPU works.
However, when I apply "MirroredStrategy" into tf.estimator.RunConfig for multi-gpus, it can not work.
The Error is:
I0514 20:11:40.999713 139768726693632 tf_logging.py:115] Error reported to Coordinator:
Traceback (most recent call last):
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 783, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1168, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data/project/tensorflow/FACE/SinglePath_NAS/single-path-nas-master_multi_gpus/nas-search/search_main.py", line 361, in nas_model_fn
train_op = ema.apply(ema_vars)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 431, in apply
self._averages[var], var, decay, zero_debias=zero_debias))
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 84, in assign_moving_average
with ops.colocate_with(variable):
File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4092, in _colocate_with_for_gradient
with self.colocate_with(op, ignore_existing):
File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4144, in colocate_with
op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1305, in internal_convert_to_tensor_or_indexed_slices
value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1144, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 447, in _tensor_conversion_mirrored
assert not as_ref
AssertionError

Any help would be appreciated, thank you!

@fabbrimatteo
Copy link

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

@hixiaye
Copy link
Author

hixiaye commented May 18, 2019

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec()
tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig()
tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator()
other points:
I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus:
NUM_GPUS = 2
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
gpu_options = tf.GPUOptions(allow_growth=True)
session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

@fabbrimatteo
Copy link

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

@iamweiweishi
Copy link

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

@QueeneTam
Copy link

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! [email protected] is my email. Thanks a lot!

@QueeneTam
Copy link

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec()
tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig()
tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator()
other points:
I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus:
NUM_GPUS = 2
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
gpu_options = tf.GPUOptions(allow_growth=True)
session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! [email protected] is my email. Thanks a lot!

@marsggbo
Copy link

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

Hello, I find a way to solve this problem. By reading the source code of TPUEstimatorSpec, I find it has a function as_estimator_spec, so you can only make the following modification, then it will work for GPUs:

def model_fn():
    ...
    spec = TPUEstimatorSpec(
                ...
               host_call=host_call
               ...
        )
    return spec.as_estimator_spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants