Systems:
- local macOS machine
- Azure (
Standard_NV6
withNVIDIA GPU Cloud Image
) - Google Colab (to a minor extent)
Versions of TensorFlow:
- 1.12.0
- 2 (Development Preview)
Processing unit:
- CPUs
- GPUs
Processing paradigm:
- Eager mode (interactive session)
- Non-eager mode (optimised computational graph)
The official documentation explains the difference between two types of seeds:
To generate different sequences across sessions, set neither graph-level nor op-level seeds:
To generate the same repeatable sequence for an op across sessions, set the seed for the op:
To make the random sequences generated by all ops be repeatable across sessions, set a graph-level seed:
The command tf.set_random_seed()
sets the graph-level random seed (in
Tensorflow 2, the command is tf.set_random_seed()
). The
argument seed = 0
in the random normal draws sets the operation-level random seed.
These seeds apply only to random number generation, not to the operations on
them, as
tf.reduce_mean
does not have an argument for randomness.
Following this thread, I also tried to constrain the randomness with these settings:
-
set the NumPy random seed:
np.random.seed(0)
(although the code does not use it) -
resetting default graph:
tf.reset_default_graph()
-
removing parallelism in non-eager mode:
with tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1)) as sess:
The generation of random numbers is the same across all possibilities if one sets both graph-level and operation-level seeds. If one sets only the graph-level seed, the random numbers are different between the eager mode and non-eager mode.
Therefore, the mean and standard deviation of random draws will be different in eager and non-eager mode with only the top-level graph seed set.
This StackOverflow thread explains why computations give different results:
Precision in floating point calculation will depend on the library compilation options and system architecture details.
It mentions this blog post, which is a bit long and where the real relevant for us is:
the results you get will depend on your compiler, your CPU, and your compiler settings, which actually helps make the point.
The same random numbers give different sums:
- across different versions of TensorFlow (1.12.0 or 2)
- within a version, across different systems (macOS or Azure)
- within a verion and a system, across multiple runs on GPUs
See the test code below for the specific values. All the tests pass on a local macOS machine with Tensorflow 1.12.0 or Tensorflow 2, and on Azure running on CPUs or GPUs.
Please go to this internal repo for the files, or use this Dockerfile and the Pyhon test code below. As more examples arise, they will be added to the Github repo.
Dockerfile:
FROM tensorflow/tensorflow:latest-py3 # or tensorflow/tensorflow:latest-gpu-py3
ADD tests.py .
# Execute the script
CMD python tests.py
Python code:
import os
import tensorflow as tf
import numpy as np
def azure():
"""Returns True of this code is probably running on Azure."""
filepath = os.path.abspath(os.path.dirname(__file__))
return filepath.startswith("/notebooks")
def tf_1():
"""Returns True if TensorFlow is version 1"""
return tf.__version__.startswith("1.")
def format_number(n):
"""Returns the number string-formatted with 12 number after comma."""
return "%1.12f" % n
def set_top_level_seeds():
"""Sets TensorFlow graph-level seed and Numpy seed."""
if tf_1():
tf.set_random_seed(0)
else:
tf.random.set_seed(0)
np.random.seed(0)
def generate_random_numbers_non_eager(op_seed=None):
"""Returns random normal draws, non-eager mode"""
with tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1)) as sess:
set_top_level_seeds()
if op_seed:
t = tf.random.normal([100, 100], seed=op_seed)
else:
t = tf.random.normal([100, 100])
return sess.run(t)
def generate_random_numbers_eager(op_seed=None):
"""Returns random normal draws, eager mode"""
set_top_level_seeds()
if op_seed:
t = tf.random.normal([100, 100], seed=op_seed)
else:
t = tf.random.normal([100, 100])
return t
def generate_random_numbers_helper(eager, op_seed=None):
"""Wrapper for eager and non-eager functions"""
if eager:
return generate_random_numbers_eager(op_seed=op_seed)
return generate_random_numbers_non_eager(op_seed=op_seed)
def generate_random_number_stats_str_eager(op_seed=None):
"""Returns mean and standard deviation from random normal draws"""
t = generate_random_numbers_helper(eager=True, op_seed=op_seed)
mean = tf.reduce_mean(t)
sdev = tf.sqrt(tf.reduce_mean(tf.square(t - mean)))
return [format_number(n) for n in (mean, sdev)]
def generate_random_number_stats_str_non_eager(op_seed=None):
with tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1)) as sess:
t = generate_random_numbers_helper(eager=False, op_seed=op_seed)
mean = tf.reduce_mean(t)
sdev = tf.sqrt(tf.reduce_mean(tf.square(t - mean)))
return [format_number(sess.run(n)) for n in (mean, sdev)]
def generate_random_number_stats_str_helper(eager, op_seed=None):
"""Wrapper for eager and non-eager functions"""
if eager:
return generate_random_number_stats_str_eager(op_seed=op_seed)
return generate_random_number_stats_str_non_eager(op_seed=op_seed)
def generate_random_number_1_seed(eager):
"""Returns a single random number with graph-level seed only."""
num = generate_random_numbers_helper(eager)[0, 0]
return num
def generate_random_number_2_seeds(eager):
"""Returns a single random number with graph-level seed only."""
num = generate_random_numbers_helper(eager, op_seed=1)[0, 0]
return num
def generate_stats_1_seed(eager):
"""Returns mean and standard deviation wtih graph-level seed only."""
return generate_random_number_stats_str_helper(eager)
def generate_stats_2_seeds(eager):
"""Returns mean and standard deviation with graph and operation seeds."""
return generate_random_number_stats_str_helper(eager, op_seed=1)
class Tests(tf.test.TestCase):
"""Run tests for reproducibility of TensorFlow."""
def test_version(self):
self.assertTrue(tf.__version__ == "1.12.0" or
tf.__version__.startswith("2.0.0-dev2019"))
def type_helper(self, eager):
num = generate_random_number_1_seed(eager)
num_type = num.dtype
# conditional with `eager`, not with `tf.executing_eagerly()`
# because the latter is always True and set in the bottom of the script
# and non-eager mode is defined inside the call `generate_random_number_1_seed()`
# with a local session
if eager:
self.assertEqual(num_type, tf.float32)
else:
self.assertEqual(num_type, np.float32)
def test_type_eager(self):
self.type_helper(eager = True)
def test_type_non_eager(self):
self.type_helper(eager = False)
def random_number_1_seed_helper(self, eager):
num = generate_random_number_1_seed(eager)
num_str = format_number(num)
if eager:
expected_number = "1.511062622070"
else:
expected_number = "-1.409554481506"
self.assertEqual(num_str, expected_number)
def test_random_number_1_seed_eager(self):
self.random_number_1_seed_helper(eager = True)
def test_random_number_1_seed_non_eager(self):
self.random_number_1_seed_helper(eager = False)
def random_number_2_seeds_helper(self, eager):
num = generate_random_number_2_seeds(eager)
num_str = format_number(num)
self.assertEqual(num_str, "0.680345416069")
def test_random_number_2_seeds_eager(self):
self.random_number_2_seeds_helper(eager = True)
def test_random_number_2_seeds_non_eager(self):
self.random_number_2_seeds_helper(eager = False)
def arithmetic_1_seed_helper(self, eager):
mean, sd = generate_stats_1_seed(eager)
# Expected means
if not tf_1():
if azure():
expected_mean = "0.000620655250"
else:
expected_mean = "-0.008264398202"
else:
if not azure():
if eager:
expected_mean = "-0.008264393546"
else:
expected_mean = "0.001438469742"
else:
if eager:
expected_mean = "-0.008264395408"
else:
if tf.test.is_gpu_available():
expected_mean = "0.001438470092"
else:
expected_mean = "0.001438470441"
# Expected standard deviations
if not tf_1():
expected_sd = "0.995371103287"
else:
if eager:
expected_sd = "0.995371103287"
if not eager:
if not azure():
expected_sd = "0.996351540089"
else:
if tf.test.is_gpu_available():
expected_sd = "0.996351540089"
else:
expected_sd = "0.996351480484"
self.assertEqual(mean, expected_mean)
self.assertEqual(sd, expected_sd)
def test_arithmetic_1_seed_eager(self):
self.arithmetic_1_seed_helper(eager = True)
def test_arithmetic_1_seed_non_eager(self):
self.arithmetic_1_seed_helper(eager = False)
def arithmetic_2_seeds_helper(self, eager):
mean, sd = generate_stats_2_seeds(eager)
if not tf_1():
expected_mean = "0.000620646286"
else:
if not azure():
expected_mean = "0.000620653736"
else:
if not tf.test.is_gpu_available():
expected_mean = "0.000620655250"
else:
if eager:
expected_mean = "0.000620648789"
else:
expected_mean = "0.000620654318"
if not tf_1():
expected_sd = "0.997191071510"
else:
if tf.test.is_gpu_available():
expected_sd = ["0.997191190720", "0.997191071510"]
else:
expected_sd = "0.997191190720"
self.assertEqual(mean, expected_mean)
if str == type(expected_sd):
self.assertEqual(sd, expected_sd)
else:
self.assertTrue(sd in expected_sd)
def test_arithmetic_2_seeds_eager(self):
self.arithmetic_2_seeds_helper(eager = True)
def test_arithmetic_2_seeds_non_eager(self):
self.arithmetic_2_seeds_helper(eager = False)
if __name__ == '__main__':
# Syntax specific to TensorFlow 1
if tf_1():
tf.reset_default_graph()
tf.enable_eager_execution() # this will not be valid when starting a new session
tf.logging.set_verbosity(tf.logging.ERROR)
tf.test.main()
# On Google Colab, run these two lines instead of the last one:
#import unittest
#unittest.main(argv=['first-arg-is-ignored'], exit=False)
Due to rounding in floating point precision, the results of a sum of floating point numbers are different depending on the order. For example, the Kahan summation algorithm carries an error that often improves the precision of the result.
Below are some issues with people's viewpoints and the last link explains why GPUs may sum numbers in a different order depending on the alignment of the registers, which implies that different runs of an algorithm on GPUS of the same system will give different answers.
Unfortunately, the reduction ops on GPU use asynchronous atomic adds, and are therefore fundamentally nondeterministic for floating point. Making them deterministic would require either tree-structured reductions or integer math, both significantly slower.
...
In general, there is no guarantee of determinism on GPU. Therefore, we are not sure how much effort we want to spend on it. Even if we can fix this particular kernel, we have other Cudnn kernels that do have non-determinism.
On GPU, small amount of non-deterministic results is expected. TensorFlow uses the Eigen library, which uses Cuda atomics to implement reduction operations, such as tf.reduce_sum etc. Those operations are non-determnistical. Each operation can introduce a small difference. If your model is not stable, it could accumulate into large errors, after many steps.
Also, some non-determinism is caused by using modern instruction sets like SSE (see here ), so to get 100% reproducibility you may need to recompile TF without using SSE
Blog post explaining why GPUs are not deterministic:
SSE instructions enable low-level parallelism of floating-point arithmetic operations. For example, you can hold four single precision numbers at the same time in a 128-bit register, and operate on them all at the same time. This leads to massive time savings when working on large amounts of data.
But this may come at a price. Efficient use of SSE instructions can sometimes depend on exactly how the memory used to store vectors x and y is aligned. If it's aligned nicely - by which I mean, in the inner product example, that the addresses of the first elements of the arrays x and y are multiples of 16 bytes - then that's good. The hardware can efficiently move numbers from memory into registers to work on them, using instructions that depend on that alignment. So for our inner product, with a good optimizing compiler, we'd load numbers four at a time, multiply them together four at a time, and accumulate the results as we go along into our final result.
But if the memory is not nicely aligned - and there's a good chance it may not be - the compiler needs to generate a different code path to deal with the situation. Here the result will take longer to get because the numbers have to be accumulated one at a time. At run time, the code checks whether it can take the fast path or not, and works appropriately.
...
As new hardware with AVX instructions and 256 bit registers comes along, even more numerical work can be done in parallel. So - it seems that for the foreseeable future we're just going to have to live with this,