-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methodology for our comparisons vs. Groute #31
Comments
A comment on your methodology (which is, of course, different from what we do): "we generally run primitives multiple times within a single binary launch and report the average time" I generally avoid this because it is a common cause of systematic errors in measurement. Now, running multiple times and taking the average is an estimation procedure for the population mean. This procedure assumes the samples are i.i.d. Here are the runtimes of the individual samples from the K80+METIS, non-idempotent, non-DO (https://github.com/gunrock/io/blob/master/gunrock-output/20170303/bfs_k80x2_metis_soc-LiveJournal1.txt) data in #32, in order of iterations: 61.02, 46.54, 46.41, 43.97, 42.10, 42.20, 42.07, 40.22, 38.91, 38.85, 38.92, 37.24, 36.25, 36.10, 36.08, 36.05, 35.41, 35.07, 35.03, 35.06, 34.77, 34.43, 34.47, 34.42, 34.41, 34.49, 34.30, 34.40, 33.80, 34.49, 34.49, 34.46 The trend of clearly decreasing runtimes for what should be random samples from the same population is worrying. You'll see this pattern in all of your data (it's more evident in your multi-GPU runs). Is your procedure estimating the population mean correctly? I.e. is the average you compute using this procedure comparable to the population mean? |
Thanks for pointing that out, and I do see such variances in running time.
I still don't know the actual condition(s) and the reason(s) of this trend, and here is my guess:
Currently I think 2) and 3) are less likely than 1), but all of them point to lower level optimizations. From what I can tell, it's far less possible that systematic errors are the cause. |
Hi sgpyc, Things like power management, cache effects and optimizations are systematic errors. It may help to think of "systematic bias" if it is helpful. In general, if behaviour of later runs is affected by earlier runs then your observations are not independent of each other. Their average is not a good estimator of the population mean. I would advise figuring out exactly why the trend exists and controlling for it (for example, If you do this, the average computed by running (say) BFS n times from the shell should not be significantly different from running n repetitions of BFS from within Hope this helps! |
As we noted in our email communications, we think the fairest comparisons to make between two graph frameworks are those that offer the best available performance for each at the time the comparisons were made. For Gunrock today, that would be the 0.4 release (10 November 2016). We recognize this version was not available at the time the Groute paper was submitted (although it would have been appropriate for camera-ready), so we ran comparisons against a Gunrock version dated July 11, 2016 (6eb6db5d09620701bf127c5acb13143f4d8de394). Yuechao notes that to build this version, we "need to comment out the lp related includes in tests/pr/test_pr.cu, line 33 to line 35, otherwise the build will fail".
In our group, we generally run primitives multiple times within a single binary launch and report the average time (Graph500 does this, for instance). We think the most important aspect is simply to run it more than once to mitigate any startup effects. In our comparisons, we use
--iteration-num=32
.By default, we use a source vertex of 0, and depending on the test, we have used both 0-source and random-source in our publications. Getting good performance on a randomized source is harder, but avoids overtuning. In our comparisons, we use source 0, as Groute does.
The text was updated successfully, but these errors were encountered: