-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: fix performance of feature sampling for node splits #2292
RF: fix performance of feature sampling for node splits #2292
Conversation
/intelci: run |
The work is not fully finished. For instance, I do not know if / how we should escalate the error if |
@@ -138,6 +138,17 @@ void service_memset_seq(T * const ptr, const T value, const size_t num) | |||
} | |||
} | |||
|
|||
template <typename T, CpuType cpu> | |||
void service_memset_ser(T * const ptr, const T startValue, const size_t num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not obvious meaning of function name. Maybe, service_memset_range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with service_memset_incrementing
, do you agree?
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
Hi @Alexsandruss and @samir-nasibli could you comment on the test failures? Is it okay to update |
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
I have created a PR in scikit-learn-intelex to follow up on the CI failures @Alexsandruss could you sign off on the changes requested in this PR? |
...aal/src/algorithms/dtrees/forest/classification/df_classification_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
@@ -134,6 +134,7 @@ class DAAL_EXPORT Parameter | |||
Default is 256. Increasing the number results in higher computation costs */ | |||
size_t minBinSize; /*!< Used with 'hist' split finding method only. | |||
Minimal number of observations in a bin. Default is 5 */ | |||
bool useConstFeatures; /*!< Use or ignore constant-valued features when splitting nodes. Default is false */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter should be added to oneDAL interfaces too. Otherwise, it creates inconsistency also in scikit-learn-intelex, where RandomForest is going to use oneDAL interfaces by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to add it here: 6ba776f
I hope I found all relevant places. Let me know if you see something obvious missing. Thank you!
Failures should be investigated and fixed on daal4py side:
|
eda9773
to
3efc0d0
Compare
log loss went from [*] All ratios between 0.99 and 1.02, except for one measurement on the year_prediction_msd dataset, where 7-9% disagreement are observed. Again, I would like to tackle that in a separate issue. Is it okay if I temporarily allow ratios |
@Alexsandruss updated the test in the other repo: uxlfoundation/scikit-learn-intelex#1213 |
9% quality metric drop for some datasets looks unacceptable. Did you test datasets from scikit-learn_bench/configs/testing/metrics/rf_*.json? |
I'm running the tests right now. But regardless, could we split these issues? There is a critical underperformance in the current implementation. Compared to the stock version, the latest intelex release has a 25 % worse MSE on the year prediction dataset (the same where I see the +9%) and a factor of 3 (!!) is reported in uxlfoundation/scikit-learn-intelex#1213 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge when fix on Python side is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. Special thanks for adding comments.
But it seems that the error handling needs to be improved.
int errorcode = rng.uniform(1, &swapIdx, _engineImpl->getState(), 0, maxFeatures - i); | ||
if (errorcode) | ||
{ | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct behavior would be to return Status (or SafeStatus if this code is running in parallel region), not bool, from findBestSplit* functions. This Status will be able to contain a description of the error. Maybe a new error code also needs to be added to handle failures in random number generators.
@@ -619,7 +619,7 @@ protected: | |||
const size_t nGen = (!_par.memorySavingMode && !_maxLeafNodes && !_useConstFeatures) ? n : _nFeaturesPerNode; | |||
*_numElems += n; | |||
RNGs<IndexType, cpu> rng; | |||
rng.uniformWithoutReplacement(nGen, _aFeatureIdx.get(), _aFeatureIdx.get() + nGen, _engineImpl->getState(), 0, n); | |||
rng.drawKFromBufferWithoutReplacement(nGen, _aFeatureIdx.get(), _aFeatureIdx.get() + nGen, _engineImpl->getState(), n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drawKFromBufferWithoutReplacement returns error code, but it is not checked here.
Please see the other comment in this review that describes more correct way of handling errors in DAAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't add it to this part because it's dead code, will remove it soon
3efc0d0
to
cb9d780
Compare
Force-pushed after rebase. @Vika-F, I will add the status checks next. |
@Vika-F I have added status checks with minimal changes, running a few performance benchmarks, just to make sure nothing changed. Edit: Can confirm that nothing has changed |
Description
With merging #1725, the
chooseFeatures()
helper function has become inefficient.It calls a helper,
rng.uniformWithoutReplacement()
, that implements a non-standard version of drawing k out of N without replacement. The algo isO(N^2)
compute time and with the changes from #1725N
went from_nFeaturesPerNode
tomaxFeatures
, wheremaxFeatures
is the total number of features in the dataset and_nFeaturesPerNode
is the number of features per node split (default value:sqrt(maxFeatures)
).It was reported in uxlfoundation/scikit-learn-intelex#1050 and uxlfoundation/scikit-learn-intelex#984 that our RF classification implementation is very slow and under some circumstances even stalls. I think this is related.
In an initial investigation I studied the training time for different values of
max_features
We see that for large values of
max_features
, the intelex implementation outperforms the stock implementation. But for smaller values (in particular including the default value sqrt(N)), we are slower.Profiling the code confirmed that indeed we spend most of our time sampling features
I have added a few changes in this PR
service_memory.h
to initialize memory with sequential entriesservice_rng.h
O(N)
->O(k)
findBestSplitSerial()
N
, so we do not benefit from the optimization in the previous bulletO(N)
toO(k + number of constant features)
, which is the optimal solutionuseConstFeatures
setting to the training params - it used to be hardcoded tofalse
With the changes the picture changes
Note: This second run was performed on a system with fewer cores, explaining the longer overall training times. In the comparison against the stock implementation, however, this effect cancels out.
The profiler confirms that the bottleneck of calling the random number generator disappeared
Most of the compute time is now spent creating the histograms.
Statistical correctness
I have tested the statistical correctness of my Fisher Yates sampling technique numerically.
I draw 5 out of 10 numbers, meaning each number has p=0.5 of being drawn.
Running the test 10,000,000 times confirms the uniform probability of all numbers
Find the source code of the test file below