Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tests after feature selection change #1213

Conversation

ahuber21
Copy link
Contributor

@ahuber21 ahuber21 commented Mar 14, 2023

With uxlfoundation/oneDAL#2292 some changes are introduced that need to be reflected on the scikit-learn-intelex side

  • Add useConstFeatures algorithm option
    • The previously hardcoded value can now be changed as a parameter. The option defaults to the previously hardcoded value false
  • Feature sampling was changed: Sampling features for calculating the best node split was optimized, which results in different numerical values in the predictions. The overall result of the ensemble is still the same.
// from the commit message
    chore: update unit test reference values
    
    details: feature selection for node splitting was changed which results in
    different numerics for the prediction. mean and variance are still in good
    agreement
    
    mean:
      old -> 22.088
      new -> 22.104
    
    variance:
      old -> 49.4695
      new -> 49.4311

Edit:
During development I ran a RandomForestClassifier and I'm still producing similar results

Accuracy
  old -> 0.9313
  new -> 0.9316

Confusion matrix
  old
    [[6922    4]
     [ 511   63]]

  new
    [[6921    5]
     [ 508   66]]

However, the training time is greatly improved

old: 433.87 seconds
new: 17.25 seconds

Copy link
Contributor

@Alexsandruss Alexsandruss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearnex/preview/ensemble/forest wrappers should be updated too

maxLeafNodes=0 if self.max_leaf_nodes is None else self.max_leaf_nodes,
maxBins=self.maxBins,
minBinSize=self.minBinSize,
useConstFeatures=self.useConstFeatures,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daal4py wrappers for previous oneDAL versions don't have useConstFeatures arg. Use daal_check_version for branching by oneDAL versions

@@ -1,127 +1,127 @@
36.70242652
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the need for changing data file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the PR description. Node splitting will be changed with uxlfoundation/oneDAL#2292, resulting in different values. As far as I can tell, the algorithm is still producing correct results.

@ahuber21 ahuber21 force-pushed the update-tests-after-feature-selection-change branch from 49b9f1f to 6ef0a5d Compare March 23, 2023 15:08
@ahuber21 ahuber21 force-pushed the update-tests-after-feature-selection-change branch from 6ef0a5d to 15b1890 Compare March 23, 2023 15:09
@Alexsandruss
Copy link
Contributor

CI is based on previous oneDAL release so unittest file should be dispatched based on oneDAL version for green CI

details: feature selection for node splitting was changed which results in
different numerics for the prediction. mean and variance are still in good
agreement

mean:
  old -> 22.088
  new -> 22.104

variance:
  old -> 49.4695
  new -> 49.4311
@ahuber21 ahuber21 force-pushed the update-tests-after-feature-selection-change branch 2 times, most recently from 629c9eb to 99ef708 Compare March 24, 2023 10:10
@@ -33,7 +33,12 @@

ACCURACY_RATIO = 0.95 if daal_check_version((2021, 'P', 400)) else 0.85
MSE_RATIO = 1.07
LOG_LOSS_RATIO = 1.4 if daal_check_version((2021, 'P', 400)) else 1.55
if daal_check_version((2023, 'P', 101)):
LOG_LOSS_RATIO = 1.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What have happened?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is discussed in uxlfoundation/oneDAL#2292 and I am investigating the issue. I think my changes pronounce an existing bug for one particular test case even more. But I really would like to tackle speedup and accuracy in separate PRs

@ahuber21 ahuber21 force-pushed the update-tests-after-feature-selection-change branch from 99ef708 to cc22197 Compare March 24, 2023 14:04
@ahuber21 ahuber21 force-pushed the update-tests-after-feature-selection-change branch from cc22197 to b6ba186 Compare March 24, 2023 15:13
Comment on lines 185 to +193
'decision_forest_regression_batch.csv', lambda r: r[1].prediction, (2023, 'P', 1)),
('decision_forest_regression_hist_batch',
'decision_forest_regression_batch.csv', lambda r: r[1].prediction, (2023, 'P', 1)),
('decision_forest_regression_default_dense_batch',
'decision_forest_regression_batch_20230101.csv',
lambda r: r[1].prediction, (2023, 'P', 101)),
('decision_forest_regression_hist_batch',
'decision_forest_regression_batch_20230101.csv',
lambda r: r[1].prediction, (2023, 'P', 101)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of examples will fail when version updates from 2023.1.0 to 2023.1.1 because two sets of regression examples will be run.

@ahuber21 ahuber21 merged commit ef57673 into uxlfoundation:master Mar 24, 2023
@ahuber21 ahuber21 deleted the update-tests-after-feature-selection-change branch March 24, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants