-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interpretability example notebooks #21
base: obliquepr
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the following changes are made:
- 0. For simulation notebook: I would remove the gaussian circles and just focus on the sparse parity since that shows the most difference. and Remove max_feature=3*n_features
- 1.
notebook/iris_benchmark_OF_vs_RF.ipynb
move the relevant OF part content intoexamples/tree/plot_iris_dtc.py
. - 2. For simulation notebook: Add description on the sparse parity problem according to the reference I linked. Here is a paraphrased summary of what we want to say:3.
Ref for sparse parity: https://epubs.siam.org/doi/epdf/10.1137/1.9781611974973.56
Sparse parity is a variation of the noisy parity problem, which itself is a multivariate generalization of the noisy XOR problem. This is a binary classification task in high dimensions.
<describe sparse parity as done in the paper in more laymen terms>
<describe the intuition for why OF would be better than RF>
e.g. OF should be more robust to high-dimensional noise. Moreover, due to the ability to sample more variable splits (i.e. `max_features` can be greater than `n_features` compared to RF), then we expect to see an increase in performance when we are willing to use computational power to sample more splits.
...
- 3. For MNIST notebook: only show
max_features =
sqrt
andn_features
. - Add a section describing the dataset very briefly and then linking to https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html for reference.
- Add a section similar to sparse parity talking about the differences between OF and RF. Add multi-class ROC curve.
- Add similar reports shown in the existing digits example.
Ideally we can try to have this done by Friday so we can show these to sklearn devs at OH on Monday. If you can't have this done by then (I know you have a lot of stuff going on!), please let me know and I can help out so we can have things ready by Monday.
6/13/2022 TODOS:
Additional refs from sklearn dev team |
For documentation that will get merged into the PR branch:
For the real datasets, we can use cnae-9 and phishing-websites and wdbc from openml, which seemed to have differing performances for OF and RF:
Ideally we can have some intuition on why RF vs OF is better in one of these... |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Add 3 interpretability example notebooks
Any other comments?