Scarliles/splitter injection #61

SamuelCarliles3 · 2024-03-18T21:10:30Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implements ability to inject split accept/reject conditions into Splitter from python.

Any other comments?

Function signatures and code organization are expected to undergo further refactoring.

…t to injections

github-actions · 2024-03-18T21:11:51Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`cython-lint`

cython-lint detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed cython-lint version is cython-lint=0.16.0.


/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:50:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:60:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:63:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:101:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:124:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:145:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:173:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:177:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:201:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:205:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:332:9: dangerous default value!
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:333:9: dangerous default value!
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:383:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:387:5: E303 too many blank lines (2)
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:392:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:404:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:414:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:839:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:845:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:868:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:871:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:875:1: W293 blank line contains whitespace

_{Generated for commit: 9d6091a. Link to the linter CI: here}

adam2392

Few initial thoughts:

If this has negligible performance issues compared to scikit-learn fork and scikit-learn main (then I assume it would also for scikit-tree), then very cool and I think this is the right way to go. So my immediate thought is compare the runtime performance vs n_samples and n_dimensions for 100, 500 and 1000 trees. This is the probably the most important asap to determine if this is a suitable route. excited to chat about that.
If there are no performance issues, then the next question I have is how do we make this as usable as possible? We would like to ideally do some checks upon instantiation of the Splitter class, so that way it determines if any kind of SplitConditionTuple is valid and will not result in a seg fault/etc. Rn, it seems very easy to do so potentially, so we can construct some guardrails perhaps? Idk how to best do this atm though.

SamuelCarliles3 · 2024-03-19T15:10:47Z

2. If there are no performance issues, then the next question I have is how do we make this as usable as possible? We would like to ideally do some checks upon instantiation of the Splitter class, so that way it determines if any kind of SplitConditionTuple is valid and will not result in a seg fault/etc. Rn, it seems very easy to do so potentially, so we can construct some guardrails perhaps? Idk how to best do this atm though.

To be explicit, there are two sus moves happening in this design pattern:

the unencapsulated existence of the underlying condition functions
the type unsafety of the condition parameter struct payloads that have to be cast inside the condition functions

These are what I would characterize as necessary evils resulting from technical constraints of cython -- the conditions as executed in say node_split_best can't be cython extension types because a) cpp template containers can't hold cython extension types, and b) cython extension types can't be selected out of a python iterable in a nogil block/function. If we want encapsulation, our options are to write the whole shebang in native cpp and wrap it, or to accept the gil, or to forego dynamic (from c/cpp pov) constraint count. The extension type/nogil tradeoff similarly applies to the parameter struct typecast situation.

So my intent with the wrapper classes was that those (function, parameter struct) tuple structs never be created by hand; when you want to add a new type of condition, you write the condition function, the function-specific parameter payload struct, and the extension type wrapper that provides a usable python interface, and the condition functions and parameter structs are never called by hand. Once the wrapper class is there it can be used dynamically in any python context without having to disturb existing cython code.

To mitigate the aforementioned shortcomings only two things immediately come to my mind:

documenting the design pattern with the why and the how and the why-not-X
adding underscores to the condition function names (and possibly the payload structs? is that legal cython?)

I am definitely open to other suggestions.

adam2392 · 2024-03-19T15:19:34Z

Say we implement the template in C++. Would it be a lot you think? I'm not opposed to supporting c++ code as long as it's short and enables stuff we otw can't do easily with Cython alone.

If uncertain, then I'm okay keeping the current design for now. I'd rather get something working and benchmarked first.

SamuelCarliles3 · 2024-03-19T15:25:18Z

My personal opinion: we're using a duck-typed language to wrap the mother of all unsafe languages. It's the programming equivalent of duct-taping a rocket launcher to a table saw. I think it's fair to do the two mitigations I proposed (and any other simple ones that come to mind) and accept that future devs have some responsibility to know what they're doing.

adam2392 · 2024-03-19T15:39:18Z

Fair point. Let's see how the benchmarks turn out then with the current approach!

adam2392 · 2024-03-19T15:42:52Z

On additional note: It would also be good to confirm this parallelizes fine: #61 (review)

E.g. n_jobs = 1 vs >1

… memory utilization in asv

…ding benefit, reverting

SamuelCarliles3 added 14 commits February 16, 2024 13:36

init split condition injection

8c09f7f

wip

ecfc9b1

wip

0c3d5c0

wip

5fd12a2

injection progress

b593ee0

injection progress

180fac3

split injection refactoring

c207c3e

added condition parameter passthrough prototype

7cc71c1

some tidying

2470d49

more tidying

ee3399f

splitter injection refactoring

a079e4f

cython injection due diligence, converted min_sample and monotonic_cs…

5397b66

…t to injections

tree tests pass huzzah!

44f1d57

added some splitconditions to header

4f19d53

adam2392 reviewed Mar 18, 2024

View reviewed changes

SamuelCarliles3 added 6 commits March 21, 2024 10:33

commented out some sample code that was substantially increasing peak…

cb71be0

… memory utilization in asv

added vector resize to Splitter

7514619

cython minutiae

24bfd22

trying pointer to vector (instead of inline) in extension type

7e52d2a

asv setup_cache failing with pointer to vector, experiment cost excee…

53d7abb

…ding benefit, reverting

pickle and vector resize refactoring

9d6091a

adam2392 mentioned this pull request Apr 30, 2024

added regression forest benchmark #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scarliles/splitter injection #61

Scarliles/splitter injection #61

SamuelCarliles3 commented Mar 18, 2024

github-actions bot commented Mar 18, 2024 •

edited

Loading

adam2392 left a comment

SamuelCarliles3 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

SamuelCarliles3 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

Scarliles/splitter injection #61

Are you sure you want to change the base?

Scarliles/splitter injection #61

Conversation

SamuelCarliles3 commented Mar 18, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 18, 2024 • edited Loading

❌ Linting issues

cython-lint

adam2392 left a comment

Choose a reason for hiding this comment

SamuelCarliles3 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

SamuelCarliles3 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

adam2392 commented Mar 19, 2024

github-actions bot commented Mar 18, 2024 •

edited

Loading

`cython-lint`