-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add as_sklearn
and from_sklearn
APIs to serialize to CPU sklearn-estimators for supported models
#6102
base: branch-25.02
Are you sure you want to change the base?
Conversation
Why have the methods do both the conversion cuml<>sklearn and the serialisation? Having a way to convert to and from scikit-learn seems like a useful thing by itself. Maybe because you have your own way of serialising the model, or because you need a particular type of model or who-knows-what. So to serialise it you'd do something like How hard would it be to have Name bike shedding: if we don't save things to a file, how about If we do save to a file, then |
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
to_sklearn
and from_sklearn
APIs to serialize to CPU sklearn-estimators for supported modelsas_sklearn
and from_sklearn
APIs to serialize to CPU sklearn-estimators for supported models
@betatim just implemented your suggestions and changed the functionality to not save a file but return an estimator. I think the idea of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just have two comments
return self._cpu_model | ||
|
||
@classmethod | ||
def from_sklearn(cls, model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have a global conversion table, so that we don't need to provide the class as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a class method, so we get the class from that, it's not something the user passes (like self
in non class methods)
A global conversion table will be useful for a follow up to add cuml.from_sklearn
library type of functionality though
""" | ||
estimator = cls() | ||
estimator.import_cpu_model() | ||
estimator._cpu_model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be interesting to add an optional parameter to this function to allow a deepcopy of the sklearn model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I asked the same thing before reading this suggestion :)
python/cuml/cuml/internals/base.pyx
Outdated
self.import_cpu_model() | ||
self.build_cpu_model() | ||
self.gpu_to_cpu() | ||
return self._cpu_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not return a deep copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why would we want to deepcopy? Mostly asking because in my experience 99% of cases where someone uses deepcopy
there is something else that we can do instead or just not do it. Mostly Python "just works" without deepcopy'ing, hence my interest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we simply return a reference to the internal model, any modification to one (additional training or something else) would affect the other. This might create a situation in which the CPU and GPU attributes are out of sync in the cuML estimator. Or inversely, the sklearn estimator returned by the function might silently be updated by the cuML estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it should be a parameter, by default most users probably won't care about needing a deep copy, so I wouldn't do it by default, but if a user needs it then they can request it, what do you guys think?
else: | ||
raise ValueError(f"Serializer {format} not supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does click
not take care of invalid values being passed :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, I added it by accident based on habits :P
# Convert to sklearn estimator | ||
sklearn_estimator = accelerated_estimator.as_sklearn() | ||
|
||
# Save using chosen format | ||
with open(output, "wb") as f: | ||
serializer.dump(sklearn_estimator, f) | ||
|
||
# Exit after conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the comments? They seem like they repeat what the code says. I like comments that explain why the code is the way it is, but I don't think we need that here as it is pretty straightforward
@pytest.fixture | ||
def random_state(): | ||
return 42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this instead of using 42
in the tests directly?
We could have a global version of this that allows us to run the tests with several seeds, but maybe something to tackle in the future/new PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need it at all
Co-authored-by: Tim Head <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-approving this given vacation schedules. Looks great to me once current discussion is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, addding the deepcopy
optional parameter to both functions is probably optimal, but both solutions work for me.
No description provided.