-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pickling as much as possible for serialization #966
Comments
The focus is not about memory copy. Initially, the question was to be environment independent: user doesn't have to think about "if i want to get an older item, but my setup has changed". Skore is currently not really a cache object, but a way to represent/track object in a cool interface. @tuscland @MarieS-WiMLDS IMO the central question is: if we want to cache objects, we should focus on pickle. If not, we should remove all the sugar we've added to make it clear that we are not a cache system. |
OK the trade-off is clearer to me now. I'm wondering then if we should have both modes: experimentation that would benefit from pickling and MLOps/tracking that would benefit from your approach. I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy. |
That sounds interesting, but I don't know what it could look like
It's already super fuzzy, e.g. when to use |
An Item is public. It is a structure that represents the result of some work the user performed.
|
Skore is not a cache. |
With user-controlled parameter like |
Before thinking about a solution, I want to make sure that experimentation lacks expressivity with our scheme. The serialization protocol is private. We picked JSON for simplicity, but we know it is not efficient, and that we might need to optimize it in the future. |
As i previously said, IMO the central question is a product question: do we want to cache all types of object, or not. |
Actually, as a user, I'm not sure the difference between tracking and caching is clear to me. use case example. from skore import CrossValidationReporter
from sklearn import plenty of models, make_classification
project.create("test", overwrite = True)
X, y = make_classification()
models_to_test = [DecisionTreeClassifier(), LogisticRegression(), GradienBoostingClassifier()]
for model in models_to_test:
cv_reporter = CrossValidationReporter(model, X, y)
project.put("cv_reporter", cv_reporter)
cross_val_versions = get_item_versions("cv_reporter")
scorings = ["auc", "recall"]
for scoring in scorings:
max = 0
for cv, model in zip(cross_val_versions, models_to_test):
if cv.cv_results[scoring] > max:
max = cv.cv_results[scoring]
best_model = model
print(f"for scoring {scoring}, the best model is {model}") |
A couple of thoughts more since I'm struggling here with some abstractions level. If I take the use case of @MarieS-WiMLDS, it is what I define as "experimentation mode". In this mode, I am unsure that being environment/platform independent is crucial. I can imagine to even dump environment information when creating the Now for what I understood from the "tracking" or where I think that the current environment/platform independence is useful, is in an MLOps setting where I cannot easily chose how to setup my environment, hardware, etc. In this case, it makes a lot of sense to me to: (i) restrain the complexity related to the environment and (ii) trim the insights of the model to inference only. |
Currently, it seems that we are using our own way to do serialization. Let me take an example:
skore/skore/src/skore/item/cross_validation_item.py
Lines 131 to 134 in 4c0b0b8
or
skore/skore/src/skore/item/numpy_array_item.py
Lines 75 to 80 in 4c0b0b8
I'm a bit worried to see this type of pattern because we are: copying the data and using an inefficient representation (not contiguous in memory). Since those data, at least for the moment are only shared between our own interfaces, I'm wondering why we are not just using pickling with
cloudpickle
orjoblib
.It would also have the advantages that you can pickle much more type of data out of the box without to rewrite a specific item class as far I can see.
I think that a good amount of work have been dedicated to avoid memory copy with pickling and numpy array which could be nice to leverage here:
The text was updated successfully, but these errors were encountered: