Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pickling as much as possible for serialization #966

Open
glemaitre opened this issue Dec 16, 2024 · 10 comments
Open

Using pickling as much as possible for serialization #966

glemaitre opened this issue Dec 16, 2024 · 10 comments
Assignees

Comments

@glemaitre
Copy link
Member

Currently, it seems that we are using our own way to do serialization. Let me take an example:

if k == "indices":
cv_results_serialized["indices"] = {
"train": tuple(arr.tolist() for arr in v["train"]),
"test": tuple(arr.tolist() for arr in v["test"]),

or

import numpy
if not isinstance(array, numpy.ndarray):
raise ItemTypeError(f"Type '{array.__class__}' is not supported.")
return cls(array_json=dumps(array.tolist()))

I'm a bit worried to see this type of pattern because we are: copying the data and using an inefficient representation (not contiguous in memory). Since those data, at least for the moment are only shared between our own interfaces, I'm wondering why we are not just using pickling with cloudpickle or joblib.

It would also have the advantages that you can pickle much more type of data out of the box without to rewrite a specific item class as far I can see.

I think that a good amount of work have been dedicated to avoid memory copy with pickling and numpy array which could be nice to leverage here:

@thomass-dev
Copy link
Collaborator

thomass-dev commented Dec 16, 2024

The focus is not about memory copy. Initially, the question was to be environment independent: user doesn't have to think about "if i want to get an older item, but my setup has changed".

Skore is currently not really a cache object, but a way to represent/track object in a cool interface.
But the further we go, the more we question ourself.

@tuscland @MarieS-WiMLDS IMO the central question is: if we want to cache objects, we should focus on pickle. If not, we should remove all the sugar we've added to make it clear that we are not a cache system.

@glemaitre
Copy link
Member Author

The focus is not about memory copy. Initially, the question was to be environment independent: user doesn't have to think about "if i want to get an older item, but my setup has changed".

OK the trade-off is clearer to me now. I'm wondering then if we should have both modes: experimentation that would benefit from pickling and MLOps/tracking that would benefit from your approach. I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

@augustebaum
Copy link
Contributor

I'm wondering then if we should have both modes

That sounds interesting, but I don't know what it could look like

I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

It's already super fuzzy, e.g. when to use put/get vs. put_item/get_item, what is an Item (should it be private or public)...

@tuscland tuscland added needs-triage This has been recently submitted and needs attention user-reported labels Dec 17, 2024
@tuscland
Copy link
Member

An Item is public. It is a structure that represents the result of some work the user performed.

get, put are conveniences that automatically perform the best effort to convert/coerce a result to the relevant item.
get_item, put_item are the base function with no sugar.

@tuscland
Copy link
Member

@tuscland @MarieS-WiMLDS IMO the central question is: if we want to cache objects, we should focus on pickle. If not, we should remove all the sugar we've added to make it clear that we are not a cache system.

Skore is not a cache.
We have done a significant amount of work to identify the common data structures that as less as possible dependent on an environment, so that we have the opportunity to offer platform independent usage of the library.

@thomass-dev
Copy link
Collaborator

thomass-dev commented Dec 17, 2024

OK the trade-off is clearer to me now. I'm wondering then if we should have both modes: experimentation that would benefit from pickling and MLOps/tracking that would benefit from your approach. I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

With user-controlled parameter like put([...], protocol=<JSON|pickle>), with by default JSON?
User should be warn that he can break not only the get, but the export of some items in the UI by changing its environment...

@tuscland
Copy link
Member

With user-controlled parameter like put([...], protocol=<JSON|pickle>), with by default JSON?

Before thinking about a solution, I want to make sure that experimentation lacks expressivity with our scheme.

The serialization protocol is private. We picked JSON for simplicity, but we know it is not efficient, and that we might need to optimize it in the future.

@tuscland tuscland removed the needs-triage This has been recently submitted and needs attention label Dec 17, 2024
@tuscland tuscland self-assigned this Dec 17, 2024
@thomass-dev
Copy link
Collaborator

thomass-dev commented Dec 17, 2024

As i previously said, IMO the central question is a product question: do we want to cache all types of object, or not.

@MarieS-WiMLDS
Copy link
Contributor

Actually, as a user, I'm not sure the difference between tracking and caching is clear to me.
I expect from a cache to be able to store an object for me and let me recover it easily, unchanged.
I expect from a tracking system to be able to store various states of an object, and help me understand how it evolves and maybe how it relates to other objects. About the evolution part, in skore, it's both in skore-ui, and in skore back-end. Therefore I want to be able to recover objects from the db; which seems pretty close to the cache?

use case example.

from skore import CrossValidationReporter
from sklearn import plenty of models, make_classification

project.create("test", overwrite = True)

X, y = make_classification()
models_to_test = [DecisionTreeClassifier(), LogisticRegression(), GradienBoostingClassifier()]

for model in models_to_test: 
  cv_reporter = CrossValidationReporter(model, X, y)
  project.put("cv_reporter", cv_reporter)


cross_val_versions = get_item_versions("cv_reporter")
scorings = ["auc", "recall"]


for scoring in scorings: 
  max = 0
  for cv, model in zip(cross_val_versions, models_to_test):
    if cv.cv_results[scoring] > max:
      max = cv.cv_results[scoring]
      best_model = model
  print(f"for scoring {scoring}, the best model is {model}")

@glemaitre
Copy link
Member Author

A couple of thoughts more since I'm struggling here with some abstractions level.

If I take the use case of @MarieS-WiMLDS, it is what I define as "experimentation mode". In this mode, I am unsure that being environment/platform independent is crucial. I can imagine to even dump environment information when creating the skore project. It would allow me to spin it on if I want to get back to where I was. In this setting, I'm also thinking that I don't want to be limited to what I can or cannot dump. However, I get that skore-UI can only process or filter a subset of the information that I potentially dumped.

Now for what I understood from the "tracking" or where I think that the current environment/platform independence is useful, is in an MLOps setting where I cannot easily chose how to setup my environment, hardware, etc. In this case, it makes a lot of sense to me to: (i) restrain the complexity related to the environment and (ii) trim the insights of the model to inference only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants