Using pickling as much as possible for serialization #966

glemaitre · 2024-12-16T11:34:08Z

Currently, it seems that we are using our own way to do serialization. Let me take an example:

skore/skore/src/skore/item/cross_validation_item.py

Lines 131 to 134 in 4c0b0b8

    
           if k == "indices": 
        
               cv_results_serialized["indices"] = { 
        
                   "train": tuple(arr.tolist() for arr in v["train"]), 
        
                   "test": tuple(arr.tolist() for arr in v["test"]),

or

skore/skore/src/skore/item/numpy_array_item.py

Lines 75 to 80 in 4c0b0b8

    
           import numpy 
        
           if not isinstance(array, numpy.ndarray): 
        
               raise ItemTypeError(f"Type '{array.__class__}' is not supported.") 
        
           return cls(array_json=dumps(array.tolist()))

I'm a bit worried to see this type of pattern because we are: copying the data and using an inefficient representation (not contiguous in memory). Since those data, at least for the moment are only shared between our own interfaces, I'm wondering why we are not just using pickling with cloudpickle or joblib.

It would also have the advantages that you can pickle much more type of data out of the box without to rewrite a specific item class as far I can see.

I think that a good amount of work have been dedicated to avoid memory copy with pickling and numpy array which could be nice to leverage here:

The text was updated successfully, but these errors were encountered:

thomass-dev · 2024-12-16T14:26:23Z

The focus is not about memory copy. Initially, the question was to be environment independent: user doesn't have to think about "if i want to get an older item, but my setup has changed".

Skore is currently not really a cache object, but a way to represent/track object in a cool interface.
But the further we go, the more we question ourself.

@tuscland @MarieS-WiMLDS IMO the central question is: if we want to cache objects, we should focus on pickle. If not, we should remove all the sugar we've added to make it clear that we are not a cache system.

glemaitre · 2024-12-16T15:25:26Z

The focus is not about memory copy. Initially, the question was to be environment independent: user doesn't have to think about "if i want to get an older item, but my setup has changed".

OK the trade-off is clearer to me now. I'm wondering then if we should have both modes: experimentation that would benefit from pickling and MLOps/tracking that would benefit from your approach. I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

augustebaum · 2024-12-16T16:49:11Z

I'm wondering then if we should have both modes

That sounds interesting, but I don't know what it could look like

I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

It's already super fuzzy, e.g. when to use put/get vs. put_item/get_item, what is an Item (should it be private or public)...

tuscland · 2024-12-17T09:13:28Z

An Item is public. It is a structure that represents the result of some work the user performed.

get, put are conveniences that automatically perform the best effort to convert/coerce a result to the relevant item.
get_item, put_item are the base function with no sugar.

tuscland · 2024-12-17T09:16:42Z

@tuscland @MarieS-WiMLDS IMO the central question is: if we want to cache objects, we should focus on pickle. If not, we should remove all the sugar we've added to make it clear that we are not a cache system.

Skore is not a cache.
We have done a significant amount of work to identify the common data structures that as less as possible dependent on an environment, so that we have the opportunity to offer platform independent usage of the library.

thomass-dev · 2024-12-17T09:18:36Z

OK the trade-off is clearer to me now. I'm wondering then if we should have both modes: experimentation that would benefit from pickling and MLOps/tracking that would benefit from your approach. I don't know if it going to be super messy or if the delimitation between both modes is actually super fuzzy.

With user-controlled parameter like put([...], protocol=<JSON|pickle>), with by default JSON?
User should be warn that he can break not only the get, but the export of some items in the UI by changing its environment...

tuscland · 2024-12-17T09:22:26Z

With user-controlled parameter like put([...], protocol=<JSON|pickle>), with by default JSON?

Before thinking about a solution, I want to make sure that experimentation lacks expressivity with our scheme.

The serialization protocol is private. We picked JSON for simplicity, but we know it is not efficient, and that we might need to optimize it in the future.

thomass-dev · 2024-12-17T09:24:33Z

As i previously said, IMO the central question is a product question: do we want to cache all types of object, or not.

MarieS-WiMLDS · 2024-12-17T14:24:03Z

Actually, as a user, I'm not sure the difference between tracking and caching is clear to me.
I expect from a cache to be able to store an object for me and let me recover it easily, unchanged.
I expect from a tracking system to be able to store various states of an object, and help me understand how it evolves and maybe how it relates to other objects. About the evolution part, in skore, it's both in skore-ui, and in skore back-end. Therefore I want to be able to recover objects from the db; which seems pretty close to the cache?

use case example.

from skore import CrossValidationReporter
from sklearn import plenty of models, make_classification

project.create("test", overwrite = True)

X, y = make_classification()
models_to_test = [DecisionTreeClassifier(), LogisticRegression(), GradienBoostingClassifier()]

for model in models_to_test: 
  cv_reporter = CrossValidationReporter(model, X, y)
  project.put("cv_reporter", cv_reporter)


cross_val_versions = get_item_versions("cv_reporter")
scorings = ["auc", "recall"]


for scoring in scorings: 
  max = 0
  for cv, model in zip(cross_val_versions, models_to_test):
    if cv.cv_results[scoring] > max:
      max = cv.cv_results[scoring]
      best_model = model
  print(f"for scoring {scoring}, the best model is {model}")

glemaitre · 2024-12-17T17:35:49Z

A couple of thoughts more since I'm struggling here with some abstractions level.

If I take the use case of @MarieS-WiMLDS, it is what I define as "experimentation mode". In this mode, I am unsure that being environment/platform independent is crucial. I can imagine to even dump environment information when creating the skore project. It would allow me to spin it on if I want to get back to where I was. In this setting, I'm also thinking that I don't want to be limited to what I can or cannot dump. However, I get that skore-UI can only process or filter a subset of the information that I potentially dumped.

Now for what I understood from the "tracking" or where I think that the current environment/platform independence is useful, is in an MLOps setting where I cannot easily chose how to setup my environment, hardware, etc. In this case, it makes a lot of sense to me to: (i) restrain the complexity related to the environment and (ii) trim the insights of the model to inference only.

tuscland added needs-triage This has been recently submitted and needs attention user-reported labels Dec 17, 2024

tuscland removed the needs-triage This has been recently submitted and needs attention label Dec 17, 2024

tuscland self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pickling as much as possible for serialization #966

Using pickling as much as possible for serialization #966

glemaitre commented Dec 16, 2024

thomass-dev commented Dec 16, 2024 •

edited

Loading

glemaitre commented Dec 16, 2024

augustebaum commented Dec 16, 2024

tuscland commented Dec 17, 2024

tuscland commented Dec 17, 2024

thomass-dev commented Dec 17, 2024 •

edited

Loading

tuscland commented Dec 17, 2024

thomass-dev commented Dec 17, 2024 •

edited

Loading

MarieS-WiMLDS commented Dec 17, 2024

glemaitre commented Dec 17, 2024

Using pickling as much as possible for serialization #966

Using pickling as much as possible for serialization #966

Comments

glemaitre commented Dec 16, 2024

thomass-dev commented Dec 16, 2024 • edited Loading

glemaitre commented Dec 16, 2024

augustebaum commented Dec 16, 2024

tuscland commented Dec 17, 2024

tuscland commented Dec 17, 2024

thomass-dev commented Dec 17, 2024 • edited Loading

tuscland commented Dec 17, 2024

thomass-dev commented Dec 17, 2024 • edited Loading

MarieS-WiMLDS commented Dec 17, 2024

glemaitre commented Dec 17, 2024

thomass-dev commented Dec 16, 2024 •

edited

Loading

thomass-dev commented Dec 17, 2024 •

edited

Loading

thomass-dev commented Dec 17, 2024 •

edited

Loading