Decide how to bundle `.pickle` files from COMSOL solution results data #4026

agriyakhetarpal · 2024-04-20T18:26:06Z

.pickle files are binary files and not inherently secure. They therefore should be best included elsewhere in PyBaMM's packaging infrastructure when publishing to common distribution channels like PyPI owing to security reasons in the case where these binaries are modified with malicious intent and make it to the develop branch or a PyPI release where PyBaMM can be downloaded from (binary or non-binary). This means they should neither be included in the source distribution (controlled by MANIFEST.in), nor in the wheels (in the case where the wheel is built from the sdist). Binary platform-specific wheels for PyBaMM cannot be built from sdists right now and have to be built in-tree (#3651, #3564), and require restructuring the build system configuration at least, if not rewriting what files are copied into the package structure – since pybamm/input/comsol_results/ is a namespace package (via setuptools.command.install_data).

We have the following options:

Store in JSON, CSV, or TXT file formats (non-binary) instead of .pickle
Use pooch to reproducibly download these files when needed and adjust PyBaMM's public API plus example notebooks for this, while moving these files to a separate repository in the pybamm-team organisation, and marking the repository as an archive to prevent tampering.
Use non-binary storage solutions from xarray or zarr because they can handle multi-dimensional data

Out of these, point 2 requires an internet connection and those wishing to load COMSOL results from these files would need access to the internet to download them, and therefore should be the last option to consider.

Task list, as of 10/05/2024

There are two major things to be done:

Convert pickle files to non-binary files (done in Using JSON to import comsol results #4064)
Use https://github.com/fatiando/pooch as a required dependency to fetch these files plus adjust for the changes here
- Set up a new repository: https://github.com/pybamm-team/pybamm-data
  - Add a README and some documentation
  - Copy the files and delete them from the PyBaMM main (this) repository (and additionally store checksums in a text file somewhere), and publish a GitHub Release there to store them perhaps (shorter URL)
- Refer to the Scientific Python guide: https://learn.scientific-python.org/development/patterns/data-files/#downloading-larger-files-on-demand
- Use pytest's conditional markers or regular markers to skip tests where access to the internet is not available (adjust noxfile.py, test-related files, etc.)
- [z] Ensure files are not copied into the sdist (control MANIFEST.in and setuptools-specific package data)
- Unsure about this point – maybe provide a public API to fetch these files (with the URLs kept hardcoded)?

The text was updated successfully, but these errors were encountered:

agriyakhetarpal · 2024-04-21T08:00:31Z

NumPy arrays in the values of the dict can't be serialised to JSON directly. It might be worth converting these arrays to lists to be able to save them or just use pandas, which is already an optional dependency for PyBaMM, to convert to JSON and read it afterwards where needed

kratman · 2024-04-26T14:41:08Z

@agriyakhetarpal Why is this a release blocker?

agriyakhetarpal · 2024-04-26T14:54:38Z

Ah, I added this in error – most likely it was because I was selecting multiple issues, my apologies. It's something to look into soon enough, but by no means a release blocker. I removed the label.

santacodes · 2024-04-26T17:12:17Z

I think the first option would be better, JSON and CSV files would provide a better structure than txt files in my opinion. I could try this out if it's not being carried out by someone else.

agriyakhetarpal · 2024-04-26T17:23:11Z

Thanks, assigned you. You'll need to take care of the fact that if you convert NumPy arrays to some other format without a compatibility-provider library and convert them back again, you could potentially lose data. You'll also need to take care of things like floating-point precision (most of the elements in that have decimal values).

agriyakhetarpal · 2024-05-13T12:35:06Z

This is now a legible release blocker, added the label. The reason is that the JSON files are cumulatively summing up to 26.47 MB (https://github.com/pybamm-team/pybamm-data/releases/tag/v1.0.0), which is going to make the sdist and wheels bloated with very large files that are cumulatively greater in size than that of the Python extension module we ship inside the wheel. We will need to move to pooch + add relevant classes/functions to download them quite soon, and we will need to do this before we do the v24.5rc0 pre-release (cc: @santacodes).

The size of the sdist should not be larger than a few MBs at most, and while there is no guideline as such for sdist sizes, it is generally acceptable to keep just minimal files in them so that they are smaller than the wheels, and also to provide a reasonably sized download for users (both wheels and sdist). For context, our Windows wheels are currently ~8 MB, macOS ones are ~12 MB, and those for Linux are ~22 MB in size. This would not have been a blocker if the wheels did not include the files but the sdist did – both of them will, at this time.

agriyakhetarpal added difficulty: medium Will take a few days priority: medium To be resolved if time allows labels Apr 20, 2024

agriyakhetarpal added the release blocker Issues that need to be addressed before the creation of a release label Apr 25, 2024

agriyakhetarpal removed the release blocker Issues that need to be addressed before the creation of a release label Apr 26, 2024

agriyakhetarpal assigned santacodes Apr 26, 2024

santacodes mentioned this issue May 6, 2024

Using JSON to import comsol results #4064

Merged

8 tasks

agriyakhetarpal added the release blocker Issues that need to be addressed before the creation of a release label May 13, 2024

santacodes mentioned this issue May 16, 2024

Added pooch registries for data files #4098

Merged

8 tasks

kratman closed this as completed in #4098 May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide how to bundle `.pickle` files from COMSOL solution results data #4026

Decide how to bundle `.pickle` files from COMSOL solution results data #4026

agriyakhetarpal commented Apr 20, 2024 •

edited

Loading

agriyakhetarpal commented Apr 21, 2024

kratman commented Apr 26, 2024

agriyakhetarpal commented Apr 26, 2024

santacodes commented Apr 26, 2024

agriyakhetarpal commented Apr 26, 2024

agriyakhetarpal commented May 13, 2024

Decide how to bundle .pickle files from COMSOL solution results data #4026

Decide how to bundle .pickle files from COMSOL solution results data #4026

Comments

agriyakhetarpal commented Apr 20, 2024 • edited Loading

Task list, as of 10/05/2024

agriyakhetarpal commented Apr 21, 2024

kratman commented Apr 26, 2024

agriyakhetarpal commented Apr 26, 2024

santacodes commented Apr 26, 2024

agriyakhetarpal commented Apr 26, 2024

agriyakhetarpal commented May 13, 2024

Decide how to bundle `.pickle` files from COMSOL solution results data #4026

Decide how to bundle `.pickle` files from COMSOL solution results data #4026

agriyakhetarpal commented Apr 20, 2024 •

edited

Loading