Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing rework #451

Merged
merged 821 commits into from
Mar 20, 2024
Merged

Publishing rework #451

merged 821 commits into from
Mar 20, 2024

Conversation

blue442
Copy link
Collaborator

@blue442 blue442 commented Mar 18, 2024

Addressing #404 - adopted the existing infastructure to allow for the publishing of a FoundryDataset object as opposed to the previous approach of passing arguments to the Foundry.publish() function. This will allow better validation of the required components - in particular, the objects containing the datacite and metadata information.

To facilitate the validation of the datacite and metadata information, the dc_model.py and project_model.py files were created using the datamodel code generator package, which took the existing jsonschema definitions for these objects from the MDF schema repository and created a pydantic representation to be used in the validation. In an effort to ensure reproducibility for any future changes to the schemas, the resultant .py files from the datamodel code generator were left full intact. These are used as superclasses to model definitions for FoundrySchema and FoundryDatacite objects in the models.py file that add the required logic for the pydantic classes to work well with the Foundry package.

The dataset_publishing.ipynb was updated to reflect the new approach, but will require additional work once #403 is undertaken.

ascourtas and others added 30 commits December 16, 2021 16:21
…-cov-conflict

downgrade pytest-cov to ~=2.12.1 to fix conflict with flake8
Update setup and requirements to remove ML packages
…tests

remove references to dlhub in test files
summer student work and additional progress, including:
- code coverage
- bubble vis of repo
- bug fixes
- project README changes and improvements
- OOP refactoring
- test updates
- updated example notebooks
… remove dataframe building section at end, clean up installs
add joblib to setup.py and requirements.txt, increment to foundry ver…
for oqmd notebook, remove reference to f.describe(), add train split,…
blaiszik and others added 22 commits February 15, 2023 15:36
_read_json was printing a debug data frame. Removed.
* add initial directory-making functionality

* add acl permission setting

* add PUT request logiv and ACL setting, plus TODOs

* add logic to delete acl rule after creation

* add try/except handling to acl creation

* add prepare query param so we don't need to make dirs; fix bug when rule_id is not set

* clean up path joining logic, as well as comments

* add capability to upload all files in a folder, instead of one individual file

* update endpoint destination to use a UUID as the folder name

* break out acl rule adding to its own function, tidy up

* break out PUT request functionality

* break out upload_folder() into upload_file() and integrate https functions into publish(), with proper params

* change endpoint to NCSA, make usage more modular; small os.path bug fixes

* reorder functions to be easier to read

* add upload capability for single file, with error handling

* fix logic bugs with destination path setting s.t. all subfolders are written to destination

* cleanup var names in upload_folder() logic; making endpoint_dest path more robust

* code cleanup and breakout helper functions to reduce size of publish()

* add parameter checks to publish() and reduce param complexity

* add docstrings, plus add test param to publish()

* appease flake8

* add one more flake8 fix

* fix auths in tests, add system test for HTTPS publication, small comments

* add system test for HTTPS upload

* break out https publishing into more unit-testable method

* refactor function defs to work better for testing; add https upload unit test

* fix bug where artifact was written to uploaded dataset

* update os.walk block comparison to be more robust

* update publish() docstring and add type hints

* clean up imports, fix type hint for Response, add some context for Xtract file

* WIP to separate helpers into submodule -- need to fix test and method design

* fix typing discrepancy for requests.Response

* update modification date

* Temporarily remove ACL rule creation for https upload

* Fix flake8 comment error

* Fix flake8 once more

* Fixing local tests, flake8, kwargs

* Adding test data

* Debug result on GHA

* Debug result on GHA

* Debug result on GHA

* Debug result on GHA

* add Ben's patch to submodule

* generalize the included functions and
move make_globus_link here from foundry object

* move make_globus_link function to submodule

* update tests to generalized input format

* properly pass 'auths' object between functions

* update modification date

* prepend underscore to private function

* correct call to upload_to_endpoint() in foundry.py

* re-add ACL rule logic

* update auth passing to be more user-friendly; includes test changes

* Introduce a collection to hold authorizers

It uses a dataclass so that we can annotate the type of authorizers
that the tuple, then document them

I put it in a new module, `foundry.auth` so that it can be used
by both the foundry module and the https_upload module (avoiding
circular dependencies)

* alter args such that it's not possible for the user to have endpoint_id and gcs_auth_client misalign

* change language to endpoint_auth_clients for clarity of purpose

* docstring updates

---------

Co-authored-by: Ben Blaiszik <[email protected]>
Co-authored-by: isaac-darling <[email protected]>
Co-authored-by: Logan Ward <[email protected]>
* add initial directory-making functionality

* add acl permission setting

* add PUT request logiv and ACL setting, plus TODOs

* add logic to delete acl rule after creation

* add try/except handling to acl creation

* add prepare query param so we don't need to make dirs; fix bug when rule_id is not set

* clean up path joining logic, as well as comments

* add capability to upload all files in a folder, instead of one individual file

* update endpoint destination to use a UUID as the folder name

* break out acl rule adding to its own function, tidy up

* break out PUT request functionality

* break out upload_folder() into upload_file() and integrate https functions into publish(), with proper params

* change endpoint to NCSA, make usage more modular; small os.path bug fixes

* reorder functions to be easier to read

* add upload capability for single file, with error handling

* fix logic bugs with destination path setting s.t. all subfolders are written to destination

* cleanup var names in upload_folder() logic; making endpoint_dest path more robust

* code cleanup and breakout helper functions to reduce size of publish()

* add parameter checks to publish() and reduce param complexity

* add docstrings, plus add test param to publish()

* appease flake8

* add one more flake8 fix

* fix auths in tests, add system test for HTTPS publication, small comments

* add system test for HTTPS upload

* break out https publishing into more unit-testable method

* refactor function defs to work better for testing; add https upload unit test

* fix bug where artifact was written to uploaded dataset

* update os.walk block comparison to be more robust

* update publish() docstring and add type hints

* clean up imports, fix type hint for Response, add some context for Xtract file

* WIP to separate helpers into submodule -- need to fix test and method design

* fix typing discrepancy for requests.Response

* update modification date

* Temporarily remove ACL rule creation for https upload

* Fix flake8 comment error

* Fix flake8 once more

* Fixing local tests, flake8, kwargs

* Adding test data

* Debug result on GHA

* Debug result on GHA

* Debug result on GHA

* Debug result on GHA

* add Ben's patch to submodule

* generalize the included functions and
move make_globus_link here from foundry object

* move make_globus_link function to submodule

* update tests to generalized input format

* properly pass 'auths' object between functions

* update modification date

* prepend underscore to private function

* correct call to upload_to_endpoint() in foundry.py

* re-add ACL rule logic

* update auth passing to be more user-friendly; includes test changes

* Introduce a collection to hold authorizers

It uses a dataclass so that we can annotate the type of authorizers
that the tuple, then document them

I put it in a new module, `foundry.auth` so that it can be used
by both the foundry module and the https_upload module (avoiding
circular dependencies)

* alter args such that it's not possible for the user to have endpoint_id and gcs_auth_client misalign

* change language to endpoint_auth_clients for clarity of purpose

* docstring updates

* fix bug from last round of review edits

---------

Co-authored-by: Ben Blaiszik <[email protected]>
Co-authored-by: isaac-darling <[email protected]>
Co-authored-by: Logan Ward <[email protected]>
* update publishing notebook example to use HTTPS upload primarily, along with minor fixes

* add https upload methods and data

* fix function call to publish_dataset

* remove ACL rule code to fix error issue

* update globus images in notebook

* remove commented code

* add missing scopes

* appease flake overlords

* Add search lambda authorizer (sl_authorizer) to dlhub_client
instantiation

* removed unnecessary scopes

* update curation info in notebook

---------

Co-authored-by: Ben Blaiszik <[email protected]>
Co-authored-by: isaac-darling <[email protected]>
Co-authored-by: Logan Ward <[email protected]>
Co-authored-by: Eric Blau <[email protected]>
* adding ability to specify splits for loading

* refining test

* Update splits_to_load --> splits

---------

Co-authored-by: blaiszik <[email protected]>
* update to 0.6.0 for HTTPS pub

* Upload Foundry class load() function to default download using https (#340)

* Update setup.py

Fix version number for pyPI deploy

* Update setup.py version for pyPI

* Update requirements.txt to latest DLHub SDK

This is needed to require upgrade of DLHub SDK for Foundry users when they upgrade Foundry.

* Update version to 0.6.3

* incorporate load() to foundry.__init__()

* automating api documentation using github action (#342)

* CI: Automated documentation build

* Removing remnants of XTract

* CI: Automated documentation build

* CI: Automated documentation build

* Update README.md with contributing instructions (#357)

* Update README.md with contributing instructions

* Update PR language

* merging in split specification

* flake fixes

* add jingrui examples (#363)

* removed blank line

* Load on init (#358)

* incorporate load() to foundry.__init__()

* merging in split specification

* flake fixes

* removed blank line

* Adds note for quickstart globus set to false

* Validating metadata before publishing

* remove arguments from Foundry object that are duplicated with base class

* Update setup.py version to 0.7.0

* refactor foundry to separate foundry instance from dataset objects

* fine tuning search functionality

* removing redefinition of FoundryDataset

* address comments in PR review

* remove unused import

* updgrade setup-python from v2 to v4

* upgrade other setup-python from v2 to v4

* modify limit test

---------

Co-authored-by: ascourtas <[email protected]>
Co-authored-by: Ben Blaiszik <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marshall McDonnell <[email protected]>
Copy link

what-the-diff bot commented Mar 18, 2024

PR Summary

  • Method Renaming in foundry.foundry.md

    • The list method has been renamed to get_metadata_by_doi to reflect its purpose better.
  • Updates to dataset_publishing.ipynb

    • The notebook's content has been expanded with explanations and examples on constructing FoundryDataset objects.
    • New information related to DataCite metadata and its application has been added.
    • The script for publishing datasets to Foundry has been revised.
    • A section which added datacite information via kwargs to the foundry.publish() method is now deprecated.
    • The notebook version has been updated from 3.9.4 to 3.10.0.
  • New File: dc_model.py

    • A new data model file dc_model.py has been created. It defines a data model generated by datamodel-codegen tool.
  • Changes to foundry.py

    • Two parameters parallel_https and local_cache_dir were added to the initialisation method.
    • The search method has been updated to handle variety of options for searching data.
    • The dataset_from_metadata and search_results_to_dataframe methods have been enhanced for improved functionality.
    • The publish_dataset method has been simplified and now includes metadata and datacite validation.
  • Changes to foundry_dataset.py

    • An add_data method has been incorporated.
    • The FoundryDataset class now supports datacite_entry and foundry_cache parameters.
    • Some methods like get_as_dict, get_citation etc., have been streamlined by removing unused arguments.
  • Updates to models.py

    • Unused imports and classes have been removed for cleaner code.
    • FoundrySchema and FoundryDatacite classes have been updated for better data handling and validation.
  • Addition of foundry/project_model.py

    • New models classes have been added for project specification.
  • Enhancements to Unit Tests

    • Tests in tests/test_foundry.py and tests/test_foundry_components.py have been revamped to align with new coding standards.
    • Updates to tests/test_foundry_cache.py improves testing accuracy with new source ID data and usage of the FoundrySchema class.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.63636% with 28 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (dev@8b2b61a). Click here to learn what that means.

❗ Current head 4f15455 differs from pull request most recent head 69722ba. Consider uploading reports for the commit 69722ba to get more accurate results

Files Patch % Lines
foundry/foundry.py 60.00% 8 Missing ⚠️
foundry/foundry_cache.py 63.63% 8 Missing ⚠️
foundry/models.py 80.00% 8 Missing ⚠️
foundry/foundry_dataset.py 80.00% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@          Coverage Diff           @@
##             dev     #451   +/-   ##
======================================
  Coverage       ?   74.01%           
======================================
  Files          ?       12           
  Lines          ?      943           
  Branches       ?        0           
======================================
  Hits           ?      698           
  Misses         ?      245           
  Partials       ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kjschmidt913 kjschmidt913 merged commit d2566a6 into dev Mar 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.