Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use real AWS S3 data tests and apply related fixes #212

Merged
merged 87 commits into from
Jul 3, 2024

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Jun 25, 2024

Description

This changes in this PR address #198 by removing moto and related tests to avoid existing and future challenges with mocked s3 resources (in addition to the failing tests, a short justification can be found here). Moving forward, CytoTable will now be tested on a CSV and SQLite resource from the cellpainting-gallery (as outlined in the fixtures). As a result, I believe we now are addressing #193 within reason because of the SQLite addition (please don't hesitate to let me know if you feel otherwise and we should keep that issue open).

In the process of developing towards this fix I added a new preset which enables compatibility with JUMP data (cpg0016-jump) from the cellpainting-gallery in order to effectively perform a test from an S3 SQlite object (no other presets appeared to exactly match this need). CC @jenna-tomkinson

Some notes:

  • I noticed functions within the sources module experiencing challenges from Parsl app decoration due to how the cloudpathlib client is constructed and how logging operates with HTEX executor tasks (which sometimes makes errors less visible). Thinking through this had me recognize that there didn't seem to be a major benefit from these being decorated this way. Because of this, I changed these functions to be un-decorated, which simplified troubleshooting and feels like a good option to reduce complexity in the future.
  • I moved logic found within convert for checking that a file has at least one row to sources through _file_is_more_than_one_line to help keep the use of cloudpathlib.CloudPath's consistent and filter data sources earlier in the flow before we begin processing them.
  • I noticed cytominer-database was consistently experiencing challenges within Python 3.12 environments (tests couldn't find the ingest command line option). As a result, I've moved to static files produced from cytominer-database which increase the size of this repository but should help us save time during tests and avoid the error I was seeing during development.
  • After removing the cytominer-database processing step, I noticed that there were some missing CSV's from the cytominer-database test data, which have been added in this PR.
  • A chunk size of 30000 was used after trying many different iterations with the S3 SQLite file test. GitHub Actions runner containers appear to experience resource challenges at sizes above this. GitHub Actions doesn't appear to provide exact details on resource use overages upfront, so I expect the chunk size may need tinkering over time.
  • Sorting results for any chunk size with the S3 SQLite file appeared to trigger GitHub Actions runner container challenges so I've turned this off for that particular test.
  • I've split the large_data_tests marked tests to be run as a separate job. Running things together with the other tests and test matrix appeared to have trouble with resources available in GitHub Actions runner containers. While the time involved to run the single test in this low for the time being, we may need to split it into another scheduled test (or something similar) if/when the mark's tests grow. This may be related to challenges with Possible Parsl manager persistence after CytoTable processing #211, Python garbage collection, or GitHub Actions runner container storage availability (more investigation is needed).
  • A minor fix to address an issue related to IO Error when installing httpfs extension from Python duckdb/duckdb#3243 was applied to duckdb connection setup (this seems to sporadically appear, disappear, and reappear again within GitHub Actions runner environments).

Closes #198
Closes #193

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs requested review from gwaybio and kenibrewer June 25, 2024 22:25
@d33bs d33bs changed the title Use real AWS S3 data tests and related fixes Use real AWS S3 data tests and apply related fixes Jun 25, 2024
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I went in thinking the review would be brutal, but I think it is relatively bite-sized. A few comments

.github/workflows/test.yml Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/test_convert_threaded.py Outdated Show resolved Hide resolved
@d33bs
Copy link
Member Author

d33bs commented Jul 3, 2024

Thank you @gwaybio ! After making some changes I feel good about how this now looks. I plan to create a new issue to explore memory resource reductions for sorted joins.

@d33bs d33bs merged commit 8e343d0 into cytomining:main Jul 3, 2024
12 checks passed
@d33bs d33bs deleted the move-to-real-s3-tests branch July 3, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants