-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use real AWS S3 data tests and apply related fixes #212
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d33bs
changed the title
Use real AWS S3 data tests and related fixes
Use real AWS S3 data tests and apply related fixes
Jun 25, 2024
gwaybio
approved these changes
Jun 28, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I went in thinking the review would be brutal, but I think it is relatively bite-sized. A few comments
Co-Authored-By: Gregory Way <[email protected]>
Co-Authored-By: Gregory Way <[email protected]>
Thank you @gwaybio ! After making some changes I feel good about how this now looks. I plan to create a new issue to explore memory resource reductions for sorted joins. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This changes in this PR address #198 by removing
moto
and related tests to avoid existing and future challenges with mockeds3
resources (in addition to the failing tests, a short justification can be found here). Moving forward, CytoTable will now be tested on a CSV and SQLite resource from thecellpainting-gallery
(as outlined in the fixtures). As a result, I believe we now are addressing #193 within reason because of the SQLite addition (please don't hesitate to let me know if you feel otherwise and we should keep that issue open).In the process of developing towards this fix I added a new preset which enables compatibility with JUMP data (
cpg0016-jump
) from thecellpainting-gallery
in order to effectively perform a test from an S3 SQlite object (no other presets appeared to exactly match this need). CC @jenna-tomkinsonSome notes:
sources
module experiencing challenges from Parsl app decoration due to how thecloudpathlib
client is constructed and how logging operates with HTEX executor tasks (which sometimes makes errors less visible). Thinking through this had me recognize that there didn't seem to be a major benefit from these being decorated this way. Because of this, I changed these functions to be un-decorated, which simplified troubleshooting and feels like a good option to reduce complexity in the future.convert
for checking that a file has at least one row tosources
through_file_is_more_than_one_line
to help keep the use ofcloudpathlib.CloudPath
's consistent and filter data sources earlier in the flow before we begin processing them.cytominer-database
was consistently experiencing challenges within Python 3.12 environments (tests couldn't find theingest
command line option). As a result, I've moved to static files produced fromcytominer-database
which increase the size of this repository but should help us save time during tests and avoid the error I was seeing during development.cytominer-database
processing step, I noticed that there were some missing CSV's from thecytominer-database
test data, which have been added in this PR.30000
was used after trying many different iterations with the S3 SQLite file test. GitHub Actions runner containers appear to experience resource challenges at sizes above this. GitHub Actions doesn't appear to provide exact details on resource use overages upfront, so I expect the chunk size may need tinkering over time.large_data_tests
marked tests to be run as a separate job. Running things together with the other tests and test matrix appeared to have trouble with resources available in GitHub Actions runner containers. While the time involved to run the single test in this low for the time being, we may need to split it into another scheduled test (or something similar) if/when the mark's tests grow. This may be related to challenges with Possible Parsl manager persistence after CytoTable processing #211, Python garbage collection, or GitHub Actions runner container storage availability (more investigation is needed).duckdb
connection setup (this seems to sporadically appear, disappear, and reappear again within GitHub Actions runner environments).Closes #198
Closes #193
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.