-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cellfinder workflow #15
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #15 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 1 2 +1
Lines 5 98 +93
=====================================
- Misses 5 98 +93
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sfmig - especially for the really detailed and good PR description!
This is a good start - I think the ability to programmatically change the input data is a key requirement and currently missing though, so I've requested changes (happy to move that to a separate issue though if you prefer! In this case, as me for re-review and link to the issue, and I will approve) - I've commented about how I suggest doing this (I could be wrong!)
I plan to implement the other variants in future PRs (but happy to discuss this).
Let's discuss this at the next developer meeting (I've put it on the agenda).
Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside init? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.
Would it make sense to split it into a configuration dataclass, and the setup_workflow
function (see my comments around the INPUT_DATA_CACHE_DIR
variable)? Maybe I'm too biased by how I approached this for brainreg
?
I assume we are not interested in timing the download from GIN, right?
I agree.
Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.
As you suggest, we may want to benchmark them individually in the future 🤔 maybe open an issue and not benchmark them for now?
|
||
# Local cached directories | ||
CELLFINDER_CACHE_DIR = Path.home() / ".cellfinder_benchmarks" | ||
INPUT_DATA_CACHE_DIR = CELLFINDER_CACHE_DIR / "cellfinder_test_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a way to be able to configure the value for INPUT_DATA_CACHE_DIR
, so we can run the same workflow on different data?
I am thinking the best way is to check whether an environmental variable exists, and if it doesn't, use this data as default, via pooch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an idea, see
def setup_workflow(): |
self.voxel_sizes = [5, 2, 2] # microns | ||
self.start_plane = 0 | ||
self.end_plane = -1 | ||
self.trained_model = None # if None, it will use a default model | ||
self.model_weights = None | ||
self.model = "resnet50_tv" | ||
self.batch_size = 32 | ||
self.n_free_cpus = 2 | ||
self.network_voxel_sizes = [5, 1, 1] | ||
self.soma_diameter = 16 | ||
self.ball_xy_size = 6 | ||
self.ball_z_size = 15 | ||
self.ball_overlap_fraction = 0.6 | ||
self.log_sigma_size = 0.2 | ||
self.n_sds_above_mean_thresh = 10 | ||
self.soma_spread_factor = 1.4 | ||
self.max_cluster_size = 100000 | ||
self.cube_width = 50 | ||
self.cube_height = 50 | ||
self.cube_depth = 20 | ||
self.network_depth = "50" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to be able to set (some of) these as well, to run the workflow with different parameters?
|
||
Parameters | ||
---------- | ||
cfg : Workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that you call the argument of type Workflow
"cfg", no? Should Workflow
really be called CellfinderConfig
?
…signal and background files to None by default in dataclass (they need to be predefined).
bece87e
to
8e13570
Compare
thanks for the feedback @alessandrofelder ! I think it looks much better now 🤩 I followed the structure from your example to:
For the config definition, I followed your approach: I check if there is an environment variable pointing to a config json file, otherwise I use the default dictionary (a dictionary can also be passed as an input). For retrieving the data, I check if the data exists locally, otherwise I fetch it from GIN. I also added some smoke tests for the workflow, which in retrospect maybe was a bit of a rabbit hole... I moved the tests to a different PR - still draft, I am having some odd issues with pytest and checking the logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
I've made a minor suggestion to narrow the scope of the default_dict
variable for you to consider.
) | ||
|
||
else: | ||
config = CellfinderConfig(**default_config_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether default_dict
should only be initialised here inside the else
block, instead of as a global variable, as it's good practice to keep the scope of variables small and define them near where they are actually used? (Also, if we keep it as a module-level constant it is now, it should be called DEFAULT_DICT
according to PEP8?)
assert input_config_path.exists() | ||
|
||
# read into dict (assuming config is a json dict?) | ||
# TODO:add error handling here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point: we might want to move from configuration dataclasses to pydantic
or attrs
for implicit validation? Open an issue?
…ult config from input config in workflow setup.
this PR went through quite a bit of discussion, with enough changes to make it into a separate PR - the current solution is now at #23. Closing this one for now! |
This PR is a first step towards defining the cellfinder-core workflows (#10)
Workflow variants
From the README, the cellfinder workflow can be defined in two ways:
main()
function fromcellfinder_core.main
(see snippet here)Within each approach, we can also choose to read in the data using
Dask
, or usingtifffile.imread
. That would make 4 different workflow variants.In this PR, I focus on the first variant (using the
main()
function and reading the data withDask
) - I plan to implement the other variants in future PRs (but happy to discuss this).Workflow class
I defined a
Workflow
class with:voxel_size
), andsetup_
methods that wrap the steps required to process the data, but that we don't plan to benchmark (such as defining processing parameters or downloading the test data).I did this mostly thinking about how we could reuse these workflows for
asv
, but the ideal structure for reusing this in benchmarking needs more exploration (likely in its own PR). I was assuming we will want to time the whole workflow (i.e.,workflow_from_cellfinder_run
), and also each of the main steps separately (here,read_with_dask
,cellfinder_run
,save_cells
)Note: initially I was also thinking about including some other the methods in this class (not "
setup_
methods"), that would define wrappers around several API functions. For example, I had an extra methodread_signal_and_background_with_dask
, for reading the signal and background data combined, thinking that we may want to time them in combination. But I am not sure if this split is worth it... so I just kept it simple and used theread_with_dask
API function instead.Questions and feedback
Workflow
class, and just have thesetup_parmeters
andsetup_input_data
inside__init__
? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions inasv
.asv
work)?asv
to time these functions for a range of values for certain preprocessing parameters.