Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellfinder workflow #15

Closed
wants to merge 20 commits into from
Closed

Cellfinder workflow #15

wants to merge 20 commits into from

Conversation

sfmig
Copy link
Contributor

@sfmig sfmig commented Sep 18, 2023

This PR is a first step towards defining the cellfinder-core workflows (#10)

Workflow variants

From the README, the cellfinder workflow can be defined in two ways:

  • using the main() function from cellfinder_core.main (see snippet here)
  • spelling out the detection and classification steps (see example here)

Within each approach, we can also choose to read in the data using Dask, or using tifffile.imread. That would make 4 different workflow variants.

In this PR, I focus on the first variant (using the main() function and reading the data with Dask) - I plan to implement the other variants in future PRs (but happy to discuss this).

Workflow class

I defined a Workflow class with:

  • attributes holding preprocessing parameters (such as voxel_size), and
  • setup_ methods that wrap the steps required to process the data, but that we don't plan to benchmark (such as defining processing parameters or downloading the test data).

I did this mostly thinking about how we could reuse these workflows for asv, but the ideal structure for reusing this in benchmarking needs more exploration (likely in its own PR). I was assuming we will want to time the whole workflow (i.e., workflow_from_cellfinder_run), and also each of the main steps separately (here,read_with_dask, cellfinder_run, save_cells)

Note: initially I was also thinking about including some other the methods in this class (not "setup_ methods"), that would define wrappers around several API functions. For example, I had an extra method read_signal_and_background_with_dask, for reading the signal and background data combined, thinking that we may want to time them in combination. But I am not sure if this split is worth it... so I just kept it simple and used the read_with_dask API function instead.

Questions and feedback

  • Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside __init__? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.
  • I assume we are not interested in timing the download from GIN, right?
  • Additional thoughts about reusing this for benchmarking:
    • Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
    • In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.

@sfmig sfmig marked this pull request as ready for review September 20, 2023 16:55
@codecov
Copy link

codecov bot commented Sep 21, 2023

Codecov Report

Attention: 98 lines in your changes are missing coverage. Please review.

Comparison is base (8665e74) 0.00% compared to head (ac14d84) 0.00%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main     #15   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files          1       2    +1     
  Lines          5      98   +93     
=====================================
- Misses         5      98   +93     
Files Coverage Δ
brainglobe_workflows/__init__.py 0.00% <0.00%> (ø)
brainglobe_workflows/cellfinder/cellfinder_main.py 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alessandrofelder alessandrofelder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sfmig - especially for the really detailed and good PR description!
This is a good start - I think the ability to programmatically change the input data is a key requirement and currently missing though, so I've requested changes (happy to move that to a separate issue though if you prefer! In this case, as me for re-review and link to the issue, and I will approve) - I've commented about how I suggest doing this (I could be wrong!)

I plan to implement the other variants in future PRs (but happy to discuss this).

Let's discuss this at the next developer meeting (I've put it on the agenda).

Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside init? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.

Would it make sense to split it into a configuration dataclass, and the setup_workflow function (see my comments around the INPUT_DATA_CACHE_DIR variable)? Maybe I'm too biased by how I approached this for brainreg?

I assume we are not interested in timing the download from GIN, right?

I agree.

Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.

As you suggest, we may want to benchmark them individually in the future 🤔 maybe open an issue and not benchmark them for now?


# Local cached directories
CELLFINDER_CACHE_DIR = Path.home() / ".cellfinder_benchmarks"
INPUT_DATA_CACHE_DIR = CELLFINDER_CACHE_DIR / "cellfinder_test_data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a way to be able to configure the value for INPUT_DATA_CACHE_DIR, so we can run the same workflow on different data?

I am thinking the best way is to check whether an environmental variable exists, and if it doesn't, use this data as default, via pooch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an idea, see

Comment on lines 52 to 72
self.voxel_sizes = [5, 2, 2] # microns
self.start_plane = 0
self.end_plane = -1
self.trained_model = None # if None, it will use a default model
self.model_weights = None
self.model = "resnet50_tv"
self.batch_size = 32
self.n_free_cpus = 2
self.network_voxel_sizes = [5, 1, 1]
self.soma_diameter = 16
self.ball_xy_size = 6
self.ball_z_size = 15
self.ball_overlap_fraction = 0.6
self.log_sigma_size = 0.2
self.n_sds_above_mean_thresh = 10
self.soma_spread_factor = 1.4
self.max_cluster_size = 100000
self.cube_width = 50
self.cube_height = 50
self.cube_depth = 20
self.network_depth = "50"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to be able to set (some of) these as well, to run the workflow with different parameters?


Parameters
----------
cfg : Workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that you call the argument of type Workflow "cfg", no? Should Workflow really be called CellfinderConfig?

@sfmig sfmig force-pushed the smg/cellfinder-workflow branch from bece87e to 8e13570 Compare October 2, 2023 19:20
@sfmig
Copy link
Contributor Author

sfmig commented Oct 2, 2023

thanks for the feedback @alessandrofelder ! I think it looks much better now 🤩

I followed the structure from your example to:

  • define a config dataclass, and
  • combine the setup steps into one.

For the config definition, I followed your approach: I check if there is an environment variable pointing to a config json file, otherwise I use the default dictionary (a dictionary can also be passed as an input).

For retrieving the data, I check if the data exists locally, otherwise I fetch it from GIN.

I also added some smoke tests for the workflow, which in retrospect maybe was a bit of a rabbit hole... I moved the tests to a different PR - still draft, I am having some odd issues with pytest and checking the logs

@sfmig sfmig requested a review from alessandrofelder October 2, 2023 19:26
Copy link
Member

@alessandrofelder alessandrofelder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.
I've made a minor suggestion to narrow the scope of the default_dict variable for you to consider.

)

else:
config = CellfinderConfig(**default_config_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether default_dict should only be initialised here inside the else block, instead of as a global variable, as it's good practice to keep the scope of variables small and define them near where they are actually used? (Also, if we keep it as a module-level constant it is now, it should be called DEFAULT_DICT according to PEP8?)

brainglobe_workflows/cellfinder/cellfinder_main.py Outdated Show resolved Hide resolved
assert input_config_path.exists()

# read into dict (assuming config is a json dict?)
# TODO:add error handling here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point: we might want to move from configuration dataclasses to pydantic or attrs for implicit validation? Open an issue?

@sfmig
Copy link
Contributor Author

sfmig commented Oct 9, 2023

this PR went through quite a bit of discussion, with enough changes to make it into a separate PR - the current solution is now at #23.

Closing this one for now!

@sfmig sfmig closed this Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants