Cellfinder workflow #15

sfmig · 2023-09-18T12:14:40Z

This PR is a first step towards defining the cellfinder-core workflows (#10)

Workflow variants

From the README, the cellfinder workflow can be defined in two ways:

using the main() function from cellfinder_core.main (see snippet here)
spelling out the detection and classification steps (see example here)

Within each approach, we can also choose to read in the data using Dask, or using tifffile.imread. That would make 4 different workflow variants.

In this PR, I focus on the first variant (using the main() function and reading the data with Dask) - I plan to implement the other variants in future PRs (but happy to discuss this).

Workflow class

I defined a Workflow class with:

attributes holding preprocessing parameters (such as voxel_size), and
setup_ methods that wrap the steps required to process the data, but that we don't plan to benchmark (such as defining processing parameters or downloading the test data).

I did this mostly thinking about how we could reuse these workflows for asv, but the ideal structure for reusing this in benchmarking needs more exploration (likely in its own PR). I was assuming we will want to time the whole workflow (i.e., workflow_from_cellfinder_run), and also each of the main steps separately (here,read_with_dask, cellfinder_run, save_cells)

Note: initially I was also thinking about including some other the methods in this class (not "setup_ methods"), that would define wrappers around several API functions. For example, I had an extra method read_signal_and_background_with_dask, for reading the signal and background data combined, thinking that we may want to time them in combination. But I am not sure if this split is worth it... so I just kept it simple and used the read_with_dask API function instead.

Questions and feedback

Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside __init__? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.
I assume we are not interested in timing the download from GIN, right?
Additional thoughts about reusing this for benchmarking:
- Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
- In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.

codecov · 2023-09-21T14:23:44Z

Codecov Report

Attention: 98 lines in your changes are missing coverage. Please review.

Comparison is base (8665e74) 0.00% compared to head (ac14d84) 0.00%.
Report is 3 commits behind head on main.

Additional details and impacted files

@@          Coverage Diff          @@
##            main     #15   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files          1       2    +1     
  Lines          5      98   +93     
=====================================
- Misses         5      98   +93

Files	Coverage Δ
brainglobe_workflows/__init__.py	`0.00% <0.00%> (ø)`
brainglobe_workflows/cellfinder/cellfinder_main.py	`0.00% <0.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alessandrofelder

Thanks @sfmig - especially for the really detailed and good PR description!
This is a good start - I think the ability to programmatically change the input data is a key requirement and currently missing though, so I've requested changes (happy to move that to a separate issue though if you prefer! In this case, as me for re-review and link to the issue, and I will approve) - I've commented about how I suggest doing this (I could be wrong!)

I plan to implement the other variants in future PRs (but happy to discuss this).

Let's discuss this at the next developer meeting (I've put it on the agenda).

Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside init? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.

Would it make sense to split it into a configuration dataclass, and the setup_workflow function (see my comments around the INPUT_DATA_CACHE_DIR variable)? Maybe I'm too biased by how I approached this for brainreg?

I assume we are not interested in timing the download from GIN, right?

I agree.

Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.

As you suggest, we may want to benchmark them individually in the future 🤔 maybe open an issue and not benchmark them for now?

alessandrofelder · 2023-09-22T13:35:49Z

brainglobe_workflows/cellfinder/cellfinder_main.py

+
+# Local cached directories
+CELLFINDER_CACHE_DIR = Path.home() / ".cellfinder_benchmarks"
+INPUT_DATA_CACHE_DIR = CELLFINDER_CACHE_DIR / "cellfinder_test_data"


I think we need a way to be able to configure the value for INPUT_DATA_CACHE_DIR, so we can run the same workflow on different data?

I am thinking the best way is to check whether an environmental variable exists, and if it doesn't, use this data as default, via pooch?

For an idea, see

brainglobe-workflows/brainglobe_workflows/registration_script.py

Line 76 in 8be0ed0

def setup_workflow():

alessandrofelder · 2023-09-22T13:39:48Z

brainglobe_workflows/cellfinder/cellfinder_main.py

+        self.voxel_sizes = [5, 2, 2]  # microns
+        self.start_plane = 0
+        self.end_plane = -1
+        self.trained_model = None  # if None, it will use a default model
+        self.model_weights = None
+        self.model = "resnet50_tv"
+        self.batch_size = 32
+        self.n_free_cpus = 2
+        self.network_voxel_sizes = [5, 1, 1]
+        self.soma_diameter = 16
+        self.ball_xy_size = 6
+        self.ball_z_size = 15
+        self.ball_overlap_fraction = 0.6
+        self.log_sigma_size = 0.2
+        self.n_sds_above_mean_thresh = 10
+        self.soma_spread_factor = 1.4
+        self.max_cluster_size = 100000
+        self.cube_width = 50
+        self.cube_height = 50
+        self.cube_depth = 20
+        self.network_depth = "50"


Maybe we want to be able to set (some of) these as well, to run the workflow with different parameters?

alessandrofelder · 2023-09-22T13:41:09Z

brainglobe_workflows/cellfinder/cellfinder_main.py

+
+    Parameters
+    ----------
+    cfg : Workflow


It's interesting that you call the argument of type Workflow "cfg", no? Should Workflow really be called CellfinderConfig?

…up workflow.

…signal and background files to None by default in dataclass (they need to be predefined).

sfmig · 2023-10-02T19:26:17Z

thanks for the feedback @alessandrofelder ! I think it looks much better now 🤩

I followed the structure from your example to:

define a config dataclass, and
combine the setup steps into one.

For the config definition, I followed your approach: I check if there is an environment variable pointing to a config json file, otherwise I use the default dictionary (a dictionary can also be passed as an input).

For retrieving the data, I check if the data exists locally, otherwise I fetch it from GIN.

I also added some smoke tests for the workflow, which in retrospect maybe was a bit of a rabbit hole... I moved the tests to a different PR - still draft, I am having some odd issues with pytest and checking the logs

alessandrofelder

Awesome.
I've made a minor suggestion to narrow the scope of the default_dict variable for you to consider.

alessandrofelder · 2023-10-04T10:22:46Z

brainglobe_workflows/cellfinder/cellfinder_main.py

+        )
+
+    else:
+        config = CellfinderConfig(**default_config_dict)


I wonder whether default_dict should only be initialised here inside the else block, instead of as a global variable, as it's good practice to keep the scope of variables small and define them near where they are actually used? (Also, if we keep it as a module-level constant it is now, it should be called DEFAULT_DICT according to PEP8?)

brainglobe_workflows/cellfinder/cellfinder_main.py

alessandrofelder · 2023-10-04T10:26:58Z

brainglobe_workflows/cellfinder/cellfinder_main.py

+        assert input_config_path.exists()
+
+        # read into dict (assuming config is a json dict?)
+        # TODO:add error handling here?


This is a good point: we might want to move from configuration dataclasses to pydantic or attrs for implicit validation? Open an issue?

…ult config from input config in workflow setup.

sfmig · 2023-10-09T14:59:46Z

this PR went through quite a bit of discussion, with enough changes to make it into a separate PR - the current solution is now at #23.

Closing this one for now!

sfmig marked this pull request as ready for review September 20, 2023 16:55

sfmig requested a review from alessandrofelder September 20, 2023 16:55

alessandrofelder requested changes Sep 22, 2023

View reviewed changes

alessandrofelder assigned sfmig Sep 22, 2023

sfmig mentioned this pull request Sep 22, 2023

Asv benchmarks for cellfinder workflow #18

Closed

sfmig added 17 commits October 2, 2023 20:02

cellfinder workflow first draft

08c0e1b

test data

852ee04

rename

25883ef

delete test data from repo

7b97560

rename init

c87f9f2

rename and reduce workflow to main version

61bbbeb

exclude notebooks from sdist

aedb87d

precommit fixes

7b6e882

change brainglobe_scripts references in pyproject.toml

f1836ff

add config class and default config. Combine setup steps into one set…

16e6312

…up workflow.

set signal and parent dirs as strings (for read with dask). set list …

4de8497

…signal and background files to None by default in dataclass (they need to be predefined).

add logging info

5844ebf

add logging message for fetching from GIN

9192df3

refactor the input data retrieval

59d9ca0

add pooch as dependency

06ec524

add cellfinder-core dependency

c85c2d5

fix typo in Pathlike

8e13570

sfmig force-pushed the smg/cellfinder-workflow branch from bece87e to 8e13570 Compare October 2, 2023 19:20

sfmig requested a review from alessandrofelder October 2, 2023 19:26

alessandrofelder approved these changes Oct 4, 2023

View reviewed changes

sfmig added 3 commits October 4, 2023 12:05

timestamp output directory

446d959

make default config explicitly a module-level constant. separate defa…

a4bcd51

…ult config from input config in workflow setup.

remove option to pass a config dict to setup_workflow

ac14d84

sfmig mentioned this pull request Oct 9, 2023

Cellfinder workflow CLI input argument with tests #23

Merged

sfmig closed this Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cellfinder workflow #15

Cellfinder workflow #15

sfmig commented Sep 18, 2023 •

edited

Loading

codecov bot commented Sep 21, 2023 •

edited

Loading

alessandrofelder left a comment

alessandrofelder Sep 22, 2023

alessandrofelder Sep 22, 2023

alessandrofelder Sep 22, 2023

alessandrofelder Sep 22, 2023

sfmig commented Oct 2, 2023 •

edited

Loading

alessandrofelder left a comment

alessandrofelder Oct 4, 2023

alessandrofelder Oct 4, 2023

sfmig commented Oct 9, 2023

Cellfinder workflow #15

Cellfinder workflow #15

Conversation

sfmig commented Sep 18, 2023 • edited Loading

Workflow variants

Workflow class

Questions and feedback

codecov bot commented Sep 21, 2023 • edited Loading

Codecov Report

alessandrofelder left a comment

Choose a reason for hiding this comment

alessandrofelder Sep 22, 2023

Choose a reason for hiding this comment

alessandrofelder Sep 22, 2023

Choose a reason for hiding this comment

alessandrofelder Sep 22, 2023

Choose a reason for hiding this comment

alessandrofelder Sep 22, 2023

Choose a reason for hiding this comment

sfmig commented Oct 2, 2023 • edited Loading

alessandrofelder left a comment

Choose a reason for hiding this comment

alessandrofelder Oct 4, 2023

Choose a reason for hiding this comment

alessandrofelder Oct 4, 2023

Choose a reason for hiding this comment

sfmig commented Oct 9, 2023

sfmig commented Sep 18, 2023 •

edited

Loading

codecov bot commented Sep 21, 2023 •

edited

Loading

sfmig commented Oct 2, 2023 •

edited

Loading