-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Inputs and outputs for workflow modules #29
Comments
I think this is very likely and that most things will work on a per sample level. I think the merging is actually the exception here. Also I think if we were going to have anything that relied on merged objects downstream they would be mostly separate from processes that take individual samples so we probably wouldn't have to join this output with per sample output. That being said I do think using something with
Another option that I don't know how I feel about but could be worth thinking about is having a single sample workflow and a grouped workflow. So the Alternatively, if you update to make the input/output to everything on a per sample level then you can have a flag that runs the workflow with simulated data and runs the simulate step and if not used it will use real data and skip the simulate step. But this would only work if you move any grouping of projects to within each module. Just jotting down some thoughts, but generally I think we might continue to make updates as we add more modules and see what works best. |
Currently, the inputs for OpenScPCA-nf modules are channels that contain the following tuple:
[project_id, project_dir]
This makes for a pretty nice compact and universal input, and for modules that depend on nothing but "raw" data it is is pretty easy to work with. It is therefore currently the module workflow's responsibility to filter the inputs down to the portion of the data that is needed for each analysis.
However, that structure may not be workable in the case that one module depends on the output of another module. We have not yet described standards for what a module workflow might
emit
for consumption by other module workflows, but it seems unlikely that the output would be of the form[project_id, project_dir]
.There are a couple reasons for this:
simulate-sce
workflow largely performs its simulations on a single sample at a time. While it would be possible to rejoin the results into a single directory, this would mean we would need to have a process that downloads the data for all samples within each project and then emits the folder with all of them in it. This would result in a lot of copying without much benefit.A motivating example
I already mentioned the
simulate-sce
module workflow, and it was the main reason I started thinking about this problem in the first place (though as I write this I may have a simpler/better solution, discussed below). Thesimulate-sce
module workflow does not currentlyemit
anything, but we might want it to, in order to use the results from that workflow in downstream workflows. For example, themerge-sce
workflow could be used to create simulated merged data.However, the
merge-sce
workflow currently expects[project_id, project_dir]
as input, and as discussed, this might be a pain to produce from the simulation workflow, which works on a sample-by-sample basis.An easier output might be something of the form
[project_id, sample_id, sample_dir]
, but then the consuming workflow (merge-sce
in this case) might have to perform some form of merging on the input channel if it wants things organized by project. This isn't necessarily terrible to do: we could do that with something likegroupTuple()
function.This assumes that we want to pass along the whole directory, which we probably don't need to do all the time. This
simulate-sce
function produces both SCE and AnnData objects: different workflows might want to consume one or the other of these, and sometimes both? The merge workflow really only needs the SCEs, and of those, only the processed files. So we could instead have a workflow emit multiple channels: a set with[project_id, sample_id, sample_sce]
, where we might separate out processed and earlier stage files, and separately[project_id, sample_id, sample_anndata]
, though because AnnData are likely be split by RNA and other features,sample_anndata
might actually be a array of files at times (or in the case that there is more than one library for a sample), though Nextflow generally handles this pretty reasonably.Other outputs
One can also imagine that some workflows might not output sample-level info at all, or might have multiple files per sample. Another likely scenario is that a workflow outputs some results that need to be combined with other data for a downstream analysis. An example of this might be cell typing or doublet detection: for efficiency we would not want the output of these analyses to be another whole set of annotated SCE files, but instead a table of annotations that we could use to join with the original data in a downstream step.
For outputs like these, we would want to emit them on a sample-by-sample basis, to allow the workflow to use a
join()
orcombine()
operation (likelyjoin()
, assuming one output per sample, as these are more efficient thancombine()
).Thoughts toward a standard
We want to keep the inputs and output from module workflows relatively consistent to make interactions between them as easy as possible. So we should favor relatively short, simple tuples within the channels where we can, especially given the fact that Nextflow does not use named tuples for channel elements. (What happens within a module can get more complex as needed.)
It seems likely that most analyses will have at least some component that is run on a sample-by-sample basis, so it seems likely to make sense to make that the main unit of organization. If we always have the first two elements of the cross-workflow channels be
[sample_id, project_id]
(in that order to facilitatejoin
operations).Following those two identifiers, we would have a slot for the main data, which we should probably prefer to be a file or array of files. Workflows that produce multiple kinds of outputs that are not expected to be used together often (such as SCE and AnnData output) can emit more than one channel, with different sets of files. Any
emit
statements should use descriptive names, such assample_processed_sce_ch
to indicate the contents of the final element(s) of the tuple.Putting this together, we might have a workflow that looked something like the following (though a real workflow would be more complex!):
A later workflow could then combine the channels above for a later analysis that used the annotations:
The main limitation I see with this plan is that it may not be uncommon for a workflow to also want access to the sample metadata sheet for the project or other project-level data like bulk sequencing. We could pass this in as a separate channel, or include the
project_dir
in each channel as well, with the expectation that most workflows would disregard that element, or simply pass it along unmodified.Or just run the workflow twice
While I was originally thinking about this as a way to have a single-shot run of the workflow that would either include or not include the simulation step, followed by running everything else, it may be much simpler to just have two workflows for that use case: One to run the simulation and a second to run afterward to just work with either the real or simulation data buckets as the input. Having written all this, I think this is the way we should probably go, but there may still be value in starting to codify some expectations about workflow outputs for some of the more complex cases.
The text was updated successfully, but these errors were encountered: