Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Inputs and outputs for workflow modules #29

Open
jashapiro opened this issue May 10, 2024 · 1 comment
Open

Discussion: Inputs and outputs for workflow modules #29

jashapiro opened this issue May 10, 2024 · 1 comment

Comments

@jashapiro
Copy link
Member

Currently, the inputs for OpenScPCA-nf modules are channels that contain the following tuple:
[project_id, project_dir]

This makes for a pretty nice compact and universal input, and for modules that depend on nothing but "raw" data it is is pretty easy to work with. It is therefore currently the module workflow's responsibility to filter the inputs down to the portion of the data that is needed for each analysis.

However, that structure may not be workable in the case that one module depends on the output of another module. We have not yet described standards for what a module workflow might emit for consumption by other module workflows, but it seems unlikely that the output would be of the form [project_id, project_dir].

There are a couple reasons for this:

  • For efficiency, workflows are likely to split input projects into multiple streams, and joining them back into a single output could be difficult
    • For example, the simulate-sce workflow largely performs its simulations on a single sample at a time. While it would be possible to rejoin the results into a single directory, this would mean we would need to have a process that downloads the data for all samples within each project and then emits the folder with all of them in it. This would result in a lot of copying without much benefit.
  • The contents of an output directory may not be as predictable as inputs, and it may be better to have more fully specified input channels which contain more information about the kind of data that is needed for each workflow.

A motivating example

I already mentioned the simulate-sce module workflow, and it was the main reason I started thinking about this problem in the first place (though as I write this I may have a simpler/better solution, discussed below). The simulate-sce module workflow does not currently emit anything, but we might want it to, in order to use the results from that workflow in downstream workflows. For example, the merge-sce workflow could be used to create simulated merged data.

However, the merge-sce workflow currently expects [project_id, project_dir] as input, and as discussed, this might be a pain to produce from the simulation workflow, which works on a sample-by-sample basis.

An easier output might be something of the form [project_id, sample_id, sample_dir], but then the consuming workflow (merge-sce in this case) might have to perform some form of merging on the input channel if it wants things organized by project. This isn't necessarily terrible to do: we could do that with something like groupTuple() function.

This assumes that we want to pass along the whole directory, which we probably don't need to do all the time. This simulate-sce function produces both SCE and AnnData objects: different workflows might want to consume one or the other of these, and sometimes both? The merge workflow really only needs the SCEs, and of those, only the processed files. So we could instead have a workflow emit multiple channels: a set with [project_id, sample_id, sample_sce] , where we might separate out processed and earlier stage files, and separately [project_id, sample_id, sample_anndata], though because AnnData are likely be split by RNA and other features, sample_anndata might actually be a array of files at times (or in the case that there is more than one library for a sample), though Nextflow generally handles this pretty reasonably.

Other outputs

One can also imagine that some workflows might not output sample-level info at all, or might have multiple files per sample. Another likely scenario is that a workflow outputs some results that need to be combined with other data for a downstream analysis. An example of this might be cell typing or doublet detection: for efficiency we would not want the output of these analyses to be another whole set of annotated SCE files, but instead a table of annotations that we could use to join with the original data in a downstream step.

For outputs like these, we would want to emit them on a sample-by-sample basis, to allow the workflow to use a join() or combine() operation (likely join(), assuming one output per sample, as these are more efficient than combine()).

Thoughts toward a standard

We want to keep the inputs and output from module workflows relatively consistent to make interactions between them as easy as possible. So we should favor relatively short, simple tuples within the channels where we can, especially given the fact that Nextflow does not use named tuples for channel elements. (What happens within a module can get more complex as needed.)

It seems likely that most analyses will have at least some component that is run on a sample-by-sample basis, so it seems likely to make sense to make that the main unit of organization. If we always have the first two elements of the cross-workflow channels be [sample_id, project_id] (in that order to facilitate join operations).

Following those two identifiers, we would have a slot for the main data, which we should probably prefer to be a file or array of files. Workflows that produce multiple kinds of outputs that are not expected to be used together often (such as SCE and AnnData output) can emit more than one channel, with different sets of files. Any emit statements should use descriptive names, such as sample_processed_sce_ch to indicate the contents of the final element(s) of the tuple.

Putting this together, we might have a workflow that looked something like the following (though a real workflow would be more complex!):

process do_annotation {
  input:
	tuple val(sample_id), val(project_id), path(sample_sce)
  output:
    tuple val(sample_id), val(project_id), path(annotation_tsv) 
}

workflow annotate {
  take:
    sample_sce_ch // [sample_id, project_id, sample_sce]

  main:
    do_annotation(sample_sce_ch)

 . emit:
    sample_annotation_ch = do_annotation.out // [sample_id, project_id, sample_annotation.tsv]
}

A later workflow could then combine the channels above for a later analysis that used the annotations:

workflow use_annotations {
  take:
    sample_sce_ch // [sample_id, project_id, sample_sce]
    sample_annotation_ch // [sample_id, project_id, sample_annotation]

  main:
    combined_ch = sample_sce_ch.join(sample_annotation_ch)
    ...
}

The main limitation I see with this plan is that it may not be uncommon for a workflow to also want access to the sample metadata sheet for the project or other project-level data like bulk sequencing. We could pass this in as a separate channel, or include the project_dir in each channel as well, with the expectation that most workflows would disregard that element, or simply pass it along unmodified.

Or just run the workflow twice

While I was originally thinking about this as a way to have a single-shot run of the workflow that would either include or not include the simulation step, followed by running everything else, it may be much simpler to just have two workflows for that use case: One to run the simulation and a second to run afterward to just work with either the real or simulation data buckets as the input. Having written all this, I think this is the way we should probably go, but there may still be value in starting to codify some expectations about workflow outputs for some of the more complex cases.

@allyhawkins
Copy link
Member

It seems likely that most analyses will have at least some component that is run on a sample-by-sample basis, so it seems likely to make sense to make that the main unit of organization. If we always have the first two elements of the cross-workflow channels be [sample_id, project_id] (in that order to facilitate join operations).

I think this is very likely and that most things will work on a per sample level. I think the merging is actually the exception here. Also I think if we were going to have anything that relied on merged objects downstream they would be mostly separate from processes that take individual samples so we probably wouldn't have to join this output with per sample output.

That being said I do think using something with [sample_id, sample_sce, project_id] would allow us the flexibility to join things and group things as necessary within individual workflows.

While I was originally thinking about this as a way to have a single-shot run of the workflow that would either include or not include the simulation step, followed by running everything else, it may be much simpler to just have two workflows for that use case: One to run the simulation and a second to run afterward to just work with either the real or simulation data buckets as the input. Having written all this, I think this is the way we should probably go, but there may still be value in starting to codify some expectations about workflow outputs for some of the more complex cases.

Another option that I don't know how I feel about but could be worth thinking about is having a single sample workflow and a grouped workflow. So the main.nf will run any analyses performed on individual samples. Then you have a separate group-analysis.nf that does the merging and then performs analysis on merged objects. I would imagine that would be things like differential expression, NMF, etc. While the single sample workflow contains things like cell typing.

Alternatively, if you update to make the input/output to everything on a per sample level then you can have a flag that runs the workflow with simulated data and runs the simulate step and if not used it will use real data and skip the simulate step. But this would only work if you move any grouping of projects to within each module.

Just jotting down some thoughts, but generally I think we might continue to make updates as we add more modules and see what works best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants