Parallelize get_predicted_CNV_regions #350

blakecaldwell · 2021-08-08T20:17:14Z

When running with HMM_report_by='cell', the time spent in get_predicted_CNV_regions dominates total runtime as each cell is processed sequentially. This pull request parallelizes the bulk of the loop to speed up runtime. When the number of cells is over 10k, this can reduce the runtime from days to hours.

The parallel framework was used because it is already imported in inferCNV_BayesNet.R. mclapply parallelizes the loop over the number of cores specifed by num_threads when infercnv::run() is called.

Note that parallel refactoring was complicated by the counter variable that ensures unique names for cnv regions. The workaround was to assign the region names in a loop at the end, after parallel execution. This has the same effect as incrementing a counter, but means that the call to .get_cnv_gene_region_bounds must be placed outside the parallel loop. The function is simple enough that running it on a single core won't significantly delay overall runtime.

Blake Caldwell added 2 commits August 8, 2021 14:11

Use mclapply to parallelize get_predicted_CNV_regions

bda566d

Defer region naming to only have 1 parallel block

10fccfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize get_predicted_CNV_regions #350

Parallelize get_predicted_CNV_regions #350

blakecaldwell commented Aug 8, 2021

Parallelize get_predicted_CNV_regions #350

Are you sure you want to change the base?

Parallelize get_predicted_CNV_regions #350

Conversation

blakecaldwell commented Aug 8, 2021