Works with Seqmonk output to produce a collectors curve of significant interactions hit vs. number of reads sampled
Generates a collectors curve to determine completeness of significant interactions sampled.
Requires:
-A sorted probe list generated by SeqMonk, in the format "Probe(Chr:Start-End(length in kbp)) ChrNumber StartPos EndPos" (tab-separated)
-An interaction list generated by SeqMonk, in the format "Probe1 Chromosome1 Start End Probe2 Chromosome2 Start End" (tab-separated)
-A SAM-formatted file (can be generated from BAM-format using samtools), e.g. the one used by SeqMonk to generate the above files
Usage: python read_vs_significant_interaction_collectors_curve.py [OPTIONS]
Options:
"-p", "--probe_list_fp" The path to the probe list generated by SeqMonk
"-i", "--sig_ints_fp" The path to the interaction list generated by SeqMonk
"-s", "--sam_file_fp" The path to the SAM file used to generate SeqMonk results
"-z", "--step_size" The step-size to increase the proportion of reads sub-sampled by each iteration (default: 0.1)
"-n", "--num_iter" The number of subsamples to generate for each step (default: 100)
"-t", "--num_threads" The number of concurrent processes to start (default: 2)
"-f", "--save_sig_ints_hit_fp" Optional: save the proportions of significant interactions hit at each step to this file for plotting later
Example:
To find the number of significant interactions hit at 10%, 20%, 30% ... 90% of reads, at 100 samples per read:
./read_vs_significant_interaction_collectors_curve.py --probe_list_fp seqmonk_probe_list.txt --sig_ints_fp seqmonk_output.txt --sam_file_fp reads.sam -z 0.1 -n 100
NB: Very memory intensive, depending on the size of the input files. Intended to be run on a machine with a lot of RAM when processing large datasets.
Running the small example:
In the "test" directory are some small example files which can be used to run the code and generate output. To do this, run:
./read_vs_significant_interaction_collectors_curve.py -p test/Probe_List.small.test.txt -i test/Sig_Intrxns.small.test.txt -s SAM_File.small.test.sam -z 0.1 -n 100
The file "Sig_Interxns.small.test_collectors_curve.png" is an example of the output produced by the program. The large error in this case is due to the very small sample size. Not every run will be the same, as the sampling is done randomly at each step.