Skip to content

Takes the list of interactions deemed significant by SeqMonk and the raw SAM file generated from the HiC experiment, and generates a collector's curve to investigate sampling depth. NOTE: HIGHLY MEMORY INTENSIVE.

Notifications You must be signed in to change notification settings

alcamerone/HiCCollectorsCurve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Works with Seqmonk output to produce a collectors curve of significant interactions hit vs. number of reads sampled

Generates a collectors curve to determine completeness of significant interactions sampled.

Requires:

-A sorted probe list generated by SeqMonk, in the format "Probe(Chr:Start-End(length in kbp)) ChrNumber StartPos EndPos" (tab-separated)

-An interaction list generated by SeqMonk, in the format "Probe1 Chromosome1 Start End Probe2 Chromosome2 Start End" (tab-separated)

-A SAM-formatted file (can be generated from BAM-format using samtools), e.g. the one used by SeqMonk to generate the above files

Usage: python read_vs_significant_interaction_collectors_curve.py [OPTIONS]

Options:

"-p", "--probe_list_fp" The path to the probe list generated by SeqMonk

"-i", "--sig_ints_fp" The path to the interaction list generated by SeqMonk

"-s", "--sam_file_fp" The path to the SAM file used to generate SeqMonk results

"-z", "--step_size" The step-size to increase the proportion of reads sub-sampled by each iteration (default: 0.1)

"-n", "--num_iter" The number of subsamples to generate for each step (default: 100)

"-t", "--num_threads" The number of concurrent processes to start (default: 2)

"-f", "--save_sig_ints_hit_fp" Optional: save the proportions of significant interactions hit at each step to this file for plotting later

Example:

To find the number of significant interactions hit at 10%, 20%, 30% ... 90% of reads, at 100 samples per read:

./read_vs_significant_interaction_collectors_curve.py --probe_list_fp seqmonk_probe_list.txt --sig_ints_fp seqmonk_output.txt --sam_file_fp reads.sam -z 0.1 -n 100

NB: Very memory intensive, depending on the size of the input files. Intended to be run on a machine with a lot of RAM when processing large datasets.

Running the small example:

In the "test" directory are some small example files which can be used to run the code and generate output. To do this, run:

./read_vs_significant_interaction_collectors_curve.py -p test/Probe_List.small.test.txt -i test/Sig_Intrxns.small.test.txt -s SAM_File.small.test.sam -z 0.1 -n 100

The file "Sig_Interxns.small.test_collectors_curve.png" is an example of the output produced by the program. The large error in this case is due to the very small sample size. Not every run will be the same, as the sampling is done randomly at each step.

About

Takes the list of interactions deemed significant by SeqMonk and the raw SAM file generated from the HiC experiment, and generates a collector's curve to investigate sampling depth. NOTE: HIGHLY MEMORY INTENSIVE.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages