capture scripts and information for methylation study using ONT modified bases basecall
The main script is process_gene_features.py. It processes a number of input files, described below, to generate files with C-methylation (Cm) information in relation to chromosome features and CpG islands.
Script usage: ./process_gene_features.py chromosome_features This script will process 5 input files to extract the Cm probability for the features in relation to the CpG islands identified in the chromosome . Example and format for each is given at the start of the script as well as for the different output files.
The 5 input files are:
-
chromosome_features: gene feature file derived from the Ensembl gff for fAstCal1.2
-
CpG_chr: CpG island file derived from the output of CpGProd_linux for each chromosome. The CpG islands for each chromosome were named CpG where X is an incremental integer starting at 1
-
fastq_mapping_sorted.bam: binary version of the sam file generated by mapping of the fastq reads to the genome with minimap2
-
two pre-processed files: one containing the position of CG dinucleotides in each reads of the fastq files obtained after basecalling and a second one containing the corresponding Cm probability at these positions. They were obtained by running preprocess_modif_tables.py. Both files are in fasta like format:
- Preprocessed_Table_pos:
- Preprocessed_Table_mod: The path to the folders to access these files are provided into the contig file: process.config
The time for each step will be captured into the process_chr.log file.
It will produce a number of files ouput:
-
chrCpG_results.tsv lists the feature name, length then Cm probability, number of CG dinucleotide and cumulative length of regions within and outside the CpG islands
-
chrCpG_summary.tsv lists the feature name, length then Cm probability for the feature and regions within and outside the CpG islands normalised by the feature length
-
chr_feature_results.tsv summarize the results for each type of feature across all the chromosome feature total_number mean_length mean_prob mean_prob/length noCpG_mean_prob noCpG_mean_length noCpG_mean_CGcount features_with_CpG inCpG_mean_length inCpG_mean_CGcount
-
Cm_chr.bed list the coodinates of the features and their Cm probabilities
-
Cm_inCpG_chr.bed list the coodinates of the features and the Cm probabilities associated with the CpG islands regions
The path to the folders to store these files are also provided into the contig file: process.config