Please cite TaJoCGI using the following:
- Length ≥ 500 bp
- GC ≥ 0.55
- (ObsCpG / ExpCpG) ≥ 0.65
- GC = (number G + number C) / (Length)
- ObsCpG = (number CpG) / (Length)
- ExpCpG = (GC / 2)2 (doi: 10.1073/pnas.0510310103)
- In 200 bp window at beginning of sequence, get %GC & (ObsCpG / ExpCpG). Shift by 1 bp until it meets criteria above.
- If the window meets the criteria, shift the window 200 bp and then evaluate again.
- Repeat these 200-bp shifts until the window does not meet the criteria.
- Shift the last window 1 bp back (toward the 5' end) until it meets the criteria.
- Evaluate total %GC and (ObsCpG / ExpCpG) for this combined strand.
- If this large CpG island does not meet the criteria, trim 1 bp from each side until it meets the criteria.
- Two individual CpG islands were connected if they were separated by less than 100 bp.
- Repeat steps 5–6 to evaluate the new sequence segment until it meets the criteria.
- Reset start position immediately after the CGI identified at step 8 and go to step 1.
Note: From the original paper, it seems like ≥ 500 bp filtering happened at the end.
If using the Cython version (default), you'll need to compile this first:
python3 cySetup.py build_ext --inplace
If it gives you the following error:
'numpy/arrayobject.h' file not found
... run the following in python3 (e.g., by running python3
in the Terminal):
import os
import numpy
print(os.path.dirname(numpy.__file__))
Copy the output, then paste it where <NUMPY>
is located in the code below, and run it
in the Terminal:
export numpy_loc="<NUMPY>"
cp -r "${numpy_loc}"/core/include/numpy \
/usr/local/include
The Cython version is much faster, but if you have plenty of time and don't feel like
compiling, at the top of TJalgorithm.py
you can simply change
import cyFuns as cgi
to
import pyFuns as cgi
This will use the full Python implementation.
/path/TJalgorithm.py -c 12 -o Glazer_CGI.bed Glazer_assembly.fa
Note: Each thread in the parallelization for this program consists of finding all CpG islands in a single sequence, so providing more cores than chromosomes/sequences is useless.
For the Glazer (2015) threespine stickleback assembly, 42,560 CpG islands were identified across 23 chromosomes in 2 minutes, 51 seconds. This run was in parallel using 12 cores on an AMD Opteron processor, and used 3.689560 Gb RAM.
Takai D, Jones PA. 2002. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA 99:3740–3745. doi: 10.1073/pnas.052410099