Skip to content

Commit

Permalink
Merge pull request #150 from TobyBaril/v5.0.0
Browse files Browse the repository at this point in the history
Updates for Version 5!
  • Loading branch information
TobyBaril authored Oct 10, 2024
2 parents b7c0bf7 + d504db7 commit 38ca68d
Show file tree
Hide file tree
Showing 28 changed files with 522 additions and 48 deletions.
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Earl Grey is a full-automated transposable element (TE) annotation pipeline, lev

# Contents

[Changes in Latest Release](#changes-in-latest-release)

[Example Run](#example)

[References and Acknowledgements](#references-and-acknowledgements)
Expand All @@ -23,6 +25,20 @@ Earl Grey is a full-automated transposable element (TE) annotation pipeline, lev

<!-- toc -->

# Changes in Latest Release
Big changes in the latest release!

*Earl Grey v5.0.0 is here!*

This release incorporates the incremental improvements made throughout the life of version 4.

It is now possible to run some subroutines in Earl Grey (run either of these new commands with `-h` to see a list of options):
- `earlGreyLibConstruct` can be used to run Earl Grey for _de novo_ TE detection, consensus generation, and improvement through the BEAT process. The output will be the strained TE consensus sequences, which can then be used for subsequent annotation. This is useful when you want to make a combined library from the libraries of several different genomes, where it is no longer required to waste time running annotations. Once the libraries are generated and you have curated them, you can then run the next step in isolation (next point!).
- `earlGreyAnnotationOnly` can be used to run the final annotation and defragmentation steps in Earl Grey. This is useful if you have already run the BEAT process and have a library of _de novo_ TE consensus sequences that you would like to use to annotate a given genome. This script is also compatible with the `-r` flag to take known repeats from the databases used to configure RepeatMasker in addition to the custom repeat library.
- *EXPERIMENTAL FEATURE:* I have also added an option to run [HELIANO](https://github.com/Zhenlisme/heliano) for improved detection of Helitrons, which are notoriously difficult to detect and classify using homology methods. This can be implemented by adding `-e yes` to the command line options after upgrading to v5.0.0. Currently, HELIANO annotations replace those which they overlap following the RepeatMasker run, which is performed during defragmentation (in a similar way to full-length LTRs being dealt with in `RepeatCraft`). Feedback is welcomed on this implementation, and I am continuing to test and improve the implementation of HELIANO within Earl Grey.

Thank you for your continued support and enthusiasm for Earl Grey!

# Example

Given an input genome, Earl Grey will run through numerous steps to identify, curate, and annotate transposable elements (TEs). We recommend running earlGrey within a tmux or screen session, so that you can log off and leave Earl Grey running.
Expand All @@ -45,6 +61,7 @@ Required Parameters:
-d == Create soft-masked genome at the end? (yes/no, Default: no)
-n == Max number of sequences used to generate consensus sequences (Default: 20)
-a == minimum number of sequences required to build a consensus sequence (Default: 3)
-e == Optional: Run HELIANO for detection of Helitrons (yes/no, Default: no)
-h == Show help
```

Expand Down Expand Up @@ -118,8 +135,8 @@ As of `v4.4.5`, there is an option to generate _de novo_ TE libraries without ru

```
# BED format
NC_045808.1 4964941 4965925 LINE/Penelope 5073 +
NC_045808.1 7291353 7291525 LINE/L2 1279 +
NC45808.1 4964941 4965925 LINE/Penelope 5073 +
NC45808.1 7291353 7291525 LINE/L2 1279 +
NC_045808.1 8922477 8923791 DNA/TcMar-Tc1 11957 +
Expand Down Expand Up @@ -617,7 +634,7 @@ You are ready to go! Just remember to activate the _intel_ terminal, then the co
In this case, we need to bind a system directory to the docker container. In the line below, we are binding a directory call `host_data` that is found on our current path to `/data/` in the docker container. Please replace the file path before `:` to the directory you wish to bind to `/data/` in the container. This container must be run in interactive mode the first time you use it.

```
docker run -it -v `pwd`/host_data/:/data/ quay.io/biocontainers/earlgrey:5.0.0--h4ac6f70_1
docker run -it -v `pwd`/host_data/:/data/ quay.io/biocontainers/earlgrey:5.0.0--h4ac6f70_0
```

## If you are running the container for the first time, you need to enable Earl Grey to configure the Dfam libraries correctly in interactive mode.
Expand Down
65 changes: 53 additions & 12 deletions earlGrey
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
usage()
{
echo " #############################
earlGrey version 4.5.0
earlGrey version 5.0.0
Required Parameters:
-g == genome.fasta
-s == species name
Expand All @@ -20,6 +20,7 @@ usage()
-d == Create soft-masked genome at the end? (yes/no, Default: no)
-n == Max number of sequences used to generate consensus sequences (Default: 20)
-a == minimum number of sequences required to build a consensus sequence (Default: 3)
-e == Run HELIANO as an optional step to detect Helitrons (yes/no, Default: no)
-h == Show help
Example Usage:
Expand Down Expand Up @@ -49,6 +50,9 @@ makeDirectory()
mkdir -p $OUTDIR/${species}_RepeatLandscape/
mkdir -p $OUTDIR/${species}_mergedRepeats/
mkdir -p ${OUTDIR}/${species}_summaryFiles/
if [ ! -z "$heli" ]; then
mkdir -p ${OUTDIR}/${species}_heliano/
fi
}

# Subprocess PrepGenome #
Expand Down Expand Up @@ -155,7 +159,7 @@ deNovo1()
fi
}

# Subprocess strainer # CHECK FILE STRUCTURE FOR EARL GREY RUN
# Subprocess strainer
# contains the BLAST, Extract, Extend, Trim pipeline from James Galbraith
strainer()
{
Expand Down Expand Up @@ -192,19 +196,41 @@ novoMask()
fi
}

# Subprocess heliano
# Run HELIANO as an optional step, then replace overlapping repeats in the merged output with HELIANO outputs
### TODO:
heliano_optional()
{
cd ${OUTDIR}/${species}_heliano/
heliano -g $genome --nearest -dn 6000 -flank_sim 0.5 -o ${OUTDIR}/${species}_heliano/HEL_${timestamp} -w 10000 -n $ProcNum
awk '{OFS="\t"}{print $1, "HELIANO", "RC/Helitron", $2+1, $3, $5, $6, ".", "ID="$9"_"$11";shortTE=F"}' ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.bed > ${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
helitron_gff=${OUTDIR}/${species}_heliano/HEL_${timestamp}/RC.representative.gff
}

# Subprocess rcMergeRepeats
# Defragment repeat sequences to adjust for insertion times
mergeRep()
{
mkdir ${OUTDIR}/${species}_mergedRepeats/looseMerge
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
if [ -s "$helitron_gff" ]; then
echo "Running loose merge with HELIANO output"
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeatsLoose -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/looseMerge -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi

if [ -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
awk '{OFS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, toupper($9)}' ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff > ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff.1 && mv ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.gff{.1,}
fi

if [ ! -f "${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed" ]; then
echo "ERROR: loose merge defragmentation failed, trying strict merge..."
cd ${OUTDIR}/${species}_mergedRepeats/
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
if [ -s "$helitron_gff" ]; then
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin -e $helitron_gff
else
${SCRIPT_DIR}/rcMergeRepeats -f $genome -s $species -d ${OUTDIR}/${species}_mergedRepeats/ -u ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out -q ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).tbl -t $ProcNum -b ${dict} -m $margin
fi
if [ ! -f "${OUTDIR}/${species}_mergedRepeats/${species}.filteredRepeats.bed" ]; then
echo "ERROR: strict merge also failed, check ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/$(basename $genome).out looks as expected"
exit 2
Expand Down Expand Up @@ -364,10 +390,12 @@ Checks()
echo "ERROR: teStrainer module not found, please check all modules are present and run the configure script in the Earl Grey directory before attempting to run Earl Grey"; exit 1
fi

if [ "$CONDA_DEFAULT_ENV" == "earlGrey" ]; then
echo "Conda environment is active"
if [ "$heli" == "yes" ] ; then
heli=yes; echo "HELITRON detection will be run using HELIANO"
elif [ -z "$heli" ] || [ "$heli" == "no" ]; then
heli=no; echo "HELITRON detection will not be run"
else
echo "Conda environment is inactive, please activate conda environment before using Earl Grey"; exit 1
heli=no; echo "HELITRON detection not specified using (yes/no). Using default parameter (no)."
fi

# biocontainer checks
Expand Down Expand Up @@ -395,7 +423,7 @@ Checks()

# Main #

while getopts g:s:o:t:f:i:r:c:l:m:d:n:a:h option
while getopts g:s:o:t:f:i:r:c:l:m:d:n:a:e:h option
do
case "${option}"
in
Expand All @@ -412,6 +440,7 @@ do
d) softMask=${OPTARG};;
n) no_seq=${OPTARG};;
a) min_seq=${OPTARG};;
e) heli=${OPTARG};;
h) usage; exit 0;;
esac
done
Expand All @@ -421,7 +450,7 @@ SECONDS=0
# Step 1 - set up the directories and make sure all modules are present

stage="Checking Parameters" && runningTea
SCRIPT_DIR=/home/toby/projects/earlGreyREwrite/EarlGreyUpdate/scripts
SCRIPT_DIR=/data/toby/EarlGrey/scripts/
Checks
stage="Making Directories" && runningTea
makeDirectory
Expand Down Expand Up @@ -504,8 +533,21 @@ if [ ! -f ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl ]; then
sleep 1
else
stage="Final masking already complete, skipping..." && runningTea
div_file=$(readlink -f $OUTDIR/${species}_RepeatLandscape/${species}.divsum)
genome_size=$(sed -n '4p' ${OUTDIR}/${species}_RepeatMasker_Against_Custom_Library/*.tbl | rev | cut -f1,1 -d ':' | rev | sed 's/ bp.*//g; s/ //g')
sleep 1
fi

# Stage 5.5 - Run HELIANO as an optional step

if [ "$heli" == "yes" ]; then
if [ ! -s ${OUTDIR}/${species}_heliano/*/RC.representative.gff ]; then
stage="Running HELIANO to Detect Helitrons" && runningTea
timestamp=$(date +"%Y%m%d_%H%M")
heliano_optional
else
stage="HELITRON detection already complete, skipping..." && runningTea
helitron_gff="$(realpath $(ls -td -- ${OUTDIR}/${species}_heliano/*/ | head -n 1))/RC.representative.gff"
echo "HELITRON GFF: $helitron_gff"
fi
sleep 1
fi

Expand All @@ -519,7 +561,6 @@ else
sleep 1
fi


# Stage 7
stage="Generating Summary Plots" && runningTea
charts
Expand Down
Loading

0 comments on commit 38ca68d

Please sign in to comment.