Find and load leaders

This section is also covered in Initial data loading.

Important! Identifying the leader regions is needed to get properly oriented (by strand) spacer sequences for spacer blasting. So, do this before attempting spacer blasting if you want to investige PAMs, spacer-protospacer mismatches, or anything else involving proper orientation (strand) of the protospacer.
The potential leader regions can simply be defined as the regions adjacent to the CRISPR array. By default, the direct repeat degeneracies in the arrays are used to help narrow down the leader region (this assumes that direct repeats are most conserved near the leader, and that degeneracies exist in the array).
By default, potential leader regions will only span from the CRISPR array to either the max possible length of the leader region (1000 bp) or the beginning of the closest gene (this assumes leaders don't extent into genes).

getting potential leader regions

CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna

CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna

Align the leaders using mafft or another sequence aligner.

mafft --adjustdirection possible_leaders.fna > possible_leaders_aln.fna

If 2 leaders written for a locus (ie. both 3' & 5' end), remove the 1 that does not align
View the alignment (via Jalview, Geneious, etc.); determine where leader conservation ends
- For example: conservation ends 50bp from end of alignment (make a note: 50bp)
- this will be trimmed off of the leader region when added to CLdb so just the conserved region up to the CRISPR array will be added to CLdb.

Both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirect)

CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 possible_leaders.fna possible_leaders_aln.fna

'-t 50' = trim off the last 50bp of unconserved sequence in the alignment furthest from CRISPR array

CLdb_groupLeaders.pl -d CLdb.sqlite