Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Error in rule pasta" when aligning paralogous gene sequences #71

Open
finn314 opened this issue Oct 24, 2024 · 0 comments
Open

"Error in rule pasta" when aligning paralogous gene sequences #71

finn314 opened this issue Oct 24, 2024 · 0 comments

Comments

@finn314
Copy link

finn314 commented Oct 24, 2024

Hello! I'm working with a program called ROADIES (link) that uses PASTA. ROADIES basically produces guide trees from provided genome assemblies by selecting a bunch of random "c-genes" (non-recombining sites within the genomes), finding each c-gene in each genome assembly, and aligning the matches to the c-gene with PASTA to produce a gene tree for each c-gene. These gene trees are then merged to create a species tree.

However, I keep encountering an issue - the program exits due to an error with PASTA when the "c-gene" happens to be a sequence from a highly paralogous gene family . Examples: KCNJ potassium transporters and olfactory receptor genes. I cannot for the life of me understand why the program crashes -- I have confirmed that there is adequate memory and CPU allocated to the program, and that this issue doesn't seem to be on the ROADIES end (it's a PASTA error).

Here's an example of the error message encountered when one of the c-genes (in this case, a randomly selected "gene 213" that corresponds to a sequence from EPPK1). EPPK1 is a member of a family of cytoskeletal elements.

Error in rule pasta:
    jobid: 710
    input: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
    output: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
    shell:
        
		if [[ `grep -n '>' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa | wc -l` -gt 15 ]] || [[ `awk 'BEGIN{l=0;n=0;st=0}{if (substr($0,1,1) == ">") {st=1} else {st=2}; if(st==1) {n+=1} else if(st==2) {l+=length($0)}} END{if (n>0) {print int((l+n-1)/n)} else {print 0} }' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa` -gt 750 ]]
		then
			input_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
			output_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
			reference=""
			all_matched=true

			while IFS= read -r line; do
				line=$(echo "$line" | tr '[:lower:]' '[:upper:]')
  				if [[ "$line" != ">"* ]]; then
    				if [ -z "$reference" ]; then
      					reference="$line"
    				elif [ "$line" != "$reference" ]; then
       					all_matched=false
      					break
    				fi
 	 			fi
			done < "$input_file"

			if [ "$all_matched" = true ]; then
				cp "$input_file" "$output_file"
			else
				run_pasta.py -i /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa -j gene_213 --alignment-suffix=fa.aln --num-cpus 4
			fi
		fi
		touch /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln

		
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job pasta since they might be corrupted:
/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln

Has anyone encountered this error before / can anyone explain to me what this section of code is trying to do, and why it might be failing?

Thank you so much for any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant