-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pxrms .txt file using wildcard #183
Comments
Hi @luizmchado. pxrms has a regex option (
At the moment, the regex ( |
Thanks for the insight @josephwb, would it work if I added multiple names delimited by commas i.e pxrms -s TEST/test.fa -r Pilosocereus*,Micranthocereus*? |
Unfortunately, no. Or, not yet. The regex bit was implemented independently of the lists, and so only at present assumes a single pattern; that is why I suggested the multiple passes idea. If you are really clever it may be possible to come up with a single regex that matches multiple cases, but this will not work for many scenarios. |
Hello Joseph, after some work I came up with a solution. Here's the script I used: #!/bin/bash Define the list of taxon names for each passtaxon_names=( Define initial directory for the first passinitial_dir="/mnt/c/users/gabit/linux2/3.Combined_paralogs/reduced/reduced2" Iterate over each taxon name and run pxrms commandfor (( pass=0; pass<${#taxon_names[@]}; pass++ )); do Determine the directory for the current pass
Process files using results from previous pass if available
Process files from initial directory for the first pass
detailed explanation provided by chatgpt: "In this modified script: Each pass creates a new directory (pass_1, pass_2, etc.) where the output files are stored. This script ensures that each pass aggregates the results from the previous pass while maintaining the integrity of the original data." Hope this helps anybody in a similar situation! |
Nicely done :) I'll see how difficult it will be to ingest a list of
regexes to make things run more smoothly.
…On Fri., Mar. 15, 2024, 16:49 Luiz Machado, ***@***.***> wrote:
Hello Joseph, after some work I came up with a solution.
Wrote a bash script that uses a 'for' loop to iterate through every fasta
file in the directory appointed (my biggest problem was the large volume of
577 .fasta files, and 92 taxons to be removed from them).
Here's the script I used:
#!/bin/bash
Define the list of taxon names for each pass
taxon_names=(
"Melocactus*"
"Another_Taxon*"
"Yet_Another_Taxon*"
# Add more taxon names for additional passes as needed
)
Define initial directory for the first pass
initial_dir="/mnt/c/users/gabit/linux2/3.Combined_paralogs/reduced/reduced2"
Iterate over each taxon name and run pxrms command
for (( pass=0; pass<${#taxon_names[@]}; pass++ )); do
# Determine the directory for the current pass
pass_dir="$initial_dir/pass_$((pass+1))"
mkdir -p "$pass_dir"
# Process files using results from previous pass if available
if (( pass > 0 )); then
prev_pass_dir="$initial_dir/pass_$pass"
for input_file in "$prev_pass_dir"/*.fasta; do
output_file="$pass_dir/$(basename "$input_file")"
pxrms -s "$input_file" -r "${taxon_names[pass]}" -o "$output_file"
done
else
# Process files from initial directory for the first pass
for input_file in *.fasta; do
output_file="$pass_dir/$input_file"
pxrms -s "$input_file" -r "${taxon_names[pass]}" -o "$output_file"
done
fi
done
detailed explanation provided by chatgpt:
"In this modified script:
Each pass creates a new directory (pass_1, pass_2, etc.) where the output
files are stored.
For each pass after the first one, the script processes the output files
from the previous pass, located in the directory of the previous pass.
For the first pass, the script processes the initial .fasta files directly
from the initial directory.
The pxrms command processes each input file and generates an output file
in the current pass directory.
This script ensures that each pass aggregates the results from the
previous pass while maintaining the integrity of the original data."
Hope this helps anybody in a similar situation!
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGFMM2JKHICPG2Y3BUXBO3YYNNE7AVCNFSM6AAAAABEXEDZAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBQGQZDCMZRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Reopening bc it seems doable. |
Hi, is it possible to use a wildcard in the .txt specified file to remove i.e every taxon starting with the name Pilosocereus*.
This is the command I'm using:
for i in *.fasta; do pxrms -s "$i" -f /mnt/c/users/gabit/linux2/3.Combined_paralogs/reduced/taxa_to_delete_multiple.txt -o /mnt/c/users/gabit/linux2/3.Combined_paralogs/reduced/reduced2/"${i%.fasta}.fasta"; done
I've tried it multiple times unsuccessfully, even using -n and specifying the name as Pilosocereus.
Any help would be appreciated.
taxa_to_delete_multiple.txt
The text was updated successfully, but these errors were encountered: