Skip to content

BioPython

Meg Staton edited this page Oct 21, 2020 · 4 revisions

Biopython

Demo on ISAAC

module load biopython/1.72

Example script 1 - Lets create a sequence object and do some common sequence manipulations.

from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())
print(my_seq)

Example script 2 - We'll parse through the sequences in a fasta file and print their lengths. Get the fasta file first.

On command line:

wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

How many sequences does it have?

grep -c '^>' ls_orchid.fasta

And the script:

from Bio import SeqIO

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(len(seq_record))

What if we wanted the count the number of sequences?

from Bio import SeqIO

seq_count = 0
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    seq_count += 1

print("The number of sequences is: " + str(seq_count))

What if I need to get the sequence as a string? (So I can use it with regular expressions, or find an adapter, or do something else with it)

from Bio import SeqIO

seq_count = 0
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    bases = seq_record.seq
    len_bases = len(bases)
    seq_count += 1
    #print seq name and first 5 characters then total length
    print(seq_record.id + "\t" + bases[0:5] + "\t" + str(len_bases))

print("The number of sequences is: " + str(seq_count))