87.2 BioPython: Sequence Analysis and Bioinformatics
Right, so you’ve got a string of DNA. It’s just a text file, really, but if you try to parse it with standard Python, you’re going to end up writing a bunch of tedious, error-prone string manipulation code that’ll break the second someone sends you a FASTA file with a slightly different header format. This is why BioPython exists. It’s not just a library; it’s a collection of sensible, battle-hardened abstractions for the messy world of biological data. It lets you stop thinking about file formats and start thinking about biology. Let’s get our hands dirty.
First things first, you’ll need to install it. pip install biopython is your gateway drug. Now, let’s talk about the workhorse: the Seq object. This is why you’re here. You might think a DNA sequence is a string, and you’d be mostly right, but also dangerously wrong. A simple Python string doesn’t know that ‘A’ pairs with ‘T’, or that you can’t transcribe a protein sequence.
from Bio.Seq import Seq
# This is not a string. It's a Seq object, and it's about to make your life easier.
dna_sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Looks like a string, right?
print(dna_sequence) # Output: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
# But it can do things a string can't.
complement = dna_sequence.complement()
reverse_complement = dna_sequence.reverse_complement()
print(complement) # Output: TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC
print(reverse_complement) # Output: CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT
See that? One method call. You didn’t have to write a dictionary to map bases or reverse the string yourself. This is the BioPython ethos: encapsulate the common, error-prone operations so you don’t have to.
Reading Sequence Files: FASTA & Friends
Biology runs on plain text files, and the FASTA format is the granddaddy of them all. It’s deceptively simple, which is why everyone has their own slight variation. BioPython’s SeqIO module (Sequence Input/Output) handles this chaos for you.
from Bio import SeqIO
# Let's read a FASTA file. This is the one-liner that saves you hours.
for record in SeqIO.parse("my_sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Description: {record.description}")
print(f"Sequence: {record.seq}")
print(f"Sequence length: {len(record.seq)}")
print("---")
The magic here is that SeqIO.parse() returns an iterator of SeqRecord objects, not just raw sequences. A SeqRecord holds the sequence itself (record.seq, which is a Seq object!), along with its metadata (id, description), and optionally annotations and other features. This is the correct way to model biological data, and it prevents you from inevitably creating a messy dictionary-based system yourself.
A common pitfall? Assuming the file is well-formed. Always wrap your parsing in a try-except block because you will, at some point, get a file with a blank line or a corrupted entry that will blow up the parser.
Translating Sequences: More Than Just str.replace()
Want to transcribe DNA to RNA? Or translate to a protein? This is where pretending a sequence is a basic string will lead you astray. What about ambiguous bases? What if the sequence length isn’t a multiple of three? BioPython handles these edge cases gracefully.
# Transcribe DNA to RNA (just replaces T with U, but done properly)
messenger_rna = dna_sequence.transcribe()
print(messenger_rna) # Output: AUGGCCAUUGUAUGGGCCGCUGAAAGGGUGCCCGAUAG
# Translate to a protein sequence
protein_seq = messenger_rna.translate()
print(protein_seq) # Output: MAIVMGR*KGAR*
# Wait, why are there stop codons (*)?
# Let's try a different translation table (the standard one by default is 11)
protein_seq_no_stop = messenger_rna.translate(to_stop=True)
print(protein_seq_no_stop) # Output: MAIVMGR
# Or use a different genetic code (e.g., for mitochondria)
protein_seq_mito = messenger_rna.translate(table="Vertebrate Mitochondrial")
print(protein_seq_mito) # Output: MAIVMGRWKGAR*
The to_stop=True parameter is a lifesaver. It tells the translator to stop at the first stop codon it encounters, instead of just plowing through and representing stop codons with an asterisk. This is a classic example of a rough edge in the data that BioPython anticipates for you. Knowing which translation table to use is on you, though—the library can’t read the experiment’s methodology section.
BLAST: Automating the “Hey, I’ve Seen This Before”
Running a BLAST search manually through a web browser is for chumps. BioPython can automate this, but a word of caution: be polite. Don’t hammer the NCBI’s servers with thousands of requests from a script. They will rightfully block you.
from Bio.Blast import NCBIWWW, NCBIXML
# Run a blastn search against the nr database
result_handle = NCBIWWW.qblast("blastn", "nt", dna_sequence)
# Parse the XML output (because of course BLAST returns XML)
blast_records = NCBIXML.parse(result_handle)
# Iterate through the results
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-50: # E-value threshold
print(f"****Alignment****")
print(f"Sequence: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Length: {alignment.length}")
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
The NCBIXML module parses the truly heinous XML output into a much more manageable Python object hierarchy. The key best practice here is to always check the expect value (E-value) to judge the significance of a hit. Aligning to random junk is easy; aligning with a fantastically low E-value is biologically interesting.
Where BioPython Gets Weird
It’s not all perfect. The object hierarchy can feel deep and a bit convoluted at times. The documentation is comprehensive but can be overwhelming—you often need to know what you’re looking for before you can find it. And some modules, like the Bio.Graphics module for drawing genome diagrams, are fantastically powerful but have a learning curve steeper than a cliff face. They made a choice to provide maximum flexibility, which sometimes comes at the cost of immediate simplicity. But honestly, that’s a trade-off I’ll take. It means the library doesn’t paint you into a corner when your problem gets complex, which in bioinformatics, it always does.