Nature vs Nurture
Despite my original ambitions at the start of my career, I never ended up becoming a bioinformatician of any repute. I studied degrees for both Science and Information Technology, but I quickly found out that I would be better suited as a bioinformatics end user rather than a bioinformatics developer.
All the software engineers in my classes had been coding since they were toddlers, but for me it felt like learning how to ride a unicycle upside down. It’s pretty embarrassing to struggle so much in programming 101 but some people’s brains are just innately better tuned for computer programming, just like others are naturally talented at music or art.
It doesn’t mean we should avoid the things we aren’t naturally talented at, in fact there’s a lot to be said for knowing how to nurture our skills under non-ideal conditions.
Interdisciplinarity
Biologists need a better understanding of bioinformatics, just like software engineers need to know more about biology. The power of bioinformatics as a tool for learning is that genes are mysterious black boxes to most students. Sure they understand in the abstract that DNA sequence is transcribed into mRNA which is then then translated into protein, but to them there’s no functional distinction between different sections of the DNA sequence. Bioinformatics forces students to actually go through the sequence base-by-base and highlight the gene’s different features. Can they find the start and stop codons within an open reading frame? What about the upstream promoter where the polymerase enzymes bind? How can you tell non-coding introns apart from coding exons? Which part of the translated gene product is phosphorylated by different enzymes, or removed altogether via peptide cleavage sites? There are bioinformatic tools to predict each of these genetic features using a databases of consensus sequences, and it’s pretty much plug and play.
Back to Basics
Finding the gene or protein sequence is no longer an obstacle - it’s very much google-able as the right databases have already been indexed by google’s omnipresent crawler algorithms. But zooming out to compare your gene of interest to other genes is still within the realm of relatively niche bioinformatics tools. Does your gene belong to a gene family, which share similar features and functions? Is your gene of interest found in all types of cells, or only in select tissues or organisms? NCBI’s Basic Local Alignment Search Tools - or BLAST - is the closest thing to “google” out of the sequence search engines. There are different flavours of BLAST depending on what you enter as your search query. blastN uses a nucleotide - DNA or RNA sequence to search nucleotide databases. blastP uses a protein or amino acid sequence to search protein databases. These are like for like searches, but this is where it gets interesting. blastX takes a nucleotide sequence as its query input, and searches protein databases. How does that work? The program translates your nucleotide query in every reading frame - all 6 - to explore if this sequence is expressed into protein in any possible configuration. tBLASTn works in a similar way but in reverse - it accepts a protein sequence as the query, and searches translated nucleotide databases based on all of the possible DNA sequences that could be translated into that amino acid sequence.
Filtering the Outputs
What comes back is an alignment between your query sequence, and a sequence flagged for similarity from the database. The algorithm tries to line up the two in the way that best showcases how similar they are, and provides metrics that help you quantify this similarity too. For example in tBlastN, with each BLAST search result you have:
Identities - number of amino acids that identically line up between the two results,
Positives - the number of amino acids that share very similar chemical properties at the same position across the two sequences, and
e-value - the statistical probability that the two sequences aligned purely by “random chance”.
If the two sequences are related, you’d expected identities and positives to be high (as close to 100% as you can get, often >50 or 70% identity are used as cutoffs), and the e-value - a probability index between 0 and 1 - to be as close to 0 as possible. In other words very low probability that the two sequences were just randomly paired up as a statistical anomaly. Comparing two sequences is known as a pairwise sequence alignment; but what if you had more than two sequences to analyse? EMBL-EBI’s Clustal Omega tool is the go-to for aligning three-or-more, or multiple sequence alignments. The difference or similarity between each base across all of the sequences you’re comparing can be used to generate a phylogenetic tree, and estimate the relatedness between gene sequences with robust underlying statistical models.
These foundational bioinformatics tools are not considered in any way to be the forefront of innovation in our discipline, but they’re more than enough to design authentic research learning activities in our classes. In our latest video, I walk through a number of inquiry-based ideas centred around these bioinformatics tools, and how they can be connected to undergraduate research projects or Citizen Science.
Jack.