Haskell Meets the Central Dogma
11 Jan 2016This post will introduce some Haskell code to perform sequence operations relevant to the central dogma of molecular biology. The code here should be relatively simple, but some slight familiarity with Haskell or functional programming is assumed.
The treatment of biology in this post is also kept simple, though if you don’t know what the central dogma of molecular biology is or could do with a brief review, I have written a bit on it; it is meant to give a little background for some of the molecular biology terms used in this post, and to describe its context in the scope of biology today.
Here’s the code from this post if you’d like to follow along.
In ghci
you can load the file via :l centraldogma.hs
to use the functions defined here..
Some helpful type synonyms
I’ll be working with sequences here using Haskell’s native String
type. Though
DNA, RNA, and protein sequences will all share this same underlying type it will
be helpful, in terms of documenting the code, to be able to convey whether a
function is meant to work with a particular kind of sequence or another. To do
this let’s make use of the type
declaration which creates type synonyms.
These should help clarify the code’s intent.
Note that in Haskell a string is a special case of a list, []
, one that contains Char
elements:
[Char]
. These will be more informative than just Char
and String
later on:
Replication, Replication
So let’s get started with replication, one of the basic operations of DNA and RNA in which an exact copy (in its own medium) is produced. Since replication (when we ignore the possibility of error) can be modeled by defining a function that simply returns its own input, it’s rather trivial to implement. Here I’ll choose to define a function that works element-wise, and a sequence-wise function which is the mapping of the element-wise one over all of its elements. This pattern will be employed throughout.
When we supply a sequence string to the replication
function like
replication "GATTACA"
, it should give us "GATTACA"
when it evaluates.
You could try this out with various sequence strings, the empty string is an
important test as well. Here’s a little function to help test if the results are good.
Everybody likes complements
Now let’s take a look at how we can write code to get the complementary sequence
for a DNA or RNA sequence. Recall that DNA and RNA form double stranded
structures according to nucleobase complementarity: A
pairs with T
(or U
in RNA) and C
pairs with G
. Here’s a representation of this in code:
These examples leverage pattern matching on their input. Note that we also have
the option of writing it with guards, though I don’t prefer guard syntax for pattern
matching. Here’s complementDNABase
using guards just to demonstrate:
Giving "GATTACA"
to complementDNA
should yield "CTAATGT"
and giving the
corresponding "GAUUACA"
to complementRNA
should yield "CUAAUCU"
.
Complementation is a reciprocal process, meaning that if we produce s'
as the
complement of s
, then s
is also the complement of s'
. This is an important
property of the function, so let’s test it by making sure that if either of the
functions reproduce the input when executed twice in succession.
Use these functions on any sequence string to see if the function was correctly written.
DNA and RNA sequences are traditionally presented in a specific direction: 5’ to
3’ (details here).
It is important to remember that the complementing strand of DNA or RNA
generally “runs” in the opposite direction. That is to say, if you take GATTACA
which is reading 5’ to 3’, and produce it’s complement CTAATGT
, the complement
produced is reading 3’ to 5’. It is common then to refer to the “reverse
complement” of a sequence as the complement read in the conventional 5’ to 3’
fashion. Producing such a reverse complement function in Haskell is simple:
Here I’ve used function composition with Haskell’s
reverse
function to easily define the reverse complement functions. That’s pretty handy.
Transcription and noitpircsnarT
The process of transcription refers to the production of RNA from a DNA template
and reverse transcription the production of DNA from an RNA template. This is
again mediated by base complementarity, and the code for it can be readily made
by adapting our complementing functions from above. When a DNA sequence, s
, is
transcribed, a copy of that sequence is made in RNA using base complementarity
to the complementing DNA strand, s'
. Since RNA nucleobases differ in the use
of Uracil instead of Thymine, the matter of transcription involves replacing
our 'T'
s with 'U'
s and vice versa for reverse transcription.
Transcription and reverse transcription functions should be symmetrical. By which I mean to say that if one transcribes a DNA sequence and then reverse transcribes the result, it should reproduce the input. The same is true for reverse transcription of an RNA sequence then transcription of the result. We can make some fun functions to test whether this holds true:
These only really work if the inputs are valid. Passing it a mixed sequence like “GATUACA” should evaluate False in both functions.
Translation -> Traducción
Translation is the process of creating proteins from messenger RNA sequences. RNA is read in chunks of three nucleotides, which constitute a codon, to determine which amino acid to add to a growing protein chain. The “Genetic Code” is the (mostly universal) coding scheme used in biology that maps codons to amino acids. Without further ado, here’s the genetic code in haskell:
When it comes to expressing coding schemes, there’s just no way to cut corners so that got a little verbose. Writing that sort of code is no fun, but there was no way around it1. So now that we’ve got our genetic code, let’s put it to work translating some codons to a protein sequence.
chunksOf 3
creates a function that takes a list, and returns a list of lists
of length 3, though the last item might be shorter if the input list is not evenly
divisible by 3. We can feed our sequence string to this, strings are just lists
of characters after all, and get a list of codons back. chunksOf 3 "GAUUACA"
would produce ["GAU", "UAC", "A"]
. Mapping the genetic code function over that
list of strings would give "DY-"
.
If we want to translate DNA sequences, we could go the hard way and write a new function, or we could go the smart way and create easily with function composition:
Now we can do translateDNA "GATTACA"
to get "DY-"
.
For a change of pace, let’s do something a bit less frivolous to test our work. Let’s visit the NCBI Gene page of a gene I like and snag the sequences of an mRNA and the protein it produces. Here’s the mRNA sequence file and the protein sequence file. For simplicity I’ll put them here:
>gi|357933616|ref|NM_001252617.1| Homo sapiens lipocalin 1 (LCN1), transcript variant 2, mRNA
ACAGCCTCTCCCAGCCCCAGCAAGCGACCTGTCAGGCGGCCGTGGACTCAGACTCCGGAGATGAAGCCCC
TGCTCCTGGCCGTCAGCCTTGGCCTCATTGCTGCCCTGCAGGCCCACCACCTCCTGGCCTCAGACGAGGA
GATTCAGGATGTGTCAGGGACGTGGTATCTGAAGGCCATGACGGTGGACAGGGAGTTCCCTGAGATGAAT
CTGGAATCGGTGACACCCATGACCCTCACGACCCTGGAAGGGGGCAACCTGGAAGCCAAGGTCACCATGC
TGATAAGTGGCCGGTGCCAGGAGGTGAAGGCCGTCCTGGAGAAAACTGACGAGCCGGGAAAATACACGGC
CGACGGGGGCAAGCACGTGGCATACATCATCAGGTCGCACGTGAAGGACCACTACATCTTTTACTGTGAG
GGCGAGCTGCACGGGAAGCCGGTCCGAGGGGTGAAGCTCGTGGGCAGAGACCCCAAGAACAACCTGGAAG
CCTTGGAGGACTTTGAGAAAGCCGCAGGAGCCCGCGGACTCAGCACGGAGAGCATCCTCATCCCCAGGCA
GAGCGAAACCTGCTCTCCAGGGAGCGATTAGGGGGACACCTTGGCTCCTCAGCAGCCCAAGGACGGCACC
ATCCAGCACCTCCGTCATTCACAGGGACATGGAAAAAGCTCCCCACCCCTGCAGAACGCGGCTGGCTGCA
CCCCTTCCTACCACCCCCCGCCTTCCCCCTGCCCTGCGCCCCCTCTCCTGGTTCTCCATAAAGAGCTTCA
GCAGTTCCCAGTGA
>gi|357933617|ref|NP_001239546.1| lipocalin-1 isoform 1 precursor [Homo sapiens]
MKPLLLAVSLGLIAALQAHHLLASDEEIQDVSGTWYLKAMTVDREFPEMNLESVTPMTLTTLEGGNLEAK
VTMLISGRCQEVKAVLEKTDEPGKYTADGGKHVAYIIRSHVKDHYIFYCEGELHGKPVRGVKLVGRDPKN
NLEALEDFEKAAGARGLSTESILIPRQSETCSPGSD
The idea now is to make sure that the results of our translation function when fed the mRNA sequence jives with the reference implementation’s results. Let’s get down to work! I don’t want to have to type or copy the sequence into the source, so let’s make a function to read the sequences out of the files.
This is not a proper FASTA parsing job, but it serves for now. I’ve dropped the
first line describing the sequence and gotten rid of the newline characters. So
now I can do the following in ghci
:
It’s pretty clear that our results don’t match. Some readers might have seen this coming, because I’ve not prepared to handle an important detail! If you look carefully you’ll see that the expected sequence is there, but sandwiched between some other protein sequence. Without getting into specifics, mRNA sequences contain untranslated regions (UTRs) at both the 5’ and 3’ ends of the sequence. These untranslated regions are a part of the sequence file; to figure out what the ribosome would actually produce from this particular mRNA we’ll need to know how to drop them (UTRs are totally interesting in their own right, but that’s a story for another time).
To determine where the translated section begins, we must look for the first (5’
most) AUG
codon, this is the start codon and codes for Methionine.
Everything from that point on is
translated until we reach a stop codon: one of UAA
, UAG
, UGA
. The region
from the start codon to the stop codon is the coding region.
Let’s implement these rules in code now:
This now drops everything before the start codon, and keeps everything until the
stop codon. Now we can try this function out and test its results in ghci
:
Tada! We can translate a given DNA or RNA sequence as well as fish out the first coding
region from a sequence. By the way, have you thought it’s weird that I’ve been
using translateDNA
on an RNA sequence? Yeah, it’s not just you. Nucleic acid
sequences are often stored in the DNA alphabet, regardless of whether it makes
more physical/biological sense as RNA. It’s a bit odd, but common enough to
gloss over transcription and reverse transcription since they directly correlate.
Translation Party: Reverse Translation
In my previous post, I made something of a big deal about how reverse translation (also sometimes called “back translation”) has never been observed and we’ve got pretty good reasons to never expect it to occur. But it turns out that there are plenty of times where we might want to address reverse translation computationally.
Most amino acids have more than one codon that codes for them, but no codon codes for more than one amino acid. In this way, the genetic code is degenerate but not ambiguous. In other words, a given nucleic acid sequence passed through the genetic code will only produce one protein sequence, but a given protein sequence could be decoded to multiple nucleic acid sequences.
This leads to a very hard problem in trying to represent all the possible encoding sequences. Only for rather short protein sequences is it computationally feasible to consider representing all the combinations in memory or storage, or to iteratively consider all of the possibilities.
I think that looking deeper into reverse translation applications is worthy of one or more focused posts, but it’d be good to give a short treatment here. I’ll define a function that decodes a protein sequence to its possible DNA sequences. Might get a little lengthy again…
With this element-wise function in hand, we can create a reverse translation function to work on a protein sequence:
For each amino acid in the protein sequence this function will produce a list of the codons that can code for it, resulting in a list of codon options. If we wanted to ask how many different ways the protein sequence could be encoded, we could do the following:
This folds our list up by taking the length of each sub-list and multiplying it
by the accumulated value which I initiated at 1
. Note that I used genericLength
from Data.List
so that it would return the length as type Integer
which can
hold an arbitrarily large number (these numbers get big). Let’s give it a couple tries:
Damn, that’s a lot of ways that this short sequence (only 176 amino acids) could be
represented in DNA. Also interesting is that giving uniqueCodingsCount
a sequence
with an invalid amino acid such as 'B'
correctly reports that there are 0
ways to encode this protein.
Now let’s set to work creating a function that will actually generate the unique DNA sequence combinations that can code for a protein sequence. The signature for this function should look like this:
It’s clear that this function will make use of reverseTranslate
whose result
type is [[DNACodon]]
, so we can begin to think how we might convert that to
[DNASeq]
. To do this we can make use of a fold function (like I have already
above) with a function and an accumulator to reduce the list. So what kind of
function might we be interested in? Consider the behavior of list comprehensions:
When we draw from two lists in the creation of a new list, the list is produced as an operation with all the combinations of the elements in the drawn lists. To help illustrate how we can make use of this, here’s one more example:
Putting the pieces together:
Remember that massive number we got from uniqueCodingsCount
on the lipocalin
protein sequence? We really don’t want to go printing that many of anything…
let’s keep the tests small and reasonable.
Pretty cool, the empty protein sequence can only be encoded as an empty DNA
sequence. For a lone histidine "H"
we essentially get codonsFor 'H'
and for
"HAM"
we get a list of 9 unique combinations of codons. When I gave it
bad amino acids it tells me there is no way to encode such a sequence. We
expect length uniqueCodings s
should be equal to uniqueCodingsCount s
:
It looks like our function checks out, we now have a way to work with the DNA sequences that can code for a specified protein. To wrap this post up, I’ll put that function to work solving a real world problem (I’ve done this in my own research).
Storytime
Imagine you are a scientist who has a protein you would like to produce and study. You know the amino acid sequence for the protein and you are ordering the synthesis of a DNA sequence to code for your protein. For reasons you try valiantly (but unsuccessfully) to explain to your great uncle Max during Thanksgiving dinner, you wish to incorporate a specific DNA sequence in a particular segment of your protein’s coding region. This is to allow a restriction enzyme to cut your DNA sequence for important and inscrutable purposes.
Now your DNA sequence design has a constraint. If you can help it, you’d prefer not to mutate the protein sequence. Perhaps you can take advantage of the degeneracy of the genetic code to choose codons that provide your restriction enzyme site without modifying any amino acids.
You select a small region of your protein that is suitable for the site: “ALGYL”
You ask, “Can I put an EcoRI
or EcoRV
site in here?”
The solution
You’re smart; you decide to let a computer do the investigation.
Sadly you can’t put in an EcoRI site without changing amino acids, but you can choose your codons to incorporate the EcoRV site. I greatly abbreviated the output of that final evaluation, it actually gives 96 result sequences! If the sequence search space were much larger, it could be hard to even find the codons of note. A good way to improve the code would be to have the code identify them. In this case the EcoRV site spans three codons for “GYL”. If we narrow the search to just those three we get an exact notion of our options:
With that bit of practical problem solving, we conclude our demonstration.
Footnotes
1: I could employ the IUPAC compression format to make this shorter, but that would just pass the buck and I’d have to write the details in an IUPAC function. It would be reusable though, and might be worth doing for that.