This post will recap the previous,
Haskell Meets the Central Dogma,
with the intent to introduce algebraic data
types.
Using these will allow refinement of the design, present some more
features of Haskell, and continue a gradual exposition.
The first in this series, A Review of the Central Dogma,
reviews the biological context of the functions presented.
I previously began by declaring some type synonyms in my code so that I could
then differentiate between functions meant to operate with DNA (DNASeq),
RNA (RNASeq), or protein (ProteinSeq). The type synonyms were all equivalent
to Haskell’s String type: [Char]. Here is that code for quick reference:
With these in place I could declare translate :: RNASeq -> ProteinSeq in lieu
of translate :: String -> String to indicate that the expected behavior is
meaningful for particular Strings values, not String values in general. Yet
since these declarations are fundamentally the same, we are still able to do
some weird things that it might be best to prevent:
The takeaway lesson here is that while using type to create synonyms can help
with code readability where you’d like to tersely annotate something with a
comment (it’s like laying a comment on top of your type), they don’t
meaningfully change anything below the surface.
Let’s explore using some algebraic data types to represent our
data which will simultaneously improve the performance, clarity, and rigor of
the code.
Algebraic types with data
So let’s begin by intoducing some custom data types to represent nucleic acid
sequences and protein sequences. I will not preserve the conceptual distinction
between DNA and RNA at the type level as I did before; because they are
directly correlated (to the extent that we are concerned here), it makes sense
to think of DNA and RNA as simply different representations of the same
underlying nucleic acid sequence. This is a common convention in
bioinformatics, and it will help to keep reduce the amount and complexity of our
code. I will not cover ambiguous nucleotide or amino acids here.
Here I’ve created two new data types: Nucleotide, and AminoAcid. These new
data types have their own unique value constructors. The type synonyms
NucleicSeq and ProteinSeq suggest my intent to work with these sequences as
lists of their respective subunits.
Let’s get a quick feel for this system:
Because lists in Haskell are homogenous structures, we are now sure that we
cannot compile some code in which we’ve accidentally tried to combine the types
into one value. Since the types are distinct, it is now clear to both the reader
and the compiler that translate :: NucleicSeq -> ProteinSeq uses
[Nucleotide] as its input type.
It looks like working with this system could get a little cumbersome in terms of
typing: "GATTACA" was much easier to type than [Guanosine, Adenine, Thymine,
Thymine, Adenine, Cytosine, Adenine]. Let’s take a moment to write some helpful
functions for nicely inputting to and outputting from this model:
So now where it is more convenient we have the ability to make sequences from
Strings and to produce Strings from our sequences like so:
I should note that performing lookups on lists like I have done here is not the
most efficient method. Data structures such as Map or HashMap will
have better lookup performance.
Re-implementation of Complementation
With this new data model in place, let’s implement the central dogma functions
agin. Since replication is so trivial (I’ll say a little more on this at the
end), I’ll move right into complementation. This is exactly the same
as before but with Nucleotide values instead of Char values.
complementNucleotide expresses the Nucleotide-wise rules of complementation
and complement maps it over a list to generate the complementary list.
reverseComplement uses function composition to create a function that returns
the reverse result of the complement function (putting it into the standard 5’
to 3’ orientation). Put in practice:
Translation
The previous implementation for the geneticCode function was written to
pattern match on strings of three characters which representing a codon. If the
function was given a string of a different length and/or a string containing an
invalid character, it would produce '-' to denote that it is not meaningful.
Here is the first two lines of my previous definition of geneticCode:
Here is a direct re-implementation using the new algebraic types.
In the former there is no guarantee whether the characters in the input
are valid while in the latter we have that assurance. Yet in both functions
there is the possibility for lists of incorrect length; I would like to fix that
so I will formalize a codon data type and write the function to fit.
Now I have a new data type called Codon whose values can be created using the
type constructor Codon (same name in this case, but it’s good to remember the
distinction between types and their constructors) which takes three and only
three Nucleotide values and holds them as an ordered triplet. I have prepended
each field with ! to make them strict, essentially establishing a policy that
a Codon cannot be partially defined.
This provides the basis of the new definition of geneticCode:
Thanks to some help from vim, I didn’t have to invest much time into retyping
that. Now to create the new translate function:
translate :: NucleicSeq -> ProteinSeq makes use of asCodons to convert the
nucleotides to codons and maps geneticCode over the result. asCodons does
the conversion by chunking the input into triplet lists whose values are then
used to construct a Codon in toCodon. If toCodon gets a list of length
other than 3, because the number of input Nucleotide values was an integer
multiple of 3, it produces a Nothing. The mapMaybe in asCodons discards
Nothing values and extracts the values from the Just.
Let’s try the functions out:
To give it a more in depth test application, I’ll adjust simpleSeqGetter and
translateCodingRegion to then translate the mRNA from the before.
Reverse Translation
The first key to the puzzle of computing reverse translation will be a
function that maps each AminoAcid to its corresponding list of coding
Codons. I’ll also define a convenience function mkCodon (“make Codon”) to
help keep the lines a little shorter.
The reverse translation operation can just be the application of codonsFor to
each AminoAcid in a ProteinSeq:
Now let’s create uniqueCodings :: ProteinSeq -> [NucleicSeq] which will
produce all of the possible nucleic acid sequences which could code for the
protein. I’ll have to add an intermediate step to unpack the Nucleotide
components from a Codon.
Beware: the size of the output increases exponentially with the
size of the input (unless you only have Methionine or Tryptophan in the
sequence).
Note that I have modified uniqueCodings to make use of the lists as
applicative functors. liftA2 is from Control.Applicative. You may want to
start reading here
if applicative functors are new to you.
Let’s do a few tests:
With the appropriate adjustment to constrainedUniqueCodings I can repeat my
previous work on finding out whether a certain NucleicSeq can be found among
the reverse translated NucleicSeqs from a given ProteinSeq.
What about replication?
Fidelitous (error-free) replication is such a trivial problem that we needn’t
hardly think about it. The function which always gives back the value it was
given is known as the identity function.
The replication of a sequence can then be expressed by using the Haskell
identity function id, which is:
So if we cared to, an alias for id could be made like so:
The story of replication probably won’t get interesting again until we consider
infidelity (scandalous!).