Haskell Revisits the Central Dogma
06 Feb 2016This post will recap the previous, Haskell Meets the Central Dogma, with the intent to introduce algebraic data types. Using these will allow refinement of the design, present some more features of Haskell, and continue a gradual exposition. The first in this series, A Review of the Central Dogma, reviews the biological context of the functions presented.
Here’s the code for this post. I would like to thank Roman Cheplyaka for helpful discussion and code review.
Limitations of type synonyms
I previously began by declaring some type synonyms in my code so that I could
then differentiate between functions meant to operate with DNA (DNASeq
),
RNA (RNASeq
), or protein (ProteinSeq
). The type synonyms were all equivalent
to Haskell’s String
type: [Char]
. Here is that code for quick reference:
type Element = Char -- A unitary object
type Seq = [Element] -- An ordered collection of Elements
type DNABase = Element -- DNA variant of Element
type DNASeq = [DNABase] -- DNA variant of Sequence
type RNABase = Element -- RNA variant of Element
type RNASeq = [RNABase] -- RNA variant of Sequence
type AminoAcid = Element -- Protein variant of Element
type ProteinSeq = [AminoAcid] -- Protein variant of Sequence
With these in place I could declare translate :: RNASeq -> ProteinSeq
in lieu
of translate :: String -> String
to indicate that the expected behavior is
meaningful for particular Strings
values, not String
values in general. Yet
since these declarations are fundamentally the same, we are still able to do
some weird things that it might be best to prevent:
ghci> let mydna = "GATTACA" :: DNASeq
ghci> let mypro = "ALGLYSW" :: ProteinSeq
ghci> mydna ++ mypro -- we can concatenate, but what kind of molecule is that?!
"GATTACAALGYLSW"
ghci> :t mydna ++ mypro -- result type of concatenation in ghci?
mydna ++ mypro :: [DNABase]
ghci> :t mypro ++ mydna -- it depends on the order!
mypro ++ mydna :: [AminoAcid]
ghci> :t reverseTranslate
reverseTranslate :: ProteinSeq -> [[DNACodon]]
ghci> reverseTranslate mydna -- The base Chars in DNA can interpreted as valid amino acids
[["GGA","GGC","GGG","GGT"],["GCA","GCC","GCG","GCT"],["ACA","ACC","ACG","ACT"],["ACA","ACC","ACG","ACT"],["GCA","GCC","GCG","GCT"],["TGC","TGT"],["GCA","GCC","GCG","GCT"]]
The takeaway lesson here is that while using type
to create synonyms can help
with code readability where you’d like to tersely annotate something with a
comment (it’s like laying a comment on top of your type), they don’t
meaningfully change anything below the surface.
Let’s explore using some algebraic data types to represent our
data which will simultaneously improve the performance, clarity, and rigor of
the code.
Algebraic types with data
So let’s begin by intoducing some custom data types to represent nucleic acid sequences and protein sequences. I will not preserve the conceptual distinction between DNA and RNA at the type level as I did before; because they are directly correlated (to the extent that we are concerned here), it makes sense to think of DNA and RNA as simply different representations of the same underlying nucleic acid sequence. This is a common convention in bioinformatics, and it will help to keep reduce the amount and complexity of our code. I will not cover ambiguous nucleotide or amino acids here.
--A Nucleotide may be either Adenine, Cytosine, Guanosine, or Thymine/Uracil
--Thymine is Uracil in RNA
data Nucleotide = Adenine | Cytosine | Guanosine | Thymine
deriving (Eq, Ord, Show)
--An AminoAcid may be one of the 20 standard amino acids, or a Stop
data AminoAcid = Alanine
| Arginine
| Asparagine
| Aspartate
| Cysteine
| Glutamate
| Glutamine
| Glycine
| Histidine
| Isoleucine
| Leucine
| Lysine
| Methionine
| Phenylalanine
| Proline
| Serine
| Threonine
| Tryptophan
| Tyrosine
| Valine
| Stop
deriving (Eq, Show)
type NucleicSeq = [Nucleotide] -- Nucleic Acid Sequence
type ProteinSeq = [AminoAcid] -- Protein Sequence
Here I’ve created two new data types: Nucleotide
, and AminoAcid
. These new
data types have their own unique value constructors. The type synonyms
NucleicSeq
and ProteinSeq
suggest my intent to work with these sequences as
lists of their respective subunits.
Let’s get a quick feel for this system:
ghci> let mydna = [Guanosine, Adenine, Thymine, Thymine, Adenine, Cytosine, Adenine]
ghci> let mypro = [Alanine, Leucine, Glycine, Tyrosine, Leucine, Serine, Tryptophan]
ghci> show mydna
"[Guanosine,Adenine,Thymine,Thymine,Adenine,Cytosine,Adenine]"
ghci> show mypro
"[Alanine,Leucine,Glycine,Tyrosine,Leucine,Serine,Tryptophan]"
ghci> mydna ++ mypro -- Lists are homogenous, we can't mix types in a sequence
-- Couldn't match type ‘AminoAcid’ with ‘Nucleotide’
-- Expected type: [Nucleotide]
-- Actual type: [AminoAcid]
-- In the second argument of ‘(++)’, namely ‘mypro’
-- In the expression: mydna ++ mypro
ghci> Thymine : mydna -- Prepend Thymine
[Thymine,Guanosine,Adenine,Thymine,Thymine,Adenine,Cytosine,Adenine]
Because lists in Haskell are homogenous structures, we are now sure that we
cannot compile some code in which we’ve accidentally tried to combine the types
into one value. Since the types are distinct, it is now clear to both the reader
and the compiler that translate :: NucleicSeq -> ProteinSeq
uses
[Nucleotide]
as its input type.
It looks like working with this system could get a little cumbersome in terms of
typing: "GATTACA"
was much easier to type than [Guanosine, Adenine, Thymine,
Thymine, Adenine, Cytosine, Adenine]
. Let’s take a moment to write some helpful
functions for nicely inputting to and outputting from this model:
charToNuc :: Char -> Nucleotide
charToNuc 'A' = Adenine
charToNuc 'C' = Cytosine
charToNuc 'G' = Guanosine
charToNuc 'T' = Thymine
charToNuc 'U' = Thymine
charToNuc x = error $ x:" is not a valid nucleotide character"
stringToNucleicSeq :: String -> NucleicSeq
stringToNucleicSeq = map charToNuc
dnaSymbol :: Nucleotide -> Char
dnaSymbol Adenine = 'A'
dnaSymbol Cytosine = 'C'
dnaSymbol Guanosine = 'G'
dnaSymbol Thymine = 'T'
rnaSymbol :: Nucleotide -> Char
rnaSymbol Thymine = 'U'
rnaSymbol x = dnaSymbol x
representAsDNA :: NucleicSeq -> String
representAsDNA = map dnaSymbol
representAsRNA :: NucleicSeq -> String
representAsRNA = map rnaSymbol
aaSymbolTable = [(Alanine , 'A'), (Arginine , 'R'), (Asparagine , 'N'),
(Aspartate , 'D'), (Cysteine , 'C'), (Glutamate , 'E'),
(Glutamine , 'Q'), (Glycine , 'G'), (Histidine , 'H'),
(Isoleucine , 'I'), (Leucine , 'L'), (Lysine , 'K'),
(Methionine , 'M'), (Phenylalanine , 'F'), (Proline , 'P'),
(Serine , 'S'), (Threonine , 'T'), (Tryptophan , 'W'),
(Tyrosine , 'Y'), (Valine , 'V'), (Stop , '*')]
representProtein :: ProteinSeq -> String
representProtein = map fetchAA where
fetchAA x = fromJust $ lookup x aaSymbolTable
stringToProteinSeq :: String -> ProteinSeq
stringToProteinSeq = map fetchSymbol where
fetchSymbol x = fromMaybe (error ("Invalid amino acid: " ++ show x)) $ lookup x $ invert aaSymbolTable
invert = map swap
swap (x, y) = (y, x)
So now where it is more convenient we have the ability to make sequences from
String
s and to produce String
s from our sequences like so:
ghci> let nseq = stringToNucleicSeq "GATTACA"
ghci> representAsRNA nseq
"GAUUACA"
ghci> let pseq = stringToProteinSeq "AVSLGL"
ghci> pseq
[Alanine,Valine,Serine,Leucine,Glycine,Leucine]
I should note that performing lookups on lists like I have done here is not the
most efficient method. Data structures such as Map
or HashMap
will
have better lookup performance.
Re-implementation of Complementation
With this new data model in place, let’s implement the central dogma functions
agin. Since replication is so trivial (I’ll say a little more on this at the
end), I’ll move right into complementation. This is exactly the same
as before but with Nucleotide
values instead of Char
values.
complementNucleotide :: Nucleotide -> Nucleotide
complementNucleotide Adenine = Thymine
complementNucleotide Cytosine = Guanosine
complementNucleotide Guanosine = Cytosine
complementNucleotide Thymine = Adenine
complement :: NucleicSeq -> NucleicSeq
complement s = map complementNucleotide s
reverseComplement = reverse . complement
complementNucleotide
expresses the Nucleotide
-wise rules of complementation
and complement
maps it over a list to generate the complementary list.
reverseComplement
uses function composition to create a function that returns
the reverse result of the complement
function (putting it into the standard 5’
to 3’ orientation). Put in practice:
ghci> complement [Thymine, Adenine, Cytosine, Guanosine]
[Adenine,Thymine,Guanosine,Cytosine]
ghci> complement $ stringToNucleicSeq "GATTACA"
[Cytosine,Thymine,Adenine,Adenine,Thymine,Guanosine,Thymine]
ghci> reverseComplement $ stringToNucleicSeq "GATTACA"
[Thymine,Guanosine,Thymine,Adenine,Adenine,Thymine,Cytosine]
ghci> reverseComplement [Glutamate] -- Disallowed
-- Couldn't match expected type ‘Nucleotide’
-- with actual type ‘AminoAcid’
-- In the expression: Glutamate
-- In the first argument of ‘reverseComplement’, namely ‘[Glutamate]’
Translation
The previous implementation for the geneticCode
function was written to
pattern match on strings of three characters which representing a codon. If the
function was given a string of a different length and/or a string containing an
invalid character, it would produce '-'
to denote that it is not meaningful.
Here is the first two lines of my previous definition of geneticCode
:
geneticCode :: RNACodon -> AminoAcid
geneticCode "AAA" = 'K' -- Lysine
Here is a direct re-implementation using the new algebraic types.
geneticCode :: NucleicSeq -> AminoAcid
geneticCode [Adenine, Adenine, Adenine] = Lysine
In the former there is no guarantee whether the characters in the input are valid while in the latter we have that assurance. Yet in both functions there is the possibility for lists of incorrect length; I would like to fix that so I will formalize a codon data type and write the function to fit.
data Codon = Codon !Nucleotide !Nucleotide !Nucleotide
deriving (Eq, Ord, Show)
Now I have a new data type called Codon
whose values can be created using the
type constructor Codon
(same name in this case, but it’s good to remember the
distinction between types and their constructors) which takes three and only
three Nucleotide
values and holds them as an ordered triplet. I have prepended
each field with !
to make them strict, essentially establishing a policy that
a Codon
cannot be partially defined.
This provides the basis of the new definition of geneticCode
:
geneticCode :: Codon -> AminoAcid
geneticCode (Codon Adenine Adenine Adenine ) = Lysine
geneticCode (Codon Adenine Adenine Cytosine ) = Asparagine
geneticCode (Codon Adenine Adenine Guanosine) = Lysine
geneticCode (Codon Adenine Adenine Thymine ) = Asparagine
geneticCode (Codon Adenine Cytosine Adenine ) = Threonine
geneticCode (Codon Adenine Cytosine Cytosine ) = Threonine
geneticCode (Codon Adenine Cytosine Guanosine) = Threonine
geneticCode (Codon Adenine Cytosine Thymine ) = Threonine
geneticCode (Codon Adenine Guanosine Adenine ) = Arginine
geneticCode (Codon Adenine Guanosine Cytosine ) = Serine
geneticCode (Codon Adenine Guanosine Guanosine) = Arginine
geneticCode (Codon Adenine Guanosine Thymine ) = Serine
geneticCode (Codon Adenine Thymine Adenine ) = Isoleucine
geneticCode (Codon Adenine Thymine Cytosine ) = Isoleucine
geneticCode (Codon Adenine Thymine Guanosine) = Methionine
geneticCode (Codon Adenine Thymine Thymine ) = Isoleucine
geneticCode (Codon Cytosine Adenine Adenine ) = Glutamine
geneticCode (Codon Cytosine Adenine Cytosine ) = Histidine
geneticCode (Codon Cytosine Adenine Guanosine) = Glutamine
geneticCode (Codon Cytosine Adenine Thymine ) = Histidine
geneticCode (Codon Cytosine Cytosine Adenine ) = Proline
geneticCode (Codon Cytosine Cytosine Cytosine ) = Proline
geneticCode (Codon Cytosine Cytosine Guanosine) = Proline
geneticCode (Codon Cytosine Cytosine Thymine ) = Proline
geneticCode (Codon Cytosine Guanosine Adenine ) = Arginine
geneticCode (Codon Cytosine Guanosine Cytosine ) = Arginine
geneticCode (Codon Cytosine Guanosine Guanosine) = Arginine
geneticCode (Codon Cytosine Guanosine Thymine ) = Arginine
geneticCode (Codon Cytosine Thymine Adenine ) = Leucine
geneticCode (Codon Cytosine Thymine Cytosine ) = Leucine
geneticCode (Codon Cytosine Thymine Guanosine) = Leucine
geneticCode (Codon Cytosine Thymine Thymine ) = Leucine
geneticCode (Codon Guanosine Adenine Adenine ) = Glutamate
geneticCode (Codon Guanosine Adenine Cytosine ) = Aspartate
geneticCode (Codon Guanosine Adenine Guanosine) = Glutamate
geneticCode (Codon Guanosine Adenine Thymine ) = Aspartate
geneticCode (Codon Guanosine Cytosine Adenine ) = Alanine
geneticCode (Codon Guanosine Cytosine Cytosine ) = Alanine
geneticCode (Codon Guanosine Cytosine Guanosine) = Alanine
geneticCode (Codon Guanosine Cytosine Thymine ) = Alanine
geneticCode (Codon Guanosine Guanosine Adenine ) = Glycine
geneticCode (Codon Guanosine Guanosine Cytosine ) = Glycine
geneticCode (Codon Guanosine Guanosine Guanosine) = Glycine
geneticCode (Codon Guanosine Guanosine Thymine ) = Glycine
geneticCode (Codon Guanosine Thymine Adenine ) = Valine
geneticCode (Codon Guanosine Thymine Cytosine ) = Valine
geneticCode (Codon Guanosine Thymine Guanosine) = Valine
geneticCode (Codon Guanosine Thymine Thymine ) = Valine
geneticCode (Codon Thymine Adenine Adenine ) = Stop -- Ochre
geneticCode (Codon Thymine Adenine Cytosine ) = Tyrosine
geneticCode (Codon Thymine Adenine Guanosine) = Stop -- Amber
geneticCode (Codon Thymine Adenine Thymine ) = Tyrosine
geneticCode (Codon Thymine Cytosine Adenine ) = Serine
geneticCode (Codon Thymine Cytosine Cytosine ) = Serine
geneticCode (Codon Thymine Cytosine Guanosine) = Serine
geneticCode (Codon Thymine Cytosine Thymine ) = Serine
geneticCode (Codon Thymine Guanosine Adenine ) = Stop -- Opal
geneticCode (Codon Thymine Guanosine Cytosine ) = Cysteine
geneticCode (Codon Thymine Guanosine Guanosine) = Tryptophan
geneticCode (Codon Thymine Guanosine Thymine ) = Cysteine
geneticCode (Codon Thymine Thymine Adenine ) = Leucine
geneticCode (Codon Thymine Thymine Cytosine ) = Phenylalanine
geneticCode (Codon Thymine Thymine Guanosine) = Leucine
geneticCode (Codon Thymine Thymine Thymine ) = Phenylalanine
Thanks to some help from vim, I didn’t have to invest much time into retyping
that. Now to create the new translate
function:
--Convert a [Nucleotide] to [Codon], trimming possible remainder
asCodons :: NucleicSeq -> [Codon]
asCodons s = mapMaybe toCodon (chunksOf 3 s) where
toCodon [i,j,k] = Just (Codon i j k)
toCodon _ = Nothing
translate :: NucleicSeq -> ProteinSeq
translate s = map geneticCode $ asCodons s
translate :: NucleicSeq -> ProteinSeq
makes use of asCodons
to convert the
nucleotides to codons and maps geneticCode
over the result. asCodons
does
the conversion by chunking the input into triplet lists whose values are then
used to construct a Codon
in toCodon
. If toCodon
gets a list of length
other than 3, because the number of input Nucleotide
values was an integer
multiple of 3, it produces a Nothing
. The mapMaybe
in asCodons
discards
Nothing
values and extracts the values from the Just
.
Let’s try the functions out:
ghci> asCodons [Adenine, Thymine]
[]
ghci> asCodons [Adenine, Thymine, Cytosine]
[Codon Adenine Thymine Cytosine]
ghci> asCodons [Adenine, Thymine, Cytosine, Guanosine]
[Codon Adenine Thymine Cytosine]
ghci> translate $ stringToNucleicSeq "GATTACA"
[Aspartate,Tyrosine]
To give it a more in depth test application, I’ll adjust simpleSeqGetter
and
translateCodingRegion
to then translate the mRNA from the before.
simpleSeqGetter :: FilePath -> IO NucleicSeq
simpleSeqGetter fp = do
unmunged <- readFile fp
let linedropped = drop 1 $ lines unmunged
return $ foldr ((++) . stringToNucleicSeq) ([] :: NucleicSeq) linedropped
translateCodingRegion :: NucleicSeq -> ProteinSeq
translateCodingRegion s = takeWhile (Stop /=) $ dropWhile (Methionine/=) $ translate s
ghci> myseq <- simpleSeqGetter "lipocalin1_mRNA.fa"
ghci> representProtein $ translateCodingRegion myseq
"MKPLLLAVSLGLIAALQAHHLLASDEEIQDVSGTWYLKAMTVDREFPEMNLESVTPMTLTTLEGGNLEAKVTMLISGRCQEVKAVLEKTDEPGKYTADGGKHVAYIIRSHVKDHYIFYCEGELHGKPVRGVKLVGRDPKNNLEALEDFEKAAGARGLSTESILIPRQSETCSPGSD"
Reverse Translation
The first key to the puzzle of computing reverse translation will be a
function that maps each AminoAcid
to its corresponding list of coding
Codon
s. I’ll also define a convenience function mkCodon
(“make Codon”) to
help keep the lines a little shorter.
mkCodon :: String -> Codon
mkCodon [f,s,t] = Codon (charToNuc f) (charToNuc s) (charToNuc t)
mkCodon _ = error "Incorrect number of characters to make Codon"
codonsFor :: AminoAcid -> [Codon]
codonsFor Alanine = [mkCodon "GCA", mkCodon "GCC", mkCodon "GCG", mkCodon "GCT"]
codonsFor Cysteine = [mkCodon "TGC", mkCodon "TGT"]
codonsFor Aspartate = [mkCodon "GAC", mkCodon "GAT"]
codonsFor Glutamate = [mkCodon "GAA", mkCodon "GAG"]
codonsFor Phenylalanine = [mkCodon "TTC", mkCodon "TTT"]
codonsFor Glycine = [mkCodon "GGA", mkCodon "GGC", mkCodon "GGG", mkCodon "GGT"]
codonsFor Histidine = [mkCodon "CAC", mkCodon "CAT"]
codonsFor Isoleucine = [mkCodon "ATA", mkCodon "ATC", mkCodon "ATT"]
codonsFor Lysine = [mkCodon "AAA", mkCodon "AAG"]
codonsFor Leucine = [mkCodon "CTA", mkCodon "CTC", mkCodon "CTG", mkCodon "CTT", mkCodon "TTA", mkCodon "TTG"]
codonsFor Methionine = [mkCodon "ATG"]
codonsFor Asparagine = [mkCodon "AAC", mkCodon "AAT"]
codonsFor Proline = [mkCodon "CCA", mkCodon "CCC", mkCodon "CCG", mkCodon "CCT"]
codonsFor Glutamine = [mkCodon "CAA", mkCodon "CAG"]
codonsFor Arginine = [mkCodon "AGA", mkCodon "AGG", mkCodon "CGA", mkCodon "CGC", mkCodon "CGG" , mkCodon "CGT"]
codonsFor Serine = [mkCodon "AGC", mkCodon "AGT", mkCodon "TCA", mkCodon "TCC", mkCodon "TCG" , mkCodon "TCT"]
codonsFor Threonine = [mkCodon "ACA", mkCodon "ACC", mkCodon "ACG", mkCodon "ACT"]
codonsFor Valine = [mkCodon "GTA", mkCodon "GTC", mkCodon "GTG", mkCodon "GTT"]
codonsFor Tryptophan = [mkCodon "TGG"]
codonsFor Tyrosine = [mkCodon "TAC", mkCodon "TAT"]
codonsFor Stop = [mkCodon "TAA", mkCodon "TAG", mkCodon "TGA"]
The reverse translation operation can just be the application of codonsFor
to
each AminoAcid
in a ProteinSeq
:
reverseTranslate :: ProteinSeq -> [[Codon]]
reverseTranslate = map codonsFor
ghci> reverseTranslate [Phenylalanine, Isoleucine, Glutamine, Glutamine]
[[Codon Thymine Thymine Cytosine,Codon Thymine Thymine Thymine],[Codon Adenine Thymine Adenine,Codon Adenine Thymine Cytosine,Codon Adenine Thymine Thymine],[Codon Cytosine Adenine Adenine,Codon Cytosine Adenine Guanosine],[Codon Cytosine Adenine Adenine,Codon Cytosine Adenine Guanosine]]
Now let’s create uniqueCodings :: ProteinSeq -> [NucleicSeq]
which will
produce all of the possible nucleic acid sequences which could code for the
protein. I’ll have to add an intermediate step to unpack the Nucleotide
components from a Codon
.
Beware: the size of the output increases exponentially with the
size of the input (unless you only have Methionine
or Tryptophan
in the
sequence).
-- Unpack the Codons in [Codon] to get [NucleicSeq]
unCodons :: [Codon] -> [NucleicSeq]
unCodons = map unCodon where
unCodon (Codon f s t) = [f,s,t]
--produce all the ways the protein sequence could be encoded
uniqueCodings :: ProteinSeq -> [NucleicSeq]
uniqueCodings s = foldr (liftA2 (++) . unCodons) ([[]] :: [NucleicSeq]) $ reverseTranslate s
--Count up all the ways the protein sequence could be encoded
uniqueCodingsCount :: ProteinSeq -> Integer
uniqueCodingsCount s = fold ((*) . genericLength) 1 $ reverseTranslate s
Note that I have modified uniqueCodings
to make use of the lists as
applicative functors. liftA2
is from Control.Applicative
. You may want to
start reading here
if applicative functors are new to you.
Let’s do a few tests:
ghci> uniqueCodingsCount ([] :: ProteinSeq)
1
ghci> uniqueCodingsCount [Alanine]
4
ghci> uniqueCodingscount [Alanine, Aspartate]
8
ghci> map representAsDNA $ uniqueCodings [Alanine, Aspartate]
["GCAGAC","GCAGAT","GCCGAC","GCCGAT","GCGGAC","GCGGAT","GCTGAC","GCTGAT"]
With the appropriate adjustment to constrainedUniqueCodings
I can repeat my
previous work on finding out whether a certain NucleicSeq
can be found among
the reverse translated NucleicSeq
s from a given ProteinSeq
.
constrainedUniqueCodings :: ProteinSeq -> NucleicSeq -> [NucleicSeq]
constrainedUniqueCodings s constraint = filter (isInfixOf constraint) $ uniqueCodings s
ghci> let ecoRISeq = stringToNucleicSeq "GAATTC"
ghci> let ecoRVSeq = stringToNucleicSeq "GATATC"
ghci> let proteinSegment = stringToProteinSeq "ALGYLSW"
ghci> map representAsDNA$ constrainedUniqueCodings proteinSegment ecoRISeq
[]
ghci> map representAsDNA $ constrainedUniqueCodings proteinSegment ecoRISeq
["GCACTAGGATATCTA","GCACTAGGATATCTC", ... ,"GCTTTGGGATATCTG","GCTTTGGGATATCTT"]
What about replication?
Fidelitous (error-free) replication is such a trivial problem that we needn’t
hardly think about it. The function which always gives back the value it was
given is known as the identity function.
The replication of a sequence can then be expressed by using the Haskell
identity function id
, which is:
id :: a -> a
id x = x
So if we cared to, an alias for id
could be made like so:
replication = id
The story of replication probably won’t get interesting again until we consider infidelity (scandalous!).