Message Boards Message Boards

[WSSA16] Emulation of the 'RNA to Amino Acid Sequence' pathway

Posted 8 years ago

Central Dogma

ABSTRACT

A living cell has a vast amount of enzymes and molecules that have influence on the derivation of a protein from a gene. Protein synthesis is one of the most complex processes in the cell and it involves a lot of signals, enzymes, and other molecular complexes. As the central dogma of molecular biology says, our DNA undergoes processes that make it into an RNA sequence (Transcription), and the latter to an amino acid sequence (Translation); amino acids then fold into proteins. In this pipeline, two essential enzymes are the spliceosome and the ribosome. Spliceosome cuts off the unnecessary parts (Introns) from the primary RNA transcript and joins the parts (Exons) which then have to be translated into an amino acid sequence. This produces a mature mRNA. The ribosome, in its turn, takes the mature mRNA and translates it into its corresponding amino acid sequence.

The aim of this project is to model the primary RNA transcript ?mRNA?Amino Acid Sequence pipeline. We used Machine Learning techniques to design and train an Artificial Neural Network to recognize all the important sites that take part in these processes and imitate some of the algorithms that the cell does during protein synthesis.

I. Exon-Intron Splicing

Splicosome


Data Acquisition

In the first step, we import all of the data about the special sites that will be needed. Data on Intron and Exon splice site regions were acquired from a Machine Learning Database.

Firstly, we import around 3160 sequences from the database. The sequences are labeled "EI" for an Exon-Intron splice site, "IE" for an Intron-Exon splice site and "N" for Neither. Throughout the project, we converted the sequences to 2-dimensional vectors by converting the nucleotide letters A, C, G and T to {1,0,0,0}, {0,1,0,0}, {0,0,1,0} and {0,0,0,1}, respectively. In this way it is easier for the classifier to understand the data.


Classifier

We make an association of sequences and their labels and Classify this data.

Classifier


Artificial Neural Network Design and Training (Optional)

We also propose an ANN that is trained on the acquired data as an alternative for the classification of splice sites. The user needs to format the data into grayscale images for dimensions {60, 4} to test the network. Formatting of the data is also presented below. (Important: The whole dataset has been used for training, so the user needs another one.)

ANN Training

cm


Sliding Window

To analyze the sequence and to find splice sites in it, we construct a 60 nucleotides long "window" that slides along the primary RNA transcript and looks for splice sites.

sliding window

takeIdxSeq[seq_, idx_] := If[idx <= Length@seq, Take[seq, {idx, 59 + idx}]]

scoreEI[str_String] :=
        Module[
                {p, k, l, seq, splicesites, out},
             seq = Characters[str] /. {"A" -> {1, 0, 0, 0}, "C" -> {0, 1, 0, 0}, "G" -> {0, 0, 1, 0}, "T" -> {0, 0, 0, 1}};
             splicesites = ssList[str];
             k = Take[seq, {#, # + 59}] & /@Flatten@Position[splicesites, "EI"];
             l = takeEIRange[#] & /@ k;
             p = findPositions[#] & /@ l;

         Thread[{score[pwmDon, Part[p, #]] & /@ Range[Length@p], "EI", Flatten@Position[splicesites, "EI"]}]
        ]


scoreIE[str_String] :=
        Module[
                {p, k, l, seq, splicesites, out},
             seq = Characters[str] /. {"A" -> {1, 0, 0, 0}, "C" -> {0, 1, 0, 0}, "G" -> {0, 0, 1, 0}, "T" -> {0, 0, 0, 1}};
             splicesites = ssList[seq];
             k = Take[seq, {#, # + 59}] & /@ 
                Flatten@Position[splicesites, "IE"];
             l = takeIERange[#] & /@ k;
             p = findPositions[#] & /@ l;

         Thread[{score[pwmAcc,Part[p, #]] & /@ Range[Length@p],"IE", Flatten@Position[splicesites,"IE"]}]
        ]

Donor Site and Acceptor Site PWMs

The Position Weight Matrices (PWMs) are commonly used to represent motifs (patterns) in biological sequences. These two are acquired from the same sequences from the database.

  • Donor Site (Exon-Intron) PWM

pwmDon

  • Acceptor Site (Intron-Exon) PWM

pwmAcc


Removing the wrong Splice Sites

This is one of the most crucial steps in the whole processes. An mRNA has to start and end with Exons, so if there is a IE splice site near the beginning or a EI splice site near the end of the sequence, this means that the splice sites need to be further filtered.

findMaxpos[list_, idx_] := Flatten@Position[list[[All, 1]], Max[list[[All, 1]]]] + idx - 1;

FindWrongSS[seq_String] :=
           Module[
             {listEI, listIE, list, eipos, iepos, m1, m2, k1, k2, j1, j2, f1, f2, v1, v2, out},

             listEI = scoreEI[seq];
             listIE = scoreIE[seq];
             list = SortBy[Partition[Flatten@Append[listEI, listIE], 3], Last];

             eipos = Flatten@Position[list[[All, 2]], "EI"];
             iepos = Flatten@Position[list[[All, 2]], "IE"];

             m1 = {Min[#], Max[#]} & /@ Split[eipos, #2 - #1 == 1 &];
             m2 = {Min[#], Max[#]} & /@ Split[iepos, #2 - #1 == 1 &];

             k1 = DeleteCases[m1, a_ /; First[a] == Last[a]];
             k2 = DeleteCases[m2, a_ /; First[a] == Last[a]];

             j1 = Flatten[findMaxpos[Part[list, First@# ;; Last@#], First@#] & /@ k1];
             j2 = Flatten[findMaxpos[Part[list, First@# ;; Last@#], First@#] & /@ k2];

             f1 = Flatten[Range[First@#, Last@#] & /@ k1];
             f2 = Flatten[Range[First@#, Last@#] & /@ k2];

             v1 = list[[#, 3]] & /@ Complement[f1, j1];
             v2 = list[[#, 3]] & /@ Complement[f2, j2];

             out = Flatten[Append[v1, v2]];
             out
            ]        

removeUnnecessary[seq_String] :=
                  Module[
                     {list, firstEI, firstIE, lastEI, lastIE, splicesites},

                      list = FindWrongSS[seq];
                      splicesites = ssList[seq];
                         splicesites[[list]] = "N";

                         firstIE = Part[First@Position[splicesites, "IE"], 1];
                         firstEI = Part[First@Position[splicesites, "EI"], 1];
                         lastIE = Part[Last@Position[splicesites, "IE"], 1];
                         lastEI = Part[Last@Position[splicesites, "EI"], 1];
                         If[firstIE < firstEI, splicesites[[firstEI]] = "N"];
                         If[lastIE < lastEI, splicesites[[lastEI]] = "N"];

                      splicesites
                   ]

Final Step: Splicing Introns and Joining Exons

Spliceosome[seq_String] :=
           Module[
                {a, b, k, m, positions, str},

                 a = Flatten@Position[removeUnnecessary[seq], "EI"];
                 b = Flatten@Position[removeUnnecessary[seq], "IE"];
                 k = Partition[Sort@Flatten@Append[a, b], 2] + 30;
                 m = Flatten[Range[First@#, Last@#] & /@ k];
                 positions = Partition[Complement[Range[Length@seq], m], 1];
                 str = Delete[Characters[seq], Partition[m, 1]];

                 StringJoin[str]
            ]

II. mRNA?Amino Acid Translation

Translation

Nucleotides (A,C,G,U) make triplets called codons. 61 of the 64 possible triplets code for amino acids (one codes for Met, which is the encoded by the codon AUG and in most of the cases that is the start codon), while the other 3 are stop codons, which stop the translation.

Genetic Code

  Ribosome[seq_String] :=
              Module[
                    {findStartCodon, concatinateSeq, translate, findStopCodon, sp, codonlist, prot, out},

                    findStartCodon[seq2_String] :=First@Flatten@StringPosition[seq2, "AUG"];
                    concatinateSeq[n_Integer, seq1_String] := StringPartition[StringTake[seq1, {n, -1}], 3];
                    translate[codons_List] := CodontoAminoAcid[#] & /@ codons;
                    findStopCodon[prot_List] := Take[prot, {1, First[Flatten@Position[prot, "*"]]}];

                    sp = findStartCodon[seq];
                    codonlist = concatinateSeq[sp, seq];
                    prot = translate[codonlist];
                    out = findStopCodon[prot];

                    out
                ]

Final Step: Primary RNA ? Amino - Acid

Wrapping up all of the processes ?

RNAtoAA[seq_String] :=
                    Module[
                            {spliced},
                             spliced = Spliceosome[seq];

                             Ribosome[spliced]
                      ]
POSTED BY: Elen Vardanyan

enter image description here - you earned "Featured Contributor" badge, congratulations !

This is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.

POSTED BY: Moderation Team
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract