Message Boards Message Boards

[WSSA16] Analysing Protein Sequences and Letter Sequences of Words

GROUPS:

enter image description here

Introduction

Proteins play crucial roles in the living cell and are involved in various vital processes. Protein sequences are not random, just as random sequences of letters and spaces rarely form words or sentences. Rather, protein sequences are "designed by evolution" i.e.the polypeptide chains, which did not fold in a biologically relevant time into a functional state and made the host organism less competitive, disappeared due to natural selection. Natural protein sequences are the result of many trials and errors. Important to notice that folding into specific structures in a biologically relevant time is a necessary but not a sufficient condition for an amino acid sequence to be a protein sequence.There is more information encrypted in protein sequences that needs to be understood. Our aim is to find out the regularities in the protein sequences that are absent in the randomly generated sequences of amino acids.

Results and discussion

  • Data importation (transmembrane protein sequences)
  • Sequence length statistics
  • Sequence structure statistics
  • Random amino acid sequence generation
  • Random sequence structure statistics
  • Guessing whether given sequence is a transmembrane protein sequence The code was also implemented for analyzing sequences of letters in English, German, Italian, Russian, Arabic and Hebrew words.

We decided to take sequences of the transmembrane proteins from the human protein data that is available in Mathematica 11.

(*Chose transmembrane proteins*)
(tr = {#, ProteinData[#, "Sequence"]} & /@ 
    Take[Select[ProteinData["Classes"], 
      StringMatchQ[
        CanonicalName[#], ___ ~~ "Transmembrane" ~~ ___] &]]) // Short

(*Take sequences of transmembrane proteins*)
(trseq = DeleteCases[Flatten[tr], a_ /; StringQ[a] != True]) // Short

Next we did some basic statistical anayses of the sequence lengths.

Histogram[Map[Length, Characters[trseq]]]
Print[ "Mean sequence length = ", 
 N[Mean[Map[Length, Characters[trseq]]]]]
Print["Median sequence length = ", 
 N[Median[Map[Length, Characters[trseq]]]]]
Print["SD of the sequence lengths = ", 
 N[StandardDeviation[Map[Length, Characters[trseq]]]]]
Print["Skewness of the sequence lengths = ", 
 N[Skewness[Map[Length, Characters[trseq]]]]]

enter image description here

Mean sequence length = 714.987

Median sequence length = 559.

SD of the sequence lengths = 465.857

Skewness of the sequence lengths = 1.58423

In protein sequences there are evolutionary preserved regions that are important for some function, thus it is important to determine the sub-sequences that are frequently occurring in protein sequences. This was achieved through CharacterCounts function in Mathematica 11:

(*Finds the most frequently occurring subsequences.Probably those are 
evolutionary preserved and hence are important parts of the 
protein*)(*the computaion time takes around 1 min*)
tab = Table[
   Sort[Merge[CharacterCounts[Flatten[trseq], n], Total], 
    Greater], {n, 1, 20, 1}];

Then it was visualized as a table and bar chart:

(*Generates a hierarchical table of most frequently occurring 
subsequences of a given length.The length of the subsequence can vary 
between 1 and 20*)
Manipulate[Dataset[Take[tab[[i]], {1, 20}]], {i, 1, 20, 1}] 

(*Generates a hierarchical table of most frequently occurring 
subsequences of a given length.The length of the subsequence can vary 
between 1 and 20*)
Manipulate[
 BarChart[Take[tab[[i]], {1, 20}], 
  ChartLabels -> 
   Placed[ToString /@ Automatic, Axis, Rotate[#, Pi/2] &]], {i, 1, 20,
   1}]

enter image description here enter image description here

From the above images it can be seen that in the sub-sequences that are 6 amino acid long KTGTL and DKTGTL occur 51 times. This means that those sequences are important for some function. In case of random sequences the most frequent sub-sequences that are 6 amino acid long occur occur 3-4 times (check the notebook attached). Our next step was to find the similarities between sequences. This was quantified by calculating Damerau–Levenshtein distance. Damerau–Levenshtein distance is the number of operations that are required to get one sequence from the other. There are 4 types of operations allowed: insertion, deletion, substitution and transposition(exchange of the positions) of the neighboring symbols. For example the Damerau–Levenshtein distance between "cat" and "bet" is equal to 2 since there are two substituents required, namely substituting "c" by "b" and "a" by "e".

(*Computation of the Damerau-Levenshtein distance between the \
sequences of transmembrane proteins*)
(*The computation time is around 20 min*)
dldistance = 
  Table[DamerauLevenshteinDistance[trseq[[j]], trseq[[k]]], {j, 
    694}, {k, 694}];

(*Showing the Damerau-Levenshtein distance array plot for the 
sequences of transmembrane proteins, with rainbow colors. Violet 
means that sequences are identical.*)
ArrayPlot[dldistance, ColorFunction -> "Rainbow"]
(*Generation of the properties of the Damerau-Levenshtein 
distances for the sequences of transmembrane proteins*)
Print[ "Mean Damerau-Levenshtein distance = ", 
 N[Mean[Flatten[dldistance]]]]
Print["Median Damerau-Levenshtein distance = ", 
 N[Median[Flatten[dldistance]]]]
Print["SD of the Damerau-Levenshtein distance = ", 
 N[StandardDeviation[Flatten[dldistance]]]]
Print["Skewness of the Damerau-Levenshtein distance = ", 
 N[Skewness[Flatten[dldistance]]]]

enter image description here

Mean Damerau-Levenshtein distance = 768.983

Median Damerau-Levenshtein distance = 673.

SD of the Damerau-Levenshtein distance = 423.818

Skewness of the Damerau-Levenshtein distance = 1.47982

In the rainbow color encoded array plot the violet corresponds to the 0 Damerau-Levenshtein distance, as the color changes towards read the Damerau-Levenshtein distance increases. In order to get deeper understanding of the statistical data of protein sequences displayed above we went on and generated random amino acid sequences:

aminoac = {"L", "S", "A", "V", "G", "E", "T", "I", "P", "R", "K", "F",
    "D", "Q", "N", "Y", "C", "M", "H", "W"};

(*Generation of one random amino acid sequence of random length that 
is between given integers.*)
generatePolypeptide[m_, n_] := 
 Module[{aminoac = {"L", "S", "A", "V", "G", "E", "T", "I", "P", "R", 
     "K", "F", "D", "Q", "N", "Y", "C", "M", "H", "W"}, out},
  out = StringJoin@RandomChoice[aminoac, RandomInteger[{m, n}]]
  ]

(*Generation of several(l) random amino acid sequence of random 
length that is between given integers.*)
generatePlist[l_, ps_, pl_] := Module[{out = {}},
  Do[out = AppendTo[out, generatePolypeptide[ps, pl]], l];
  out
  ]

(*Generation of 694 random amino acid sequences that has minimum 
length of 55 and maximum length of 3097. Those numbers are the 
maximal and minimal lengths of sequences in our transmembrane protein 
data.*)
(randomseq = generatePlist[694, 55, 3079]) // Short

Next we have performed similar operations over this random amino acid sequence data as we did for the transparent protein sequences. Below is rainbow color encoded array plot of the random amino acid sequence data with some statistical characteristics.

(*Computaion of the Damerau\[Dash]Levenshtein distance between the 
random amino acid sequences.*)
(*The computaion time is around 120 min*)
dldistanceRand = 
  Table[DamerauLevenshteinDistance[randomseq[[j]], 
    randomseq[[k]]], {j, 694}, {k, 694}];

(*Showing the Damerau-Levenshtein distance array plot for the 
random amino acid sequences, with rainbow colors. Violet means that 
sequences are identical.*)
ArrayPlot[dldistanceRand, ColorFunction -> "Rainbow"]
(*Generation of the properties of the Damerau-Levenshtein 
distances for the sequences of transmembrane proteins*)
Print[ "Mean Damerau-Levenshtein distance = ", 
 N[Mean[Flatten[dldistanceRand]]]]
Print["Median Damerau-Levenshtein distance = ", 
 N[Median[Flatten[dldistanceRand]]]]
Print["SD of the Damerau-Levenshtein distance = ", 
 N[StandardDeviation[Flatten[dldistanceRand]]]]
Print["Skewness of the Damerau-Levenshtein distance = ", 
 N[Skewness[Flatten[dldistanceRand]]]]

enter image description here

Mean Damerau-Levenshtein distance = 1744.85

Median Damerau-Levenshtein distance = 1863.

SD of the Damerau-Levenshtein distance = 578.999

Skewness of the Damerau-Levenshtein distance = -0.613977

From this analysis one can conclude that transmembrane protein sequences are closely related in contrast to random amino acid sequences.

Next we have written a short script that will guess whether the proposed sequence can be a transmembrane protein sequence, try it.

(*Selection of the training data sets*)
{trainTRSeq, testTRSeq} = TakeDrop[trseq, 600];
{trainRandSeq, testRandSeq} = TakeDrop[randomseq, 600];

(*Classification of the transmembrane protein classification*)
(*The computaion time is around 3 min*)
protClassifier = 
 Classify[<|"Transmembrane protein sequence" -> trainTRSeq, 
   "Not a transmembrane protein sequence" -> testRandSeq|>, 
  Method -> "Markov", PerformanceGoal -> "Quality"]

(*Creation of validation set*)
valtrSeq = Thread[testTRSeq -> "Transmembrane protein sequence"];
valRandSeq = 
  Thread[testRandSeq -> "Not a transmembrane protein sequence"];
validation = Join[valtrSeq, valRandSeq];

(*Testing the accuracy of the classifier*)
cm = ClassifierMeasurements[protClassifier, validation]
cm["Accuracy"]

(*A decision maker that will tell wehter proposed sequence can be 
transmembrane protein seqeunce*)
decision[sequence_] := 
  If [(StringLength[sequence] < 
      30) || (Max[
       Abs@ Subtract[#[[1]], #[[2]]] & /@ 
        StringPosition[sequence, x_ ~~ x_]] > 20), 
   "Not a transmembrane protein sequence", protClassifier[sequence]];

FormFunction["Sequence" -> "String", decision[#Sequence] &]

Words in different languages are in some seance similar to protein sequences. Both are products of evolution, but these evolutions have different purposes and driving forces. Using the same codes we have analyzed words in different languages. The word data was obtained as follows (for example for German words):

german = ToLowerCase[
  WordList["KnownWords", Language -> Entity["Language", "German"]]]

Next are presented word length distributions for few languages:

English

enter image description here

Mean = 8.33284

Median = 8.

SD = 2.64659

Skewness = 0.37119

German

enter image description here

Mean = 11.6281

Median = 11.

SD = 3.95164

Skewness = 0.455154

Italian

enter image description here

Mean = 9.64161

Median = 10.

SD = 2.42446

Skewness = 0.2337

Russian

enter image description here

Mean = 8.10481

Median = 8.

SD = 2.41756

Skewness = 0.44462

Arabic

enter image description here

Mean = 6.24913

Median = 6.

SD = 1.83311

Skewness = 0.525033

Hebrew

enter image description here

Mean = 6.75852

Median = 7.

SD = 1.77025

Skewness = 0.993837

It can be noticed that the length of words in the Semitic languages that are shown here are relatively shorter. Next we are showing the array plots for Damerau–Levenshtein distances calculated between 10000 words for each language.

enter image description here

It shows that the Damerau–Levenshtein distances between Russian words and Arabic words are large in comparison English, German, Italian and Hebrew. What would be your interpretation of the image above?

Attachments:
POSTED BY: Rafael Petrosian
Answer
1 year ago

enter image description here - you earned "Featured Contributor" badge, congratulations !

This is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.

POSTED BY: Moderation Team
Answer
1 year ago

Group Abstract Group Abstract