Message Boards Message Boards

Converting DNA strands to amino acid chains

Posted 2 years ago
19 Replies
Posted 2 years ago

Hi Samikshaa,

Very nice! You should consider submitting this to the Wolfram Function Repository.

POSTED BY: Rohit Namjoshi

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD
Posted 2 years ago

This is very interesting work, Samikshaa. Since you already use BioSequence[], are you already aware of the functions BioSequenceComplement[] and BioSequenceTranslate[]? For example,

BioSequenceTranslate[BioSequenceComplement[BioSequence["DNA", "AGTCGTAGTACGGAT"]]]
   BioSequence["Peptide", "SASCL", {}]

BioSequenceTranslate[BioSequenceComplement[BioSequence["DNA", "TACTTTTCGTCCGGTATAATT"]]]
   BioSequence["Peptide", "MKSRPY.", {}]

where a stop is represented by a period in BioSequence[].

POSTED BY: J. M.

It's similar to those functions, except this is one function and it returns a list of amino acid entities rather than a BioSequence. But those functions are certainly useful!

It recently was uploaded to the WFR, actually! Here's the link.

Samikshaa,

I always enjoy seeing Mathematica used for biology topics. Thank you for posting your work.

I do believe that the output of your code is, however, biologically incorrect.

Compare the list of amino acids from your first input cell to the list of amino acids returned by the built-in command BioSequenceTranslate:

ResourceFunction[
ResourceObject[<|"Name" -> "DNAtoAminoAcid", 
"ShortName" -> "DNAtoAminoAcid", 
"UUID" -> "67954e72-53c2-4527-a7c0-68dd2ba1497e", 
"ResourceType" -> "Function", "Version" -> "1.0.0", 
"Description" -> "Convert a given strand of DNA to a list of \
amino acids", 
"RepositoryLocation" -> URL[
"https://www.wolframcloud.com/obj/resourcesystem/api/1.0"], 
"SymbolName" -> "FunctionRepository`$\
093e0005691b471995f708959efa4269`DNAtoAminoAcid", 
"FunctionLocation" -> CloudObject[
"https://www.wolframcloud.com/obj/e7ed59a3-2a4e-4af8-b48c-\
a1d06c90942e"]|>, \
{ResourceSystemBase -> "https://www.wolframcloud.com/obj/\
resourcesystem/api/1.0"}]][\
"GTATACTGGTCATAGCATTGACTGGTCCATGTACTTACCGCT"]

Out[10]= {Entity["Chemical", "LMethionine"], 
Entity["Chemical", "LThreonine"], Entity["Chemical", "LSerine"], 
Entity["Chemical", "LIsoleucine"], Entity["Chemical", "LValine"], 
Entity["Chemical", "LThreonine"], 
Entity["Chemical", "LAsparticAcid"], 
Entity["Chemical", "LGlutamine"], Entity["Chemical", "LValine"], 
Entity["Chemical", "LHistidine"], 
Entity["Chemical", "LGlutamicAcid"], 
Entity["Chemical", "LTryptophan"], Entity["Chemical", "LArginine"]}

In[11]:= BioSequenceTranslate[
BioSequence["DNA", 
"GTATACTGGTCATAGCATTGACTGGTCCATGTACTTACCGCT"]]["SequenceString"]

Out[11]= "VYWS.H.LVHVLTA"

Compare Out[10] to Out[11], they are different.

The codon "GTA" codes for valine, not methionine.

Is this what you intended?

POSTED BY: Todd Allen

My code looks for the first instance of the "TAC" sequence since that becomes the starting methionine codon needed in translation. In the input string "GTATACTGGTCATAGCATTGACTGGTCCATGTACTTA CCGCT", the "TAC" first appears after the "GTA", so the "GTA" is ignored and the translation begins at "TAC", meaning the start codon is methionine. "GTA" does indeed code for valine, but my code ignores it and starts at methionine.

Are you adopting a non-standard genetic code? In the "standard" genetic code "TAC" codes for tyrosine, whereas ATG codes for methionine.

POSTED BY: Todd Allen

Right, but assuming the input DNA strand is the coding strand, it is first transcribed into mRNA and then into the amino acid chain. Given an mRNA strand, "AUG" codes for methionine; therefore, on the DNA strand, "TAC" corresponds to "AUG" which becomes methionine. That's why the function only starts reading at "TAC" on the input DNA strand, since that's where the methionine will be.

You are mistaken about the meaning of "coding strand." In the bioinformatics community coding strand refers to the DNA sequence that matches the mRNA (except for having Ts instead of Us). This means the coding strand is what is produced by the process of transcription, and the coding strand (in RNA form) is the template the ribosome would use to make a polypeptide.

See here: coding strand from wikipedia

In essence the coding strand contains the actual codons, so when a ribosome "sees" TAC in the coding strand which is the template the ribosome is attached to, it will insert a tyrosine. The cell does not take the TAC from the coding strand and transcribe it to AUG as you suggest because the coding strand had already been made by transcription.

Think about it and see if you can realize your code is producing incorrect output.

POSTED BY: Todd Allen

I see. I think I've confused the coding strand and template strand. My code takes the template strand, creates the complementary mRNA strand, then matches that mRNA strand with the appropriate anticodon. In that case, a "TAC" on the template strand would code for an "AUG"/methionine on the mRNA/ribosome. Is that correct?

Posted 2 years ago

Well, once you have the BioSequence[], it isn't overly difficult to get the same result as your function:

Lookup[EntityValue[Entity["BioSequenceType", "Peptide"], 
                   EntityProperty["BioSequenceType", "AlphabetRules"]], 
       Characters[BioSequenceTranslate[BioSequenceComplement[
                  BioSequence["DNA", "AGTCGTAGTACGGAT"]]] @ "SequenceString"],
        Nothing]

I invite you to study the documentation for BioSequence[] and functions related to it in more detail.

POSTED BY: J. M.

Let's take your second-to-last sentence: In that case, a "TAC" on the template strand would code for an "AUG"/methionine on the mRNA/ribosome.

You have to remember there is chemical directionality to the DNA.

So, TAC on the template strand is really 5' - TAC - 3'

Which means, on the coding strand you would have: 3' - AUG - 5'

Since ribosomes always scan mRNA 5' to 3', the ribosome would actually "see" 5' - GUA - 3', so it would insert valine in the polypeptide.

Don't get frustrated. This is not intuitive stuff that naturally flows out of most textbooks. You are learning as you push through this.

POSTED BY: Todd Allen

So does that mean the sequence gets reversed before it's translated in mRNA? So then a CAT on the template strand would be 5' - CAT - 3', so the mRNA is 3' - GUA - 5', which then is read by the ribosome as 5' - AUG - 3' and will thus insert a methionine, right?

I will explore it more, thank you!

That's correct, Samikshaa!

You've got it!

Do you see a path forward to update your code to produce correct output?

POSTED BY: Todd Allen

Thank you for your help! I've updated the post & code to reflect what you said, is it correct now?

Your code is now producing correct output for the central dogma, DNA --> RNA --> Protein.

If you are interested in learning more about bioinformatic topics, let me recommend this book, which is advanced - but there is nothing wrong with challenging yourself.

bioinformatics algorithms

POSTED BY: Todd Allen

The function can be used and called from the resource function repository: https://resources.wolframcloud.com/FunctionRepository/resources/DNAtoAminoAcid

Feel free to check it out!

POSTED BY: Zach Shelton
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract