Message Boards Message Boards

Glycolysis Similarity Among Humans, Chimps, and Salmon

Posted 6 years ago

Well, I find myself between jobs again, so I'll take another step into the wild biology frontier. First, a few sentences on what I've learned from working remotely for the past two years for two failed companies. I've seen a lot of one strategy that doesn't work, which seems to stem from the mythos of Steve Jobs and the iPhone. Namely, a genius designing a masterpiece in isolation and then taking the world by storm with it. It doesn't seem to work very well when the goal is to automate and scale a process. Instead, the vaguely defined masterpiece never manifests itself, the team wastes huge amounts of time on work that is thrown away, zero successful completions of the process occur (in these cases getting students into universities they will like and getting people the best deals on loans), and then the company runs out of money. And even if they had been able to decide exactly what the magical app should have been, it wouldn't have worked as soon as it was confronted with the messy details of reality. I would have been much happier if they were drawing their inspiration from the big restaurant chains or stores that started as a single shop, the big transport services that started by delivering on foot within a single city, or even the early Steve Jobs days where Apple was building computers in a garage.

Anyway, back to working with biology data, which is one of my sanctuaries of slow but steady progress (the other being working on more interesting character AI for games, because current games get boring too quickly). I've been getting familiar with more of the various NCBI databases and tinkering with the API. All of the UI, terminology, and acronyms are somewhat overwhelming at first. I find having a toy genetic engineering project in mind to helps me stay focused on figuring out specific computations I would actually want to run. So my toy project is to genetically engineer grass that doesn't have to be cut. It should automatically stop growing (but stay alive) at an agreeable height for a lawn. I chose that project because it's beyond what can be done today, but perhaps much simpler than common goals like curing diseases in humans. And it's something where I can conceive of personally doing physical experiments, because the safety and regulatory concerns are much less stringent when dealing with a tray of grass. And it's perhaps even less controversial than other genetically modified plants, because it's weakening the grass from an evolutionary standpoint. So you don't have to worry about it taking over natural fields. It turns out golf courses have been interested in creating so-called dwarf grasses via selective breeding for a while, but I think to truly get the desired behavior will require engineering.

Most genetic engineering projects you read about involve adding or disabling single genes. As you start to investigate progressing to designing small gene circuits, you realize that the metabolic pathway data is lagging behind the genomic data (which is already not as complete as you want). The most popular database for pathway data is KEGG Pathway. It consists of a few hundred hand drawn pathway maps covering pathways that are somewhat consistent (called evolutionarily conserved) across many organisms, annotated with compounds and enzymes. Then for a variety of species it lists which genes produce the proteins corresponding to the various enzymes in the reactions. Of course, the completeness varies a lot between species, however, it's enough to begin imagining potential paths forward.

I've been studying general knowledge on plant and grass development, so using the pathways I can perhaps get some ideas for genes to try and disable in order to stop the growth. Plants use a much larger amount and variety of secondary metabolites than animals, so then perhaps I can find one whose concentration correlates with the height of the blade of grass. Many processes in biology are regulated by the concentration of a signaling molecule (like how your hand knows where to grow a thumb instead of a pinky). Perhaps the trickiest part will be to design an enzyme that is activated by a particular concentration of that signaling molecule, if I can't find one that already exists. But enzyme design is a hot topic, and as computational chemistry algorithms continue to advance (and become easier to use due to integration in the WL, more standardized formats, etc), eventually that will be possible too. Then the final step would be to add a gene or genes that just produce some RNA to interfere with the genes I want to disable (a common technique). Then my enzyme can promote/activate that gene when the enzyme is activated by the concentration of the signaling molecule. Whew. That's a long list of hypotheticals for such a simple sounding project, and we haven't even considered trying to control side effects. But I think this type of reasoning will become more common in the future.

However, those investigations are for future dates! Let's wrap this post up with a simple, interesting exercise using the KEGG database. Let's see how similar the DNA and protein amino acid sequences used for glycolysis (a fundamental process in all organisms) are among humans, chimps, and something really different like salmon. The way the KEGG API works is that for a given pathway we can request either the enzyme categories used in the various reactions (prefix "ec"), the formulas for the reactions the enzymes catalyze (prefix "rn"), or the genes that produce those enzymes (prefix varies per organism, "hsa" is human, "ptr" is chimp, "sasa" is salmon). KEGG has their own IDs for the genes, but then from those you can request the sequences or links back to other databases like the NCBI nucleotide and protein databases for more information. We'll start by grabbing the XML data (in a schema they call KGML) for the glycolysis genes for each species.

human = Import["http://rest.kegg.jp/get/hsa00010/kgml", "XML"][[2, 3]];

chimpanzee = 
  Import["http://rest.kegg.jp/get/ptr00010/kgml", "XML"][[2, 3]];

salmon = Import["http://rest.kegg.jp/get/sasa00010/kgml", "XML"][[2, 
    3]];

Let's define a function to download and parse out the name and sequence data for a given gene. It would be nice if they offered this data in JSON format.

getGeneInfo[keggId_] := keggId // <|
    "Names" -> 
     First@StringCases[
       Import["http://rest.kegg.jp/get/" <> #, "Text"], 
       "NAME" ~~ Whitespace ~~ Shortest[names__] ~~ "\n" :> 
        StringSplit[names, ", "]],
    "NTSeq" -> 
     StringJoin[
      StringSplit[
        Import["http://rest.kegg.jp/get/" <> # <> "/ntseq", "Text"], 
        "\n"][[2 ;;]]],
    "AASeq" -> 
     StringJoin[
      StringSplit[
        Import["http://rest.kegg.jp/get/" <> # <> "/aaseq", "Text"], 
        "\n"][[2 ;;]]]
    |> &

Now let's download the info for each of the genes mentioned in each of the pathways. Sometimes multiple genes are given for a single enzyme or reaction node in the pathway.

humanGeneInfo = 
 Select[human, #[[1]] == "entry" && #[[2, 3, 2]] == "gene" &][[All, 2,
       2, 2]] // StringSplit // Flatten // 
  Module[{i = 0}, 
    Monitor[AssociationMap[(i++; getGeneInfo@#) &, #], 
     ProgressIndicator[i, {1, Length@#}]]] &

chimpanzeeGeneInfo = 
 Select[chimpanzee, #[[1]] == "entry" && #[[2, 3, 2]] == 
         "gene" &][[All, 2, 2, 2]] // StringSplit // Flatten // 
  Module[{i = 0}, 
    Monitor[AssociationMap[(i++; getGeneInfo@#) &, #], 
     ProgressIndicator[i, {1, Length@#}]]] &

salmonGeneInfo = 
 Select[salmon, #[[1]] == "entry" && #[[2, 3, 2]] == "gene" &][[All, 
      2, 2, 2]] // StringSplit // Flatten // 
  Module[{i = 0}, 
    Monitor[AssociationMap[(i++; getGeneInfo@#) &, #], 
     ProgressIndicator[i, {1, Length@#}]]] &

You'll see some error messages thrown from a lot of the salmon genes missing names. Now let's go through each of the human genes and find the chimpanzee gene that has a shared name and list the name of the gene, the length of the nucleotide sequence, and the edit distances to get from the human to chimpanzee nucleotide and amino acid sequences for that gene. Then sort by ratio of edit distance to sequence length so the most similar ones come first and the most different ones come last.

Table[SelectFirst[chimpanzeeGeneInfo, 
    ContainsAny[ToLowerCase@humanGene[["Names"]], 
      ToLowerCase@#[["Names"]]] &] // 
   If[! MissingQ@#, {humanGene[["Names", 1]], 
      StringLength@humanGene[["NTSeq"]], 
      EditDistance[humanGene[["NTSeq"]], #[["NTSeq"]]], 
      EditDistance[humanGene[["AASeq"]], #[["AASeq"]]]}, 
     Nothing] &, {humanGene, humanGeneInfo}] // 
 SortBy[#[[3]]/#[[2]] &]

{{ENO3,1305,0,0},{PGAM2,762,1,0},{PGK1,1254,2,0},{PDHA2,1167,2,1},{LDHB,1005,2,1},{HK1,2754,7,3},{ADH5,1125,3,2},{ADPGK,1491,4,1},{ALDOB,1095,3,1},{ACSS2,2106,8,4},{GAPDH,1008,4,0},{ALDH1A3,1539,7,1},{ALDOC,1095,5,3},{ENO2,1305,6,0},{DLAT,1944,9,4},{LDHA,999,5,1},{AKR1A1,978,5,3},{ALDH2,1554,8,3},{PGAM1,765,4,0},{HK2,2754,15,3},{PDHB,1080,6,0},{GPI,1677,10,5},{ADH4,1143,7,2},{ENO1,1305,8,0},{PCK2,1923,12,6},{ALDOA,1095,7,2},{GALM,1029,7,3},{FBP1,1017,7,3},{HKDC1,2754,19,4},{ENO4,1878,13,6},{LDHAL6B,1146,8,5},{PGM1,1689,12,6},{ACSS1,2070,15,5},{PGM2,1839,14,4},{PDHA1,1173,9,1},{BPGM,780,6,2},{ADH1A,1128,9,5},{ADH1B,1128,9,2},{GAPDHS,1227,10,5},{MINPP1,1464,12,5},{ALDH1B1,1554,13,5},{G6PC2,1068,9,5},{PGK2,1254,11,5},{FBP2,1020,10,1},{G6PC,1074,12,4},{PFKL,2343,29,0},{GCK,1398,28,13},{ADH6,1107,23,9},{PKLR,1725,36,13},{PGAM4,765,17,11},{PKM,1596,57,21},{ALDH9A1,1557,82,28},{ALDH3A2,1458,79,31},{DLD,1530,92,29},{HK3,2772,190,63},{PFKP,2355,202,68},{PFKM,2343,225,74},{ADH1C,1128,145,48},{TPI1,861,114,37},{ALDH3B1,1407,220,74},{ALDH3A1,1362,224,74},{PCK1,1869,409,135},{ALDH7A1,1620,359,121},{ALDH3B2,1158,267,97},{LDHAL6A,999,238,79},{LDHC,999,348,116},{ADH7,1161,414,170},{G6PC3,1041,454,173}}

We can see most of them are very similar. Enolase 3 is completely identical between humans and chimpanzees. It has a few isoenzymes that are expressed in different tissue types in mammals. This particular one is most common in skeletal muscle. G6PC3 has an edit distance that is almost half the length of the nucleotide sequence. This gene encodes the catalytic subunit of the G6Pase enzyme. In humans, mutations in this gene can cause babies to have low white blood cell counts. Let's do the same computation for salmon.

Table[SelectFirst[salmonGeneInfo, 
    ContainsAny[ToLowerCase@humanGene[["Names"]], 
      ToLowerCase@#[["Names"]]] &] // 
   If[! MissingQ@#, {humanGene[["Names", 1]], 
      StringLength@humanGene[["NTSeq"]], 
      EditDistance[humanGene[["NTSeq"]], #[["NTSeq"]]], 
      EditDistance[humanGene[["AASeq"]], #[["AASeq"]]]}, 
     Nothing] &, {humanGene, humanGeneInfo}] // 
 SortBy[#[[3]]/#[[2]] &]

{{GAPDH,1008,213,49},{PGAM1,765,169,35},{ENO3,1305,292,64},{PGK1,1254,283,61},{PGAM4,765,175,44},{ALDH2,1554,356,107},{ALDOA,1095,252,66},{ALDOC,1095,257,75},{GCK,1398,348,99},{PCK1,1869,477,155},{ALDH7A1,1620,415,99},{ADH5,1125,305,71},{PGM1,1689,461,124},{ALDOB,1095,301,98},{FBP2,1020,284,86},{ALDH9A1,1557,468,145},{DLD,1530,461,91},{PCK2,1923,591,214},{PDHB,1080,332,92},{LDHB,1005,310,74},{LDHA,999,316,97},{ACSS1,2070,662,218},{ADPGK,1491,493,185},{ADH1C,1128,376,142},{PGM2,1839,622,182},{DLAT,1944,659,197},{GALM,1029,356,136},{ALDH3A2,1458,549,194},{G6PC3,1041,437,191},{G6PC2,1068,454,156},{ENO4,1878,819,363}}

Over half of them are missing because there are a lot of salmon genes that are unnamed or have no shared names with the human genes. The dissimilarity ratio ranges from 0% to 44% in chimpanzees and 21% to 44% in salmon. This is expected, of course, because salmon are less similar to us than chimpanzees. Now let's ignore the annotated names and just find the salmon gene ID that is the most similar to each human gene based on finding all pairwise edit distances in the set.

Table[Join[{humanGene[["Names", 1]], 
    StringLength@humanGene[["NTSeq"]]}, 
   First@SortBy[
     KeyValueMap[{EditDistance[
         humanGene[["NTSeq"]], #2[["NTSeq"]]], #} &, salmonGeneInfo], 
     First]], {humanGene, humanGeneInfo}] // SortBy[#[[3]]/#[[2]] &]

{{ENO1,1305,262,sasa:100194865},{GAPDH,1008,210,sasa:106575942},{ALDOA,1095,233,sasa:106583908},{PGAM2,762,165,sasa:106589759},{ENO3,1305,284,sasa:100196671},{ENO2,1305,285,sasa:106576545},{PGAM1,765,169,sasa:100194748},{PKM,1596,356,sasa:100195460},{PGK1,1254,283,sasa:106568020},{HK2,2754,623,sasa:106585103},{ALDOC,1095,250,sasa:106612788},{PGAM4,765,175,sasa:100194748},{ALDH2,1554,356,sasa:106578507},{HK1,2754,634,sasa:106587516},{GPI,1677,397,sasa:100196524},{GCK,1398,344,sasa:106585167},{PGK2,1254,316,sasa:100194765},{PDHB,1080,274,sasa:106566255},{FBP1,1017,259,sasa:106561989},{PCK1,1869,477,sasa:100195420},{ALDH7A1,1620,415,sasa:100194754},{PFKP,2355,604,sasa:100380410},{PFKM,2343,613,sasa:106566997},{HKDC1,2754,739,sasa:106612435},{PGM1,1689,457,sasa:106568718},{FBP2,1020,276,sasa:106593443},{ADH5,1125,305,sasa:100195992},{ALDH1A3,1539,421,sasa:106562593},{ALDH1B1,1554,427,sasa:106578507},{ALDOB,1095,301,sasa:100136522},{PDHA1,1173,335,sasa:106590467},{PFKL,2343,682,sasa:106582566},{ACSS2,2106,623,sasa:106566799},{ALDH9A1,1557,468,sasa:100195073},{DLD,1530,461,sasa:106561021},{LDHA,999,302,sasa:106573043},{PCK2,1923,582,sasa:100195420},{LDHB,1005,308,sasa:106609123},{G6PC,1074,333,sasa:106579337},{AKR1A1,978,306,sasa:106584055},{BPGM,780,247,sasa:100196266},{PDHA2,1167,370,sasa:106563586},{ACSS1,2070,662,sasa:106576722},{ADH1B,1128,361,sasa:106611602},{TPI1,861,280,sasa:106569985},{ADH1C,1128,369,sasa:106611602},{PKLR,1725,565,sasa:100195460},{ADPGK,1491,493,sasa:106560768},{ADH1A,1128,374,sasa:106611602},{PGM2,1839,622,sasa:106595213},{DLAT,1944,659,sasa:106604067},{GALM,1029,356,sasa:106608200},{LDHAL6A,999,350,sasa:106573043},{LDHC,999,351,sasa:106573043},{ADH4,1143,404,sasa:106611602},{ADH7,1161,414,sasa:106611602},{ALDH3B1,1407,506,sasa:106606778},{HK3,2772,1003,sasa:106585103},{ALDH3B2,1158,427,sasa:106606760},{ADH6,1107,415,sasa:100195992},{ALDH3A2,1458,549,sasa:100286782},{GAPDHS,1227,463,sasa:106577739},{LDHAL6B,1146,433,sasa:106573069},{G6PC2,1068,410,sasa:106579337},{ALDH3A1,1362,531,sasa:100286782},{G6PC3,1041,437,sasa:106589637},{ENO4,1878,819,sasa:106606505},{MINPP1,1464,642,sasa:106589393}}

I looked up several of these gene IDs in KEGG and then followed the links to the NCBI gene database. Many of them have names with "-like" appended to say it is like another gene or enzyme. Also, for example, the salmon doesn't have genes listed for enolase 1 or 2 (it does have 3 and 4), but it has one called alpha-enolase that is more similar to the human enolase 1 than any other pair of human to salmon genes in the set. It's also interesting that the salmon PGAM1 gene is slightly more similar to the human PGAM2 gene than to the human PGAM1 gene. Matching the full set of human genes to salmon had hardly any effect on the range of dissimilarity ratios. It changed from 21%-44% to 20%-44%. So the maximum dissimilarity in salmon remained at 44% even when considering unnamed salmon genes.

And we'll end it there for now. I've attached a pretty messy notebook. Normally, as I make multiple passes over the code, I move and organize cells toward the top, but in this case the notebook is pretty raw. It also has some scratch work where I was parsing chemical reactions from other KGML files and looking at the number of genes involved in each reaction between the species. Until next time!

Attachments:
POSTED BY: Michael Hale
6 Replies

Michael,

This looks quite interesting, but I am stymied early on in the import.

In[1]:= human = Import["http://rest.kegg.jp/get/hsa00010/kgml", "XML"][[2, 3]]; 

XML`Parser`XMLGet::prserr: 
   MalformedURLException: The URL used an unsupported protocol at Line: 2
     Character: 73 in /tmp/m00000184221/kgml.

Import::fmterr: Cannot import data as XML format.

Part::partd: Part specification $Failed[[2,3]] is longer than depth of object.

Any idea what might be amiss?

POSTED BY: Daniel Lichtblau
Posted 6 years ago

Strange. I just copied the exact line you provided into a fresh kernel. I'm using 11.3 on Win 10. Are you using an internal build?

POSTED BY: Michael Hale

No, I am using the shipping version of 11.3, but on Linux. A mystery.

POSTED BY: Daniel Lichtblau

Dear Daniel,

same as you, but on OSX.

Cheers,

Marco

Update: It is still processing, but this appears to work:

human = Import["http://rest.kegg.jp/get/hsa00010/kgml", "XML", "ReadDTD" -> False][[2, 3]];
POSTED BY: Marco Thiel
Posted 6 years ago

Hm. I did have to use "ReadDTD" -> False when getting XML from the NCBI API, but not the KEGG API. Like in the following to get a list of all of the NCBI databases.

Import["https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi", \
"XML", "ReadDTD" -> False]

But it did provide a more helpful error message than what Daniel was getting.

POSTED BY: Michael Hale

Let's wrap this post up with a simple, interesting exercise using the KEGG database. Let's see how similar the DNA and protein amino acid sequences used for glycolysis (a fundamental process in all organisms) are among humans, chimps, and something really different like salmon.

This is really cool! Thanks so much for taking the time to share your work. It may be a "simple" exercise to you but as a total beginner it's awesome to see what you can achieve with Wolfram.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract