Message Boards Message Boards

Avoid problem using a custom distance function in ClusteringTree?

Posted 7 years ago

I've been trying to use the function ClusteringTree to test how well it works with DNA sequences. This function uses edit distance to group sequences, and to make it more "realistic" I tried to use a custom distance function which takes into account gaps (right now I have included a gap opening cost, which is high, and a gap extension cost, which is low, at the end of the post is the custom distance function).

When I use the option DistanceFunction with my custom function the following error message appears:

"ClusteringTree was unable to compute positive and real pairwise distances for ... "

I also tried with the built in function HammingDistance, just to check, and I got the same result, so I don't think the problem was with my custom function.

I would appreciate if someone has an idea of how to include a custom distance function in ClusteringTree, Dendrogram, or a way which to use it in hierarchical clustering.

Thanks

If it helps, the code I used was:

ClusteringTree[data, 
 DistanceFunction -> dnaDistance2]

And the distance function I used is:

dnaDistance2[s1_, s2_] :=

 Block[

  {misM = OptionValue["mismatchScore"], gapOp = 10, gapExt = .1, 
   iUBsymbols = {"A", "T", "C", "G", "W", "S", "M", "K", "R", "Y", 
     "B", "D", "H", "V", "N", "X"}, prescore, substitutionscore, 
   gapScore, totalScore},

  prescore = 
   Transpose[
     Characters[{s1, s2}]] /. {{x_, x_} /; x != "-" -> 
      0, {a_, b_} /; ContainsAll[iUBsymbols, {a, b}] && a != b -> 
      misM};

  substitutionscore = Total[Cases[prescore, _Integer]];

  gapScore = Total[
    Abs[gapOp*SequenceCount[#, {"DG" ..}] - gapOp] + 
       gapExt*Count[#, x_ /; x  != "DG"]
      & /@
     Flatten[
      Split[#, 
         If[{#1, #2} === {"G2", "G1"} || {#1, #2} === {"G1", "G2"}, 
           False, True] &] & /@
       SequenceCases[
        (prescore /. {{"-", "-"} -> "DG", {"-", _} -> 
            "G1", {_, "-"} -> "G2"}),
        {Repeated["G1" | "G2" | "DG"]}]
      , 1]
    ];

  totalScore = substitutionscore + gapScore;

  Ceiling[totalScore]
  ]
POSTED BY: Carlos Muñoz
4 Replies

That's because of this I think:

EuclideanDistance[1, 2]
HammingDistance[1, 2]

the first one gives a number, while the second doesn't... So in order to fix it you have to put data in a certain format:

ClusteringTree[{1, 2, 5}, DistanceFunction -> EuclideanDistance]
ClusteringTree[{{1}, {2}, {5}}, DistanceFunction -> HammingDistance]
POSTED BY: Sander Huisman
Posted 7 years ago

Thanks! My data was an association with strings as values, so I transformed them into lists with one element, as in your example, and it worked.

POSTED BY: Carlos Muñoz
Posted 7 years ago
Update:

The results I got were a bit odd, and I though it was because of my distance function, but when I checked my distance function with the data in a different format it gave me always a distance of 0.

POSTED BY: Carlos Muñoz
Posted 7 years ago

Last update:

I just realized I was using the wrong set of data, haha.

If it helps to explain, in case anyone has a similar problem, I was using sequences of different lengths when my custom function only worked with sequences of equal lengths.

POSTED BY: Carlos Muñoz
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract