I've been trying to use the function ClusteringTree to test how well it works with DNA sequences. This function uses edit distance to group sequences, and to make it more "realistic" I tried to use a custom distance function which takes into account gaps (right now I have included a gap opening cost, which is high, and a gap extension cost, which is low, at the end of the post is the custom distance function).
When I use the option DistanceFunction with my custom function the following error message appears:
"ClusteringTree was unable to compute positive and real pairwise distances for ... "
I also tried with the built in function HammingDistance, just to check, and I got the same result, so I don't think the problem was with my custom function.
I would appreciate if someone has an idea of how to include a custom distance function in ClusteringTree, Dendrogram, or a way which to use it in hierarchical clustering.
Thanks
If it helps, the code I used was:
ClusteringTree[data,
DistanceFunction -> dnaDistance2]
And the distance function I used is:
dnaDistance2[s1_, s2_] :=
Block[
{misM = OptionValue["mismatchScore"], gapOp = 10, gapExt = .1,
iUBsymbols = {"A", "T", "C", "G", "W", "S", "M", "K", "R", "Y",
"B", "D", "H", "V", "N", "X"}, prescore, substitutionscore,
gapScore, totalScore},
prescore =
Transpose[
Characters[{s1, s2}]] /. {{x_, x_} /; x != "-" ->
0, {a_, b_} /; ContainsAll[iUBsymbols, {a, b}] && a != b ->
misM};
substitutionscore = Total[Cases[prescore, _Integer]];
gapScore = Total[
Abs[gapOp*SequenceCount[#, {"DG" ..}] - gapOp] +
gapExt*Count[#, x_ /; x != "DG"]
& /@
Flatten[
Split[#,
If[{#1, #2} === {"G2", "G1"} || {#1, #2} === {"G1", "G2"},
False, True] &] & /@
SequenceCases[
(prescore /. {{"-", "-"} -> "DG", {"-", _} ->
"G1", {_, "-"} -> "G2"}),
{Repeated["G1" | "G2" | "DG"]}]
, 1]
];
totalScore = substitutionscore + gapScore;
Ceiling[totalScore]
]