# Program the distance Jaro?

Posted 1 month ago
416 Views
|
7 Replies
|
3 Total Likes
|
 Hello everyone, I'm trying to program Jaro distance as requested by this page, I've done the following code that works well for the next two pairs ("MARTHA", "MARHTA") and ("DIXON", "DICKSONX") but when I try with ("JELLYFISH", "SMELLYFISH") I get an error because the code counts the double S of "SMELLYFISH", due to this error I have not been able to finish successfully, here is what I have programmed up to this moment: uno = "DIXON"; dos ="DICKSONX" ; rep = Characters[uno] \[Intersection] Characters[dos] scope = Max[StringLength[uno], StringLength[dos]]/2 - 1 inter = Transpose[{Flatten[Position[Characters[uno], #] & /@ rep], Flatten[Position[Characters[dos], #] & /@ rep]}] m = Select[inter, Abs[#[[1]] - #[[2]]] < scope &] prb = Select[m, #[[1]] != #[[2]] &] trans = Length[DeleteCases[Position[prb, Reverse[#]] & /@ prb, {}]]/2 1/3 (Length[m]/StringLength[uno] + Length[m]/StringLength[dos] + ( Length[m] - trans)/Length[m]) Someone who can help me solve this problem? Maybe the approach I'm using is wrong, unfortunately I have not been able to find a way to solve the problem with my code, I hope that someone can please guide me to achieve my goal, any help is welcome, thank you in advance for your help
7 Replies
Sort By:
Posted 1 month ago
 One could do something like that Clear[check, jaro] check[d_Integer, l1_List, l2_List] := If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d, {l1, l2}, (* else *) Missing[] ]; jaro[s1_String, s2_String] := Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, t = 0, l4, l5}, r = Floor[Max[l1, l2]/2] - 1; w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = DeleteMissing[ Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]]; m = Length[ GatherBy[SortBy[l3, First[First[#]] &], First[First[#]] &]]; {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]); l4 = First /@ l4; l5 = First /@ l5; t = Length[ Select[Subtract @@@ Transpose[{Take[l4, Min[Length[l4], Length[l5]]], Take[l5, Min[Length[l4], Length[l5]]]}], # != 0 &]]/2; Print["r = ", r, "| m = ", m, "| t = ", t]; If[m == 0, 0, (* else *) (m/l1 + m/l2 + (m - t)/m)/3. ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 but it has an error if s1 and s2 are the same character. Then In[204]:= jaro["Mis", "Mis"] During evaluation of In[204]:= r = 0| m = 3| t = 0 Out[204]= 1. In[205]:= jaro["Miss", "Miss"] During evaluation of In[205]:= r = 1| m = 3| t = 0 Out[205]= 0.833333 the $s$ in $ss$ is in reach of r to the other s, but seems to count twice, because m must be 4 to bring jaro to 1 in that case; correspondingly the test cases have a 50% fail, of course: In[202]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} During evaluation of In[202]:= r = 2| m = 4| t = 0 During evaluation of In[202]:= r = 2| m = 5| t = 1 During evaluation of In[202]:= r = 3| m = 4| t = 0 During evaluation of In[202]:= r = 4| m = 7| t = 0 (Debug) Out[202]= {0.822222, 0.822222, 0.766667, 0.825926} so, please Luis, get it fixed, to reach at 0.8222222, 0.9444444, 0.7666667, 0.8962963.
Posted 1 month ago
 Okay, fix it into Clear[check, jaro] check[d_Integer, l1_List, l2_List] := If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d, {l1, l2}, (* else *) Missing[] ]; jaro[s1_String, s2_String] := Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, t = 0, l4, l5}, r = Floor[Max[l1, l2]/2] - 1; If[r < 0, m = 0, (* else *) w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = DeleteMissing[ Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]]; m = Length[GatherBy[SortBy[l3, First[First[#]] &], First]]; {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]); t = EditDistance[First /@ l4, First /@ l5]/2 ]; Print["r = ", r, "| m = ", m, "| t = ", t]; If[m == 0, 0, (* else *) (m/l1 + m/l2 + (m - t)/m)/3. ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 this implementation is still toy because Outer[] goes too far and then ,,,,, using the Levenshtein distance in computing the Jaro distance seems a bit lunatic ... but at least the test cases go through: In[71]:= jaro["s", "s"] During evaluation of In[71]:= r = -1| m = 0| t = 0 Out[71]= 0 In[72]:= jaro["Miss", "Miss"] During evaluation of In[72]:= r = 1| m = 4| t = 0 Out[72]= 1. In[73]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} During evaluation of In[73]:= r = 2| m = 4| t = 0 During evaluation of In[73]:= r = 2| m = 6| t = 1 During evaluation of In[73]:= r = 3| m = 4| t = 0 During evaluation of In[73]:= r = 4| m = 8| t = 0 Out[73]= {0.822222, 0.944444, 0.766667, 0.896296} In[76]:= jaro["Miss Argentina", "Miss Brasilia"] During evaluation of In[76]:= r = 6| m = 8| t = 3/2 Out[76]= 0.666438 In[77]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] During evaluation of In[77]:= r = 15| m = 31| t = 9/2 Out[77]= 0.941196 
Posted 1 month ago
 $Udo$ $Krause$ Thank you very much for your help and support to face this problem, I sincerely believed that this problem was easier to solve, but I see that no, it has its details. Thanks to the code you shared, I learned another different approach to mine, which is more sophisticated and elegant. For my part I have made the following code based on the response they provide using python in the page I shared, I probe it with the cases that they request there and with some that you showed. I hope the community will discuss the improvements we can make to the code and thus progress in my learning, thanks again for all your support. jaro[uno_String, dos_String] := Module[{scope, slen, tlen, smatches, tmatches, matches, transpositions, i, start, end, k}, scope = Floor[Max[StringLength[uno], StringLength[dos]]/2] - 1; slen = StringLength[uno]; tlen = StringLength[dos]; smatches = Table[False, {slen}]; tmatches = Table[False, {tlen}]; matches = 0; transpositions = 0; For[i = 1, i <= slen, i++, start = Max[1, i - scope]; end = Min[i + scope, tlen]; For[j = start, j <= end, j++, If[tmatches[[j]], Continue[], If[StringTake[uno, {i}] != StringTake[dos, {j}], Continue[], smatches[[i]] = True; tmatches[[j]] = True]]; matches++; Break[]]]; k = 1; i =.; For[i = 1, i <= slen, i++, If[Not[smatches[[i]]], Continue[]]; While[Not[tmatches[[k]]], k++]; If[StringTake[uno, {i}] != StringTake[dos, {k}], transpositions++]; k++]; transpositions /= 2 ; 1/3 (matches/slen + matches/tlen + (matches - transpositions)/ matches) // N] In[10]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} Out[10]= {0.822222, 0.944444, 0.766667, 0.896296} In[11]:= jaro @@@ {{"Miss Argentina", "Miss Brasilia"}, {"MissArgentina", "MissBrasilia"}, {"miss Argentina", "miss Brasilia"}, {"MISSARGENTINA", "MISSBRASILIA"}, {"miss", "miss"}, {"MiSs", "MisS"}} Out[11]= {0.687271, 0.707265, 0.687271, 0.719017, 1., 0.916667} 
Posted 1 month ago
 For my part I have made the following Code which reminds me to the quotation you witness $FORTRAN66$ in Mathematica. If it is necessary to hammer out the fact that a double loop must be done -- Outer[] states exactly that -- it can be done in Mathematica without using the built-in Symbol For[] - I guess. Let's see whether the Delayed constructs allow for it. Outer[] is an overkill and elegance is of no good if it compromises performance (and/or correctness)
Posted 14 days ago
 The Outer[] has been banned down the call tree Clear[check, jaro] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] jaro[s1_String, s2_String] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, m, l4, l5, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = Flatten[check[{w1[[#]]}, w2[[Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1]; If[Length[l3] > 0, {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}]; m = MinMax[{Length[l4], Length[l5]}]; t = Count[PadRight[l4, m[[2]], -1] - PadRight[l5, m[[2]], -1], u_ /; u != 0]/2; (m[[1]]/l1 + m[[1]]/l2 + (m[[1]] - t)/m[[1]])/3., (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 to let out In[12]:= jaro["Miss Australia", "Miss Brasilia"] Out[12]= 0.771229 In[6]:= jaro["Miss Brasilia", "Miss Australia"] Out[6]= 0.771229 In[8]:= jaro["s", "s"] Out[8]= 0 In[10]:= jaro["Miss", "Miss"] Out[10]= 1. In[13]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} Out[13]= {0.822222, 0.944444, 0.766667, 0.896296} In[25]:= jaro["Miss Mexiko", "Miss Belize"] Out[25]= 0.733766 In[26]:= jaro[ "Miss Belize", "Miss Mexiko"] Out[26]= 0.733766 In[27]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] Out[27]= 0.898185 In[28]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"] Out[28]= 0.602146 In[29]:= jaro["abdegopq", "cfhijklmnrstuvwyz"] Out[29]= 0 In[30]:= jaro["cfhijklmnrstuvwyz", "abdegopq"] Out[30]= 0 In[33]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"] Out[33]= 0.819444 In[34]:= jaro["aasdjkdashdahsgdashdgasj", "aasdjkdashdahsgdashdgasj"] Out[34]= 1. In[31]:= jaro["CRATE", "TRACE"] Out[31]= 0.733333 In[32]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"] Out[32]= 0.465097 to check it against another implementation textdistance has been choosen, it gives for a r = -1 nevertheless 1 and shows the following results Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import textdistance >>> textdistance.jaro.distance('Miss', 'Miss') 0 >>> textdistance.jaro('Miss', 'Miss') 1 >>> textdistance.jaro('s', 's') 1 >>> textdistance.jaro('Miss Australia', 'Miss Brasilia') 0.8166833166833167 >>> textdistance.jaro('Miss Brasilia', 'Miss Australia') 0.8166833166833167 >>> textdistance.jaro('DWAYNE', 'DUANE') 0.8222222222222223 >>> textdistance.jaro('MARTHA', 'MARHTA') 0.9444444444444445 >>> textdistance.jaro('DIXON', 'DICKSONX') 0.7666666666666666 >>> textdistance.jaro('JELLYFISH', 'SMELLYFISH') 0.8962962962962964 >>> textdistance.jaro('Miss Mexiko', 'Miss Belize') 0.7575757575757575 >>> textdistance.jaro('Miss Belize', 'Miss Mexiko') 0.7575757575757575 >>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010') 0.8308371735791091 >>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize') 0.6232323232323232 >>> textdistance.jaro('abdegopq', 'cfhijklmnrstuvwyz') 0.0 >>> textdistance.jaro('cfhijklmnrstuvwyz', 'abdegopq') 0.0 >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj') 0.841183574879227 >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'aasdjkdashdahsgdashdgasj') 1 >>> textdistance.jaro('CRATE', 'TRACE') 0.7333333333333334 >>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry') 0.5631555944055945 >>> most of them disagree with the above jaro[]. Does your implementation match it?
 The EditDistance[] is back and Clear[check, jaro] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] jaro[s1_String, s2_String, prec_: $MachinePrecision] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, l4, l5, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) l3 = Flatten[ check[{Transpose[{ToCharacterCode[s1], Range[l1]}][[#]]}, Transpose[{ToCharacterCode[s2], Range[l2]}][[ Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1]; If[Length[l3] > 0, {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}]; m = Min[Length[l4], Length[l5]]; t = EditDistance[Take[l4, m], Take[l5, m]]/2; N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1 many results of textdistance.jaro() are matched >>> textdistance.jaro('Miss Australia', 'Miss Brasilia') 0.8166833166833167 >>> textdistance.jaro('Miss Mexiko', 'Miss Belize') 0.7575757575757575 >>> textdistance.jaro('Miss Belize', 'Miss Mexiko') 0.7575757575757575 In[161]:= jaro["Miss Australia", "Miss Brasilia"] Out[161]= 0.8166833166833167 In[162]:= jaro["Miss Mexiko", "Miss Belize"] Out[162]= 0.7575757575757576 In[163]:= jaro[ "Miss Belize", "Miss Mexiko"] Out[163]= 0.7575757575757576 but some not, to be discussed in the following: First define a function jo[] usable in FindRoot[] to find the m and t which have been ssemingly found by textdistance.jaro(): Clear[jo, joex] jo[s1_Integer, s2_Integer, m_?NumberQ, t_?NumberQ, prec_:$MachinePrecision] := N[(m/s1 + m/s2 + (m - t)/m)/3, prec] /; m > 0 && prec > 0 joex[s1_Integer, s2_Integer, m_Integer, t_?NumberQ] := (m/s1 + m/s2 + (m - t)/m)/3 /; m > 0 so one uses a jaro[] printing it's findings: >>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize') 0.6232323232323232 In[136]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"] During evaluation of In[136]:= l4 = Miss Mei During evaluation of In[136]:= l5 = Miss eie During evaluation of In[136]:= s1 = 30| s2 = 11| r = 14| m = 8| t = 1 Out[136]= 0.6229797979797980 the result ot textdistance.jaro() follows with m = 7 and t = 0: In[145]:= jo[30, 11, 7, 0] Out[145]= 0.6232323232323232 it's unclear how a disagreement in the number of matching characters can happen.This is interesting, only the first character differs: >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj') 0.841183574879227 In[146]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"] During evaluation of In[146]:= l4 = aasdjkdashdahsgdashdgasj During evaluation of In[146]:= l5 = asdjkdashdahsgdashdgasj During evaluation of In[146]:= s1 = 24| s2 = 23| r = 11| m = 23| t = 1 Out[146]= 0.971618357487923 this is reached by the same m (luckily), but t = 10: In[149]:= jo[24, 23, 23, 10] Out[149]= 0.8411835748792271 this one has a great discrepance in the number of matching charachters >>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010') 0.8308371735791091 In[67]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] During evaluation of In[67]:= l4 = 0100010100101001001001001010010 During evaluation of In[67]:= l5 = 10000100100111101010101010101010 During evaluation of In[67]:= s1 = 31| s2 = 32| r = 15| m = 31| t = 5 Out[67]= 0.935819892473118 because it's reached with m = 28 and t = 8 In[133]:= jo[31, 32, 28, 16/2] Out[133]= 0.8308371735791091 and last but not least, Mary and Meghan >>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry') 0.5631555944055945 In[167]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"] During evaluation of In[167]:= l4 = Ma has a tte a During evaluation of In[167]:= l5 = a Meha has the eea a During evaluation of In[167]:= s1 = 22| s2 = 32| r = 15| m = 14| t = 9/2 Out[167]= 0.5841450216450216 again textdistance,jaro() disagrees with the m (needs 13) as well as with the t (has 4): In[153]:= jo[22, 32, 13, 4] Out[153]= 0.5631555944055944 I quit. I expected that the m fits and the difference lies only in determining the t which has an English description The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. as well as a French description Le nombre de transpositions est obtenu en comparant le i-ème caractère correspondant de {\displaystyle s{1}} s{1} avec le i-ème caractère correspondant de {\displaystyle s{2}} s{2}. Le nombre de fois où ces caractères sont différents, divisé par deux, donne le nombre de transpositions. but if one does that, even 'Miss Australia' and 'Miss Brasilia' do not agree with textdistance.jaro().
 Very interesting dissertation about distance Jaro, so I get to understand you still preserve certain discrepancies with textdistance results , I do not know how to solve them because that escapes my knowledge, but I do not know if you already know about the following that can be used to make more comparisons jD = ExperimentalJaroDistance; jD["DIXON", "DICKSONX"] 0.766667 jD["JELLYFISH", "SMELLYFISH"] 0.896296 `I put those examples so you can see what I mean, maybe my suggestion will help you. What illustrative has been for me everything you have done on this subject, I continue to read both your code and your observations