# Program the distance Jaro?

Posted 7 months ago
1538 Views
|
11 Replies
|
3 Total Likes
|
 Hello everyone, I'm trying to program Jaro distance as requested by this page, I've done the following code that works well for the next two pairs ("MARTHA", "MARHTA") and ("DIXON", "DICKSONX") but when I try with ("JELLYFISH", "SMELLYFISH") I get an error because the code counts the double S of "SMELLYFISH", due to this error I have not been able to finish successfully, here is what I have programmed up to this moment: uno = "DIXON"; dos ="DICKSONX" ; rep = Characters[uno] \[Intersection] Characters[dos] scope = Max[StringLength[uno], StringLength[dos]]/2 - 1 inter = Transpose[{Flatten[Position[Characters[uno], #] & /@ rep], Flatten[Position[Characters[dos], #] & /@ rep]}] m = Select[inter, Abs[#[[1]] - #[[2]]] < scope &] prb = Select[m, #[[1]] != #[[2]] &] trans = Length[DeleteCases[Position[prb, Reverse[#]] & /@ prb, {}]]/2 1/3 (Length[m]/StringLength[uno] + Length[m]/StringLength[dos] + ( Length[m] - trans)/Length[m]) Someone who can help me solve this problem? Maybe the approach I'm using is wrong, unfortunately I have not been able to find a way to solve the problem with my code, I hope that someone can please guide me to achieve my goal, any help is welcome, thank you in advance for your help
11 Replies
Sort By:
Posted 7 months ago
 One could do something like that Clear[check, jaro] check[d_Integer, l1_List, l2_List] := If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d, {l1, l2}, (* else *) Missing[] ]; jaro[s1_String, s2_String] := Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, t = 0, l4, l5}, r = Floor[Max[l1, l2]/2] - 1; w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = DeleteMissing[ Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]]; m = Length[ GatherBy[SortBy[l3, First[First[#]] &], First[First[#]] &]]; {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]); l4 = First /@ l4; l5 = First /@ l5; t = Length[ Select[Subtract @@@ Transpose[{Take[l4, Min[Length[l4], Length[l5]]], Take[l5, Min[Length[l4], Length[l5]]]}], # != 0 &]]/2; Print["r = ", r, "| m = ", m, "| t = ", t]; If[m == 0, 0, (* else *) (m/l1 + m/l2 + (m - t)/m)/3. ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 but it has an error if s1 and s2 are the same character. Then In[204]:= jaro["Mis", "Mis"] During evaluation of In[204]:= r = 0| m = 3| t = 0 Out[204]= 1. In[205]:= jaro["Miss", "Miss"] During evaluation of In[205]:= r = 1| m = 3| t = 0 Out[205]= 0.833333 the $s$ in $ss$ is in reach of r to the other s, but seems to count twice, because m must be 4 to bring jaro to 1 in that case; correspondingly the test cases have a 50% fail, of course: In[202]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} During evaluation of In[202]:= r = 2| m = 4| t = 0 During evaluation of In[202]:= r = 2| m = 5| t = 1 During evaluation of In[202]:= r = 3| m = 4| t = 0 During evaluation of In[202]:= r = 4| m = 7| t = 0 (Debug) Out[202]= {0.822222, 0.822222, 0.766667, 0.825926} so, please Luis, get it fixed, to reach at 0.8222222, 0.9444444, 0.7666667, 0.8962963.
Posted 7 months ago
 Okay, fix it into Clear[check, jaro] check[d_Integer, l1_List, l2_List] := If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d, {l1, l2}, (* else *) Missing[] ]; jaro[s1_String, s2_String] := Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, t = 0, l4, l5}, r = Floor[Max[l1, l2]/2] - 1; If[r < 0, m = 0, (* else *) w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = DeleteMissing[ Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]]; m = Length[GatherBy[SortBy[l3, First[First[#]] &], First]]; {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]); t = EditDistance[First /@ l4, First /@ l5]/2 ]; Print["r = ", r, "| m = ", m, "| t = ", t]; If[m == 0, 0, (* else *) (m/l1 + m/l2 + (m - t)/m)/3. ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 this implementation is still toy because Outer[] goes too far and then ,,,,, using the Levenshtein distance in computing the Jaro distance seems a bit lunatic ... but at least the test cases go through: In[71]:= jaro["s", "s"] During evaluation of In[71]:= r = -1| m = 0| t = 0 Out[71]= 0 In[72]:= jaro["Miss", "Miss"] During evaluation of In[72]:= r = 1| m = 4| t = 0 Out[72]= 1. In[73]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} During evaluation of In[73]:= r = 2| m = 4| t = 0 During evaluation of In[73]:= r = 2| m = 6| t = 1 During evaluation of In[73]:= r = 3| m = 4| t = 0 During evaluation of In[73]:= r = 4| m = 8| t = 0 Out[73]= {0.822222, 0.944444, 0.766667, 0.896296} In[76]:= jaro["Miss Argentina", "Miss Brasilia"] During evaluation of In[76]:= r = 6| m = 8| t = 3/2 Out[76]= 0.666438 In[77]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] During evaluation of In[77]:= r = 15| m = 31| t = 9/2 Out[77]= 0.941196 
Posted 7 months ago
 $Udo$ $Krause$ Thank you very much for your help and support to face this problem, I sincerely believed that this problem was easier to solve, but I see that no, it has its details. Thanks to the code you shared, I learned another different approach to mine, which is more sophisticated and elegant. For my part I have made the following code based on the response they provide using python in the page I shared, I probe it with the cases that they request there and with some that you showed. I hope the community will discuss the improvements we can make to the code and thus progress in my learning, thanks again for all your support. jaro[uno_String, dos_String] := Module[{scope, slen, tlen, smatches, tmatches, matches, transpositions, i, start, end, k}, scope = Floor[Max[StringLength[uno], StringLength[dos]]/2] - 1; slen = StringLength[uno]; tlen = StringLength[dos]; smatches = Table[False, {slen}]; tmatches = Table[False, {tlen}]; matches = 0; transpositions = 0; For[i = 1, i <= slen, i++, start = Max[1, i - scope]; end = Min[i + scope, tlen]; For[j = start, j <= end, j++, If[tmatches[[j]], Continue[], If[StringTake[uno, {i}] != StringTake[dos, {j}], Continue[], smatches[[i]] = True; tmatches[[j]] = True]]; matches++; Break[]]]; k = 1; i =.; For[i = 1, i <= slen, i++, If[Not[smatches[[i]]], Continue[]]; While[Not[tmatches[[k]]], k++]; If[StringTake[uno, {i}] != StringTake[dos, {k}], transpositions++]; k++]; transpositions /= 2 ; 1/3 (matches/slen + matches/tlen + (matches - transpositions)/ matches) // N] In[10]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} Out[10]= {0.822222, 0.944444, 0.766667, 0.896296} In[11]:= jaro @@@ {{"Miss Argentina", "Miss Brasilia"}, {"MissArgentina", "MissBrasilia"}, {"miss Argentina", "miss Brasilia"}, {"MISSARGENTINA", "MISSBRASILIA"}, {"miss", "miss"}, {"MiSs", "MisS"}} Out[11]= {0.687271, 0.707265, 0.687271, 0.719017, 1., 0.916667} 
Posted 7 months ago
 For my part I have made the following Code which reminds me to the quotation you witness $FORTRAN66$ in Mathematica. If it is necessary to hammer out the fact that a double loop must be done -- Outer[] states exactly that -- it can be done in Mathematica without using the built-in Symbol For[] - I guess. Let's see whether the Delayed constructs allow for it. Outer[] is an overkill and elegance is of no good if it compromises performance (and/or correctness)
Posted 6 months ago
 The Outer[] has been banned down the call tree Clear[check, jaro] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] jaro[s1_String, s2_String] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, m, l4, l5, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) w1 = Transpose[{ToCharacterCode[s1], Range[l1]}]; w2 = Transpose[{ToCharacterCode[s2], Range[l2]}]; l3 = Flatten[check[{w1[[#]]}, w2[[Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1]; If[Length[l3] > 0, {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}]; m = MinMax[{Length[l4], Length[l5]}]; t = Count[PadRight[l4, m[[2]], -1] - PadRight[l5, m[[2]], -1], u_ /; u != 0]/2; (m[[1]]/l1 + m[[1]]/l2 + (m[[1]] - t)/m[[1]])/3., (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 to let out In[12]:= jaro["Miss Australia", "Miss Brasilia"] Out[12]= 0.771229 In[6]:= jaro["Miss Brasilia", "Miss Australia"] Out[6]= 0.771229 In[8]:= jaro["s", "s"] Out[8]= 0 In[10]:= jaro["Miss", "Miss"] Out[10]= 1. In[13]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}} Out[13]= {0.822222, 0.944444, 0.766667, 0.896296} In[25]:= jaro["Miss Mexiko", "Miss Belize"] Out[25]= 0.733766 In[26]:= jaro[ "Miss Belize", "Miss Mexiko"] Out[26]= 0.733766 In[27]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] Out[27]= 0.898185 In[28]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"] Out[28]= 0.602146 In[29]:= jaro["abdegopq", "cfhijklmnrstuvwyz"] Out[29]= 0 In[30]:= jaro["cfhijklmnrstuvwyz", "abdegopq"] Out[30]= 0 In[33]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"] Out[33]= 0.819444 In[34]:= jaro["aasdjkdashdahsgdashdgasj", "aasdjkdashdahsgdashdgasj"] Out[34]= 1. In[31]:= jaro["CRATE", "TRACE"] Out[31]= 0.733333 In[32]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"] Out[32]= 0.465097 to check it against another implementation textdistance has been choosen, it gives for a r = -1 nevertheless 1 and shows the following results Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import textdistance >>> textdistance.jaro.distance('Miss', 'Miss') 0 >>> textdistance.jaro('Miss', 'Miss') 1 >>> textdistance.jaro('s', 's') 1 >>> textdistance.jaro('Miss Australia', 'Miss Brasilia') 0.8166833166833167 >>> textdistance.jaro('Miss Brasilia', 'Miss Australia') 0.8166833166833167 >>> textdistance.jaro('DWAYNE', 'DUANE') 0.8222222222222223 >>> textdistance.jaro('MARTHA', 'MARHTA') 0.9444444444444445 >>> textdistance.jaro('DIXON', 'DICKSONX') 0.7666666666666666 >>> textdistance.jaro('JELLYFISH', 'SMELLYFISH') 0.8962962962962964 >>> textdistance.jaro('Miss Mexiko', 'Miss Belize') 0.7575757575757575 >>> textdistance.jaro('Miss Belize', 'Miss Mexiko') 0.7575757575757575 >>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010') 0.8308371735791091 >>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize') 0.6232323232323232 >>> textdistance.jaro('abdegopq', 'cfhijklmnrstuvwyz') 0.0 >>> textdistance.jaro('cfhijklmnrstuvwyz', 'abdegopq') 0.0 >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj') 0.841183574879227 >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'aasdjkdashdahsgdashdgasj') 1 >>> textdistance.jaro('CRATE', 'TRACE') 0.7333333333333334 >>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry') 0.5631555944055945 >>> most of them disagree with the above jaro[]. Does your implementation match it?
Posted 6 months ago
 The EditDistance[] is back and Clear[check, jaro] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] jaro[s1_String, s2_String, prec_: $MachinePrecision] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, l4, l5, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) l3 = Flatten[ check[{Transpose[{ToCharacterCode[s1], Range[l1]}][[#]]}, Transpose[{ToCharacterCode[s2], Range[l2]}][[ Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1]; If[Length[l3] > 0, {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}]; m = Min[Length[l4], Length[l5]]; t = EditDistance[Take[l4, m], Take[l5, m]]/2; N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1 many results of textdistance.jaro() are matched >>> textdistance.jaro('Miss Australia', 'Miss Brasilia') 0.8166833166833167 >>> textdistance.jaro('Miss Mexiko', 'Miss Belize') 0.7575757575757575 >>> textdistance.jaro('Miss Belize', 'Miss Mexiko') 0.7575757575757575 In[161]:= jaro["Miss Australia", "Miss Brasilia"] Out[161]= 0.8166833166833167 In[162]:= jaro["Miss Mexiko", "Miss Belize"] Out[162]= 0.7575757575757576 In[163]:= jaro[ "Miss Belize", "Miss Mexiko"] Out[163]= 0.7575757575757576 but some not, to be discussed in the following: First define a function jo[] usable in FindRoot[] to find the m and t which have been ssemingly found by textdistance.jaro(): Clear[jo, joex] jo[s1_Integer, s2_Integer, m_?NumberQ, t_?NumberQ, prec_:$MachinePrecision] := N[(m/s1 + m/s2 + (m - t)/m)/3, prec] /; m > 0 && prec > 0 joex[s1_Integer, s2_Integer, m_Integer, t_?NumberQ] := (m/s1 + m/s2 + (m - t)/m)/3 /; m > 0 so one uses a jaro[] printing it's findings: >>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize') 0.6232323232323232 In[136]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"] During evaluation of In[136]:= l4 = Miss Mei During evaluation of In[136]:= l5 = Miss eie During evaluation of In[136]:= s1 = 30| s2 = 11| r = 14| m = 8| t = 1 Out[136]= 0.6229797979797980 the result ot textdistance.jaro() follows with m = 7 and t = 0: In[145]:= jo[30, 11, 7, 0] Out[145]= 0.6232323232323232 it's unclear how a disagreement in the number of matching characters can happen.This is interesting, only the first character differs: >>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj') 0.841183574879227 In[146]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"] During evaluation of In[146]:= l4 = aasdjkdashdahsgdashdgasj During evaluation of In[146]:= l5 = asdjkdashdahsgdashdgasj During evaluation of In[146]:= s1 = 24| s2 = 23| r = 11| m = 23| t = 1 Out[146]= 0.971618357487923 this is reached by the same m (luckily), but t = 10: In[149]:= jo[24, 23, 23, 10] Out[149]= 0.8411835748792271 this one has a great discrepance in the number of matching charachters >>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010') 0.8308371735791091 In[67]:= jaro["0100010100101001001001001010010", \ "10000100100111101010101010101010"] During evaluation of In[67]:= l4 = 0100010100101001001001001010010 During evaluation of In[67]:= l5 = 10000100100111101010101010101010 During evaluation of In[67]:= s1 = 31| s2 = 32| r = 15| m = 31| t = 5 Out[67]= 0.935819892473118 because it's reached with m = 28 and t = 8 In[133]:= jo[31, 32, 28, 16/2] Out[133]= 0.8308371735791091 and last but not least, Mary and Meghan >>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry') 0.5631555944055945 In[167]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"] During evaluation of In[167]:= l4 = Ma has a tte a During evaluation of In[167]:= l5 = a Meha has the eea a During evaluation of In[167]:= s1 = 22| s2 = 32| r = 15| m = 14| t = 9/2 Out[167]= 0.5841450216450216 again textdistance,jaro() disagrees with the m (needs 13) as well as with the t (has 4): In[153]:= jo[22, 32, 13, 4] Out[153]= 0.5631555944055944 I quit. I expected that the m fits and the difference lies only in determining the t which has an English description The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. as well as a French description Le nombre de transpositions est obtenu en comparant le i-Ã¨me caractÃ¨re correspondant de {\displaystyle s{1}} s{1} avec le i-Ã¨me caractÃ¨re correspondant de {\displaystyle s{2}} s{2}. Le nombre de fois oÃ¹ ces caractÃ¨res sont diffÃ©rents, divisÃ© par deux, donne le nombre de transpositions. but if one does that, even 'Miss Australia' and 'Miss Brasilia' do not agree with textdistance.jaro().
Posted 6 months ago
 Very interesting dissertation about distance Jaro, so I get to understand you still preserve certain discrepancies with textdistance results , I do not know how to solve them because that escapes my knowledge, but I do not know if you already know about the following that can be used to make more comparisons jD = ExperimentalJaroDistance; jD["DIXON", "DICKSONX"] 0.766667 jD["JELLYFISH", "SMELLYFISH"] 0.896296 I put those examples so you can see what I mean, maybe my suggestion will help you. What illustrative has been for me everything you have done on this subject, I continue to read both your code and your observations
Posted 6 months ago
 jD = ExperimentalJaroDistance; This function seems not to have a single point of definition: In[1]:= $Version Out[1]= "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" In[2]:= Needs["Experimental"] In[3]:= ?JaroDistance ExperimentalJaroDistance Attributes[JaroDistance]={Protected} Options[JaroDistance]={IgnoreCase->False} In[4]:= JaroDistance["Miss Australia", "Miss Brasilia"] Out[4]= 0.816683 In[5]:= JaroDistance["Miss Australia", "Miss Brasilia"] Out[5]= 0.78638 In[6]:= JaroDistance["Miss Australia", "Miss Brasilia", IgnoreCase -> True] Out[6]= 0.816683 In[7]:= JaroDistance["Miss Australia", "Miss Brasilia", IgnoreCase -> False] Out[7]= 0.816683 this is mind-boggling; JaroDistance[] has different results under identical input, it has also different result if default option is given or not. Answer Posted 6 months ago  For the sake of completeness, I found a NIST description of the Jaro distance containing William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, D.C., 1991. The abstract (HTML) and full paper (PDF). and there it is said on p. 12 Two characters are considered in common only if they are no further apart than (m/2 - 1) where m = max(d,r). Characters in common from two strings are assigned; remaining characters unassigned. Each string has the same number of assigned characters. That tells you to treat l3 as a whole and suggests the following implementation Needs["Experimental"] Clear[check] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] Clear[jaro] jaro[s1_String, s2_String, prec_:$MachinePrecision] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) l3 = DeleteDuplicates[ Flatten[check[{Transpose[{ToCharacterCode[s1], Range[l1]}][[#]]}, Transpose[{ToCharacterCode[s2], Range[l2]}][[ Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1], ((First[#1] == First[#2]) || (Last[#1] == Last[#2])) &]; m = Length[l3]; If[m > 0, t = (m - Count[{1, -1} . MapAt[First, SortBy[#, Last] & /@ Transpose[l3], {{1, All}, {2, All}}], 0])/2; N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1 which has still t sometimes off by 1/2 with respect to the textdistance.jaro. The JaroDistance[] is spurious (this was made with Mathematica 10.3). Taking the drudgery to create a table In[204]:= Grid[{{Item["s1"], Item["s2"], Item["td.jaro"], Item["jaro"], Item["Defect"], Item["JaroDistance"]}, {Item["Miss Australia"], Item["Miss Brasilia"], Item[0.8166833166833167], Item[jaro["Miss Australia", "Miss Brasilia"], Background -> Pink], Item["t+1/2"], Item[JaroDistance["Miss Australia", "Miss Brasilia", IgnoreCase -> False], Background -> Pink] }, {Item["DWAYNE"], Item["DUANE"], Item[0.8222222222222223], Item[jaro["DWAYNE", "DUANE"]], Item["-"], Item[JaroDistance["DWAYNE", "DUANE", IgnoreCase -> False], Background -> Pink] }, {Item["MARTHA"], Item["MARHTA"], Item[0.9444444444444445], Item[jaro["MARTHA", "MARHTA"]], Item["-"], Item[JaroDistance["MARTHA", "MARHTA", IgnoreCase -> False], Background -> Pink] }, {Item["DIXON"], Item["DICKSONX"], Item[0.7666666666666666], Item[jaro["DIXON", "DICKSONX"]], Item["-"], Item[JaroDistance["DIXON", "DICKSONX", IgnoreCase -> False], Background -> Pink] }, {Item["JELLYFISH"], Item["SMELLYFISH"], Item[0.8962962962962964], Item[jaro["JELLYFISH", "SMELLYFISH"]], Item["-"], Item[JaroDistance["JELLYFISH", "SMELLYFISH", IgnoreCase -> False], Background -> Pink] }, {Item["Miss Mexiko"], Item["Miss Belize"], Item[0.7575757575757575], Item[jaro["Miss Mexiko", "Miss Belize"]], Item["-"], Item[JaroDistance["Miss Mexiko", "Miss Belize", IgnoreCase -> False], Background -> Pink] }, {Item["0100010100101001001001001010010"], Item["10000100100111101010101010101010"], Item[0.8308371735791091], Item[jaro["0100010100101001001001001010010", "10000100100111101010101010101010"]], Item["-"], Item[JaroDistance["0100010100101001001001001010010", "10000100100111101010101010101010", IgnoreCase -> False]] }, {Item["aasdjkdashdahsgdashdgasj"], Item["asdjkdashdahsgdashdgasj"], Item[0.841183574879227], Item[jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]], Item["-"], Item[JaroDistance["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj", IgnoreCase -> False], Background -> Pink] }, {Item["abdegopq"], Item["cfhijklmnrstuvwxyz"], Item[0.0], Item[jaro["abdegopq", "cfhijklmnrstuvwxyz"]], Item["-"], Item[JaroDistance["abdegopq", "cfhijklmnrstuvwxyz", IgnoreCase -> False]] }, {Item["Mary has a little lamb"], Item["and Meghan has the redhead Harry"], Item[0.5631555944055945], Item[jaro["Mary has a little lamb", "and Meghan has the redhead Harry"], Background -> Pink], Item["t+1/2"], Item[JaroDistance["Mary has a little lamb", "and Meghan has the redhead Harry"]] }, {Item["Take[list,-n] gives the last n elements of list."], Item["Take[list,{m,n}] gives elements m through n of list."], Item[0.791056166056166], Item[jaro["Take[list,-n] gives the last n elements of list.", "Take[list,{m,n}] gives elements m through n of list."], Background -> Pink], Item["t+1/2"], Item[JaroDistance[ "Take[list,-n] gives the last n elements of list.", "Take[list,{m,n}] gives elements m through n of list."]] } }, Background -> {None, {Cyan}}, Frame -> All] one getsand I've to confess in complete humbleness that I'm unable to get jaro[] right using the given descriptions what it is meant to be. Please read the cited PDF and tell me.[1]:William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial
 Let's again consider the "Miss Brasilia", "Miss Australia" case: Clear[l1, l2, x1, x2, r, pinkify] l1 = "Miss Australia"; l2 = "Miss Brasilia"; x1 = ToCharacterCode[l1]; x2 = ToCharacterCode[l2]; pinkify[m_] := Block[{m1 = m, m2 = m, p = Position[m, 1], o, po}, For[o = 1, o <= Length[p], o++, po = p[[o]]; If[Position[m1[[1 ;; po[[1]], po[[2]]]], 1, {1}, 1][[1, 1]] == po[[1]] && Position[m1[[po[[1]], 1 ;; po[[2]]]], 1, {1}, 1][[1, 1]] == po[[2]], m2[[po[[1]], po[[2]]]] = Item[m[[po[[1]], po[[2]]]], Background -> Green], (* else *) m2[[po[[1]], po[[2]]]] = Item[m[[po[[1]], po[[2]]]], Background -> Pink]; m1[[po[[1]], po[[2]]]] = 0 ] ]; m2 ] /; MatrixQ[m] With[{r = Floor[Max[StringLength[l1], StringLength[l2]]/2] - 1}, Grid[pinkify[Normal[ SparseArray[ Join[ Table[Rule[{1, 1 + o}, Characters[l1][[o]]], {o, StringLength[l1]}], Table[ Rule[{1 + o, 1}, Characters[l2][[o]]], {o, StringLength[l2]}], Flatten[ Table[Rule[{1 + o2, 1 + o1}, If[x1[[o1]] == x2[[o2]] && r >= Abs[o1 - o2], 1, 0]], {o2, StringLength[l2]}, {o1, StringLength[l1]}], 1] ] ] ] ], Frame -> All, ItemSize -> {3/2, 4/3}, Alignment -> Center ] ] resulting in the pink 1do not go into the count, because they are not the rolling first ones in both directions seen from the beginning. One is left with m = 11, 6 perfect matches and 5 mismatches. This would count for a 2 t = 5 = 11 - 6, whereas the python implementation has 2 t = 4. This raises the idea, that not all mismatches must be count, but the existence of some permutation of more than 3 characters has to be realized, making the implemention more complex, as follows: Clear[jaro] jaro[s1_String, s2_String, prec_: \$MachinePrecision] := Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, t}, r = Floor[Max[l1, l2]/2] - 1; If[r >= 0,(* then *) l3 = DeleteDuplicates[ Flatten[check[{Transpose[{ToCharacterCode[s1], Range[l1]}][[#]]}, Transpose[{ToCharacterCode[s2], Range[l2]}][[ Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1], ((First[#1] == First[#2]) || (Last[#1] == Last[#2])) &]; m = Length[l3]; If[m > 0, t = share[ MapAt[First, SortBy[#, Last] & /@ Transpose[l3], {{1, All}, {2, All}}]]/2; N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *) 0 ], (* else *) 0 ] ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1 with the function share[], taking care for correspondances between the both strings of matching characters as well as for a rolling remapping for partnerless characters: Clear[share] share::insane = "x 1 as 2 is not bijective above 0"; share[l1_List, l2_List] := Block[{m = Length[l1], t = 0, x1, x2, o1, x3, o2, o3, z, seqL = 0, r, r0, recycleQ, o4}, (* search for maximal connected sets of equal characters and denote \ them # - 1 as t *) x1 = ConstantArray[0, m]; x2 = ConstantArray[0, m]; For[o1 = 1, o1 <= m, o1++, x3 = Position[x2, 0, {1}, 1]; If[TensorRank[x3] == 2, For[o2 = x3[[1, 1]], o2 <= m, o2++, If[l2[[o2]] == l1[[o1]] && x2[[o2]] == 0, If[o1 == o2, x1[[o1]] = -1; x2[[o2]] = -1, (* else *) If[l1[[o2]] == l2[[o2]], x1[[o2]] = -1; x2[[o2]] = -1, (* else *) If[l1[[o1]] == l2[[o1]], x1[[o1]] = -1; x2[[o1]] = -1, (* else *) x2[[o2]] = o1; x1[[o1]] = o2; Break[] ] ] ] ] ] ] ]; For[o1 = 1, o1 <= m, o1++, If[x1[[o1]] > 0, If[x2[[o1]] > 0, seqL += 1, (* else *) z = l1[[o1]]; o2 = o1 + 1; r = -1; recycleQ = True; While[r != 0, o4 = Position[x1[[o2 ;;]], -1, {1}, 1]; o4 = If[TensorRank[o4] == 2, o4[[1, 1]] - 1, m]; If[FreeQ[l1[[o2 ;; o4]], z], If[x1[[o2 - 1]] > o2 - 1, x2[[x1[[o2 - 1]]]] = 0]; r = 0,(* else *) o3 = o2 - 1 + First[First[Position[l1[[o2 ;; o4]], z, {1}, 1]]]; r = x1[[o3]]; If[recycleQ, x1[[o3]] = x1[[o2 - 1]]; x2[[x1[[o2 - 1]]]] = o3; x1[[o2 - 1]] = 0; recycleQ = False, (* else *) x1[[o3]] = r0; x2[[r0]] = o3 ]; r0 = r; o2 = Min[o3 + 1, m] ] ]; If[insaneQ[x1], Message[share::insane, 1, x1]]; If[insaneQ[x2], Message[share::insane, 2, x2]]; t += (detect[x1[[o1 - seqL ;; o1 - 1]], x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + If[seqL > 0, 1, 0]); seqL = 0 ], (* else *) If[x2[[o1]] > 0, z = l2[[o1]]; o2 = o1 + 1; r = -1; recycleQ = True; While[r != 0, o4 = Position[x2[[o2 ;;]], -1, {1}, 1]; o4 = If[TensorRank[o4] == 2, m, o4[[1, 1]] - 1, m]; If[FreeQ[l2[[o2 ;; o4]], z], If[x2[[o2 - 1]] > o2 - 1, x1[[x2[[o2 - 1]]]] = 0]; r = 0, (* else *) o3 = o2 - 1 + First[First[Position[l2[[o2 ;; o4]], z, {1}, 1]]]; r = x2[[o3]]; If[recycleQ, x2[[o3]] = x2[[o2 - 1]]; x1[[x2[[o2 - 1]]]] = o3; x2[[o2 - 1]] = 0; recycleQ = False, (* else *) x2[[o3]] = r0; x1[[r0]] = o3; ]; r0 = r; o2 = Min[o3 + 1, m] ] ]; If[insaneQ[x1], Message[share::insane, 1, x1]]; If[insaneQ[x2], Message[share::insane, 2, x2]]; t += (detect[x1[[o1 - seqL ;; o1 - 1]], x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + If[seqL > 0, 1, 0]); seqL = 0, (* else: x1\[LeftDoubleBracket]o1\[RightDoubleBracket] \[LessEqual] 0, x2\[LeftDoubleBracket]o1\[RightDoubleBracket] \[LessEqual] 0 *) t += (detect[x1[[o1 - seqL ;; o1 - 1]], x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + If[seqL > 0, 1, 0]); seqL = 0 ] ] ]; t + If[seqL > 0, detect[x1[[o1 - seqL ;; o1 - 1]], x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1], 0] - Count[x1, -1] ] share[{l1_List, l2_List}] := share[l1, l2] /; Length[l1] == Length[l2] && Length[l1] > 0 && VectorQ[l1, NumberQ] && VectorQ[l2, NumberQ] share[{l1_List, l2_List}] := 0 /; Length[l1] == 0 && Length[l2] == 0 share[] itself depends on dectect[]: Clear[insaneQ, check, detectI, detect] insaneQ[l1_List?VectorQ] := Block[{l = Fold[DeleteCases, l1, {-1, 0}]}, Length[DeleteDuplicates[l]] != Length[l] ] check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1] detectI[l1_List, l2_List, o1_Integer, o2_Integer] := Block[{m = Length[l1], permQ = True, o3}, If[o1 < o2, (* Is l2 a permutation of l1 in Interval[{o1,o2}]? *) For[o3 = 1, o3 <= m, o3++, If[permQ, permQ = (o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2) ] ]; If[m > 2 && permQ,(* "jaro" permutation exists *) o2 - o1,(* else *) o2 - o1 + 1 ], (* else *) 1 ] ] detect[l1_List, l2_List, o1_Integer, o2_Integer] := Block[{m = Length[l1], x = 0, o3}, If[o1 < o2, (* Contain l1,l2 a permutation of more than 2 characters? *) For[o3 = 1, o3 <= m, o3++, x += If[(o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2), 1, 0] ]; If[x > 2,(* "jaro" permutation exists *) o2 - o1,(* else *) o2 - o1 + 1 ], (* else *) 1 ] ] if one would use detectI[] one would check for a complete permutation between matches. detect[] looks only for some (> 2) characters in both strings. With this in place look into test cases:The first fail is bizarre, because it is a permutation, but other implementations have t bigger. The second fail could be fixed if one would use detectI[] instead of detect[], but then elsewhere fails show up.For your convenience the notebook is in the appendix. Attachments:
 If one sets out to admit only in-place permutations (i.e. the range of permutating characters in string s1 is the same range as in string s2) to lower the 2 t by 1 (if there are more ranges of that type in the arguments given to detect[], 1 is subtracted only once), then the previous function detect[] did not do that. Instead one should use this one: detect[l1_List, l2_List, o1_Integer, o2_Integer] := Block[{m = Length[l1], permQ = False, x = 0, o3, o41, o42}, If[o1 < o2, (* Contain l1, l2 a permutation of more than 2 characters in place? *) For[o3 = 1, o3 <= m, o3++, If[(o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2), If[x == 0, o41 = o3; o42 = o3; x = 1, (* else *) If[o3 == o42 + 1, o42 = o3; x += 1, (* else *) x = 0 ] ], (* else *) If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {}, permQ = True; Break[], (* else *) x = 0 ] ] ]; If[permQ,(* "jaro" permutation exists *) o2 - o1,(* else *) If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {}, o2 - o1, (* else *) o2 - o1 + 1 ] ], (* else *) 1 ] ] with it the test cases are now:Case 11 has no in-place permutation, therefore it does not lower 2 t and has 2 t one higher than textdistance.jaro. On the other hand, case 8 has an in-place permutation, clearly seen from this illustrationand again - but into the other direction - therefore case 8 lowers 2 t by 1 and has 2 t` one lower than textdistance.jaro. As said before, that in-place permutation regulation is needed to bring the "Miss Australia" vs. "Miss Brasilia" example to the same value textdistance.jaro has. Attachments: