Message Boards Message Boards

0
|
10851 Views
|
11 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Program the distance Jaro?

Posted 6 years ago

Hello everyone, I'm trying to program Jaro distance as requested by this page, I've done the following code that works well for the next two pairs ("MARTHA", "MARHTA") and ("DIXON", "DICKSONX") but when I try with ("JELLYFISH", "SMELLYFISH") I get an error because the code counts the double S of "SMELLYFISH", due to this error I have not been able to finish successfully, here is what I have programmed up to this moment:

uno = "DIXON"; dos ="DICKSONX" ;
rep = Characters[uno] \[Intersection] Characters[dos]
scope = Max[StringLength[uno], StringLength[dos]]/2 - 1
inter = Transpose[{Flatten[Position[Characters[uno], #] & /@ rep], 
   Flatten[Position[Characters[dos], #] & /@ rep]}]

m = Select[inter, Abs[#[[1]] - #[[2]]] < scope &]

prb = Select[m, #[[1]] != #[[2]] &]

trans = Length[DeleteCases[Position[prb, Reverse[#]] & /@ prb, {}]]/2

1/3 (Length[m]/StringLength[uno] + Length[m]/StringLength[dos] + (
   Length[m] - trans)/Length[m])

Someone who can help me solve this problem? Maybe the approach I'm using is wrong, unfortunately I have not been able to find a way to solve the problem with my code, I hope that someone can please guide me to achieve my goal, any help is welcome, thank you in advance for your help

POSTED BY: Luis Ledesma
11 Replies

One could do something like that

Clear[check, jaro]
check[d_Integer, l1_List, l2_List] := 
  If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d,
   {l1, l2}, (* else *) 
   Missing[]
   ];
jaro[s1_String, s2_String] := 
 Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, 
    w2, l3, t = 0, l4, l5},
   r = Floor[Max[l1, l2]/2] - 1;
   w1 = Transpose[{ToCharacterCode[s1], Range[l1]}];
   w2 = Transpose[{ToCharacterCode[s2], Range[l2]}];
   l3 = DeleteMissing[
     Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]];
   m = Length[
     GatherBy[SortBy[l3, First[First[#]] &], First[First[#]] &]];
   {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]);
   l4 = First /@ l4;
   l5 = First /@ l5;
   t = Length[
      Select[Subtract @@@ 
        Transpose[{Take[l4, Min[Length[l4], Length[l5]]], 
          Take[l5, Min[Length[l4], Length[l5]]]}], # != 0 &]]/2;
   Print["r = ", r, "| m = ", m, "| t = ", t]; 
   If[m == 0,
    0, (* else *)
    (m/l1 + m/l2 + (m - t)/m)/3.
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0

but it has an error if s1 and s2 are the same character. Then

In[204]:= jaro["Mis", "Mis"]
During evaluation of In[204]:= r = 0| m = 3| t = 0
Out[204]= 1.

In[205]:= jaro["Miss", "Miss"]
During evaluation of In[205]:= r = 1| m = 3| t = 0
Out[205]= 0.833333

the $s$ in $ss$ is in reach of r to the other s, but seems to count twice, because m must be 4 to bring jaro to 1 in that case; correspondingly the test cases have a 50% fail, of course:

In[202]:= jaro @@@ {{"DWAYNE", "DUANE"},
  {"MARTHA", "MARHTA"},
  {"DIXON", "DICKSONX"},
  {"JELLYFISH", "SMELLYFISH"}}
During evaluation of In[202]:= r = 2| m = 4| t = 0
During evaluation of In[202]:= r = 2| m = 5| t = 1
During evaluation of In[202]:= r = 3| m = 4| t = 0
During evaluation of In[202]:= r = 4| m = 7| t = 0

(Debug) Out[202]= {0.822222, 0.822222, 0.766667, 0.825926}

so, please Luis, get it fixed, to reach at

0.8222222, 0.9444444, 0.7666667, 0.8962963.

POSTED BY: Udo Krause

Okay, fix it into

Clear[check, jaro]
check[d_Integer, l1_List, l2_List] := 
  If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d,
   {l1, l2}, (* else *) 
   Missing[]
   ];
jaro[s1_String, s2_String] := 
 Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, 
    w2, l3, t = 0, l4, l5},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r < 0,
    m = 0, (* else *)
    w1 = Transpose[{ToCharacterCode[s1], Range[l1]}];
    w2 = Transpose[{ToCharacterCode[s2], Range[l2]}];
    l3 = DeleteMissing[
      Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]];
    m = Length[GatherBy[SortBy[l3, First[First[#]] &], First]];
    {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]);
    t = EditDistance[First /@ l4, First /@ l5]/2
    ];
   Print["r = ", r, "| m = ", m, "| t = ", t];
   If[m == 0,
    0, (* else *)
    (m/l1 + m/l2 + (m - t)/m)/3.
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0

this implementation is still toy because Outer[] goes too far and then ,,,,, using the Levenshtein distance in computing the Jaro distance seems a bit lunatic ... but at least the test cases go through:

In[71]:= jaro["s", "s"]
During evaluation of In[71]:= r = -1| m = 0| t = 0
Out[71]= 0

In[72]:= jaro["Miss", "Miss"]
During evaluation of In[72]:= r = 1| m = 4| t = 0
Out[72]= 1.

In[73]:= jaro @@@ {{"DWAYNE", "DUANE"},
  {"MARTHA", "MARHTA"},
  {"DIXON", "DICKSONX"},
  {"JELLYFISH", "SMELLYFISH"}}
During evaluation of In[73]:= r = 2| m = 4| t = 0
During evaluation of In[73]:= r = 2| m = 6| t = 1
During evaluation of In[73]:= r = 3| m = 4| t = 0
During evaluation of In[73]:= r = 4| m = 8| t = 0
Out[73]= {0.822222, 0.944444, 0.766667, 0.896296}

In[76]:= jaro["Miss Argentina", "Miss Brasilia"]
During evaluation of In[76]:= r = 6| m = 8| t = 3/2
Out[76]= 0.666438

In[77]:= jaro["0100010100101001001001001010010", \
"10000100100111101010101010101010"]
During evaluation of In[77]:= r = 15| m = 31| t = 9/2
Out[77]= 0.941196
POSTED BY: Udo Krause
Posted 6 years ago

$Udo $ $Krause$ Thank you very much for your help and support to face this problem, I sincerely believed that this problem was easier to solve, but I see that no, it has its details. Thanks to the code you shared, I learned another different approach to mine, which is more sophisticated and elegant. For my part I have made the following code based on the response they provide using python in the page I shared, I probe it with the cases that they request there and with some that you showed. I hope the community will discuss the improvements we can make to the code and thus progress in my learning, thanks again for all your support.

jaro[uno_String, dos_String] := 
 Module[{scope, slen, tlen, smatches, tmatches, matches, 
   transpositions, i, start, end, k}, 
  scope = Floor[Max[StringLength[uno], StringLength[dos]]/2] - 1; 
  slen = StringLength[uno]; tlen = StringLength[dos]; 
  smatches = Table[False, {slen}]; tmatches = Table[False, {tlen}]; 
  matches = 0; transpositions = 0; 
  For[i = 1, i <= slen, i++, start = Max[1, i - scope]; 
   end = Min[i + scope, tlen]; 
   For[j = start, j <= end, j++, 
    If[tmatches[[j]], Continue[], 
     If[StringTake[uno, {i}] != StringTake[dos, {j}], Continue[], 
      smatches[[i]] = True; tmatches[[j]] = True]]; matches++; 
    Break[]]]; k = 1; i =.; 
  For[i = 1, i <= slen, i++, If[Not[smatches[[i]]], Continue[]]; 
   While[Not[tmatches[[k]]], k++]; 
   If[StringTake[uno, {i}] != StringTake[dos, {k}], transpositions++];
    k++]; transpositions /= 2 ; 
  1/3 (matches/slen + matches/tlen + (matches - transpositions)/
      matches) // N]

In[10]:= jaro @@@ {{"DWAYNE", "DUANE"}, {"MARTHA", 
   "MARHTA"}, {"DIXON", "DICKSONX"}, {"JELLYFISH", "SMELLYFISH"}}

Out[10]= {0.822222, 0.944444, 0.766667, 0.896296}

In[11]:= jaro @@@ {{"Miss Argentina", 
   "Miss Brasilia"}, {"MissArgentina", 
   "MissBrasilia"}, {"miss Argentina", 
   "miss Brasilia"}, {"MISSARGENTINA", "MISSBRASILIA"}, {"miss", 
   "miss"}, {"MiSs", "MisS"}}

Out[11]= {0.687271, 0.707265, 0.687271, 0.719017, 1., 0.916667}
POSTED BY: Luis Ledesma

For my part I have made the following Code

which reminds me to the quotation

enter image description here

you witness $FORTRAN66$ in Mathematica. If it is necessary to hammer out the fact that a double loop must be done -- Outer[] states exactly that -- it can be done in Mathematica without using the built-in Symbol For[] - I guess. Let's see whether the Delayed constructs allow for it. Outer[] is an overkill and elegance is of no good if it compromises performance (and/or correctness)

POSTED BY: Udo Krause

The Outer[] has been banned down the call tree

Clear[check, jaro]
check[l1_List, l2_List] := 
 Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]
jaro[s1_String, s2_String] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, m, l4, l5, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    w1 = Transpose[{ToCharacterCode[s1], Range[l1]}];
    w2 = Transpose[{ToCharacterCode[s2], Range[l2]}];
    l3 = Flatten[check[{w1[[#]]}, w2[[Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1];
    If[Length[l3] > 0,
     {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}];
     m = MinMax[{Length[l4], Length[l5]}];
     t = Count[PadRight[l4, m[[2]], -1] - PadRight[l5, m[[2]], -1], u_ /; u != 0]/2;
     (m[[1]]/l1 + m[[1]]/l2 + (m[[1]] - t)/m[[1]])/3., (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0

to let out

In[12]:= jaro["Miss Australia", "Miss Brasilia"]
Out[12]= 0.771229

In[6]:= jaro["Miss Brasilia", "Miss Australia"]
Out[6]= 0.771229

In[8]:= jaro["s", "s"]
Out[8]= 0

In[10]:= jaro["Miss", "Miss"]
Out[10]= 1.

In[13]:= jaro @@@ {{"DWAYNE", "DUANE"},
  {"MARTHA", "MARHTA"},
  {"DIXON", "DICKSONX"},
  {"JELLYFISH", "SMELLYFISH"}}
Out[13]= {0.822222, 0.944444, 0.766667, 0.896296}

In[25]:= jaro["Miss Mexiko", "Miss Belize"]
Out[25]= 0.733766

In[26]:= jaro[ "Miss Belize", "Miss Mexiko"]
Out[26]= 0.733766

In[27]:= jaro["0100010100101001001001001010010", \
"10000100100111101010101010101010"]
Out[27]= 0.898185

In[28]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"]
Out[28]= 0.602146

In[29]:= jaro["abdegopq", "cfhijklmnrstuvwyz"]
Out[29]= 0

In[30]:= jaro["cfhijklmnrstuvwyz", "abdegopq"]
Out[30]= 0

In[33]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]
Out[33]= 0.819444

In[34]:= jaro["aasdjkdashdahsgdashdgasj", "aasdjkdashdahsgdashdgasj"]
Out[34]= 1.

In[31]:= jaro["CRATE", "TRACE"]
Out[31]= 0.733333

In[32]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"]
Out[32]= 0.465097

to check it against another implementation textdistance has been choosen, it gives for a r = -1 nevertheless 1 and shows the following results

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import textdistance
>>> textdistance.jaro.distance('Miss', 'Miss')
0
>>> textdistance.jaro('Miss', 'Miss')
1
>>> textdistance.jaro('s', 's')
1
>>> textdistance.jaro('Miss Australia', 'Miss Brasilia')
0.8166833166833167
>>> textdistance.jaro('Miss Brasilia', 'Miss Australia')
0.8166833166833167
>>> textdistance.jaro('DWAYNE', 'DUANE')
0.8222222222222223
>>> textdistance.jaro('MARTHA', 'MARHTA')
0.9444444444444445
>>> textdistance.jaro('DIXON', 'DICKSONX')
0.7666666666666666
>>> textdistance.jaro('JELLYFISH', 'SMELLYFISH')
0.8962962962962964
>>> textdistance.jaro('Miss Mexiko', 'Miss Belize')
0.7575757575757575
>>> textdistance.jaro('Miss Belize', 'Miss Mexiko')
0.7575757575757575
>>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010')
0.8308371735791091
>>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize')
0.6232323232323232
>>> textdistance.jaro('abdegopq', 'cfhijklmnrstuvwyz')
0.0
>>> textdistance.jaro('cfhijklmnrstuvwyz', 'abdegopq')
0.0
>>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj')
0.841183574879227
>>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'aasdjkdashdahsgdashdgasj')
1
>>> textdistance.jaro('CRATE', 'TRACE')
0.7333333333333334
>>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry')
0.5631555944055945
>>>

most of them disagree with the above jaro[]. Does your implementation match it?

POSTED BY: Udo Krause

The EditDistance[] is back and

Clear[check, jaro]
check[l1_List, l2_List] := 
 Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]
jaro[s1_String, s2_String, prec_: $MachinePrecision] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, l4, 
    l5, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    l3 = Flatten[
      check[{Transpose[{ToCharacterCode[s1], Range[l1]}][[#]]}, 
         Transpose[{ToCharacterCode[s2], Range[l2]}][[
          Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 
      1];
    If[Length[l3] > 0,
     {l4, l5} = 
      MapAt[First, 
       SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, 
         All}}];
     m = Min[Length[l4], Length[l5]];
     t = EditDistance[Take[l4, m], Take[l5, m]]/2;
     N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1

many results of textdistance.jaro() are matched

>>> textdistance.jaro('Miss Australia', 'Miss Brasilia')
0.8166833166833167
>>> textdistance.jaro('Miss Mexiko', 'Miss Belize')
0.7575757575757575
>>> textdistance.jaro('Miss Belize', 'Miss Mexiko')
0.7575757575757575

In[161]:= jaro["Miss Australia", "Miss Brasilia"]
Out[161]= 0.8166833166833167

In[162]:= jaro["Miss Mexiko", "Miss Belize"]
Out[162]= 0.7575757575757576

In[163]:= jaro[ "Miss Belize", "Miss Mexiko"]
Out[163]= 0.7575757575757576

but some not, to be discussed in the following: First define a function jo[] usable in FindRoot[] to find the m and t which have been ssemingly found by textdistance.jaro():

Clear[jo, joex]
jo[s1_Integer, s2_Integer, m_?NumberQ, t_?NumberQ, 
  prec_: $MachinePrecision] := 
 N[(m/s1 + m/s2 + (m - t)/m)/3, prec] /; m > 0 && prec > 0
joex[s1_Integer, s2_Integer, m_Integer, 
  t_?NumberQ] := (m/s1 + m/s2 + (m - t)/m)/3 /; m > 0

so one uses a jaro[] printing it's findings:

>>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize')
0.6232323232323232

In[136]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"]
During evaluation of In[136]:= l4 = Miss Mei
During evaluation of In[136]:= l5 = Miss eie
During evaluation of In[136]:= s1 = 30| s2 = 11| r = 14| m = 8| t = 1
Out[136]= 0.6229797979797980

the result ot textdistance.jaro() follows with m = 7 and t = 0:

In[145]:= jo[30, 11, 7, 0]
Out[145]= 0.6232323232323232

it's unclear how a disagreement in the number of matching characters can happen.

This is interesting, only the first character differs:

>>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj')
0.841183574879227

In[146]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]
During evaluation of In[146]:= l4 = aasdjkdashdahsgdashdgasj
During evaluation of In[146]:= l5 = asdjkdashdahsgdashdgasj
During evaluation of In[146]:= s1 = 24| s2 = 23| r = 11| m = 23| t = 1
Out[146]= 0.971618357487923

this is reached by the same m (luckily), but t = 10:

In[149]:= jo[24, 23, 23, 10]
Out[149]= 0.8411835748792271

this one has a great discrepance in the number of matching charachters

>>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010')
0.8308371735791091

In[67]:= jaro["0100010100101001001001001010010", \
"10000100100111101010101010101010"]
During evaluation of In[67]:= l4 = 0100010100101001001001001010010
During evaluation of In[67]:= l5 = 10000100100111101010101010101010
During evaluation of In[67]:= s1 = 31| s2 = 32| r = 15| m = 31| t = 5
Out[67]= 0.935819892473118

because it's reached with m = 28 and t = 8

In[133]:= jo[31, 32, 28, 16/2]
Out[133]= 0.8308371735791091 

and last but not least, Mary and Meghan

>>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry')
0.5631555944055945

In[167]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"]
During evaluation of In[167]:= l4 = Ma has a tte a
During evaluation of In[167]:= l5 = a Meha has the eea a
During evaluation of In[167]:= s1 = 22| s2 = 32| r = 15| m = 14| t = 9/2
Out[167]= 0.5841450216450216

again textdistance,jaro() disagrees with the m (needs 13) as well as with the t (has 4):

In[153]:= jo[22, 32, 13, 4]
Out[153]= 0.5631555944055944

I quit. I expected that the m fits and the difference lies only in determining the t which has an English description

The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions.

as well as a French description

Le nombre de transpositions est obtenu en comparant le i-ème caractère correspondant de {\displaystyle s{1}} s{1} avec le i-ème caractère correspondant de {\displaystyle s{2}} s{2}. Le nombre de fois où ces caractères sont différents, divisé par deux, donne le nombre de transpositions.

but if one does that, even 'Miss Australia' and 'Miss Brasilia' do not agree with textdistance.jaro().

POSTED BY: Udo Krause
Posted 6 years ago

Very interesting dissertation about distance Jaro, so I get to understand you still preserve certain discrepancies with

textdistance results

, I do not know how to solve them because that escapes my knowledge, but I do not know if you already know about the following that can be used to make more comparisons

jD = Experimental`JaroDistance;

jD["DIXON", "DICKSONX"]

    0.766667

jD["JELLYFISH", "SMELLYFISH"]

    0.896296

I put those examples so you can see what I mean, maybe my suggestion will help you. What illustrative has been for me everything you have done on this subject, I continue to read both your code and your observations

POSTED BY: Luis Ledesma

jD = Experimental`JaroDistance;

This function seems not to have a single point of definition:

In[1]:= $Version
Out[1]= "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)"

In[2]:= Needs["Experimental`"]

In[3]:= ?JaroDistance
Experimental`JaroDistance
Attributes[JaroDistance]={Protected}
Options[JaroDistance]={IgnoreCase->False}

In[4]:= JaroDistance["Miss Australia", "Miss Brasilia"]
Out[4]= 0.816683

In[5]:= JaroDistance["Miss Australia", "Miss Brasilia"]
Out[5]= 0.78638

In[6]:= JaroDistance["Miss Australia", "Miss Brasilia", IgnoreCase -> True]
Out[6]= 0.816683

In[7]:= JaroDistance["Miss Australia", "Miss Brasilia", IgnoreCase -> False]
Out[7]= 0.816683

this is mind-boggling; JaroDistance[] has different results under identical input, it has also different result if default option is given or not.

POSTED BY: Udo Krause

For the sake of completeness, I found a NIST description of the Jaro distance containing

William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, D.C., 1991. The abstract (HTML) and full paper (PDF).

and there it is said on p. 12

Two characters are considered in common only if they are no further apart than (m/2 - 1) where m = max(d,r). Characters in common from two strings are assigned; remaining characters unassigned. Each string has the same number of assigned characters.

That tells you to treat l3 as a whole and suggests the following implementation

Needs["Experimental`"]

Clear[check]
check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]

Clear[jaro]
jaro[s1_String, s2_String, prec_: $MachinePrecision] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    l3 = DeleteDuplicates[
      Flatten[check[{Transpose[{ToCharacterCode[s1], 
              Range[l1]}][[#]]}, 
          Transpose[{ToCharacterCode[s2], Range[l2]}][[
           Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1],
        1], ((First[#1] == First[#2]) || (Last[#1] == Last[#2])) &];
    m = Length[l3];
    If[m > 0,
     t = (m - 
         Count[{1, -1} . 
           MapAt[First, 
            SortBy[#, Last] & /@ Transpose[l3], {{1, All}, {2, All}}],
           0])/2;
     N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1

which has still t sometimes off by 1/2 with respect to the textdistance.jaro. The JaroDistance[] is spurious (this was made with Mathematica 10.3). Taking the drudgery to create a table

In[204]:=  Grid[{{Item["s1"], Item["s2"], Item["td.jaro"], Item["jaro"], 
       Item["Defect"], Item["JaroDistance"]},
      {Item["Miss Australia"], Item["Miss Brasilia"],
       Item[0.8166833166833167],
       Item[jaro["Miss Australia", "Miss Brasilia"], Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance["Miss Australia", "Miss Brasilia", 
         IgnoreCase -> False], Background -> Pink]
       },
      {Item["DWAYNE"], Item["DUANE"],
       Item[0.8222222222222223],
       Item[jaro["DWAYNE", "DUANE"]],
       Item["-"],
       Item[JaroDistance["DWAYNE", "DUANE", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["MARTHA"], Item["MARHTA"],
       Item[0.9444444444444445],
       Item[jaro["MARTHA", "MARHTA"]],
       Item["-"],
       Item[JaroDistance["MARTHA", "MARHTA", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["DIXON"], Item["DICKSONX"],
       Item[0.7666666666666666],
       Item[jaro["DIXON", "DICKSONX"]],
       Item["-"],
       Item[JaroDistance["DIXON", "DICKSONX", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["JELLYFISH"], Item["SMELLYFISH"],
       Item[0.8962962962962964],
       Item[jaro["JELLYFISH", "SMELLYFISH"]],
       Item["-"],
       Item[JaroDistance["JELLYFISH", "SMELLYFISH", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["Miss Mexiko"], Item["Miss Belize"],
       Item[0.7575757575757575],
       Item[jaro["Miss Mexiko", "Miss Belize"]],
       Item["-"],
       Item[JaroDistance["Miss Mexiko", "Miss Belize", 
         IgnoreCase -> False], Background -> Pink]
       },
      {Item["0100010100101001001001001010010"], 
       Item["10000100100111101010101010101010"],
       Item[0.8308371735791091],
       Item[jaro["0100010100101001001001001010010", 
         "10000100100111101010101010101010"]],
       Item["-"],
       Item[JaroDistance["0100010100101001001001001010010", 
         "10000100100111101010101010101010", IgnoreCase -> False]]
       },
      {Item["aasdjkdashdahsgdashdgasj"], Item["asdjkdashdahsgdashdgasj"],
       Item[0.841183574879227],
       Item[jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]],
       Item["-"],
       Item[JaroDistance["aasdjkdashdahsgdashdgasj", 
         "asdjkdashdahsgdashdgasj", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["abdegopq"], Item["cfhijklmnrstuvwxyz"],
       Item[0.0],
       Item[jaro["abdegopq", "cfhijklmnrstuvwxyz"]],
       Item["-"],
       Item[JaroDistance["abdegopq", "cfhijklmnrstuvwxyz", 
         IgnoreCase -> False]]
       },
      {Item["Mary has a little lamb"], 
       Item["and Meghan has the redhead Harry"],
       Item[0.5631555944055945],
       Item[jaro["Mary has a little lamb", 
         "and Meghan has the redhead Harry"], Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance["Mary has a little lamb", 
         "and Meghan has the redhead Harry"]]
       },
      {Item["Take[list,-n] gives the last n elements of list."], 
       Item["Take[list,{m,n}] gives elements m through n of list."],
       Item[0.791056166056166],
       Item[jaro["Take[list,-n] gives the last n elements of list.", 
         "Take[list,{m,n}] gives elements m through n of list."], 
        Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance[
         "Take[list,-n] gives the last n elements of list.", 
         "Take[list,{m,n}] gives elements m through n of list."]]
       }
      }, Background -> {None, {Cyan}}, Frame -> All]

one gets

enter image description here

and I've to confess in complete humbleness that I'm unable to get jaro[] right using the given descriptions what it is meant to be. Please read the cited PDF and tell me.

[1]:William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial

POSTED BY: Udo Krause

Let's again consider the "Miss Brasilia", "Miss Australia" case:

Clear[l1, l2, x1, x2, r, pinkify]
l1 = "Miss Australia";
l2 = "Miss Brasilia";
x1 = ToCharacterCode[l1];
x2 = ToCharacterCode[l2];
pinkify[m_] := Block[{m1 = m, m2 = m, p = Position[m, 1], o, po},
   For[o = 1, o <= Length[p], o++,
    po = p[[o]];
    If[Position[m1[[1 ;; po[[1]], po[[2]]]], 1, {1}, 1][[1, 1]] == 
       po[[1]]
      && Position[m1[[po[[1]], 1 ;; po[[2]]]], 1, {1}, 1][[1, 1]] == 
       po[[2]],
     m2[[po[[1]], po[[2]]]] = 
      Item[m[[po[[1]], po[[2]]]], Background -> Green], (* else *)
     m2[[po[[1]], po[[2]]]] = 
      Item[m[[po[[1]], po[[2]]]], Background -> Pink];
     m1[[po[[1]], po[[2]]]] = 0
     ]
    ];
   m2
   ] /; MatrixQ[m]

With[{r = Floor[Max[StringLength[l1], StringLength[l2]]/2] - 1},
 Grid[pinkify[Normal[
    SparseArray[
     Join[
      Table[Rule[{1, 1 + o}, Characters[l1][[o]]], {o, 
        StringLength[l1]}],
      Table[
       Rule[{1 + o, 1}, Characters[l2][[o]]], {o, StringLength[l2]}],
      Flatten[
       Table[Rule[{1 + o2, 1 + o1}, 
         If[x1[[o1]] == x2[[o2]] && r >= Abs[o1 - o2], 1, 0]], {o2, 
         StringLength[l2]}, {o1, StringLength[l1]}], 1]
      ]
     ]
    ]
   ],
  Frame -> All, ItemSize -> {3/2, 4/3}, Alignment -> Center
  ]
 ]

resulting in

enter image description here

the pink 1do not go into the count, because they are not the rolling first ones in both directions seen from the beginning. One is left with m = 11, 6 perfect matches and 5 mismatches. This would count for a 2 t = 5 = 11 - 6, whereas the python implementation has 2 t = 4. This raises the idea, that not all mismatches must be count, but the existence of some permutation of more than 3 characters has to be realized, making the implemention more complex, as follows:

Clear[jaro]
jaro[s1_String, s2_String, prec_: $MachinePrecision] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    l3 = DeleteDuplicates[
      Flatten[check[{Transpose[{ToCharacterCode[s1], 
              Range[l1]}][[#]]}, 
          Transpose[{ToCharacterCode[s2], Range[l2]}][[
           Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1],
        1], ((First[#1] == First[#2]) || (Last[#1] == Last[#2])) &];
    m = Length[l3];
    If[m > 0,
     t = share[
        MapAt[First, 
         SortBy[#, Last] & /@ Transpose[l3], {{1, All}, {2, All}}]]/2;
     N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1

with the function share[], taking care for correspondances between the both strings of matching characters as well as for a rolling remapping for partnerless characters:

Clear[share]
share::insane = "x `1` as `2` is not bijective above 0";
share[l1_List, l2_List] := 
 Block[{m = Length[l1], t = 0, x1, x2, o1, x3, o2, o3, z, seqL = 0, r,
    r0, recycleQ, o4},
  (* search for maximal connected sets of equal characters and denote \
them # - 1 as t *)
  x1 = ConstantArray[0, m];
  x2 = ConstantArray[0, m];
  For[o1 = 1, o1 <= m, o1++,
   x3 = Position[x2, 0, {1}, 1];
   If[TensorRank[x3] == 2,
    For[o2 = x3[[1, 1]], o2 <= m, o2++,
     If[l2[[o2]] == l1[[o1]] && x2[[o2]] == 0,
      If[o1 == o2,
       x1[[o1]] = -1;
       x2[[o2]] = -1, (* else *)
       If[l1[[o2]] == l2[[o2]],
        x1[[o2]] = -1;
        x2[[o2]] = -1, (* else *)
        If[l1[[o1]] == l2[[o1]],
         x1[[o1]] = -1;
         x2[[o1]] = -1, (* else *)
         x2[[o2]] = o1;
         x1[[o1]] = o2;
         Break[]
         ]
        ]
       ]
      ]
     ]
    ]
   ];
  For[o1 = 1, o1 <= m, o1++,
   If[x1[[o1]] > 0,
    If[x2[[o1]] > 0,
     seqL += 1, (* else *)
     z = l1[[o1]];
     o2 = o1 + 1;
     r = -1; recycleQ = True;
     While[r != 0,
      o4 = Position[x1[[o2 ;;]], -1, {1}, 1];
      o4 = If[TensorRank[o4] == 2, o4[[1, 1]] - 1, m];
      If[FreeQ[l1[[o2 ;; o4]], z],
       If[x1[[o2 - 1]] > o2 - 1, x2[[x1[[o2 - 1]]]] = 0];
       r = 0,(* else *)
       o3 = o2 - 1 + First[First[Position[l1[[o2 ;; o4]], z, {1}, 1]]];
       r = x1[[o3]];
       If[recycleQ,
        x1[[o3]] = x1[[o2 - 1]];
        x2[[x1[[o2 - 1]]]] = o3;
        x1[[o2 - 1]] = 0;
        recycleQ = False, (* else *)
        x1[[o3]] = r0;
        x2[[r0]] = o3
        ];
       r0 = r;
       o2 = Min[o3 + 1, m]
       ]
      ];
     If[insaneQ[x1], Message[share::insane, 1, x1]];
     If[insaneQ[x2], Message[share::insane, 2, x2]];
     t += (detect[x1[[o1 - seqL ;; o1 - 1]], 
         x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + 
        If[seqL > 0, 1, 0]);
     seqL = 0
     ], (* else *)
    If[x2[[o1]] > 0,
     z = l2[[o1]];
     o2 = o1 + 1; 
     r = -1; recycleQ = True;
     While[r != 0,
      o4 = Position[x2[[o2 ;;]], -1, {1}, 1];
      o4 = If[TensorRank[o4] == 2, m, o4[[1, 1]] - 1, m];
      If[FreeQ[l2[[o2 ;; o4]], z],
       If[x2[[o2 - 1]] > o2 - 1, x1[[x2[[o2 - 1]]]] = 0];
       r = 0, (* else *)
       o3 = o2 - 1 + First[First[Position[l2[[o2 ;; o4]], z, {1}, 1]]];
       r = x2[[o3]];
       If[recycleQ,
        x2[[o3]] = x2[[o2 - 1]];
        x1[[x2[[o2 - 1]]]] = o3;
        x2[[o2 - 1]] = 0;
        recycleQ = False,  (* else *)
        x2[[o3]] = r0;
        x1[[r0]] = o3;
        ];
       r0 = r;
       o2 = Min[o3 + 1, m]
       ]
      ];
     If[insaneQ[x1], Message[share::insane, 1, x1]];
     If[insaneQ[x2], Message[share::insane, 2, x2]];
     t += (detect[x1[[o1 - seqL ;; o1 - 1]], 
         x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + 
        If[seqL > 0, 1, 0]);
     seqL = 0, (* else: 
     x1\[LeftDoubleBracket]o1\[RightDoubleBracket] \[LessEqual] 0, 
     x2\[LeftDoubleBracket]o1\[RightDoubleBracket] \[LessEqual] 0 *)
     t += (detect[x1[[o1 - seqL ;; o1 - 1]], 
         x2[[o1 - seqL ;; o1 - 1]], o1 - seqL, o1 - 1] + 
        If[seqL > 0, 1, 0]);
     seqL = 0
     ]
    ]
   ];
  t + If[seqL > 0,
    detect[x1[[o1 - seqL ;; o1 - 1]], x2[[o1 - seqL ;; o1 - 1]], 
     o1 - seqL, o1 - 1],
    0] - Count[x1, -1]
  ]
share[{l1_List, l2_List}] := 
 share[l1, l2] /; 
  Length[l1] == Length[l2] && Length[l1] > 0 && VectorQ[l1, NumberQ] &&
    VectorQ[l2, NumberQ]
share[{l1_List, l2_List}] := 0 /; Length[l1] == 0 && Length[l2] == 0

share[] itself depends on dectect[]:

Clear[insaneQ, check, detectI, detect]
insaneQ[l1_List?VectorQ] := Block[{l = Fold[DeleteCases, l1, {-1, 0}]},
  Length[DeleteDuplicates[l]] != Length[l]
  ]
check[l1_List, l2_List] := 
 Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]
detectI[l1_List, l2_List, o1_Integer, o2_Integer] := 
 Block[{m = Length[l1], permQ = True, o3},
  If[o1 < o2,
   (* Is l2 a permutation of l1 in Interval[{o1,o2}]? *)
   For[o3 = 1, o3 <= m, o3++,
    If[permQ,
     permQ = (o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2)
     ]
    ];
   If[m > 2 && permQ,(* "jaro" permutation exists *)
    o2 - o1,(* else *)
    o2 - o1 + 1
    ], (* else *)
   1
   ]
  ]
detect[l1_List, l2_List, o1_Integer, o2_Integer] := 
 Block[{m = Length[l1], x = 0, o3},
  If[o1 < o2,
   (* Contain l1,l2 a permutation of more than 2 characters? *)
   For[o3 = 1, o3 <= m, o3++,
    x += If[(o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2), 1, 0]
    ];
   If[x > 2,(* "jaro" permutation exists *)
    o2 - o1,(* else *)
    o2 - o1 + 1
    ], (* else *)
   1
   ]
  ]

if one would use detectI[] one would check for a complete permutation between matches. detect[] looks only for some (> 2) characters in both strings.

With this in place look into test cases:

enter image description here

The first fail is bizarre, because it is a permutation, but other implementations have t bigger. The second fail could be fixed if one would use detectI[] instead of detect[], but then elsewhere fails show up.

For your convenience the notebook is in the appendix.

POSTED BY: Udo Krause

If one sets out to admit only in-place permutations (i.e. the range of permutating characters in string s1 is the same range as in string s2) to lower the 2 t by 1 (if there are more ranges of that type in the arguments given to detect[], 1 is subtracted only once), then the previous function detect[] did not do that. Instead one should use this one:

detect[l1_List, l2_List, o1_Integer, o2_Integer] := 
 Block[{m = Length[l1], permQ = False, x = 0, o3, o41, o42},
  If[o1 < o2,
   (* Contain l1,
   l2 a permutation of more than 2 characters in place? *)
   For[o3 = 1, o3 <= m, o3++,
    If[(o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2),
     If[x == 0,
      o41 = o3;
      o42 = o3;
      x = 1, (* else *)
      If[o3 == o42 + 1,
       o42 = o3;
       x += 1, (* else *)
       x = 0
       ]
      ], (* else *)
     If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {},
      permQ = True;
      Break[], (* else *)
      x = 0
      ]
     ]
    ];
   If[permQ,(* "jaro" permutation exists *)
    o2 - o1,(* else *)
    If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {},
     o2 - o1, (* else *)
     o2 - o1 + 1
     ]
    ], (* else *)
   1
   ]
  ]

with it the test cases are now:

enter image description here

Case 11 has no in-place permutation, therefore it does not lower 2 t and has 2 t one higher than textdistance.jaro. On the other hand, case 8 has an in-place permutation, clearly seen from this illustration

enter image description here

and again - but into the other direction - therefore case 8 lowers 2 t by 1 and has 2 t one lower than textdistance.jaro. As said before, that in-place permutation regulation is needed to bring the "Miss Australia" vs. "Miss Brasilia" example to the same value textdistance.jaro has.

POSTED BY: Udo Krause
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract