Group Abstract Group Abstract

Message Boards Message Boards

0
|
13.8K Views
|
11 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Program the distance Jaro?

Posted 7 years ago

Hello everyone, I'm trying to program Jaro distance as requested by this page, I've done the following code that works well for the next two pairs ("MARTHA", "MARHTA") and ("DIXON", "DICKSONX") but when I try with ("JELLYFISH", "SMELLYFISH") I get an error because the code counts the double S of "SMELLYFISH", due to this error I have not been able to finish successfully, here is what I have programmed up to this moment:

uno = "DIXON"; dos ="DICKSONX" ;
rep = Characters[uno] \[Intersection] Characters[dos]
scope = Max[StringLength[uno], StringLength[dos]]/2 - 1
inter = Transpose[{Flatten[Position[Characters[uno], #] & /@ rep], 
   Flatten[Position[Characters[dos], #] & /@ rep]}]

m = Select[inter, Abs[#[[1]] - #[[2]]] < scope &]

prb = Select[m, #[[1]] != #[[2]] &]

trans = Length[DeleteCases[Position[prb, Reverse[#]] & /@ prb, {}]]/2

1/3 (Length[m]/StringLength[uno] + Length[m]/StringLength[dos] + (
   Length[m] - trans)/Length[m])

Someone who can help me solve this problem? Maybe the approach I'm using is wrong, unfortunately I have not been able to find a way to solve the problem with my code, I hope that someone can please guide me to achieve my goal, any help is welcome, thank you in advance for your help

POSTED BY: Luis Ledesma
11 Replies

If one sets out to admit only in-place permutations (i.e. the range of permutating characters in string s1 is the same range as in string s2) to lower the 2 t by 1 (if there are more ranges of that type in the arguments given to detect[], 1 is subtracted only once), then the previous function detect[] did not do that. Instead one should use this one:

detect[l1_List, l2_List, o1_Integer, o2_Integer] := 
 Block[{m = Length[l1], permQ = False, x = 0, o3, o41, o42},
  If[o1 < o2,
   (* Contain l1,
   l2 a permutation of more than 2 characters in place? *)
   For[o3 = 1, o3 <= m, o3++,
    If[(o1 <= l1[[o3]] <= o2) && (o1 <= l2[[o3]] <= o2),
     If[x == 0,
      o41 = o3;
      o42 = o3;
      x = 1, (* else *)
      If[o3 == o42 + 1,
       o42 = o3;
       x += 1, (* else *)
       x = 0
       ]
      ], (* else *)
     If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {},
      permQ = True;
      Break[], (* else *)
      x = 0
      ]
     ]
    ];
   If[permQ,(* "jaro" permutation exists *)
    o2 - o1,(* else *)
    If[x > 2 && Complement[l1[[o41 ;; o42]], l2[[o41 ;; o42]]] == {},
     o2 - o1, (* else *)
     o2 - o1 + 1
     ]
    ], (* else *)
   1
   ]
  ]

with it the test cases are now:

enter image description here

Case 11 has no in-place permutation, therefore it does not lower 2 t and has 2 t one higher than textdistance.jaro. On the other hand, case 8 has an in-place permutation, clearly seen from this illustration

enter image description here

and again - but into the other direction - therefore case 8 lowers 2 t by 1 and has 2 t one lower than textdistance.jaro. As said before, that in-place permutation regulation is needed to bring the "Miss Australia" vs. "Miss Brasilia" example to the same value textdistance.jaro has.

POSTED BY: Udo Krause
POSTED BY: Udo Krause

For the sake of completeness, I found a NIST description of the Jaro distance containing

William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, D.C., 1991. The abstract (HTML) and full paper (PDF).

and there it is said on p. 12

Two characters are considered in common only if they are no further apart than (m/2 - 1) where m = max(d,r). Characters in common from two strings are assigned; remaining characters unassigned. Each string has the same number of assigned characters.

That tells you to treat l3 as a whole and suggests the following implementation

Needs["Experimental`"]

Clear[check]
check[l1_List, l2_List] := Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]

Clear[jaro]
jaro[s1_String, s2_String, prec_: $MachinePrecision] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], l3, m, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    l3 = DeleteDuplicates[
      Flatten[check[{Transpose[{ToCharacterCode[s1], 
              Range[l1]}][[#]]}, 
          Transpose[{ToCharacterCode[s2], Range[l2]}][[
           Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1],
        1], ((First[#1] == First[#2]) || (Last[#1] == Last[#2])) &];
    m = Length[l3];
    If[m > 0,
     t = (m - 
         Count[{1, -1} . 
           MapAt[First, 
            SortBy[#, Last] & /@ Transpose[l3], {{1, All}, {2, All}}],
           0])/2;
     N[(m/l1 + m/l2 + (m - t)/m)/3, prec], (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0 && prec > 1

which has still t sometimes off by 1/2 with respect to the textdistance.jaro. The JaroDistance[] is spurious (this was made with Mathematica 10.3). Taking the drudgery to create a table

In[204]:=  Grid[{{Item["s1"], Item["s2"], Item["td.jaro"], Item["jaro"], 
       Item["Defect"], Item["JaroDistance"]},
      {Item["Miss Australia"], Item["Miss Brasilia"],
       Item[0.8166833166833167],
       Item[jaro["Miss Australia", "Miss Brasilia"], Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance["Miss Australia", "Miss Brasilia", 
         IgnoreCase -> False], Background -> Pink]
       },
      {Item["DWAYNE"], Item["DUANE"],
       Item[0.8222222222222223],
       Item[jaro["DWAYNE", "DUANE"]],
       Item["-"],
       Item[JaroDistance["DWAYNE", "DUANE", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["MARTHA"], Item["MARHTA"],
       Item[0.9444444444444445],
       Item[jaro["MARTHA", "MARHTA"]],
       Item["-"],
       Item[JaroDistance["MARTHA", "MARHTA", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["DIXON"], Item["DICKSONX"],
       Item[0.7666666666666666],
       Item[jaro["DIXON", "DICKSONX"]],
       Item["-"],
       Item[JaroDistance["DIXON", "DICKSONX", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["JELLYFISH"], Item["SMELLYFISH"],
       Item[0.8962962962962964],
       Item[jaro["JELLYFISH", "SMELLYFISH"]],
       Item["-"],
       Item[JaroDistance["JELLYFISH", "SMELLYFISH", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["Miss Mexiko"], Item["Miss Belize"],
       Item[0.7575757575757575],
       Item[jaro["Miss Mexiko", "Miss Belize"]],
       Item["-"],
       Item[JaroDistance["Miss Mexiko", "Miss Belize", 
         IgnoreCase -> False], Background -> Pink]
       },
      {Item["0100010100101001001001001010010"], 
       Item["10000100100111101010101010101010"],
       Item[0.8308371735791091],
       Item[jaro["0100010100101001001001001010010", 
         "10000100100111101010101010101010"]],
       Item["-"],
       Item[JaroDistance["0100010100101001001001001010010", 
         "10000100100111101010101010101010", IgnoreCase -> False]]
       },
      {Item["aasdjkdashdahsgdashdgasj"], Item["asdjkdashdahsgdashdgasj"],
       Item[0.841183574879227],
       Item[jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]],
       Item["-"],
       Item[JaroDistance["aasdjkdashdahsgdashdgasj", 
         "asdjkdashdahsgdashdgasj", IgnoreCase -> False], 
        Background -> Pink]
       },
      {Item["abdegopq"], Item["cfhijklmnrstuvwxyz"],
       Item[0.0],
       Item[jaro["abdegopq", "cfhijklmnrstuvwxyz"]],
       Item["-"],
       Item[JaroDistance["abdegopq", "cfhijklmnrstuvwxyz", 
         IgnoreCase -> False]]
       },
      {Item["Mary has a little lamb"], 
       Item["and Meghan has the redhead Harry"],
       Item[0.5631555944055945],
       Item[jaro["Mary has a little lamb", 
         "and Meghan has the redhead Harry"], Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance["Mary has a little lamb", 
         "and Meghan has the redhead Harry"]]
       },
      {Item["Take[list,-n] gives the last n elements of list."], 
       Item["Take[list,{m,n}] gives elements m through n of list."],
       Item[0.791056166056166],
       Item[jaro["Take[list,-n] gives the last n elements of list.", 
         "Take[list,{m,n}] gives elements m through n of list."], 
        Background -> Pink],
       Item["t+1/2"],
       Item[JaroDistance[
         "Take[list,-n] gives the last n elements of list.", 
         "Take[list,{m,n}] gives elements m through n of list."]]
       }
      }, Background -> {None, {Cyan}}, Frame -> All]

one gets

enter image description here

and I've to confess in complete humbleness that I'm unable to get jaro[] right using the given descriptions what it is meant to be. Please read the cited PDF and tell me.

[1]:William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial

POSTED BY: Udo Krause
Posted 7 years ago
POSTED BY: Luis Ledesma
POSTED BY: Udo Krause
Posted 7 years ago
POSTED BY: Luis Ledesma
POSTED BY: Udo Krause

The Outer[] has been banned down the call tree

Clear[check, jaro]
check[l1_List, l2_List] := 
 Flatten[Outer[List, l1, Select[l2, #[[1]] == l1[[1, 1]] &], 1], 1]
jaro[s1_String, s2_String] := 
 Block[{r, l1 = StringLength[s1], l2 = StringLength[s2], w1, w2, l3, m, l4, l5, t},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r >= 0,(* then *)
    w1 = Transpose[{ToCharacterCode[s1], Range[l1]}];
    w2 = Transpose[{ToCharacterCode[s2], Range[l2]}];
    l3 = Flatten[check[{w1[[#]]}, w2[[Min[l2, Max[1, # - r]] ;; Min[l2, # + r]]]] & /@ Range[l1], 1];
    If[Length[l3] > 0,
     {l4, l5} = MapAt[First, SortBy[#, Last] & /@ (Union /@ Transpose[l3]), {{1, All}, {2, All}}];
     m = MinMax[{Length[l4], Length[l5]}];
     t = Count[PadRight[l4, m[[2]], -1] - PadRight[l5, m[[2]], -1], u_ /; u != 0]/2;
     (m[[1]]/l1 + m[[1]]/l2 + (m[[1]] - t)/m[[1]])/3., (* else *)
     0 
     ], (* else *)
    0
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0

to let out

In[12]:= jaro["Miss Australia", "Miss Brasilia"]
Out[12]= 0.771229

In[6]:= jaro["Miss Brasilia", "Miss Australia"]
Out[6]= 0.771229

In[8]:= jaro["s", "s"]
Out[8]= 0

In[10]:= jaro["Miss", "Miss"]
Out[10]= 1.

In[13]:= jaro @@@ {{"DWAYNE", "DUANE"},
  {"MARTHA", "MARHTA"},
  {"DIXON", "DICKSONX"},
  {"JELLYFISH", "SMELLYFISH"}}
Out[13]= {0.822222, 0.944444, 0.766667, 0.896296}

In[25]:= jaro["Miss Mexiko", "Miss Belize"]
Out[25]= 0.733766

In[26]:= jaro[ "Miss Belize", "Miss Mexiko"]
Out[26]= 0.733766

In[27]:= jaro["0100010100101001001001001010010", \
"10000100100111101010101010101010"]
Out[27]= 0.898185

In[28]:= jaro["Miss Mexikoooooooooooooooooooo", "Miss Belize"]
Out[28]= 0.602146

In[29]:= jaro["abdegopq", "cfhijklmnrstuvwyz"]
Out[29]= 0

In[30]:= jaro["cfhijklmnrstuvwyz", "abdegopq"]
Out[30]= 0

In[33]:= jaro["aasdjkdashdahsgdashdgasj", "asdjkdashdahsgdashdgasj"]
Out[33]= 0.819444

In[34]:= jaro["aasdjkdashdahsgdashdgasj", "aasdjkdashdahsgdashdgasj"]
Out[34]= 1.

In[31]:= jaro["CRATE", "TRACE"]
Out[31]= 0.733333

In[32]:= jaro["Mary has a little lamb", "and Meghan has the redhead Harry"]
Out[32]= 0.465097

to check it against another implementation textdistance has been choosen, it gives for a r = -1 nevertheless 1 and shows the following results

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import textdistance
>>> textdistance.jaro.distance('Miss', 'Miss')
0
>>> textdistance.jaro('Miss', 'Miss')
1
>>> textdistance.jaro('s', 's')
1
>>> textdistance.jaro('Miss Australia', 'Miss Brasilia')
0.8166833166833167
>>> textdistance.jaro('Miss Brasilia', 'Miss Australia')
0.8166833166833167
>>> textdistance.jaro('DWAYNE', 'DUANE')
0.8222222222222223
>>> textdistance.jaro('MARTHA', 'MARHTA')
0.9444444444444445
>>> textdistance.jaro('DIXON', 'DICKSONX')
0.7666666666666666
>>> textdistance.jaro('JELLYFISH', 'SMELLYFISH')
0.8962962962962964
>>> textdistance.jaro('Miss Mexiko', 'Miss Belize')
0.7575757575757575
>>> textdistance.jaro('Miss Belize', 'Miss Mexiko')
0.7575757575757575
>>> textdistance.jaro('0100010100101001001001001010010', '10000100100111101010101010101010')
0.8308371735791091
>>> textdistance.jaro('Miss Mexikoooooooooooooooooooo', 'Miss Belize')
0.6232323232323232
>>> textdistance.jaro('abdegopq', 'cfhijklmnrstuvwyz')
0.0
>>> textdistance.jaro('cfhijklmnrstuvwyz', 'abdegopq')
0.0
>>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'asdjkdashdahsgdashdgasj')
0.841183574879227
>>> textdistance.jaro('aasdjkdashdahsgdashdgasj', 'aasdjkdashdahsgdashdgasj')
1
>>> textdistance.jaro('CRATE', 'TRACE')
0.7333333333333334
>>> textdistance.jaro('Mary has a little lamb', 'and Meghan has the redhead Harry')
0.5631555944055945
>>>

most of them disagree with the above jaro[]. Does your implementation match it?

POSTED BY: Udo Krause
POSTED BY: Udo Krause
POSTED BY: Udo Krause

Okay, fix it into

Clear[check, jaro]
check[d_Integer, l1_List, l2_List] := 
  If[l1[[1]] == l2[[1]] && Abs[l1[[2]] - l2[[2]]] <= d,
   {l1, l2}, (* else *) 
   Missing[]
   ];
jaro[s1_String, s2_String] := 
 Block[{r, m = {}, l1 = StringLength[s1], l2 = StringLength[s2], w1, 
    w2, l3, t = 0, l4, l5},
   r = Floor[Max[l1, l2]/2] - 1;
   If[r < 0,
    m = 0, (* else *)
    w1 = Transpose[{ToCharacterCode[s1], Range[l1]}];
    w2 = Transpose[{ToCharacterCode[s2], Range[l2]}];
    l3 = DeleteMissing[
      Flatten[Outer[check[r, #1, #2] &, w1, w2, 1], 1]];
    m = Length[GatherBy[SortBy[l3, First[First[#]] &], First]];
    {l4, l5} = SortBy[#, Last] & /@ (Union /@ Transpose[l3]);
    t = EditDistance[First /@ l4, First /@ l5]/2
    ];
   Print["r = ", r, "| m = ", m, "| t = ", t];
   If[m == 0,
    0, (* else *)
    (m/l1 + m/l2 + (m - t)/m)/3.
    ]
   ] /; StringLength[s1] > 0 && StringLength[s2] > 0

this implementation is still toy because Outer[] goes too far and then ,,,,, using the Levenshtein distance in computing the Jaro distance seems a bit lunatic ... but at least the test cases go through:

In[71]:= jaro["s", "s"]
During evaluation of In[71]:= r = -1| m = 0| t = 0
Out[71]= 0

In[72]:= jaro["Miss", "Miss"]
During evaluation of In[72]:= r = 1| m = 4| t = 0
Out[72]= 1.

In[73]:= jaro @@@ {{"DWAYNE", "DUANE"},
  {"MARTHA", "MARHTA"},
  {"DIXON", "DICKSONX"},
  {"JELLYFISH", "SMELLYFISH"}}
During evaluation of In[73]:= r = 2| m = 4| t = 0
During evaluation of In[73]:= r = 2| m = 6| t = 1
During evaluation of In[73]:= r = 3| m = 4| t = 0
During evaluation of In[73]:= r = 4| m = 8| t = 0
Out[73]= {0.822222, 0.944444, 0.766667, 0.896296}

In[76]:= jaro["Miss Argentina", "Miss Brasilia"]
During evaluation of In[76]:= r = 6| m = 8| t = 3/2
Out[76]= 0.666438

In[77]:= jaro["0100010100101001001001001010010", \
"10000100100111101010101010101010"]
During evaluation of In[77]:= r = 15| m = 31| t = 9/2
Out[77]= 0.941196
POSTED BY: Udo Krause
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard