Message Boards Message Boards

GROUPS:

How to match strings with variable diacritics, punctuation, and spelling?

Posted 10 months ago
1098 Views
|
2 Replies
|
1 Total Likes
|

I have long lists of strings (names of figs) from different sources, each with accompanying attributes. I'd like to find matches among the names so I can pool attributes. A few entries from one list looks like:

{"Hative d\[CloseCurlyQuote]Argenteuil","Algerian (Watts)","White Marseilles"}

while another has entries like

{"Hâtive d'Argenteuil","Algerian Watts","White Marseillaise"}

I found that StringMatchQ[] does not support IgnoreDiacritics, but pre-processing with RemoveDiacritics[] works well. But how to ignore or remove punctuation? Notice that alternate marks are used for apostrophe. And then the issue with alternate spellings ... I suppose with EditDistance[] I can generate lists of possible matches to review by hand. Maybe I'm trying to re-invent someone's wheel?

Thank you in advance :)

2 Replies
Posted 10 months ago

Richard,

TextWords and WordStem might help.

For text cleanup, I found that my old friend Regex was needed to clear punctuation. Also, I found occurrences of prefixed and postfixed adjectives (e.g., colors) in the names so I ended up sorting the words within a name:

testCase = "(Green)\nd'Argentéuil";
StringJoin[
  Riffle[Sort[
    StringSplit[
     StringDelete[RemoveDiacritics[testCase], 
      RegularExpression["[[:punct:]]"]]]], " "]] // InputForm
"dArgenteuil Green"

For the harder case of mixed spellings, both EditDistance and WordStem have merits:

mixedSpellingCase = {{"Bourjassotte", "Bourjasotte"}, {"Marseilles", 
    "Marseillaise"}};
Print[EditDistance[#[[1]], #[[2]]] & /@ mixedSpellingCase];
Print[WordStem[mixedSpellingCase] // InputForm];
{1,3}
{{"Bourjassott", "Bourjasott"}, {"Marseil", "Marseillais"}}
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract