Message Boards Message Boards

GROUPS:

How to match strings with variable diacritics, punctuation, and spelling?

Posted 10 months ago
1099 Views
|
2 Replies
|
1 Total Likes
|

I have long lists of strings (names of figs) from different sources, each with accompanying attributes. I'd like to find matches among the names so I can pool attributes. A few entries from one list looks like:

{"Hative d\[CloseCurlyQuote]Argenteuil","Algerian (Watts)","White Marseilles"}

while another has entries like

{"Hâtive d'Argenteuil","Algerian Watts","White Marseillaise"}

I found that StringMatchQ[] does not support IgnoreDiacritics, but pre-processing with RemoveDiacritics[] works well. But how to ignore or remove punctuation? Notice that alternate marks are used for apostrophe. And then the issue with alternate spellings ... I suppose with EditDistance[] I can generate lists of possible matches to review by hand. Maybe I'm trying to re-invent someone's wheel?

Thank you in advance :)

2 Replies

For text cleanup, I found that my old friend Regex was needed to clear punctuation. Also, I found occurrences of prefixed and postfixed adjectives (e.g., colors) in the names so I ended up sorting the words within a name:

testCase = "(Green)\nd'Argentéuil";
StringJoin[
  Riffle[Sort[
    StringSplit[
     StringDelete[RemoveDiacritics[testCase], 
      RegularExpression["[[:punct:]]"]]]], " "]] // InputForm
"dArgenteuil Green"

For the harder case of mixed spellings, both EditDistance and WordStem have merits:

mixedSpellingCase = {{"Bourjassotte", "Bourjasotte"}, {"Marseilles", 
    "Marseillaise"}};
Print[EditDistance[#[[1]], #[[2]]] & /@ mixedSpellingCase];
Print[WordStem[mixedSpellingCase] // InputForm];
{1,3}
{{"Bourjassott", "Bourjasott"}, {"Marseil", "Marseillais"}}
Posted 10 months ago

Richard,

TextWords and WordStem might help.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract