Message Boards Message Boards

How to match strings with variable diacritics, punctuation, and spelling?

Posted 3 years ago
POSTED BY: Richard Frost
2 Replies

For text cleanup, I found that my old friend Regex was needed to clear punctuation. Also, I found occurrences of prefixed and postfixed adjectives (e.g., colors) in the names so I ended up sorting the words within a name:

testCase = "(Green)\nd'Argentéuil";
StringJoin[
  Riffle[Sort[
    StringSplit[
     StringDelete[RemoveDiacritics[testCase], 
      RegularExpression["[[:punct:]]"]]]], " "]] // InputForm
"dArgenteuil Green"

For the harder case of mixed spellings, both EditDistance and WordStem have merits:

mixedSpellingCase = {{"Bourjassotte", "Bourjasotte"}, {"Marseilles", 
    "Marseillaise"}};
Print[EditDistance[#[[1]], #[[2]]] & /@ mixedSpellingCase];
Print[WordStem[mixedSpellingCase] // InputForm];
{1,3}
{{"Bourjassott", "Bourjasott"}, {"Marseil", "Marseillais"}}
POSTED BY: Richard Frost
Posted 3 years ago

Richard,

TextWords and WordStem might help.

POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract