Message Boards Message Boards

How to Identify and Extract Foreign Character Strings?

Posted 9 years ago

Hello, I am trying to write a function to extract elements from a dataset based on character strings from several foreign-language alphabets. The approach I am considering is to extract elements (rows) based on content (ie. strings of characters that contain elements from specific built-in alphabets for recognized languages, eg. "Chinese" or "Russian." For example, if I start with source data like this, (notice the values in the last position of each list, which contain a mixture of characters from English, Russian, and Chinese)...

{{1, "GA", ".?? (xn--ses554g)"}, {2, "GA", ".?? (xn--55qx5d)"}, 
 {3, "GA", ".tokyo"}, {4, "GA", ".?? (xn--io0a7i)"}, 
 {5, "GA", ".?????? (xn--80adxhks)"}}

I'd like to end up with data like this:

{{1, "GA", ".xn--ses554g"}, {2, "GA", ".xn--55qx5d"}, 
 {3, "GA", ".tokyo"}, {4, "GA", ".xn--io0a7i"}, 
 {5, "GA", ".xn--80adxhks"}}

I've thought of a couple of approaches. If I could isolate just the characters from a single alphabet in each row, I could easily define a replacement rule. That is, a way to extract something like this would be v. helpful...

{?? ,??,"",??,??????}


{xn--ses554g, xn--55qx5d, .tokyo, xn--io0a7i, xn--80adxhks}

Alternatively, a solution to simply extracted a list of each/all foreign characters (or foreign character strings) would still work for my purposes, because I could replace those with whitespace and go from there. ie. something to produce this...

{?,? ,?,?,?,?,?,?,?,?,?,?}

Note: I can handle the leading and lagging "(" and ")" -- so needn't worry about that in explanation. Just including for reference. :-D

A few records of sample data shown above are attached to this post in .nb, for your convenience. Thank you all in advance! Look forward to your suggestions and ideas.

POSTED BY: Caitlin Ramsey
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract