Message Boards Message Boards

How to Identify and Extract Foreign Character Strings?

Posted 10 years ago

Hello, I am trying to write a function to extract elements from a dataset based on character strings from several foreign-language alphabets. The approach I am considering is to extract elements (rows) based on content (ie. strings of characters that contain elements from specific built-in alphabets for recognized languages, eg. "Chinese" or "Russian." For example, if I start with source data like this, (notice the values in the last position of each list, which contain a mixture of characters from English, Russian, and Chinese)...

{{1, "GA", ".?? (xn--ses554g)"}, {2, "GA", ".?? (xn--55qx5d)"}, 
 {3, "GA", ".tokyo"}, {4, "GA", ".?? (xn--io0a7i)"}, 
 {5, "GA", ".?????? (xn--80adxhks)"}}

I'd like to end up with data like this:

{{1, "GA", ".xn--ses554g"}, {2, "GA", ".xn--55qx5d"}, 
 {3, "GA", ".tokyo"}, {4, "GA", ".xn--io0a7i"}, 
 {5, "GA", ".xn--80adxhks"}}

I've thought of a couple of approaches. If I could isolate just the characters from a single alphabet in each row, I could easily define a replacement rule. That is, a way to extract something like this would be v. helpful...

{?? ,??,"",??,??????}

and

{xn--ses554g, xn--55qx5d, .tokyo, xn--io0a7i, xn--80adxhks}

Alternatively, a solution to simply extracted a list of each/all foreign characters (or foreign character strings) would still work for my purposes, because I could replace those with whitespace and go from there. ie. something to produce this...

{?,? ,?,?,?,?,?,?,?,?,?,?}

Note: I can handle the leading and lagging "(" and ")" -- so needn't worry about that in explanation. Just including for reference. :-D

A few records of sample data shown above are attached to this post in .nb, for your convenience. Thank you all in advance! Look forward to your suggestions and ideas.

Attachments:
POSTED BY: Caitlin Ramsey
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract