Hello, I am trying to write a function to extract elements from a dataset based on character strings from several foreign-language alphabets. The approach I am considering is to extract elements (rows) based on content (ie. strings of characters that contain elements from specific built-in alphabets for recognized languages, eg. "Chinese" or "Russian." For example, if I start with source data like this, (notice the values in the last position of each list, which contain a mixture of characters from English, Russian, and Chinese)...
{{1, "GA", ".?? (xn--ses554g)"}, {2, "GA", ".?? (xn--55qx5d)"},
{3, "GA", ".tokyo"}, {4, "GA", ".?? (xn--io0a7i)"},
{5, "GA", ".?????? (xn--80adxhks)"}}
I'd like to end up with data like this:
{{1, "GA", ".xn--ses554g"}, {2, "GA", ".xn--55qx5d"},
{3, "GA", ".tokyo"}, {4, "GA", ".xn--io0a7i"},
{5, "GA", ".xn--80adxhks"}}
I've thought of a couple of approaches. If I could isolate just the characters from a single alphabet in each row, I could easily define a replacement rule. That is, a way to extract something like this would be v. helpful...
{?? ,??,"",??,??????}
and
{xn--ses554g, xn--55qx5d, .tokyo, xn--io0a7i, xn--80adxhks}
Alternatively, a solution to simply extracted a list of each/all foreign characters (or foreign character strings) would still work for my purposes, because I could replace those with whitespace and go from there. ie. something to produce this...
{?,? ,?,?,?,?,?,?,?,?,?,?}
Note: I can handle the leading and lagging "(" and ")" -- so needn't worry about that in explanation. Just including for reference. :-D
A few records of sample data shown above are attached to this post in .nb, for your convenience. Thank you all in advance! Look forward to your suggestions and ideas.
Attachments: