Group Abstract

Message Boards

WOLFRAM COMMUNITY

24K Views

2 Replies

5 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science Recreation Machine Learning Computational Linguistics Natural Language Processing Computational Humanities Neural Networks Artificial Intelligence

UK Place Name Generator

Alan Joyce

Alan Joyce, Wolfram|Alpha

Posted 9 years ago

Every time I see another article pop up wherein somebody trains a neural net to generate names of something, I feel obligated to go back and run the same training sets through my dead-simple setup to do the same thing in the Wolfram Language. This pass at generating British place names seemed like a fun one, since the training set goes deep into small towns and villages across the UK. So I'll start with a small set of functions I've used before for this sort of thing "decamel" is a utility function to clean up and split apart any incidental camelcased words that show up in predictions; "nameGenerator" does some minimal string processing on a provided list of Wolfram Language entities or raw strings, and produces a SequencePredictorFunction; "predictionList" produces a list of results of varying lengths using a predictor function: decamel[str_] := StringTrim[ StringJoin[ StringSplit[ str, {RegularExpression["([a-z])([A-Z])"] -> "$1 $2", RegularExpression["([0-9])([A-Z])"] -> "$1 $2", RegularExpression["([a-z])([0-9])"] -> "$1 $2"}]]] predictionList[func_, num_, min_, max_, decam_: True] := If[decam == True, decamel /@ Table[StringTrim@ StringReplace[ func["\|", "RandomNextElement" -> RandomInteger[{min, max}]], "\|" -> " "], num], Table[StringTrim@ StringReplace[ func["\|", "RandomNextElement" -> RandomInteger[{min, max}]], "\|" -> " "], num]] nameGenerator[entOrString_List, extractor_: "SegmentedWords"] := Block[{names, list}, With[{heads = DeleteDuplicates[Head /@ entOrString]}, Which[ heads === {Entity}, names = CommonName[DeleteMissing[entOrString]]; list = StringRiffle[StringSplit["\|" <> # <> "\|"], "\|"] & /@ names; SequencePredict[list, FeatureExtractor -> extractor], heads === {String}, names = StringTrim /@ DeleteMissing[entOrString]; list = StringRiffle[StringSplit["\|" <> # <> "\|"], "\|"] & /@ names; SequencePredict[list, FeatureExtractor -> extractor]]]] So I'll start by importing the file used in the original article, and just grabbing place names out of it (it also includes some numerical IDs, and county names): uknames = Import["https://cdn.obrienmedia.co.uk/cdn/farfuture/5-\ 1bFjgWmjONhWhk9sGAeYzlIzhwHRSBIF_Fzr55UYs/mtime:1425905283/sites/\ default/files/uk_towns_and_counties.csv"]; namelist = uknames[[All, 2]] // Rest // DeleteDuplicates; In[84]:= Select[namelist, StringContainsQ["("]][[;; 10]] Out[84]= {"Wdig (Goodwick)", "Vermuden's Drain (Forty Foot)", "Valley \ (Dyffryn)", "Usk (Brynbuga)", "Upper Largo (Kirkton of Largo)", \ "Uisage Dubh (Black Water)", "Tyddewi (St David's)", "Treorci \ (Treorchy)", "Treorchy (Treorci)", "Trent (Piddle)"} I don't want to try to generate names with parenthetical transcriptions or alternate forms, so let's split those up and treat the parentheticals as distinct names for training purposes: In[78]:= splitter[rec_] := StringTrim[StringSplit[rec, "("], {" ", ")"}] In[105]:= newlist = Flatten[splitter /@ namelist]; In[83]:= Length[newlist] Out[83]= 41245 Then all that's left to do is make the SequencePredictorFunction, and generate some names (removing predicted names that were already in the training set, or that end with words of fewer than 5 characters): ukpl = nameGenerator[newlist, "SegmentedCharacters"]; Multicolumn[ Complement[ Select[predictionList[ukpl, 400, 8, 18], StringLength[StringSplit[#, " "][[-1]]] > 4 &], newlist], 6] Not all of these are gems (or even readable), but there's some good stuff in here my personal favorites include: Blackleaze Ferry Bleburgh Farmlingthorpe Kirphook Low of Gosbe Roebucklecott Stainton Doirkmill Tattin Grime Toberland Garker Winstapleton Dalby

Every time I see another article pop up wherein somebody trains a neural net to generate names of something, I feel obligated to go back and run the same training sets through my dead-simple setup to do the same thing in the Wolfram Language. This pass at generating British place names seemed like a fun one, since the training set goes deep into small towns and villages across the UK. So I'll start with a small set of functions I've used before for this sort of thing "decamel" is a utility function to clean up and split apart any incidental camelcased words that show up in predictions; "nameGenerator" does some minimal string processing on a provided list of Wolfram Language entities or raw strings, and produces a SequencePredictorFunction; "predictionList" produces a list of results of varying lengths using a predictor function:

decamel[str_] := 
 StringTrim[
  StringJoin[
   StringSplit[
    str, {RegularExpression["([a-z])([A-Z])"] -> "$1 $2", 
     RegularExpression["([0-9])([A-Z])"] -> "$1 $2", 
     RegularExpression["([a-z])([0-9])"] -> "$1 $2"}]]]

predictionList[func_, num_, min_, max_, decam_: True] := 
 If[decam == True, 
  decamel /@ 
   Table[StringTrim@
     StringReplace[
      func["|", "RandomNextElement" -> RandomInteger[{min, max}]], 
      "|" -> " "], num],
  Table[StringTrim@
    StringReplace[
     func["|", "RandomNextElement" -> RandomInteger[{min, max}]], 
     "|" -> " "], num]]

nameGenerator[entOrString_List, extractor_: "SegmentedWords"] :=
 Block[{names, list},
  With[{heads = DeleteDuplicates[Head /@ entOrString]},
   Which[
    heads === {Entity},
    names = CommonName[DeleteMissing[entOrString]];
    list = 
     StringRiffle[StringSplit["|" <> # <> "|"], "|"] & /@ names;
    SequencePredict[list, FeatureExtractor -> extractor],
    heads === {String},
    names = StringTrim /@ DeleteMissing[entOrString];
    list = 
     StringRiffle[StringSplit["|" <> # <> "|"], "|"] & /@ names;
    SequencePredict[list, FeatureExtractor -> extractor]]]]

So I'll start by importing the file used in the original article, and just grabbing place names out of it (it also includes some numerical IDs, and county names):

uknames = 
  Import["https://cdn.obrienmedia.co.uk/cdn/farfuture/5-\
1bFjgWmjONhWhk9sGAeYzlIzhwHRSBIF_Fzr55UYs/mtime:1425905283/sites/\
default/files/uk_towns_and_counties.csv"];

namelist = uknames[[All, 2]] // Rest // DeleteDuplicates;

In[84]:= Select[namelist, StringContainsQ["("]][[;; 10]]

Out[84]= {"Wdig (Goodwick)", "Vermuden's Drain (Forty Foot)", "Valley \
(Dyffryn)", "Usk (Brynbuga)", "Upper Largo (Kirkton of Largo)", \
"Uisage Dubh (Black Water)", "Tyddewi (St David's)", "Treorci \
(Treorchy)", "Treorchy (Treorci)", "Trent (Piddle)"}

I don't want to try to generate names with parenthetical transcriptions or alternate forms, so let's split those up and treat the parentheticals as distinct names for training purposes:

In[78]:= splitter[rec_] := 
 StringTrim[StringSplit[rec, "("], {" ", ")"}]

In[105]:= newlist = Flatten[splitter /@ namelist];

In[83]:= Length[newlist] 
Out[83]= 41245

Then all that's left to do is make the SequencePredictorFunction, and generate some names (removing predicted names that were already in the training set, or that end with words of fewer than 5 characters):

ukpl = nameGenerator[newlist, "SegmentedCharacters"];

Multicolumn[
 Complement[
  Select[predictionList[ukpl, 400, 8, 18], 
   StringLength[StringSplit[#, " "][[-1]]] > 4 &], newlist], 6]

enter image description here