Every time I see another article pop up wherein somebody trains a neural net to generate names of something, I feel obligated to go back and run the same training sets through my dead-simple setup to do the same thing in the Wolfram Language. This pass at generating British place names seemed like a fun one, since the training set goes deep into small towns and villages across the UK. So I'll start with a small set of functions I've used before for this sort of thing "decamel" is a utility function to clean up and split apart any incidental camelcased words that show up in predictions; "nameGenerator" does some minimal string processing on a provided list of Wolfram Language entities or raw strings, and produces a SequencePredictorFunction; "predictionList" produces a list of results of varying lengths using a predictor function:
decamel[str_] :=
StringTrim[
StringJoin[
StringSplit[
str, {RegularExpression["([a-z])([A-Z])"] -> "$1 $2",
RegularExpression["([0-9])([A-Z])"] -> "$1 $2",
RegularExpression["([a-z])([0-9])"] -> "$1 $2"}]]]
predictionList[func_, num_, min_, max_, decam_: True] :=
If[decam == True,
decamel /@
Table[StringTrim@
StringReplace[
func["|", "RandomNextElement" -> RandomInteger[{min, max}]],
"|" -> " "], num],
Table[StringTrim@
StringReplace[
func["|", "RandomNextElement" -> RandomInteger[{min, max}]],
"|" -> " "], num]]
nameGenerator[entOrString_List, extractor_: "SegmentedWords"] :=
Block[{names, list},
With[{heads = DeleteDuplicates[Head /@ entOrString]},
Which[
heads === {Entity},
names = CommonName[DeleteMissing[entOrString]];
list =
StringRiffle[StringSplit["|" <> # <> "|"], "|"] & /@ names;
SequencePredict[list, FeatureExtractor -> extractor],
heads === {String},
names = StringTrim /@ DeleteMissing[entOrString];
list =
StringRiffle[StringSplit["|" <> # <> "|"], "|"] & /@ names;
SequencePredict[list, FeatureExtractor -> extractor]]]]
So I'll start by importing the file used in the original article, and just grabbing place names out of it (it also includes some numerical IDs, and county names):
uknames =
Import["https://cdn.obrienmedia.co.uk/cdn/farfuture/5-\
1bFjgWmjONhWhk9sGAeYzlIzhwHRSBIF_Fzr55UYs/mtime:1425905283/sites/\
default/files/uk_towns_and_counties.csv"];
namelist = uknames[[All, 2]] // Rest // DeleteDuplicates;
In[84]:= Select[namelist, StringContainsQ["("]][[;; 10]]
Out[84]= {"Wdig (Goodwick)", "Vermuden's Drain (Forty Foot)", "Valley \
(Dyffryn)", "Usk (Brynbuga)", "Upper Largo (Kirkton of Largo)", \
"Uisage Dubh (Black Water)", "Tyddewi (St David's)", "Treorci \
(Treorchy)", "Treorchy (Treorci)", "Trent (Piddle)"}
I don't want to try to generate names with parenthetical transcriptions or alternate forms, so let's split those up and treat the parentheticals as distinct names for training purposes:
In[78]:= splitter[rec_] :=
StringTrim[StringSplit[rec, "("], {" ", ")"}]
In[105]:= newlist = Flatten[splitter /@ namelist];
In[83]:= Length[newlist]
Out[83]= 41245
Then all that's left to do is make the SequencePredictorFunction, and generate some names (removing predicted names that were already in the training set, or that end with words of fewer than 5 characters):
ukpl = nameGenerator[newlist, "SegmentedCharacters"];
Multicolumn[
Complement[
Select[predictionList[ukpl, 400, 8, 18],
StringLength[StringSplit[#, " "][[-1]]] > 4 &], newlist], 6]
Not all of these are gems (or even readable), but there's some good stuff in here my personal favorites include:
- Blackleaze Ferry
- Bleburgh
- Farmlingthorpe
- Kirphook
- Low of Gosbe
- Roebucklecott
- Stainton Doirkmill
- Tattin Grime
- Toberland Garker
- Winstapleton Dalby