A related idea is to search for the smallest hash collisions for the two weak hash functions (CRC32 and Adler32) and use all possible strings of a certain size. Unfortunately, for the generated wordlists that I could test, there wasn't any hash collisions which surprised me a little. Anyway, here's the idea and the code.
Let's assume that we use words with strings of the characters [a-zA-Z0-9] and generate all possible strings up to a certain length, i.e. size = 3 generates all possible strings of length 1, 2, and 3.
Here's the alphabet and 30 random samples:
alpha = Flatten[{CharacterRange["a", "z"], CharacterRange["A", "Z"],
CharacterRange["0", "9"]}];
RandomSample[
Table[StringJoin[#] & /@ Tuples[alpha, n], {n, 1, 3}] // Flatten, 30]
{"oEg", "LMZ", "j3n", "GDY", "dUe", "7ma", "Z5x", "jtb", "6OI", \
"HdL", "EzE", "VMY", "OYM", "p7Z", "PNg", "Vqz", "obw", "Kyj", "OUi", \
"pda", "LwI", "S2A", "4KO", "Ah0", "VsX", "XI4", "ZiC", "tRE", "EMv", \
"fc3"}
We reuse some of the functions from the original notebook:
crc32[word_] := Hash[word, "CRC32", "HexString"]
adler32[word_] := Hash[word, "Adler32", "HexString"]
listCollisions2[words_, fun_] :=
Module[{hashList, counts, hashMap, collisions},
hashList = fun[#] & /@ words;
If[Length[hashList] != Length[words],
counts = Counts[hashList];
hashMap = AssociationThread[words -> hashList];
collisions = Keys@Select[counts, # > 1 &];
Table[Select[hashMap, # == c & ], {c, collisions}] ,
{} (* else: no collisions *)
]
]
As mentioned above, we don't found any hash collisions for CRC32 or Adler32 for sizes up to 4:
Table[{n, listCollisions2[generateWords[n], crc32]}, {n, 1, 4}] // AbsoluteTiming
{25.0142, {{1, {}}, {2, {}}, {3, {}}, {4, {}}}}
Table[{n, listCollisions2[generateWords[n], adler32]}, {n, 1, 4}] // AbsoluteTiming
{25.9066, {{1, {}}, {2, {}}, {3, {}}, {4, {}}}}
One major problem with this approach is that the size of wordlists increases very fast and soon be forbiddingly large:
Table[{n, Length[generateWords[n]]}, {n, 1, 4}] // TableForm
{
{1, 62},
{2, 3 906},
{3, 242 234},
{4, 150 18 570}
}
For size 5, I couldn't even generate the wordlist (got the error "No more memory available. Mathematica kernel has shut down"), but using WL's excellent FindSequenceFunction
we can at least estimate the sizes of the wordlists for larger sizes:
FindSequenceFunction[{62, 3906, 242234, 15018570}]
{#, %[#]} & /@ Range[1, 10] // TableForm
which outputs
62/61 (-1 + 62^#1) &
{
{1, 62},
{2, 3906},
{3, 242234},
{4, 15018570},
{5, 931151402},
{6, 57731386986},
{7, 3579345993194},
{8, 221919451578090},
{9, 13759005997841642},
{10, 853058371866181866}
}
Thus, generateWords[5]
would give a wordlist of 931 151 402 words.
Update:
Perhaps it's not that surprising that there are no hash conflicts for Adler32. Here are some of the words and their Adler32 hashes for the size 4 generated words, i.e. the hashes are distinct and just incrementing from the last (two counters) and are in order.
Short[{#, adler32[#]} & /@ generateWords[4], 20]
{{"a", "00620062"}, {"b", "00630063"}, {"c", "00640064"}, {"d",
"00650065"}, {"e", "00660066"}, {"f", "00670067"}, {"g",
"00680068"}, {"h", "00690069"}, {"i", "006a006a"}, {"j",
"006b006b"}, {"k", "006c006c"}, {"l", "006d006d"}, {"m",
"006e006e"}, {"n", "006f006f"}, {"o", "00700070"}, {"p",
"00710071"}, {"q", "00720072"}, {"r", "00730073"}, {"s",
"00740074"}, {"t", "00750075"}, {"u", "00760076"}, {"v",
"00770077"}, {"w", "00780078"}, {"x", "00790079"}, {"y",
"007a007a"}, {"z", "007b007b"}, {"A", "00420042"}, {"B",
"00430043"}, {"C", "00440044"}, {"D", "00450045"}, {"E",
"00460046"}, {"F", "00470047"}, <<15018506>>, {"999E",
"024a00f1"}, {"999F", "024b00f2"}, {"999G", "024c00f3"}, {"999H",
"024d00f4"}, {"999I", "024e00f5"}, {"999J", "024f00f6"}, {"999K",
"025000f7"}, {"999L", "025100f8"}, {"999M", "025200f9"}, {"999N",
"025300fa"}, {"999O", "025400fb"}, {"999P", "025500fc"}, {"999Q",
"025600fd"}, {"999R", "025700fe"}, {"999S", "025800ff"}, {"999T",
"02590100"}, {"999U", "025a0101"}, {"999V", "025b0102"}, {"999W",
"025c0103"}, {"999X", "025d0104"}, {"999Y", "025e0105"}, {"999Z",
"025f0106"}, {"9990", "023500dc"}, {"9991", "023600dd"}, {"9992",
"023700de"}, {"9993", "023800df"}, {"9994", "023900e0"}, {"9995",
"023a00e1"}, {"9996", "023b00e2"}, {"9997", "023c00e3"}, {"9998",
"023d00e4"}, {"9999", "023e00e5"}}
Testing with some longer words and it use the same principle of two counters:
{#, adler32[#]} & /@ {"aaaaa", "aaaab", "aaaac", "aaaaaaaaaaa",
"aaaaaaaaaab",
"aaaaaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaaaaab"}
{{"aaaaa", "05b401e6"}, {"aaaab", "05b501e7"}, {"aaaac",
"05b601e8"}, {"aaaaaaaaaaa", "190d042c"}, {"aaaaaaaaaab",
"190e042d"}, {"aaaaaaaaaaaaaaaaaaa",
"48110734"}, {"aaaaaaaaaaaaaaaaaab", "48120735"}}
The hashes for CRC32 is much more random like:
Short[{#, crc32[#]} & /@ generateWords[4], 20]
{{"a", "e8b7be43"}, {"b", "71beeff9"}, {"c", "06b9df6f"}, {"d",
"98dd4acc"}, {"e", "efda7a5a"}, {"f", "76d32be0"}, {"g",
"01d41b76"}, {"h", "916b06e7"}, {"i", "e66c3671"}, {"j",
"7f6567cb"}, {"k", "0862575d"}, {"l", "9606c2fe"}, {"m",
"e101f268"}, {"n", "7808a3d2"}, {"o", "0f0f9344"}, {"p",
"82079eb1"}, {"q", "f500ae27"}, {"r", "6c09ff9d"}, {"s",
"1b0ecf0b"}, {"t", "856a5aa8"}, {"u", "f26d6a3e"}, {"v",
"6b643b84"}, {"w", "1c630b12"}, {"x", "8cdc1683"}, {"y",
"fbdb2615"}, {"z", "62d277af"}, {"A", "d3d99e8b"}, {"B",
"4ad0cf31"}, {"C", "3dd7ffa7"}, {"D", "a3b36a04"}, {"E",
"d4b45a92"}, {"F", "4dbd0b28"}, <<15018506>>, {"999E",
"8fef8e8d"}, {"999F", "16e6df37"}, {"999G", "61e1efa1"}, {"999H",
"f15ef230"}, {"999I", "8659c2a6"}, {"999J", "1f50931c"}, {"999K",
"6857a38a"}, {"999L", "f6333629"}, {"999M", "813406bf"}, {"999N",
"183d5705"}, {"999O", "6f3a6793"}, {"999P", "e2326a66"}, {"999Q",
"95355af0"}, {"999R", "0c3c0b4a"}, {"999S", "7b3b3bdc"}, {"999T",
"e55fae7f"}, {"999U", "92589ee9"}, {"999V", "0b51cf53"}, {"999W",
"7c56ffc5"}, {"999X", "ece9e254"}, {"999Y", "9beed2c2"}, {"999Z",
"02e78378"}, {"9990", "af800b3e"}, {"9991", "d8873ba8"}, {"9992",
"418e6a12"}, {"9993", "36895a84"}, {"9994", "a8edcf27"}, {"9995",
"dfeaffb1"}, {"9996", "46e3ae0b"}, {"9997", "31e49e9d"}, {"9998",
"a15b830c"}, {"9999", "d65cb39a"}}