Hi Richard,
I had some time left to experiment with another approach.
This works due to the limited length of the subsets.
My approach is basically trying to solve in one pass this puzzle.
I calculate the possible subsets and label them with Primes and then delete the subsets if present in an association database (db).
The benefit of this approach is that it it also handles the duplicates I was talking about.
I tested Sander's routine with 100000 subsets and it solved in 670 seconds. My routine did this in 64 seconds. I tested this by removing any duplicates in the testset to be sure to get the same results as Sanders approach.
ss = RandomChoice[words, {100000, 10}];
ss = Table[Take[s, RandomInteger[{1, Length[s]}]], {s, ss}];
ss = DeleteDuplicates /@ ss;
ss = Sort /@ SortBy[ss, Length];
RemoveSubsetsv2[subsets_] :=
Module[{data, uniquedata, uniquerules, db, step},
data = Gather /@ subsets;
uniquedata = SortBy[DeleteDuplicates@Flatten[data, 1], Length];
uniquerules =
Dispatch[
Flatten[{Thread[
uniquedata -> Rest@Prime@(Range[Length@uniquedata + 1])],
Thread[
Rest@Prime@(Range[Length@uniquedata + 1]) -> uniquedata]},
1]];
db = Times @@ # -> # & /@ (data /. uniquerules) // Association;
step = 0;
PrintTemporary[Dynamic[step]];
(step++;
KeyDropFrom[db,
Rest@ReverseSort@
MapApply[Times,
DeleteDuplicates[(Gather /@ Subsets[#]) /.
uniquerules]]]) & /@ ReverseSortBy[subsets, Length];
SortBy[Flatten /@ (Values@db /. uniquerules), Length]
]
RemoveSubsetsv2[{{"a"}, {"a", "a", "b", "b", "c"}, {"a", "a"}, {"a",
"a", "b", "b", "a", "c"}, {"c", "b", "b", "b", "a",
"a"}, {"b"}, {"a", "b"}, {"a", "a", "a", "a"}, {"c"}}]
{{"a", "a", "a", "a"}, {"a", "a", "a", "b", "b", "c"}, {"c", "b", "b",
"b", "a", "a"}}
RemoveSubsets[{{"a"}, {"a", "a", "b", "b", "c"}, {"a", "a"}, {"a",
"a", "b", "b", "a", "c"}, {"c", "b", "b", "b", "a",
"a"}, {"b"}, {"a", "b"}, {"a", "a", "a", "a"}, {"c"}}]
{{"a", "a", "b", "b", "b", "c"}}
Would love to hear how it works on your real data.