Message Boards Message Boards

1
|
4910 Views
|
22 Replies
|
16 Total Likes
View groups...
Share
Share this post:

How to efficiently eliminate subsets from numerous sets?

Posted 2 years ago

I'm looking for a sanity check and possible methods of speed up for filtering a long list of sets, some of which are subsets of others.

Specifically I have N = 120k to 200k sets of length < 10. The sets contain Strings of StringLength 4 to 64, most commonly StringLength 5 to 8. The N sets are sorted by length, short to long, and the interior sets are sorted alphabetically. All sets are unique, but there are less than 1000 unique Strings.

I have devised an algorithm with shortcircuit logic:

  • a storage array is initialized to size N
  • counter k = 0
  • an outer "i" Do loop runs from 1 to N-1
  • an inner "j" While loop runs from i+1 to N or ceases if the ith set is a subset of the jth
  • SubsetQ is only called if the ith set is shorter than the jth
  • if no superset was found then the ith is placed in the storage array at position ++k
  • at completion, the Nth set is placed in the storage array at ++k

Current runtimes are 12 to 18 minutes per 1000 "i"; i.e. 2-3 days per set of N.

POSTED BY: Richard Frost
22 Replies

Hi David, I would be honoured. Glad we could help you out.

POSTED BY: l van Veen

That is some tight code. May I also acknowledge your assistance in my article to Frontiers?

POSTED BY: Richard Frost

Hi Sander, good point on replace. I noticed that the Subset is not very fast. A speed up could also be made to hardcode subsets using compile (up to 10 items since it’s the max) but that is a lot of work and probably not worth it. The dataset David posted is already cleaned up in 2.5 seconds on my system. This was fun!

POSTED BY: l van Veen

Yeah I was also thinking of using Primes or numbers in ‘fancy’ bases like you did but thought it would be slower. You proved me wrong. You could probably still speed it up a bit by using Replace with a level spec rather than /. ReplaceAll which does all levels by default. Not sure if it speeds up a lot since you’re using Dispatch already… I built mine explicitly assuming multiplicities don’t matter.

POSTED BY: Sander Huisman

Speed demon! I love it.

Here are two pairs of real data files, about 1MB each. The "Subsets" are input and "Results" are from my hybrid algorithm above. The hosting service for my website runs a clean ship so don't be too concerned about picking up a virus.

https://frostconcepts.org/data/RichardSubsets1.csv

https://frostconcepts.org/data/RichardResult1.csv

https://frostconcepts.org/data/RichardSubsets2.csv

https://frostconcepts.org/data/RichardResult2.csv

POSTED BY: Richard Frost

Hi Richard, I had some time left to experiment with another approach. This works due to the limited length of the subsets. My approach is basically trying to solve in one pass this puzzle. I calculate the possible subsets and label them with Primes and then delete the subsets if present in an association database (db). The benefit of this approach is that it it also handles the duplicates I was talking about. I tested Sander's routine with 100000 subsets and it solved in 670 seconds. My routine did this in 64 seconds. I tested this by removing any duplicates in the testset to be sure to get the same results as Sanders approach.

ss = RandomChoice[words, {100000, 10}];
ss = Table[Take[s, RandomInteger[{1, Length[s]}]], {s, ss}];
ss = DeleteDuplicates /@ ss;
ss = Sort /@ SortBy[ss, Length];

RemoveSubsetsv2[subsets_] := 
 Module[{data, uniquedata, uniquerules, db, step},
  data = Gather /@ subsets;
  uniquedata = SortBy[DeleteDuplicates@Flatten[data, 1], Length];
  uniquerules = 
   Dispatch[
    Flatten[{Thread[
       uniquedata -> Rest@Prime@(Range[Length@uniquedata + 1])],
      Thread[
       Rest@Prime@(Range[Length@uniquedata + 1]) -> uniquedata]}, 
     1]];
  db = Times @@ # -> # & /@ (data /. uniquerules) // Association;
  step = 0;
  PrintTemporary[Dynamic[step]];
  (step++; 
     KeyDropFrom[db, 
      Rest@ReverseSort@
        MapApply[Times, 
         DeleteDuplicates[(Gather /@ Subsets[#]) /. 
           uniquerules]]]) & /@ ReverseSortBy[subsets, Length];
  SortBy[Flatten /@ (Values@db /. uniquerules), Length]
  ]

RemoveSubsetsv2[{{"a"}, {"a", "a", "b", "b", "c"}, {"a", "a"}, {"a", 
   "a", "b", "b", "a", "c"}, {"c", "b", "b", "b", "a", 
   "a"}, {"b"}, {"a", "b"}, {"a", "a", "a", "a"}, {"c"}}]

{{"a", "a", "a", "a"}, {"a", "a", "a", "b", "b", "c"}, {"c", "b", "b", "b", "a", "a"}}

RemoveSubsets[{{"a"}, {"a", "a", "b", "b", "c"}, {"a", "a"}, {"a", 
   "a", "b", "b", "a", "c"}, {"c", "b", "b", "b", "a", 
   "a"}, {"b"}, {"a", "b"}, {"a", "a", "a", "a"}, {"c"}}]

{{"a", "a", "b", "b", "b", "c"}}

Would love to hear how it works on your real data.

POSTED BY: l van Veen

Sander and I van: Thank you both again for taking an interest in this computation. I've created a hybrid version of it, that first extracts multiplicities from the input data:

multiplicities = Table[
   If[Length[set] > Length[DeleteDuplicates[set]], set, Nothing],
   {set, sslist}
   ];
sslist = Complement[sslist, multiplicities];

These are often about 2% of the total and processed quickly by my original method. I then apply Sander's approach to the remainder, and afterwards take the union. Overall it's running about 3 minutes per list of sets. A fabulous improvement over my original 50 hours!

POSTED BY: Richard Frost

I see, very good point. Let me screen my data for such cases and I'll post back tomorrow.

POSTED BY: Richard Frost

Yes so realize that the method provided from Sander's algo will be

RemoveSubsets[{{"A", "A", "B", "B"}, {"A", "A", "B", "B", "B"}, {"A", "A", "A", "B", "B"}}]

{"A", "A", "B", "B", "B"}

POSTED BY: l van Veen

In this list of sets:

 {"A", "A", "B", "B"}
 {"A", "A", "B", "B", "B"}
 {"A", "A", "A", "B", "B"}

the first is subset of the 2nd and 3rd, and neither are subsets of each other, so the result is:

 {"A", "A", "B", "B", "B"}
 {"A", "A", "A", "B", "B"}
POSTED BY: Richard Frost

Hi Sander and Richard, The problem stated by Richard is an interesting one indeed. I do think that there is some ambiguity like {{"A", "A", "B", "B"}, {"A", "A", "B", "B", "B"}, {"A", "A", "A", "B", "B"}} resolves to {{"A", "A", "B", "B", "B"}} but it could just also have been , {"A", "A", "A", "B", "B"} What would be the correct one for your application? (I know nothing about genes except that I know I have some) If I understand it correctly you might expect to answer above with { {"A", "A", "B", "B", "B"}, {"A", "A", "A", "B","B"}}

POSTED BY: l van Veen

For the supersets of "A"'s you list, a reduction to a single "A" represents a reduction to primaries. It is something I've had occassion to use when analyzing repeating motifs in an entire chromosome.

But for the present, Sander has provided exactly what I wanted: fast elimination of subsets. The reason for doing so is born out of the application area. In particular, I am trying to determine which elements of IAS-OLI16 are causing it to fluoresce with O. europaea in PCR tests. (https://www.genome.gov/about-genomics/fact-sheets/Polymerase-Chain-Reaction-Fact-Sheet)

POSTED BY: Richard Frost

I had thought about that. But if you don’t want duplicates then it is easy to remove duplicates at the beginning as a preprocessing step.

POSTED BY: Sander Huisman

Thx Richard, How would you solve a set like: ss = {{"A", "A", "A", "A", "A", "A"}, {"A", "B"}, {"B"}, {"A", "A", "A", "B"}, {"A", "A", "A", "A", "B"}, {"A", "A", "B", "A"}, {"A", "A", "D"}} Sanders solution (Nice!) is {{"A", "A", "A", "A", "A", "A"}, {"A", "B"}, {"B"}, {"A", "A", "A", "B"}, {"A", "A", "A", "A", "B"}, {"A", "A", "B", "A"}, {"A", "A", "D"}}

and for: {{"A"}, {"A", "A"}, {"A", "A", "A"}, {"A", "A", "A", "A"}, {"A", "A", "A", "A", "A"}, {"A", "A", "A", "A", "A", "A"}, {"A", "A", "A", "A", "A", "A", "A"}, {"A", "A", "A", "A", "A", "A", "A", "A"}, {"A", "A", "A", "A", "A", "A", "A", "A", "A"}, {"A", "A", "A", "A", "A", "A", "A", "A", "A", "A"}}

you get {{"A", "A", "A", "A", "A", "A", "A", "A", "A", "A"}}

Is that what you need or could above also resolve to just {"A"}

POSTED BY: l van Veen

Yes, there are several hundred sets of motifs containing multiplicities. For example:

{GAACA,TTTT,TTTT}

occurs many times in primer IAS-OLI16 forward matches with the 23 chromosomes of Olea europaea (cultivated Olive).

Here is IAS-OLI16:

ATCGGAGAGTTGCTACAGAACAACATCATCATCGCTATCATCATCATCGCCCATTCACATTTTTCTCCAA\
GACCGTGTGTGTTGTTGGTGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGCGAGGGGGGCTGGTGTGGTAA\
ATTTGACTGGCGTTTTTAGGTGGGGATTTTTATTGCTCGACTTGGAAGGGATGCTGTTTTATGAAAGGCA\
TTTATTACTACAGCCCCG

A high-resolution dataset of an O. europaea genome can be found here: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_902713445.1/

POSTED BY: Richard Frost
Posted 2 years ago

Hi Richard, Just wondering. Are there in the subsets any duplicate strings?

POSTED BY: Updating Name

Please double-check if it works properly, I was unable to check deeply. I play around with the index i a bit, and I think I got it correct. But it always feels a bit strange to do this manual work.

The main idea is:

Scan from large to small (so in reverse). Then for this element i, make a pattern that matches any subset of that element i. Find the position of the elements 1 through i-1 that matches the pattern (using the Position function). Delete those. Now update the index i because we delete stuff 'in front of' the current element. Since you start with the biggest sets first, this will (potentially) delete the most stuff, and after each iteration the number of items to scan decreases because i decreases, and because elements are deleted.

Still a n^2 algorithm though. It might be solveable in a neat way by using hypergraph but I'm not sure if Mathematica is set-up for that at the moment, and whether this is possible memory-wise for list of 10^5 elements and above.

Of course you can acknowledge me :-)

I'll think about turning this into a Wolfram Function Repository function in the future…

POSTED BY: Sander Huisman

Wow Sander, you turned the problem upside down and boy does it run fast. I really like your approach. It could be a kernel function DeleteSubsets[].This computation is part of a digital analysis of high resolution long-read genomes from fruiting plants. May I acknowledge your contribution in a research article I'll write this winter?

POSTED BY: Richard Frost

Sander, thank you for your suggestions. Your recommendation to replace the String elements with indices is very good -- something I completely overlooked. Trying it in my algorithm, it reduces the elapsed time by 25%. I've yet to try your algorithm though. I will do so tonight and report back tomorrow.

POSTED BY: Richard Frost

I'm still not 100% sure what you want, but I think it can be done using this code:

ClearAll[RemoveSubsets]
RemoveSubsets[sslist_List] := Module[{unique, db, idb, ss, i, sel, patt, poss},
  unique = Union[Flatten[sslist]];
  db = Thread[unique -> Range[Length[unique]]];
  idb = Thread[Range[Length[unique]] -> unique];
  ss = Replace[sslist, db, {2}];
  ss = Sort /@ SortBy[ss, Length];

  PrintTemporary[Dynamic[{i, Length[ss]}]];

  i = Length[ss];
  While[i > 1,
   sel = ss[[i]];
   patt = {Repeated[Alternatives @@ sel]};
   poss = Position[ss[[;; i - 1]], patt, {1}];
   ss = Delete[ss, poss];
   i -= Length[poss];
   i--;
   ];

  ss /. idb
  ]

SeedRandom[1234];
words = ResourceFunction["RandomString"][Characters["atcg"], {1000,10}]; (* database of 1000 words (of length 10) *)

(* made 10k lists of max length 10*)
ss = RandomChoice[words, {10000, 10}]; 
ss = Table[Take[s, RandomInteger[{1, Length[s]}]], {s, ss}];

RemoveSubsets[ss]

Is this what you want? Can probably be made faster using Replace instead of Position/Delete…

POSTED BY: Sander Huisman

Hi Sander, consider this list of String sets:

{
{"ATGA"},
{"GAGAC","TTTA"},
{"ATGA","TTTA"},
{"AAAAAT","GCGAGAA"},
{"AAAAAT","GAGCC","GCGAGAA"}
}

The first is a subset of the third and the fourth is a subset of the fifth, so after filtering the result would be:

{
{"GAGAC","TTTA"},
{"ATGA","TTTA"},
{"AAAAAT","GAGCC","GCGAGAA"}
}
POSTED BY: Richard Frost

Hi Richard,

Can you give a sample code? It is not clear which should or should not be subset of what? Since you have a list of lists with subsets, it is not clear. Do you have a sample with (say) 1000 entries? And what you are trying to achieve. Maybe something can be sped up by using rules or …

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract