Thank you for the example.
I tested RegularExpression and StringExpression against one of my genomic computations which checks 2773 different patterns against several chromosomes. In these tests, the chromosome is 35.7M characters.
Here are two examples of the 2773 match sequences for the RegularExpression test:
GGTG.{35,35}TTAT
GGTG.{13,13}CCAA.{86,86}TTAT
Here are the timing results:
read "EM03 chr 1 F parts clusters.mx.gz" (31.7237MB) (2.05 seconds)
r, k = 4713, 123 (24.5 seconds)
r, k = 9426, 311 (36.4 seconds)
r, k = 14139, 570 (50.1 seconds)
r, k = 18852, 792 (42.9 seconds)
r, k = 23565, 1123 (1.07 minutes)
r, k = 28278, 1429 (59.8 seconds)
r, k = 32991, 1767 (1.10 minutes)
r, k = 37704, 2072 (59.6 seconds)
r, k = 42417, 2415 (1.13 minutes)
r, k = 47130, 2773 (1.20 minutes)
finished cluster vetting r, k = 47132, 2773 in 9.05 minutes
Here are the same two match sequences for the StringExpression test:
GGTG~~Repeated[_,{35,35}]~~TTAT
GGTG~~Repeated[_,{13,13}]~~CCAA~~Repeated[_,{86,86}]~~TTAT
Here are the timing results:
read "EM03 chr 1 F parts clusters.mx.gz" (31.7237MB) (2.01 seconds)
r, k = 4713, 123 (1.64 minutes)
r, k = 9426, 311 (2.86 minutes)
r, k = 14139, 570 (4.06 minutes)
r, k = 18852, 792 (3.46 minutes)
r, k = 23565, 1123 (5.28 minutes)
r, k = 28278, 1429 (4.89 minutes)
r, k = 32991, 1767 (5.56 minutes)
r, k = 37704, 2072 (4.97 minutes)
r, k = 42417, 2415 (5.58 minutes)
r, k = 47130, 2773 (5.80 minutes)
finished cluster vetting r, k = 47132, 2773 in 44.1 minutes