Group Abstract Group Abstract

Message Boards Message Boards

0
|
3.2K Views
|
3 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Finding strings in a 12 million entries database

Posted 1 year ago
POSTED BY: Rodrigo Amor
3 Replies
Posted 1 year ago

Thank you very much Daniel and Chris. As soon as my code finish running, I will try your solutions.

Amazing to have this kind of help.

Regards Rodrigo

POSTED BY: Rodrigo Amor

This should help. First some helper code to construct an example.

randomString[n_] := StringJoin @@ Map[FromCharacterCode, RandomChoice[Range[65, 90], n]]
randomEntry[n_, m_] := Prepend[RandomInteger[{0, 9}, m], randomString[n]]
randomEntrySet[n_, m_, l_] := Table[randomEntry[n, m], {l}]

Quick example of what this does:

In[1229]:= randomEntrySet[3, 4, 10]

Out[1229]= {{"LWN", 9, 1, 8, 7}, {"JDD", 2, 2, 0, 7}, {"BKE", 3, 2, 6, 7},
  {"ZGS", 5, 9, 3, 2}, {"JPW", 1, 1, 3, 9}, {"FBK", 5, 8, 4, 8},
  {"IPR", 7, 9, 1, 9}, {"FQD", 9, 3, 0, 2}, {"SDD", 9, 4, 3,  2}, {"UVG", 1, 3, 3, 5}}

Create a random example for a dataset of length one million and a set of one thousand keys selected at random from that set.

SeedRandom[1234];
dsize = 10^6;
randdata = randomEntrySet[10, 10, dsize];
ksize = 10^3;
randSKUs = randdata[[RandomChoice[Range@dsize, ksize], 1]];

Your lookup iterates over all the SKUs we seek. This has the possible advantage of finding them in the order they are listed (if that's important to your work).

Timing[skuposns = Flatten@Table[
     Position[randdata[[All, 1]], randSKUs[[i]]], {i, 1, Length@randSKUs}];]

(* Out[1235]= {44.2355, Null} *)

If we use Alternatives it is much faster to find these positions.

Timing[skuposns2 = Flatten[Position[randdata[[All, 1]], Apply[Alternatives, randSKUs]]];]

(* Out[1236]= {0.095339, Null} *)

These are not in the same ordering. The first is ordered by where the SKU appears in the search list while the second is ordered by where a sought-for SKU appears in the data set. It is easy to show this so I'll skip that and just show that the two are equivalent as unordered sets.

skuposns === skuposns2, Union[skuposns] === Union[skuposns2]}

(* Out[1263]= {False, True} *)

Why did I use Union instead of, say, Sort? In this particular example it happens that a key we sought for is repeated in the set, so the result lengths differ by one since there's a duplicate in the first set. Which might be a disadvantage, depending on actual usage requirements.

If you do not care about the ordering of the positions then you are done. If you require the ordering of the first result, that can be obtained with a bit more work. And it's much faster since we've reduced the size of the set we'd need to work with. If this is a requirement feel free to note that and people can provide guidance.

POSTED BY: Daniel Lichtblau
Posted 1 year ago
POSTED BY: Chris P
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard