Message Boards Message Boards

0
|
654 Views
|
3 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Finding strings in a 12 million entries database

Posted 2 months ago

Hello Wolfram community

My problem is really easy and I am sure it has a faster solution than the one I have.

I have a Table in the form

database= {{"AAAAAAAAAA",1,2,3,4,5,6...},{"BBBBBBBBBB",1,2,3,4,5,6...}...,,{"Z9923X2130",1,2,3,4,5,6...}}

I want to know the position (row) of "AAAAAAAAAA", "BBBBBBBBBB" and so on. Right now what I am doing is

Position[database[[All,1]], "AAAAAAAAAA"]

I can put all the strings I want to find in a list and then find all in "bulk"

skustofind ={"AAAAAAAAAA", "BBBBBBBBBB"}
results=Flatten@Table[Position[database[[All,1]], skustofind[[i]]],{i,1,Length@skustofind}]

This works perfectly fine. However, my list of strings to find can be thousands of elements at a time. As the database is also 12 million rows, this process takes too long.

I have tried Select also but it seems it is quite slow also. I am sure I am missing something and would love to have another point of view.

Thanks
Rodrigo

POSTED BY: Rodrigo Amor
3 Replies
Posted 2 months ago
POSTED BY: Rodrigo Amor

This should help. First some helper code to construct an example.

randomString[n_] := StringJoin @@ Map[FromCharacterCode, RandomChoice[Range[65, 90], n]]
randomEntry[n_, m_] := Prepend[RandomInteger[{0, 9}, m], randomString[n]]
randomEntrySet[n_, m_, l_] := Table[randomEntry[n, m], {l}]

Quick example of what this does:

In[1229]:= randomEntrySet[3, 4, 10]

Out[1229]= {{"LWN", 9, 1, 8, 7}, {"JDD", 2, 2, 0, 7}, {"BKE", 3, 2, 6, 7},
  {"ZGS", 5, 9, 3, 2}, {"JPW", 1, 1, 3, 9}, {"FBK", 5, 8, 4, 8},
  {"IPR", 7, 9, 1, 9}, {"FQD", 9, 3, 0, 2}, {"SDD", 9, 4, 3,  2}, {"UVG", 1, 3, 3, 5}}

Create a random example for a dataset of length one million and a set of one thousand keys selected at random from that set.

SeedRandom[1234];
dsize = 10^6;
randdata = randomEntrySet[10, 10, dsize];
ksize = 10^3;
randSKUs = randdata[[RandomChoice[Range@dsize, ksize], 1]];

Your lookup iterates over all the SKUs we seek. This has the possible advantage of finding them in the order they are listed (if that's important to your work).

Timing[skuposns = Flatten@Table[
     Position[randdata[[All, 1]], randSKUs[[i]]], {i, 1, Length@randSKUs}];]

(* Out[1235]= {44.2355, Null} *)

If we use Alternatives it is much faster to find these positions.

Timing[skuposns2 = Flatten[Position[randdata[[All, 1]], Apply[Alternatives, randSKUs]]];]

(* Out[1236]= {0.095339, Null} *)

These are not in the same ordering. The first is ordered by where the SKU appears in the search list while the second is ordered by where a sought-for SKU appears in the data set. It is easy to show this so I'll skip that and just show that the two are equivalent as unordered sets.

skuposns === skuposns2, Union[skuposns] === Union[skuposns2]}

(* Out[1263]= {False, True} *)

Why did I use Union instead of, say, Sort? In this particular example it happens that a key we sought for is repeated in the set, so the result lengths differ by one since there's a duplicate in the first set. Which might be a disadvantage, depending on actual usage requirements.

If you do not care about the ordering of the positions then you are done. If you require the ordering of the first result, that can be obtained with a bit more work. And it's much faster since we've reduced the size of the set we'd need to work with. If this is a requirement feel free to note that and people can provide guidance.

POSTED BY: Daniel Lichtblau
Posted 2 months ago

Yes it very simple to make it 10000 x faster when testing a 12 million strings list ;)

The simplest solution is maybe to build first an optimized (Dispatch) list of rules (Rule) which associates the strings with their position in the database.

Let's make a 12 million strings list similar to the one (skustofind) you give in your example:

chars = Join[CharacterRange["A", "Z"], CharacterRange["0", "9"]];
skustofind = Table[RandomChoice[chars, 10] // StringJoin, 12000000];

Then we build the optimized table describing which string corresponds to which position:

dispatch = Dispatch@Table[ skustofind[[i]] -> i, {i, 1, Length@skustofind}];

(Or in a more functional programming way you could write also dispatch=Dispatch@Thread@Rule[skustofind, Range[Length@skustofind]])

That's it, let's compare the time it takes using Position and the Dispatch table.

For example if the test string is:

testString=skustofind[[10000000]]

EH2AMN0CW0

then

Position[skustofind, testString] // AbsoluteTiming

{0.268778, {{10000000}}}

while

testString /. dispatch // AbsoluteTiming

{0.00001, 10000000}

The dispatch approach is about (0.26/0.00002 =)10000x faster here. The bigger your string list is, the longer it will take for Position to find while it is practically the same constant time with the dispatch way.

Alternative approaches (to build the same kind of optimized table as dispatch) is for example to use Association or a DataStructure but the final timings are the same.

POSTED BY: Chris P
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract