Message Boards

WOLFRAM COMMUNITY

654 Views

3 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language

Finding strings in a 12 million entries database

Rodrigo Amor

Posted 2 months ago

Hello Wolfram community My problem is really easy and I am sure it has a faster solution than the one I have. I have a Table in the form database= {{"AAAAAAAAAA",1,2,3,4,5,6...},{"BBBBBBBBBB",1,2,3,4,5,6...}...,,{"Z9923X2130",1,2,3,4,5,6...}} I want to know the position (row) of "AAAAAAAAAA", "BBBBBBBBBB" and so on. Right now what I am doing is Position[database[[All,1]], "AAAAAAAAAA"] I can put all the strings I want to find in a list and then find all in "bulk" skustofind ={"AAAAAAAAAA", "BBBBBBBBBB"} results=Flatten@Table[Position[database[[All,1]], skustofind[[i]]],{i,1,Length@skustofind}] This works perfectly fine. However, my list of strings to find can be thousands of elements at a time. As the database is also 12 million rows, this process takes too long. I have tried Select also but it seems it is quite slow also. I am sure I am missing something and would love to have another point of view. Thanks Rodrigo

POSTED BY: Rodrigo Amor

3 Replies

Sort By:

Rodrigo Amor

Posted 2 months ago

POSTED BY: Rodrigo Amor

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 2 months ago

This should help. First some helper code to construct an example. randomString[n_] := StringJoin @@ Map[FromCharacterCode, RandomChoice[Range[65, 90], n]] randomEntry[n_, m_] := Prepend[RandomInteger[{0, 9}, m], randomString[n]] randomEntrySet[n_, m_, l_] := Table[randomEntry[n, m], {l}] Quick example of what this does: In[1229]:= randomEntrySet[3, 4, 10] Out[1229]= {{"LWN", 9, 1, 8, 7}, {"JDD", 2, 2, 0, 7}, {"BKE", 3, 2, 6, 7}, {"ZGS", 5, 9, 3, 2}, {"JPW", 1, 1, 3, 9}, {"FBK", 5, 8, 4, 8}, {"IPR", 7, 9, 1, 9}, {"FQD", 9, 3, 0, 2}, {"SDD", 9, 4, 3, 2}, {"UVG", 1, 3, 3, 5}} Create a random example for a dataset of length one million and a set of one thousand keys selected at random from that set. SeedRandom[1234]; dsize = 10^6; randdata = randomEntrySet[10, 10, dsize]; ksize = 10^3; randSKUs = randdata[[RandomChoice[Range@dsize, ksize], 1]]; Your lookup iterates over all the SKUs we seek. This has the possible advantage of finding them in the order they are listed (if that's important to your work). Timing[skuposns = Flatten@Table[ Position[randdata[[All, 1]], randSKUs[[i]]], {i, 1, Length@randSKUs}];] (* Out[1235]= {44.2355, Null} ) If we use `Alternatives` it is much faster to find these positions. Timing[skuposns2 = Flatten[Position[randdata[[All, 1]], Apply[Alternatives, randSKUs]]];] ( Out[1236]= {0.095339, Null} ) These are not in the same ordering. The first is ordered by where the SKU appears in the search list while the second is ordered by where a sought-for SKU appears in the data set. It is easy to show this so I'll skip that and just show that the two are equivalent as unordered sets. skuposns === skuposns2, Union[skuposns] === Union[skuposns2]} ( Out[1263]= {False, True} *) Why did I use `Union` instead of, say, `Sort`? In this particular example it happens that a key we sought for is repeated in the set, so the result lengths differ by one since there's a duplicate in the first set. Which might be a disadvantage, depending on actual usage requirements. If you do not care about the ordering of the positions then you are done. If you require the ordering of the first result, that can be obtained with a bit more work. And it's much faster since we've reduced the size of the set we'd need to work with. If this is a requirement feel free to note that and people can provide guidance.

POSTED BY: Daniel Lichtblau

Chris P

Posted 2 months ago

Yes it very simple to make it 10000 x faster when testing a 12 million strings list ;) The simplest solution is maybe to build first an optimized (`Dispatch`) list of rules (`Rule`) which associates the strings with their position in the database. Let's make a 12 million strings list similar to the one (`skustofind`) you give in your example: chars = Join[CharacterRange["A", "Z"], CharacterRange["0", "9"]]; skustofind = Table[RandomChoice[chars, 10] // StringJoin, 12000000]; Then we build the optimized table describing which string corresponds to which position: dispatch = Dispatch@Table[ skustofind[[i]] -> i, {i, 1, Length@skustofind}]; (Or in a more functional programming way you could write also `dispatch=Dispatch@Thread@Rule[skustofind, Range[Length@skustofind]]`) That's it, let's compare the time it takes using Position and the Dispatch table. For example if the test string is: testString=skustofind[[10000000]] `EH2AMN0CW0` then Position[skustofind, testString] // AbsoluteTiming `{0.268778, {{10000000}}}` while testString /. dispatch // AbsoluteTiming `{0.00001, 10000000}` The dispatch approach is about (0.26/0.00002 =)10000x faster here. The bigger your string list is, the longer it will take for Position to find while it is practically the same constant time with the dispatch way. Alternative approaches (to build the same kind of optimized table as dispatch) is for example to use Association or a DataStructure but the final timings are the same.

POSTED BY: Chris P

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback