Group Abstract

Message Boards

9K Views

2 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Computer Science Mathematics Mathematica Statistics and Probability

Posted 11 years ago

Sampling without replacement is a critical component of many Monte-Carlo and machine-learning applications. Consider the following call of `RandomSample`, which computes a small sample, without replacement, uniformly from the interval $[0,10^7)$: Timing@RandomSample[Range[10000000] - 1, 5] {0.120953, {7991714, 6487537, 7898730, 1186713, 3651374}} Not bad, but it scales linearly with the population size, and is slow for larger populations: Timing@RandomSample[Range[1000000000] - 1, 5] {33.424484, {334003748, 219706235, 447223582, 997340429, 315472718}} An uncritical implementation (in Mathematica!) of Jeffrey S. Vitter's "Method D" is orders of magnitude faster and scales with the sample size, though it produces the values in order: Timing@Reap[ sampleWithoutReplacementMethodD[1000000000, 5], {"sample"}][[2, 1, 1]] {0.000307, {306405595, 308696939, 451351161, 545005165, 730112554}} The method is described in the paper Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67 and the implementation is in the attached notebook. Attachments:

POSTED BY: Brian Beckman

2 Replies

Sort By:

Posted 11 years ago

That's a nice solution. The only downsides w.r.t. Vitter's that I can think of are (1) the storage for the hash table, which can be mitigated by making it a Module variable so it's garbage-collected; permuting the ordered output of Vitter's in the ordinary use-case would also entail extra storage, so, in practice, I call that a wash, (2) the time to regenerate numbers in case of collision, negligible unless the sample size approaches the population size, which is not the use case here, where we are interested in small samples from very large populations of integers.

POSTED BY: Brian Beckman

Posted 11 years ago

`RandomSample` takes a set as input. It scales linearly in the size of the set, in the above examples, for the fairly obvious reason that it has to first create the set. A fast way to get a few elements between 1 and a given maximum, without replacement, is as below. It uses hashing under the hood to remember what's been chosen. We allow for multiple calls, keeping already chosen elements from being reselected. I did not add error handling for the case where there are not sufficiently many elements remaining to fill the order. randomSampleFromRange[n_, r_, taken_] := Reap[Module[ {j = 0, val}, While[j < r, val = RandomInteger[{1, n}]; If[! TrueQ[taken[val]], j++; Sow[val]; taken[val] = True]; ]; ]][[2, 1]] Example: Clear[taken] Timing[randomSampleFromRange[10^10, 20, taken]] (* Out[15]= {0., {5225044465, 7784713962, 4676630142, 5502367616, 561195181, 3490396528, 4458882112, 5747601606, 8671904831, 7764579636, 6624948369, 4611866065, 4057914586, 8132719602, 4118571904, 6751169498, 9765405784, 2128317632, 6907625021, 1835393080}} *)

POSTED BY: Daniel Lichtblau

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback