Message Boards Message Boards

0
|
7244 Views
|
2 Replies
|
1 Total Likes
View groups...
Share
Share this post:

Sampling Without Replacement much faster than Built-In

Posted 10 years ago

Sampling without replacement is a critical component of many Monte-Carlo and machine-learning applications.

Consider the following call of RandomSample, which computes a small sample, without replacement, uniformly from the interval $[0,10^7)$:

Timing@RandomSample[Range[10000000] - 1, 5]
{0.120953, {7991714, 6487537, 7898730, 1186713, 3651374}}

Not bad, but it scales linearly with the population size, and is slow for larger populations:

Timing@RandomSample[Range[1000000000] - 1, 5]
{33.424484, {334003748, 219706235, 447223582, 997340429, 315472718}}

An uncritical implementation (in Mathematica!) of Jeffrey S. Vitter's "Method D" is orders of magnitude faster and scales with the sample size, though it produces the values in order:

Timing@Reap[
   sampleWithoutReplacementMethodD[1000000000, 5], {"sample"}][[2, 1, 1]]
{0.000307, {306405595, 308696939, 451351161, 545005165, 730112554}}

The method is described in the paper Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67 and the implementation is in the attached notebook.

Attachments:
POSTED BY: Brian Beckman
2 Replies

RandomSample takes a set as input. It scales linearly in the size of the set, in the above examples, for the fairly obvious reason that it has to first create the set.

A fast way to get a few elements between 1 and a given maximum, without replacement, is as below. It uses hashing under the hood to remember what's been chosen. We allow for multiple calls, keeping already chosen elements from being reselected. I did not add error handling for the case where there are not sufficiently many elements remaining to fill the order.

randomSampleFromRange[n_, r_, taken_] := Reap[Module[
    {j = 0, val},
    While[j < r,
      val = RandomInteger[{1, n}];
      If[! TrueQ[taken[val]], j++; Sow[val]; taken[val] = True];
      ];
    ]][[2, 1]]

Example:

Clear[taken]
Timing[randomSampleFromRange[10^10, 20, taken]]

(* Out[15]= {0., {5225044465, 7784713962, 4676630142, 5502367616, 
  561195181, 3490396528, 4458882112, 5747601606, 8671904831, 
  7764579636, 6624948369, 4611866065, 4057914586, 8132719602, 
  4118571904, 6751169498, 9765405784, 2128317632, 6907625021, 
  1835393080}} *)
POSTED BY: Daniel Lichtblau

That's a nice solution. The only downsides w.r.t. Vitter's that I can think of are (1) the storage for the hash table, which can be mitigated by making it a Module variable so it's garbage-collected; permuting the ordered output of Vitter's in the ordinary use-case would also entail extra storage, so, in practice, I call that a wash, (2) the time to regenerate numbers in case of collision, negligible unless the sample size approaches the population size, which is not the use case here, where we are interested in small samples from very large populations of integers.

POSTED BY: Brian Beckman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract