Sampling without replacement is a critical component of many Monte-Carlo and machine-learning applications.
Consider the following call of RandomSample
, which computes a small sample, without replacement, uniformly from the interval $[0,10^7)$:
Timing@RandomSample[Range[10000000] - 1, 5]
{0.120953, {7991714, 6487537, 7898730, 1186713, 3651374}}
Not bad, but it scales linearly with the population size, and is slow for larger populations:
Timing@RandomSample[Range[1000000000] - 1, 5]
{33.424484, {334003748, 219706235, 447223582, 997340429, 315472718}}
An uncritical implementation (in Mathematica!) of Jeffrey S. Vitter's "Method D" is orders of magnitude faster and scales with the sample size, though it produces the values in order:
Timing@Reap[
sampleWithoutReplacementMethodD[1000000000, 5], {"sample"}][[2, 1, 1]]
{0.000307, {306405595, 308696939, 451351161, 545005165, 730112554}}
The method is described in the paper Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67 and the implementation is in the attached notebook.
Attachments: