Message Boards Message Boards

Generate weighted random numbers

Posted 11 years ago
Hi there,

I need to generate weighted random numbers.
The resulting distribution of random numbers should correspond to a scalar function W(a, b).
The function Random[] only gives a uniformly distribution. I think there is a function to generate weighted random numbers, but seems it does not work with a continuously function.
Any ideas how to do this with mathematica? Thanks a lot in advance!

Cheers,
Michael
POSTED BY: Michael Knoerzer
13 Replies
Posted 5 months ago

If I wanted a random sampling of a weighted distribution that was not a normal distribution or probability distribution, WM documentation implies that I use

p5 = RandomSample[w -> n, 5]

where w = weight and n = variable. How does Mathematica make this weighted sampling??

POSTED BY: Raymond Low
Posted 5 months ago

Probably by accumulating weights together, taking an uniform sample on the total range of those weights, and figuring out on what interval corresponding to a value the sample ended (or doing the same by subtracting weights from a random sample until it is below zero), then performing the same with remaining values and weights for the next sample in the case of RandomSample. Like this, but more efficiently:

ClearAll[randomSample];
randomSample[w0_List -> v0_List, n0_ : default] :=
  With[{n = Min[n0, default] /. default -> Length[w0]},
   Last[Nest[
     Apply[Function[{w, v, l},
       With[{r = Total[w] RandomReal[]},
        LengthWhile[Accumulate[Most[w]], # < r &] + 1 //
         {Delete[w, #], Delete[v, #], Append[l, v[[#]]]} &]]],
     {w0, v0, {}}, n]]];

randomSample[{1, 3, 5} -> {a, b, c}, 2]

(* {c, b} *)
POSTED BY: Jari Kirma
Posted 4 months ago

Thank you, Jari. I have read your code many times, but I don't know WM well enough to picture the verbiage, I imagined that w -> n would have to regenerate the database enumerating the variables as many times as their weight. RandomSample - to create random numbers to choose variables and then remove them before the next selection. It would seem simpler to take RandomSample of the original database, and then to Tally the database before taking the RandomSample. If the database had duplicates, does RandomSample remove the duplicates or just not replace the variable selected?

Again thank you for your insights.

POSTED BY: Raymond Low
Posted 4 months ago

If a value is present multiple times in the input (and zero weights are not involved) it can also be present multiple times in the result from RandomSample.

I'm not entirely certain how to interpret your question, but it should also be noted that - although not obvious as a feature - weights can actually affect the order of items on the result, even when it is a permutation. Consider the following, for instance:

RandomSample[{1, 1000, 1000000, 1000000000} -> {1, 2, 1, 3}]

This almost always results the following:

{3, 1, 2, 1}

Thus tallying in order to avoid manipulating the value and weight lists is unlikely to work as expected.

POSTED BY: Jari Kirma

If only a small number of values are required it will just pick uniform random values between 0 an 1, use the weights in effect to form subintervals of (0,1) to see which input from the list gets selected. That element is then marked as removed, so further selections of that element will be discarded.

When many elements are needed, the idea is to prune the list every so often, and reweight accordingly. This at least is how I recall it being done. But this was done years ago and not by me, so I might be remembering incorrectly how it was actually coded.

POSTED BY: Daniel Lichtblau
Posted 4 months ago

"Discarding" a value in further selections can mean two things: re-running selection step if the item is already taken, or discarding it from further selection candidates. With very skewed probabilities the first option can be very inefficient, while the second is just (relatively) fine.

Based on run time performance I could imagine something like the following being done per resulting value in C pseudocode:

item_t getone(item_t *values, double *weights, double *weightsum) {
  r = *weightsum * randomreal();
  i = 0;

  while (r > weights[i]) {
    r -= weights[i];
    i++;
  }

  *weightsum -= weights[i];
  weights[i] = 0;

  return values[i];
}

One could even perform in-place compaction for such arrays... but I guess I'm thinking on quite non-Mathematica abstraction level here.

POSTED BY: Jari Kirma

I meant the first. Actual deletion is O(n), so not really fine, unless absolutely necessary, if the list is long. I do not know how the code deals with the lopsided weights issue but most likely by having a threshold of misses-to-hits after which reweighting is done; that's what I mean by "absolutely necessary".

As for in-place compaction, possibly that can be done using an indexed priority queue, if one is willing to accept each selection being O(log n) instead of O(1). Operative term being "possibly".

POSTED BY: Daniel Lichtblau
Posted 4 months ago

There are certainly plenty of options available here if one really wants to go into optimising certain kinds of workloads. Just that it's pretty obvious from run time observations that the simplest strategy of "repeat if item is marked already used" is not at least always used, because if it would, taking a permutation of a list with trillion-to-one weight distribution would not be effectively instantaneous...

POSTED BY: Jari Kirma
Posted 4 months ago

the nuances escape me

db = RandomInteger[89, 5589];
db[[10 ;; 20]]
Histogram[db, 89]

enter image description here

My question would be - which method would be a better representation of the database, RandomSample on the database or RandomSample on the Tally of the database?

dbTally = Tally[db];
dbTally[[10 ;; 20]];
w = dbTally[[All, 2]];
n = dbTally[[All, 1]];
sampledb = RandomSample[db, 7]
sampleTally = RandomSample[w -> n, 7]

I guess you have answered the question "Thus tallying in order to avoid manipulating the value and weight lists is unlikely to work as expected." And the extra work only adds distortion to the selection.

POSTED BY: Raymond Low
Posted 4 months ago

Ah, it may be that I misunderstood something...

RandomSample removes the sampled value from the value list before sampling again. So, RandomSample[{1000, 1} -> {1, 2}, 2] may have 1 only once in a result, and only twice in RandomSample[{999, 1, 1} -> {1, 2, 3}, 3], while with corresponding RandomChoice both are capable and very likely to return only ones.

If RandomSample db directly it is capable of returning each unique value the number of times Tally would indicate...

POSTED BY: Jari Kirma
Thanks a lot for your replies. I need to check if I can somehow use a built-in distribution. I am not sure about this.
POSTED BY: Michael Knoerzer
Posted 11 years ago
As Daniel Lichtblau stated, RandomVariate combined with a specific distribution (such as NormalDistribution) does the trick, and Mathematica provides plenty of built-in distributions to play with. One benefit of using built-in distributions is that we can easily and efficiently manipulate them symbolically.

For instance, numeric random variates of a specific distributions are available:
 RandomVariate[NormalDistribution[0, 1], {20}]
 
 (* {0.270132, -1.79075, 0.0748062, 0.715661, 0.617855, 0.202395, \
 -1.01686, -0.125683, 1.5585, 1.42078, 2.04629, -0.309086, -1.33024, \
 0.591959, 0.530648, -0.52868, -1.07803, 0.504637, 0.327951, -0.569275} *)
 
 Variance[%]
 
 (* 0.963508 *)
... but also, it is possible compare this with symbolic result:
Variance[NormalDistribution[0, 1]]

(* 1 *)

EDIT
: This is probably closest to what you want. This is an example directly from Mathematica documentation:
dist = ProbabilityDistribution[(Sqrt[2] / Pi) (1 / (1 + x^4)), {x, -Infinity, Infinity}];
It's important to define the function so that it integrates to unity over range of defined ProbabilityDistribution.
Integrate[(Sqrt[2] / Pi) (1 / (1 + x^4)), {x, -Infinity, Infinity}]

(* 1 *)
Some random variates:
Histogram[RandomVariate[dist, {1000}]]

And symbolic manipulation is possible:
Variance[dist]

(* 1 *)

Also, using TransformedDistribution, we can also map one distribution to another. Let's take inverse function of cumulative distribution function of a normal distribution:
InverseFunction[CDF[NormalDistribution[0, 1]]][x]

(* -Sqrt[2] InverseErfc[2 x] *)
Now we use this to map uniform distribution to above normal distribution:
dist = TransformedDistribution[-Sqrt[2] InverseErfc[2 x], x \[Distributed] UniformDistribution[{0, 1}]];
We can generate transformed numeric random variates this way:
Histogram[RandomVariate[dist, {10000}]]

Also, symbolic manipulation is still possible. Now it's considerably slower, though (first value is runtime in seconds, second is the symbolic result):
Variance[dist] // AbsoluteTiming

(* {2.605953, 1} *)
POSTED BY: Jari Kirma
If your W corresponds to a probability distribution (which I would think is the case) then RandomVariate is meant for this purpose.
POSTED BY: Daniel Lichtblau
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract