Message Boards Message Boards

CentralMoment[2] vs. Variance using WeightedData

Dear All,

I have the following data:

value = {3,4,5,6,7,8,9,10,11,12,13,14,15,16,17};
weights = {11396160,108204708,262511086,298022123,200846772,91005916,29926435,7513828,1495797,242405,32428,3602,358,42,1};
ListPlot[{value,weights}//Transpose]

enter image description here

Where the weights can be interpreted as probabilities (though not normalized). and the pairs of values as a distribution.

wd = WeightedData[value,weights]

In order to calculate the variance, one can use the WeightedData to help out. At first I calculate the mean:

mean = N@Mean[wd]  (* same as value.weights/Total[weights] *)
5.99879

Which seems ok, I calculated it in two ways, and they match. Now for the variance:

((value - mean)^2).weights/Total[weights]
CentralMoment[wd,2]//N
MomentConvert[CentralMoment[2],"Moment"]/.Moment[x_]:>Moment[wd,x]//N
Variance[wd]//N
1.78035
1.78035
1.78035
2.26599

The first three are identical, but the fourth one is not. Of course it might be because CentralMoment divides by Length[value] and Variance by Length[value]-1...we can correct for that:

Variance[wd](Length[value]-1)/Length[value]//N
2.11492

But still not the same value! How does Variance work with WeightedData? And why does it differ from CentralMoment[...,2]?

POSTED BY: Sander Huisman
5 Replies
Posted 7 years ago

To estimate variance (or the mean for that matter) of some distribution you'll need something stronger to justify the "standardized weights" as probabilities. Was this a "simple random sample" from a larger population of integer values?

The formula that seems to be used for Variance with weighted data is as follows:

Variance[WeightedData[x, w]]
(Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w])

This (as you've noticed) doesn't always match what one would expect when the weights are frequency counts. Here are two examples. When the weights are all 1, then all formulas match:

x = {1, 12, 3};
w = {1, 1, 1};
Variance[WeightedData[x, w]]
(* 103/3 *)
(Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w])
(* 103/3 *)
Variance[{1, 12, 3}]
(* 103/3 *)
(Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - 1)
(* 103/3 *)

But if the weights are changed from 1's, then Variance with weighted data doesn't match:

x = {1, 12, 3};
w = {2, 2, 2};
Variance[WeightedData[x, w]]
(* 103/3 *)
(Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w])
(* 103/3 *)
Variance[{1, 1, 12, 12, 3, 3}]
(* 412/15 *)
(Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - 1)
(* 412/15 *)

I, too, don't know where they got their formula for variance for weighted data. At minimum, it would seem to call for a more specific reference in the Documentation Center. If values are what one expects only when the weights are 1, then that seems to defeat the purpose of weighted data.

POSTED BY: Jim Baldwin

I found the same formula, but I was not sure how/when these should be applied. I think I will stick to CentralMoment for now; i guess these will be 'biased', but the answers seems more correct, especially if you see the distribution visually... I should have plenty of samples (10^9)...

I'm still confused on how to use WeightedData properly though...

O btw, this was not a sample of a larger population, this is the entire population: the value with the number of observations.

POSTED BY: Sander Huisman

Hello! I suppose that's a correction for bias.

Suppose the weight is uniform. Then Dot[weights, weights]=1/n and const=n/(n-1). But, in the general case,

const = 1 / (1 + -Dot[weights, weights])

seems different from the standard coefficient

https://en.wikipedia.org/wiki/Weightedarithmeticmean#Reliability_weights

which should be the inverse of

1 - (Dot[weights^2, weights^2]/Dot[weights, weights]^2)

For instance

  pds = Table[1/p, {p}];
    1 - ((Dot[pds^2, pds^2]/Dot[pds, pds]^2))

gives (p-1)/p. Maybe there are elementary alterations of the weights (normalization, etc.)?

POSTED BY: Claude Mante

I think you are right, it has to do with unbiased and biased estimates of the variance. I'm talking about frequency weights in my case. My variables weights are the number of observation of my values:

So I guess I have to use: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights_2

$$s^2 = \frac{\sum^N_{i=1} w_i (x_i - \mu)^2 }{V_1 - 1}$$ $$V_1 = \sum^N_{i=1} w_i $$

I didn't do -1 but my $V_1$ is ~10^9. So that should not matter that much...

The weights that the algorithms gets are normalized:

weights = weights / Total[weights]

Thanks a lot!

POSTED BY: Sander Huisman

After some digging in to the internal code I found out what the factor is. It is given by:

const = 1 / (1 + -Dot[weights, weights])

where weights are now normalized weights = weights/Total[weights] which works out to be 1.27277 in my case and that exactly explains why Variance gives a higher value. The origin of this factor is however unclear to me... anyone?

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract