Message Boards

WOLFRAM COMMUNITY

6674 Views

5 Replies

6 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Mathematics Mathematica Algebra Wolfram Language Statistics and Probability

CentralMoment[2] vs. Variance using WeightedData

Sander Huisman

Sander Huisman, University of Twente

Posted 7 years ago

Dear All, I have the following data: value = {3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}; weights = {11396160,108204708,262511086,298022123,200846772,91005916,29926435,7513828,1495797,242405,32428,3602,358,42,1}; ListPlot[{value,weights}//Transpose] Where the weights can be interpreted as probabilities (though not normalized). and the pairs of values as a distribution. wd = WeightedData[value,weights] In order to calculate the variance, one can use the WeightedData to help out. At first I calculate the mean: mean = N@Mean[wd] (* same as value.weights/Total[weights] *) 5.99879 Which seems ok, I calculated it in two ways, and they match. Now for the variance: ((value - mean)^2).weights/Total[weights] CentralMoment[wd,2]//N MomentConvert[CentralMoment[2],"Moment"]/.Moment[x_]:>Moment[wd,x]//N Variance[wd]//N 1.78035 1.78035 1.78035 2.26599 The first three are identical, but the fourth one is not. Of course it might be because CentralMoment divides by Length[value] and Variance by Length[value]-1...we can correct for that: Variance[wd](Length[value]-1)/Length[value]//N 2.11492 But still not the same value! How does Variance work with WeightedData? And why does it differ from CentralMoment[...,2]?

POSTED BY: Sander Huisman

5 Replies

Sort By:

Jim Baldwin

Jim Baldwin, Retired

Posted 7 years ago

To estimate variance (or the mean for that matter) of some distribution you'll need something stronger to justify the "standardized weights" as probabilities. Was this a "simple random sample" from a larger population of integer values? The formula that seems to be used for `Variance` with weighted data is as follows: Variance[WeightedData[x, w]] (Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w]) This (as you've noticed) doesn't always match what one would expect when the weights are frequency counts. Here are two examples. When the weights are all 1, then all formulas match: x = {1, 12, 3}; w = {1, 1, 1}; Variance[WeightedData[x, w]] (* 103/3 ) (Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w]) ( 103/3 ) Variance[{1, 12, 3}] ( 103/3 ) (Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - 1) ( 103/3 ) But if the weights are changed from 1's, then Variance with weighted data doesn't match: x = {1, 12, 3}; w = {2, 2, 2}; Variance[WeightedData[x, w]] ( 103/3 ) (Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - Total[w^2]/Total[w]) ( 103/3 ) Variance[{1, 1, 12, 12, 3, 3}] ( 412/15 ) (Total[w x^2] - Total[w x]^2/Total[w])/(Total[w] - 1) ( 412/15 *) I, too, don't know where they got their formula for variance for weighted data. At minimum, it would seem to call for a more specific reference in the Documentation Center. If values are what one expects only when the weights are 1, then that seems to defeat the purpose of weighted data.

POSTED BY: Jim Baldwin

Sander Huisman

Sander Huisman, University of Twente

Posted 7 years ago

I found the same formula, but I was not sure how/when these should be applied. I think I will stick to CentralMoment for now; i guess these will be 'biased', but the answers seems more correct, especially if you see the distribution visually... I should have plenty of samples (10^9)... I'm still confused on how to use WeightedData properly though... O btw, this was not a sample of a larger population, this is the entire population: the value with the number of observations.

POSTED BY: Sander Huisman

Claude Mante

Claude Mante, Retired

Posted 7 years ago

Hello! I suppose that's a correction for bias. Suppose the weight is uniform. Then Dot[weights, weights]=1/n and const=n/(n-1). But, in the general case, const = 1 / (1 + -Dot[weights, weights]) seems different from the standard coefficient https://en.wikipedia.org/wiki/Weightedarithmeticmean#Reliability_weights which should be the inverse of 1 - (Dot[weights^2, weights^2]/Dot[weights, weights]^2) For instance pds = Table[1/p, {p}]; 1 - ((Dot[pds^2, pds^2]/Dot[pds, pds]^2)) gives (p-1)/p. Maybe there are elementary alterations of the weights (normalization, etc.)?

POSTED BY: Claude Mante

Sander Huisman

Sander Huisman, University of Twente

Posted 7 years ago

I think you are right, it has to do with unbiased and biased estimates of the variance. I'm talking about frequency weights in my case. My variables weights are the number of observation of my values: So I guess I have to use: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights_2 $$s^2 = \frac{\sum^N_{i=1} w_i (x_i - \mu)^2 }{V_1 - 1}$$ $$V_1 = \sum^N_{i=1} w_i $$ I didn't do -1 but my $V_1$ is ~10^9. So that should not matter that much... The weights that the algorithms gets are normalized: weights = weights / Total[weights] Thanks a lot!

POSTED BY: Sander Huisman

Sander Huisman

Sander Huisman, University of Twente

Posted 7 years ago

After some digging in to the internal code I found out what the factor is. It is given by: const = 1 / (1 + -Dot[weights, weights]) where weights are now normalized weights = weights/Total[weights] which works out to be 1.27277 in my case and that exactly explains why Variance gives a higher value. The origin of this factor is however unclear to me... anyone?

POSTED BY: Sander Huisman

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback