Message Boards Message Boards

0
|
9201 Views
|
7 Replies
|
8 Total Likes
View groups...
Share
Share this post:

Calculation of variance of weighted data

The WDC example (/Scope/Data: "Find the variance of WeightedData") merges the internal intermediary result (of the mean) and shows the end result in a single line only, so i cannot back track anymore.

Inspired by that WDC example, please could anyone demonstrate how the $\frac{8800}{23}$ was calculated (just the start/from which definition)? Feel free to use two lines: 1 line for the numeric mean, 1 for the variance using that numeric mean.

In[1]:= data = {-30, 10, 10, 10, 10, 10, 10, 10, 20, 20};(* sample data *)
{Mean[data], Variance[data]}(* bias-corrected sample variance*)

Out[2]= {8, 1760/9}

In[3]:= edis = EmpiricalDistribution[data];(* population *)
{Mean[edis], Variance[edis]}(* population variance *)

Out[4]= {8, 176}

In[5]:= wdata = WeightedData[{-30, 10, 20}, {1/10, 7/10, 2/10}];
wedis = EmpiricalDistribution[wdata];
{Mean[wedis], Variance[wedis]}(* okay,as expected *)

Out[7]= {8, 176}

In[8]:= {Mean[wdata], Variance[wdata]}(* which formula/definition used here, why? *)

Out[8]= {8, 8800/23}

I cannot figure it out, thank you! Best wishes.

POSTED BY: Raspi Rascal
7 Replies

Not a question today. I answered it on my own and took these notes, hopefully someone agrees or disagrees or finds it useful.

POSTED BY: Raspi Rascal
Posted 4 years ago

Raspi:

Thank you for the great question. Though I am not sure I fully understand your question, I think I might understand your frustration.

The documentation for Variance states that Variance takes either a Distribution or a List as its argument.

The head of a WeightedData is WeightedData. While one of the properties of WeightedData is an EmpericalPDF, I do not think that a Probability Density Function is the same thing as a Distribution.

So, my best guess is that there is an undocumented use of [Variance] that takes a WeightedData argument.

It would be great if someone from Wolfram jumped in to help us better understand.

POSTED BY: Mike Besso
Posted 4 years ago

Since Mathematica can do many caclulations symbolic, you can build a small symbolic weighted data list and use this:

wdata2 = WeightedData[Array[v, 3], Array[w, 3]]

Here you get the formula how Mean and Variance are calculated

Mean[wdata2]
Variance[wdata2]
POSTED BY: Michael Helmle

Thanks but that's exactly the point which i was making. Your output shows the end result, not the definition. I cannot verify where the output comes from. You will not find such an output as the literal definition of the variance, in any reference book, textbook, or elsewhere.

I shouldn't give up here.

The topic is relevant in practice. Knowing what you're doing is relevant in practice ... because weighted data is everywhere (including introtexts) and the calculation of variance is everywhere too. So if you're not sure what you're doing, you will end up producing the wrong variance value for your problem. Thanks to the documentation of Mathematica :P

POSTED BY: Raspi Rascal
Posted 4 years ago

@MichaelHelmle did give you enough information to determine the answer for WeightedData. And it is readily available at a Wiki page (and elsewhere).

Using Michael's notation the weighted mean is given by

$$\bar{v}=\frac{\sum _{i=1}^n v_i w_i}{\sum _{i=1}^n w_i}$$

And the weighted variance is given by

$$\frac{\sum _{i=1}^n w_i (v_i-\bar{v})^2}{V_1-\frac{V_2}{V_1}}$$

where

$$V_i=\sum_{j=1}^n w_j^i$$

The weights are assumed to be known and not random variables.

POSTED BY: Jim Baldwin

@Jim I am not capable of following the wiki article, my bad. Your posted middle formula does the trick for me though, thanks:

In[9]:= V1 = 1/10 + 7/10 + 2/10;(* ==1, True*)
V2 = (1/10)^2 + (7/10)^2 + (2/10)^2;(* ==(27/50), True *)
176/(V1 - (V2/V1))

Out[9]= 8800/23

So … Mathematica uses the formula for the "Reliability weights-weighted unbiased (estimate of) sample variance", aha good to know! I find the explanation elsewhere more understandable, especially the paraphrase "this expression reduces to the unweighted bias-corrected sample variance with the familiar $\frac{1}{n-1}$ factor when there are $n$ equal weights in the population". Wiki claims that the expectancy of this quantity equals the population variance: $\mathcal{E}[s_w^2]=\sigma _{\text{actual}}^2$

Two things i learn from this (please correct if i'm mistaken):

  1. wdata and data are not related in any way, if data is to be viewed as a sample with $n=10$ observations. hence the results Out[2] and Out[8] have a very different meaning. In particular, the weights in In[8] do represent the weights in the population, not the weighting of $-30$ (and of $10$ and $20$) within the sample! For 2 days i was wrong, thinking the opposite; whose fault is it? I am gonna blame the ones whose job is to tell/teach the right thing fully, clearly, exemplarily but didn't do so.
  2. The quality of this estimator function is not too good, its "convergence" (and performance) is slow. It is interesting to learn that the estimator does indeed its job (and better for increasing sample sizes $n$) but using the regular estimator $s_{N-1}^2$ gives a far higher quality estimation of the population variance: $\mathcal{E}(s_{N-1}^2)=\sigma _{\text{actual}}^2$

Here are two codes to illustrate the correct use of the Variance function on sample data (not population data). The first variant is more intelligible but the second variant should be preferred imho. It is good to know both for learning/doing statistics in $Wolfram$ $L$.

n = 10;(* taking n observations PER SAMPLE *)
A = Tuples[{0, 1}, n];(* drawing with replacement *)
prules = {0 -> 1/3, 1 -> 2/3};(* the 2 elements in population are weighted *)
Total[A /. a : {_?NumericQ ..} :> Variance[WeightedData[a, a /. prules]]]/Length[A] (* should converge to 2/9 *)

As opposed to working with the unbiased sample variance:

piData = Sort@(A /. a : {_?NumericQ ..} :> {Variance[a], Times @@ (a /. prules)}); 
pData = {#[[1, 1]], Total[#[[All, 2]]]} & /@ Split[piData, First[#1] == First[#2] &];
var = Times @@@ pData // Total (* expectancy of s^2 is 2/9 *)

This was really essential for me to figure out, i am slowly understanding what the built-in statistics functions do for me, hehe. Also i am still wondering if the Mathematica of today is better than Mathstatica (powered by Mathematica) and the ubiquitous $R$ for programming in stochastics/statistics. Today i got my hands on the two Murray Spiegel Schaum's Outline workbooks on statistics. Oh, just found a new one, hmmm.

@Mike I was frustrated indeed for 48h haha but things are clearing up now. Thanks for your attention, appreciated! Hope you enjoyed this 'contribution'. I am all new to statistics, so my steps are tiny but need to be firmly rooted (for "knowing what i am doing" — very important when handling a software application, or maths).

POSTED BY: Raspi Rascal
Posted 4 years ago

@Raspi Rascal

Yes, I am enjoying this discussion. And for me, it is timely. My job is requiring me to do more and more statistics.

Keep up the good work.

POSTED BY: Mike Besso
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract