Message Boards Message Boards

[WSG23] Daily Study Group: Introduction to Statistics

A Wolfram U Daily Study Group on "Introduction to Statistics" begins on April 3, 2023.

Join a cohort of fellow statistics enthusiasts to learn about collecting, describing, analyzing and interpreting data and trends in science, industry and society. Learn about techniques for data visualization and descriptive statistics, methods for calculating confidence intervals and tools for hypothesis testing from video lessons created by veteran instructor and developer David Withoff. Participate in live Q&A and review your understanding through interactive in-session polls. Complete quizzes at the end of the study group to get your certificate of program completion.

April 3-14, 2023, 11am-12pm CT (4-5pm GMT)

REGISTER HERE

enter image description here

Please feel free to use this thread to collaborate and share ideas, additional resources and questions with the instructors as well as with other members of the study group.

47 Replies

Hi;

When going through our class notes and in particular, lesson 23 "Confidence Intervals for Proportions" on page 3 about halfway down the page is a line of code for calculating Z* as follows:

SuperscriptBox["z", "*"]=InverseCDF[NormalDistribution[0,1],0.975]

Where is the 0.975 number coming from in this line of code and where is the 95% Confidence Level that we are trying to achieve appear in the calculations that we are trying to achieve, as stated in the line of text immediately above the line of code? It is obliviously an important number because if it changes then z* changes and with it the rest of the values.

Thanks so much,

Mitchell Sandlin

POSTED BY: Mitchell Sandlin
Posted 1 year ago

The 0.975 number is the probability needed for looking up z-star in tables, or for using the InverseCDF function to get z-star.

The end points of a 95% confidence interval are the points at the ends of an interval that includes the central 95% of the probability in the sampling distribution of the underlying test statistic, but instead of giving those points directly, tables of the sampling distribution give points such that some fraction of the probability will be to the left of any point in the distribution, so to use those tables (or the InverseCDF function) it is necessary to work out those probabilities. For a 95% confidence interval, the end points of the central 95% leave 2.5% the left of the left end point and 2.5% to the right of the right end point, which leaves 97.5% of the probability to the left of the right end point. So to get the end points of a 95% confidence interval, the points that are needed from the table (or from InverseCDF) are the point that gives 2.5%, or a fraction 0.025, of the probability to the left of that point, and the point that gives 97.5%, or a fraction 0.975, of the probability to the left of that point.

So the need for the 0.975 number is simply a consequence of how probability tables and the InverseCDF function work. Tables (and functions like InverseCDF) are based on probability to the left of any point, rather than points that include some probability around the center of the distribution.

The origin of the 0.975 number and is also described, with illustrations, in lesson 16 "Computing Confidence Intervals".

POSTED BY: Dave Withoff

David -

RE: Resource for Conditional Probability Problems

Your remark during the Statistics DSG that the internet is a great source of problem solving help was right on target. The following link goes to some discussion and a collection of solved problems in conditional probability. Working through these problems helps develop the ability to understand what is being asked in conditional probability problems and what techniques can be used to solve them.

https://www.studocu.com/ph/document/central-luzon-state-university/differential-equation-l/prob3160ch4-math/33698873

POSTED BY: Joseph Smith

The Wolfram U course Introduction to Statistics is now available with updated content (based on the feedback we received from DSG attendees) and also features a Final Exam. We would like to encourage all our DSG attendees to take the exam and claim their Level 1 certification. Good luck!

I received an email today reminding me to complete the quizzes and exam for the statistics DSG. However, requesting an exam does not work. There is just an endless loading symbol.

POSTED BY: Joseph Smith

Hi everyone! A message to the Statistics Study Group was sent in error at 11:00 CT today. The message reminds you to take the course quizzes for your certificate of completion by Friday, April 28, which is accurate. It also says the exam is available in the framework, which is not accurate. We continue our testing, and the exam will be deployed in the next day or two. We have to ask for your continued patience regarding the exam. We will post here, on Community, when the exam is available.

Keep in mind there is no deadline for taking the exam. Once that is part of the framework, the Level 1 certificate will be available.

POSTED BY: Jamie Peterson
Posted 1 year ago

Hi Jamie,

I notice that the final exam is now active in the course framework. Since an official announcement has not been made, I am thinking that the exam is not quite ready. In any case, I would like to privately report an error that I found, without disclosing anything about the final to the group. Please let me know where I can send an email about that.

Regards,

Bob

POSTED BY: Bob Renninger

Hi @Bob Renninger. Yes, the exam has been added to the framework! Please send bug reports to wolfram-u@wolfram.com. Thank you!

POSTED BY: Jamie Peterson
Posted 1 year ago

Thanks, I have reported the issues.

POSTED BY: Bob Renninger

Dear Jose and Arben,

I suspect that there are two errors in the Quiz 6.

Problem 8: There is no correct answer. Multiple choice D) can be an answer but only when it is modified as "Approximately normal and with a standard error of 80g."

Problem 10: There is no answer. Even if we change the order of the two samples in MeanDifferenceCI, still there is no correct multiple choice.

Please see the attached

Attachments:
POSTED BY: Hee-Young Shin

Hi Hee-Young—thanks for your comments and notebook.

Re: Question 10, I've looked at your provided notebook, and the answer you calculate is listed among the choice I see on the deployed quiz (granted, rounded to the nearest .01, but we'll call that good enough!). Do you see something different from the following for your answer choices?

enter image description here

As for Question 8, I believe you are correct and will get that updated. Thanks again for pointing it out!

POSTED BY: Arben Kalziqi
Posted 1 year ago

Thank you Arben for your time.

POSTED BY: José Dordá

Of course! Thanks for helping us catch errors. Even with lots of review, things still sometimes get through...

POSTED BY: Arben Kalziqi
Posted 1 year ago

About quizes:

While going through the quizes, it seems that there is no correct option (answer) for Quiz 6, Question 10. I tried with Mathematica and also an specialized statistics software and none of the option correspond to the correct answer for the data supplied.

POSTED BY: José Dordá

Hi José,

I've just checked with the quiz author, and we are able to get the correct answer as listed in the choices. While we're not sure of what answer you calculated, they suggest that it's possible to get the reverse of the correct answer if you swap the two provided samples for one another in the argument of the relevant function.

Please let us know whether this helps, and feel free to provide more information if it doesn't.

Arben

POSTED BY: Arben Kalziqi
Posted 1 year ago

Quiz 15, Problem 4, What is the expected value of the mean from a sampling distribution of sample size n, drawn from a normally distributed population with standard deviation σ? Goin g through the possible answers provided, the answer that is shown as correct is dependent on the sample problem in the video for nb 44. The central limit theorem governs this situation in this problem and it says the expected value of the mean from a sampling distribution equals the mean (µ) of the population. what you offer is the numerical value of the µ from the specific situation in the sampling problem.

POSTED BY: Tom Ogilvy

Hi Tom,

Thanks for bringing this to our attention. We're rolling out a fix; you may have to clear your cookies/cache to see it, but the answer will be updated.

POSTED BY: Arben Kalziqi

Hi;

When I try to request my final exam from the course framework, nothing happens. I get a wait indicator that never produces a final exam.

POSTED BY: Mitchell Sandlin

Hi @Mitchell Sandlin, the exam has not yet been added to the course framework. We will send an email notification to the Study Group participants when this is available next week.

POSTED BY: Jamie Peterson

Hi;

I am attempting to weight temperature data to use in creating a Normal Distribution using the EstimateDistribution function - see attached notebook. However, it seems that all I am getting is a bunch of messages with the processing failing to create a Normal Distribution. Please tell me what I am doing incorrectly.

Thanks,

Mitch Sandlin

Attachments:
POSTED BY: Mitchell Sandlin

Take a look at

PDF[NormalDistribution[], #] & /@ wdAllFlat

The temperature values are far from 0 which is the mean used by NormalDistribution[], so the probability is tiny.

POSTED BY: Rohit Namjoshi

Hi Rohit;

In implement your suggestion, I did not see much change - see attached notebook. However, there is a good possibility that I misunderstood your suggestion and implemented it incorrectly.

As a little background, I ran exactly the same calculations without weighting the data, and all the probability calculations, including the NormalDistribution[] function, looked reasonable. However, I wanted to run the same calculations with weighted data to take into account that the current temperatures may be a better approximation due to climate change, and therefore assign more weight to the current temperatures and less weight to the temperatures 20 years ago.

Thanks,

Mitch Sandlin

Attachments:
POSTED BY: Mitchell Sandlin
Posted 1 year ago

Yes, with weighting function PDF[NormalDistribution[], #] & and data values around 80, the weights are values of the probability density function of the standard normal distribution around 80 standard deviations above the mean, so the weights are smaller (around 10^-1400) than the smallest possible machine number (which is about 10^-323). NormalDistribution[] refers to the standard normal distribution, which is the normal distribution with a mean of zero and a standard deviation of 1, so NormalDistribution is equivalent to NormalDistribution[0,1]. I don't know the limitations of the EstimatedDistribution function, but the error messages suggest that it is not designed to deal with weights that are that small. I'm not sure if that is what was intended here, but the following worked when I tried it:

sampleMean = Mean[wdAllFlat];

sampleStandardDeviation = StandardDeviation[wdAllFlat];

weightedData = WeightedData[wdAllFlat, 
   PDF[NormalDistribution[sampleMean, sampleStandardDeviation], #] &];

EstimatedDistribution[weightedData, NormalDistribution[\[Mu], \[Sigma]]]
POSTED BY: Dave Withoff

I have been experimenting with TTest and I am observing puzzling behavior.

As the attached workbook shows, the SignificanceLevel option does not seem to change the p value for the sample dataset used in Lesson 27.

The attached worksheet also shows experiments where the p value for a TTest with normally distributed data does not seem to decrease as the standard deviation of normally distributed data decreases.

Perhaps I am making a mistake in how I am using TTest command.

Attachments:
POSTED BY: Joseph Smith
Posted 1 year ago

1) The significance level in a hypothesis test has no effect on the p-value. The significance level is the threshold for interpreting the p-value (deciding whether the p-value indicates a statistically significant departure from the null hypothesis). The SignificanceLevel option in the TTest function has an effect only if the interpretation is included in the output, as in:

In[]:= TTest[{1, 2, 3, 4, 5}, 1, {"PValue", "TestConclusion"}]

Out[701]= {0.0474207, 
  The null hypothesis that the mean of the population is equal to 1 is rejected at the 5 percent level based on the T test.}

In[]:= TTest[{1, 2, 3, 4, 5}, 1, {"PValue", "TestConclusion"}, SignificanceLevel -> 0.01]

Out[702]= {0.0474207, 
The null hypothesis that the mean of the population is equal to 1 is not rejected at the 1 percent level based on the T test.}

2) Unlike the sample that was used in the video, the mean of the sample generated by data1 = RandomVariate[NormalDistribution[1000, 0.254], 2500] is very close to 1000, and the sample size is 100 times bigger than the sample used in the video. Of those two differences, the most important difference here is that the sample mean is very close to 1000, which is the population mean under the null hypothesis. TTest[data1, 1000] will return a large p-value because the mean of the sample is (by construction) very close to 1000. Replacing 0.254 by 0.0254 or 0.00254 makes the population standard deviation smaller, but since the mean of the sample is still 1000, the p-values will be big.

It is easier to see the effect of population standard deviation on the p-value by using samples from populations for which the mean is not equal the the mean under the null hypothesis, such as:

In[]:= data1 = RandomVariate[NormalDistribution[1001, 3], 10];
TTest[data1, 1000]

Out[]= 0.760372

In[]:= data2 = RandomVariate[NormalDistribution[1001, 1], 10];
TTest[data2, 1000]

Out[]= 0.0812006

In[]:= data3 = RandomVariate[NormalDistribution[1001, 0.5], 10];
TTest[data3, 1000]

Out[]= 0.0000260985
POSTED BY: Dave Withoff

David,

Thanks for your response. I will need to digest this carefully but I really appreciate your help.

Joe Smith

POSTED BY: Joseph Smith

Thanks again. Some additional experiments with data where the mean moves away from 1000 illustrate the interaction between PValue, the significance level, and the conclusion of the TTest with respect to rejection of the null hypothesis.

Attachments:
POSTED BY: Joseph Smith

In the lecture today on the Multiple Testing Problem, a website was mentioned of someone who made it his hobby to collect false positives, or "spurious correlations", e.g., a time series of people drowning in swimming pools over successive years that appears to be very similar to the time series of movies coming out starring Nicholas Cage. The website shows many more funny coincidences.

This is the website: https://www.tylervigen.com/spurious-correlations

There's a book as well by the same author: https://www.amazon.com/Spurious-Correlations-Tyler-Vigen/dp/0316339431

Please post the link to the course framework on this community site. The chat pane does not appear in the recording of the course so if I miss that lecture I won't see the link. Thanks!

POSTED BY: Joseph Smith

Unfortunately the course framework is not ready for release yet. We are only sharing a beta version with our study group attendees. We can include the link in our reminder emails, so you can get to it, even if you miss the live session.

Thanks. Understood.

POSTED BY: Joseph Smith

Hi;

In using both functions NormalCI and MeanCI, I am getting some unexpected results - please see attached notebook.

Thanks,

Mitch Sandlin

Attachments:
POSTED BY: Mitchell Sandlin
Posted 1 year ago

Hi Mitch,

I suspect it has something to do with the handling of temperatures and temperature differences, see the link in Abrita's post.

I'm always a bit cautious when it comes to dimensionally significant quantities. Especially with large data sets, I have the feeling that the calculations are faster with pure numbers. Therefore, just like you, I would have loaded the temperatures first and put them into a consistent format. However, then I would have dropped the units. You can do that for example like this:

wdAllFlat = QuantityMagnitude@Flatten[wdAll];

With dimensionless numbers you get the expected results (as you have already shown with your "copy&paste approach").

A comment on your second question: The variable wdAllFlatt still has units in your case - therefore you get a normal distribution with units (http://reference.wolfram.com/language/ref/QuantityDistribution.html). However, if you work with pure numbers, then you will get the result with pure numbers only - this makes plotting much easier, for example...

By the way, this is not a speech against the use of units in WL - on the contrary: I am a big fan of it; but not in every context.

POSTED BY: lara wag
Posted 1 year ago

In the notebook "11.NumericalSummariesOfData.nb" there are images of this kind: picture taken from notebook You animated these diagrams in the accompanying video and I wondered how you did it. Unfortunately, there are only pictures in the notebook (which makes sense, since you have explanations in the notebook that are "on the soundtrack" in the video). Even if it's a little off-topic, but: would you be so kind and share with us the code of this stunning animation?

POSTED BY: lara wag
Posted 1 year ago

The animations are done using dynamic displays with off-screen controls. The programming is off-screen to avoid the potential distraction from the topic of the video, and avoid cluttering up the screen with a bunch of code that has nothing to do with statistics.

For a simple example of the basic idea, evaluate:

x = 0; Slider[Dynamic[x], {0, 10}]

to get a slider for controlling the global value of x (which is initially set to zero) and then evaluate

Dynamic[AngularGauge[x, {0, 10}]]

to get an angular gauge that shows the value of x. Moving the position of the slider changes the angular gauge.

The animation in the video is done by putting the slider in an off-screen notebook, and copying the dynamic AngularGauge display to the notebook shown in the video. The off-screen slider then controls the dynamic display in the video.

Other than that, the only difference in that particular animation is that there are two off-screen controls (one for use when describing "skewness" and one for use when describing "outliers") and the display is a labeled NumberLinePlot graphic rather than an AngularGauge display.

There are several other animated illustrations in the statistics course that are done in this same way.

A disclaimer regarding the code for this animation is that this code is ad hoc and not intended for public release. This code is also not intended to show exemplary programming practice. It's single-use code written to get the desired animation in this one example. There are lots of details here for getting the display to be exactly what is wanted. None of those details are especially tricky or complex, but there are nevertheless lots of details.

With that disclaimer, here is the code that was used to generate the animation in this video.

x = 0; Slider[Dynamic[x], {-4.3, 3.3}]
y = 0; Slider[Dynamic[y], {-30, 26}]
data = {45.4, 48.1, 49.7, 50.4, 51.5, 52.5, 53.4, 54.3, 54.8, 55.9, 
  57.8, 60.9}; Dynamic[
 md = Median[data];
 d1 = Take[data, 5] - md;
 d2 = Take[data, -5] - md;
 pts = If[x > 0, Join[md + d1, data[[{6, 7}]], md + (1 + x) d2], 
    Join[md + (1 - x) d1, data[[{6, 7}]], md + d2]] + {UnitStep[-y] y,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, UnitStep[y] y};
 m = Mean[pts];
 md = Median[pts]; 
 Show[NumberLinePlot[pts, BaseStyle -> 14, ImageSize -> 600,
   Epilog -> {Text[
      StringTemplate["mean = ``"][Round[m, 0.001]], {m, 4.8}], 
     Arrow[{{m, 2}, {m, 1.5}}], 
     Text[StringTemplate["median = ``"][md], {md, -3.2}], 
     Arrow[{{md, -1}, {md, 0}}]}, AspectRatio -> 0.2, 
   PlotStyle -> PointSize[0.01]], 
  Graphics[{Opacity[0], Point[{45, 5}], Point[{45, -5}]}], 
  PlotRange -> {{20, 80}, Automatic}]]
POSTED BY: Dave Withoff
Posted 1 year ago

Dear Mr. Withoff,

Thank you very much for your answer.

You are absolutely right, the code does distract a lot from the actual topic. There is also a separate course on the topic of "animations" here on WolframU.

Still, thanks for sharing, I found your method of moving the five points or the single point via the join function (pts = ...) very enlightening. Also that you were able to assemble the final image with only "Epilog" was not something I would have expected.

POSTED BY: lara wag

Hi;

I am interested in finding the value of "x" in a probability when I know the probability in which I am interested. Intuitively one would assume that the Solve function should produce the answer by simply solving for x - see below. However, Solve, SolveValue or Reduce does not return the desired results. In some situations, I know the probability in which I am interested but I do not know the value of x that gives that probability without a lot of trial and error, so I hope someone could point me in the right direction.

Thanks, Mitch Sandlin

Solve[0.6 == Probability[x, x [Distributed] NormalDistribution[998, 202]], x]

POSTED BY: Mitchell Sandlin

Dear Mr. Sandlin:

I am NO expert on Wolfram or Probability but the following may help you.

Attachments:

Hi Juan;

Thanks so much. It was actually the InverseCDF function that performed the correct calculation.

Mitch Sandlin

POSTED BY: Mitchell Sandlin

How to take the Quizzes and the Online Course Exam?

POSTED BY: Md Mohsin

We will share links to quizzes and final exam during the DSG.

Show does not work properly in Mathematica 13.2.x. The overlay shown in the downloaded notebook 8.UsingHistogramData.nb fails to show the overlay!

POSTED BY: Marvin Schaefer

Hi Marvin,

The reason is that Show takes PlotRange from its first argument (Histogram) and you will notice that the second argument (ListLinePlot) has a very different PlotRange. Why???

Because, unexpectedly, MovingAverage performs a unit conversion from Celcius to Kelvin.

MovingAverage[{Quantity[22, "DegreesCelsius"], Quantity[24, "DegreesCelsius"]}, 2]
(* {Quantity[5923/20, "Kelvins"]} *)

To get the plots to overlay correctly, convert back to Celsius.

Show[Histogram[data, Automatic, "PDF"], 
 ListLinePlot[
  Transpose[
   MapAt[UnitConvert[MovingAverage[#, 2], "DegreesCelcius"] &, 
    HistogramList[data, Automatic, "PDF"], 1]], PlotMarkers -> All]]

This looks like a bug to me, or at least unexpected, undocumented (as far as I can tell) and can result in unexpected errors. You should report it to Wolfram Support.

POSTED BY: Rohit Namjoshi

There was an update to how temperature units are handled in 13.2. This should provide an explanation: https://reference.wolfram.com/language/tutorial/TemperatureUnits.html

Thank you, Abrita. Seeing the documentation is what I badly needed !

-Marv

POSTED BY: Marvin Schaefer

Thank you very much for diagnosing the problem, Rohit! I got a response from support about the units on the x-axis not corresponding between the ListLinePlot and the graphic, and that was also not making sense to me at the time. Your response made it all very clear and, as you indicate, I do not think that this modification in implicit conversion was not documented in the prerelease documents I received for 3.2 or 3.3. It certainly is not compatible with the documentation and is possibly an easily corrected bug.

I truly appreciate your persisting in analysis of this conundrum! I’ll re-report it to Support.

-Marv Schaefer

POSTED BY: Marvin Schaefer

Reminder that the statistics group starts Monday! Author @David Withoff will join us as we kickoff our latest Daily Study Group. A pre-release version of the interactive course framework will be shared with participants. Sign up here.

POSTED BY: Jamie Peterson
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract