Message Boards Message Boards

0
|
11472 Views
|
9 Replies
|
8 Total Likes
View groups...
Share
Share this post:

Histogram and HistogramList do not honor the requested number of bins

Posted 10 years ago

Hi list. I'm new here. Medium skills with Mathematica.

Histogram and HistogramList both have allowable forms such as this:

HistogramList[some_list, bspec]

where bspec can be the number of bins. The actual number of bins used seems to have a very casual relationship to the number of bins requested. For example, the following bit of code

numBins = 1;
uniformSamples = RandomReal[{-Pi/2, Pi/2}, 1000];
list = HistogramList[uniformSamples, numBins];
{numBins, Length[list[[1]]] - 1}

returns the following numbers for {number of bins requested, number of bins used} for a few choices of numBins. Note that there are one fewer bins than bin delimiters (edges), thus Length[list[[1]]] - 1.

{1, 1}
{2, 2}
{3, 2}
{4, 4}
{5, 4}
{6, 8}
{7, 8}
{8, 8}
{9, 8}
{10, 16}
{11, 16}
{16, 16}
{17, 16}
{21, 16}
{22, 32}

It seems that there is something that is overriding the input request and instead returning bin numbers that are powers of two.

How can I get the number of bins that I request without ginning up my own list of bin edges? And who knows if that would even work. I can't see this behavior in the documentation for these functions which states unambiguously, "The following bin specifications bpsec can be given: n use n bins"

Jerry

POSTED BY: Jerry
9 Replies

Hi Jerry,

What is not written there is that it chooses a 'nice' subdivisions around the range of your numbers. If you look at the boundaries of the bins it will be 'nice' numbers. If your domain equals -pi to pi there are no 'nice' subdivisions. It probably uses something like FindDivisions to find 'nice' bin-boundaries.

What I always use is the {xmin,xmax,dx} specification. That case, I know exactly where the bin starts and how many I will get. Even if n would work as you would expect, the start point is still 'random'. i.e. giving a specification of 'n' does not uniquely define the divisions. {xmin,xmax,dx} defines it uniquely.

POSTED BY: Sander Huisman

No need to open a new thread on this related topic, i just want to get this out:

The histogram functions in Mathematica use the right-continuous $[b1,b2),[b2,b3),[b3,b4),…$ bin specification for determining which bin a number on the border belongs to. My textbook defines them the other way, namely "left-continuous" $(b1,b2],(b2,b3],(b3,b4],…$ for a good reason: when you have "data" given in form of classes and their counts (absolute frequency) and want to approximate its empirical distribution function by a normal distribution, then the bins can be interpreted as perfect representations of those classes … as long as the right border of the bin is included: A distribution function is always defined for $X\leq x$ as in $P(X\leq x)$. From this we can see that the right border is included.

$$P(X\leq x)\approx \phi \left(\frac{x-\mu }{\sigma }\right)$$

I am working on six similar textbook problems and the classes (other word: intervals) are always defined such that the left border is excluded, the right border is included. I found an undocumented way of inputting these classes directly into Mathematica and i am realizing, as i am writing, that the mentioned lamentable bin specification is built-in and could not be altered, not even in future as an option (or maybe global system setting?).

If i am the first ever to complain about this situation, then maybe it isn't of any concern in practice (outside of textbook problems) or practitioners don't care. I just wished that the user had the flexibility to set that bin specification. If i there are more other sources, books, profs, maples, matlabs, youtubes, wikis, several important sources/references etc using the bin specification which i am looking for, then it's maybe motivation enough for Wolfram to offer such an option to the end user, in future.

Of course, it is an arbitrary thing. The same author (of a table, of a book, of article) could define the classes to be right-continuous for problem 3 and to be left-continuous for problem 11, or even mix them. But the mentioned representation/interpretation of (cumulative) classes of an empirical distribution function CDF is a strong argument for my case.

POSTED BY: Raspi Rascal

Are you saying that for the case of ProbabilityDensity the integration of the results would be not equal to 1?? That would be a big problem! Please show us an example!

POSTED BY: Sander Huisman

However, does it matter for a continuous (like normal) distribution? P(X<=x) = P(X<x) ? In reality, for continuous data you don't have data at exactly x…

I've been making PDFs/histograms/CDFs for a decade and never has this ever been important :-)

POSTED BY: Sander Huisman
Posted 4 years ago

The arbitrary choice is a good case for not doing histograms in the 21st century. Try SmoothHistogram instead. (Not to mention that no roughly continuous distribution looks as blocky as a histogram even with large amounts of data and small bin widths.)

POSTED BY: Jim Baldwin

Alternatively, one uses HistogramList and then just uses ListPlot with InterpolationOrder -> 1 or 2 and it looks fine.

POSTED BY: Sander Huisman
Posted 10 years ago

Thanks, Sander. That is exactly what I needed to know. Indeed, the divisions returned for my example are simple rational numbers beginning with -8/5 to sort of match the -Pi/2 request. This strikes me as silly but at least it should be mentioned in the docs. In addition to usually giving an unexpected number of bins it also frequently gives nonuniform bin widths at the ends which results in unexpected bin counts. This happens even when “ProbabilityDensity” is given as an option, in which case (at least this case) the nonuniform bin widths need to be taken into account when present the probability density estimate. The more I think about this, the more it looks like a bug, not just a documentation problem.

Thanks again. {xmin, xmas, dx} it is!

Jerry

POSTED BY: Jerry
Posted 10 years ago

What I’m saying is that due to the goofy way in which the first bin edge and the last bin edge are calculated, there can be first and last bin widths that are too narrow and thus the bin counts for those two bins will be low. When there are nonuniform bin widths and the histogram is being used to estimate a density function, the width of the bin has to be incorporated into the calculation of the histogram in order to get a proper pdf estimate. Mathematica counts data for these “thin” bins but then plots them as if they had the same widths as the other bins, in the process making short heights which do not accurately portray the pdf..

Here’s a bit of code you can play with:

SeedRandom[2];
numBins = 8;
uniformSamples = RandomReal[{-Pi/2, Pi/2}, 1000];
Histogram[uniformSamples, numBins, "PDF"]
histogram = HistogramList[uniformSamples, numBins, "PDF"];
histogramDistribution = HistogramDistribution[uniformSamples, numBins];
binEdges = histogram[[1]]
binCounts = histogram[[2]]
binWidths = 
 binEdges[[2 ;; Length[binEdges]]] - 
  binEdges[[1 ;; Length[binEdges] - 1]]
{numBins, Length[histogram[[1]]] - 1}
Total[binCounts]
Total[binCounts*binWidths]
Plot[CDF[histogramDistribution, x], {x, -2, 2}]

This produces the attached screen shot. As you can see, the first and last bins have a very low count and are plotted with a “full” bin width, seeming to indicate an incorrect pdf. Instead, they should be plotted with a full height and a reduced width. As you can see, the areas do total to one but only if you correct for the (constant) bin width, which is certainly an odd feature since the bin width is not reported by either of the histogram functions. Yet, HistogramDistribution takes care of the (constant) bin width issue but not the nonuniform bin width issue, as seen in the last plot.

Attachments:
POSTED BY: Jerry

Thanks for your feedback. Yes, i can imagine that in reality, real-life problems, the borders are not hit exactly at x. My book problems are with discrete data, absolute frequencies, and i can see a notable difference in the normal distribution approximations.

I am going to post my treatment of the problem in a new thread because it should be very interesting and instructive for first-time learners of the topic (statistics probabilities).

POSTED BY: Raspi Rascal
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract