Message Boards Message Boards


How does FindDistribution calculate BIC and AIC?

Posted 3 years ago
6 Replies
0 Total Likes

A new tool called FIndDistribution calculate (among others) BIC and AIC. But when I'm check the values using the log likelihood value with the rest of parameters I get other values. Thank you


POSTED BY: jl cb
6 Replies

It does not appear that a standard definition of AIC is used in FindDistribution. The usual formula with $n$ observations and $k$ parameters is $AIC=-2 \log L+2 k$. But FindDistribution seems to use $AIC=2 \log L - 2k/(n-k-1)$.

Here's "proof" of what FindDistribution uses. (This works unless a MixtureDistribution is found with more than two distributions.)

Set the sample size and get a random sample:

n = 100;
data = RandomVariate[Exponential[1], n];

Find the top best fitting distributions and collect the AIC and LogLikelihood values:

nbest = 5;
fd = FindDistribution[data, nbest, {"AIC", "LogLikelihood"}]
(* {{ExponentialDistribution[1.07497], {-1.83436, -0.906976}},
{MixtureDistribution[{0.72248, 0.27752}, {GammaDistribution[1.2916, 0.369832], UniformDistribution[{0.0177466, 3.77159}]}], {-1.75598, -0.824797}},
{MixtureDistribution[{0.765586, 0.234414}, {GammaDistribution[1.3553, 0.342329], NormalDistribution[2.45319, 0.808439]}],{-1.78878, -0.841199}},
{LogNormalDistribution[-0.708397,1.24473], {-1.84787, -0.903319}},
{WeibullDistribution[0.92819,0.897149], {-1.85254, -0.905651}}} *)

Find $k$ (the number of parameters):

k = StringCount[Table[ToString[fd[[i, 1]]], {i, nbest}], ","] + 1;
(* {1,6,6,2,2} *)
nMixtures = StringCount[Table[ToString[fd[[i, 1]]], {i, nbest}], "MixtureDistribution"]
(* {0,1,1,0,0} *)
k = k - nMixtures
(* {1,5,5,2,2} *)

Extract the LogLikelihood and AIC values and show the AIC values from the formula used by FindDistribution:

logL = Table[fd[[i, 2, 2]], {i, nbest}]
(* {-0.9069763208709088`,-0.8247971259593575`,-0.8411993739988386`,-0.903318900970027`,-0.9056506499137531`} *)
aic = Table[fd[[i, 2, 1]], {i, nbest}]
(* {-1.8343608050071236`,-1.7559772306421193`,-1.7887817267210813`,-1.847874915342116`,-1.8525384132295681`} *)
2 logL - 2 k/(n - k - 1)
(* {-1.8343608050071238`,-1.7559772306421193`,-1.7887817267210815`,-1.847874915342116`,-1.8525384132295681`} *)

It appears that the formula used is wrong and that it seems to be a combination of $AIC$ and $AIC_c$ as $AIC_c =-2\log L + 2 k n/(n-k-1)$.

Posted 3 years ago

Thanks for the clarification. I don't remember the alternative version of AIC. For now i will use the standard version of AIC. Do you know under what criteria FindDisitribution sort from the first best fit to the last fit? Thanks again.

POSTED BY: jl cb

I was too mild in my assessment: the AIC definition used by FindDistribution is wrong. So I'm a assuming that it is a coding error rather than an alternative formulation of AIC. (Likely the intent was to use $AIC_c$ but the sample size ( $n$) in the numerator for the number of parameters adjustment was left out of the equation. But someone will correct me if I'm wrong.) That wrong definition way under-corrects for the number of parameters in a model so the ranking of models from FindDistribution is suspect (if the number of parameters vary among the models considered).

To make the usual statements about $AIC$: this measure allows the ranking of models and cannot be used as a measure of absolute fit. The model with the best $AIC$ might be the best model of a bunch of bad models or the best of a bunch of very good models.

While FindDistribution can get you a parsimonious description of the distribution of your data (a few parameters + a particular family of distributions) that you can easily use outside of Mathematica, if you want a more justifiable description of the data, I'd suggest using SmoothKernelDistribution. You don't get a parsimonious description of your data but it will almost certainly provide a better fit.

Posted 3 years ago

Thanks. I'll check with SmoothKernelDistribution then. Not even LogLikelihood values can be used?

POSTED BY: jl cb

The "LogLikelihood" values reported by FindDistribution appear to be different than one one would likely use, too. (The reported LogLikelihood is the real LogLikelihood using the parameter estimate divided by the sample size. So it is an average log likelihood contribution for an individual observation?) Also, the estimates of the parameters does not appear to be a maximum likelihood estimates.

Below is some code to show these issues:

n = 500;
data = RandomVariate[ExponentialDistribution[1], n];

fd = FindDistribution[data, 1, {"AIC", "LogLikelihood"}]
(* {{ExponentialDistribution[1.040152396547962`],{-1.9080105994525727`,-0.9519972675977724`}}} *)

(* Check on reported LogLikelihood *)
    data] /. {\[Lambda] -> 1.040152396547962`})/n
(* -0.9519972675977716` *)

(* Maximum likelihood estimate: two different ways *)
FindDistributionParameters[data, ExponentialDistribution[\[Lambda]]]
(* {\[Lambda]\[Rule]1.049212868865103`} *)
EstimatedDistribution[data, ExponentialDistribution[\[Lambda]]]
(* ExponentialDistribution[1.049212868865103`] *)

I see that in the documentation for the experimental FindDistribution there is a RandomSeed option and "The internal information criterion uses a Bayesian information criterion together with priors over TargetFunctions." So this tells me maximum likelihood is not being used but there aren't too many details given in the documentation. Maybe with more details in the documentation everything will become clear.

Posted 3 years ago

Similarly, $AIC$ from LinearModelFit appears to be wrongly computed.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract