Message Boards Message Boards

Given an exact formula get probability distribution with matching PDF?

Crossposting here to get a wider opinion. So, given some data, Mathematica 10.2 can now attempt to figure out what probability distribution might have produced it. Cool! But suppose that, instead of having data, we have something that is in some ways better -- a formula. Let's call it $f$. We suspect -- perhaps because $f$ is non-negative over some domain and because the integral of $f$ over that domain is 1 -- that $f$ is actually the PDF of some distribution (Normal, Lognormal, Gamma, Weibull, etc.) or some relatively simple transform of that distribution.

Is there any way that Mathematica can help figure out the distribution (or simple transform) whose PDF is the same as $f$?

Example: Consider the following formula:

1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi])

$$\frac{e^{-\frac{1}{8} (\log (5)-m)^2}}{2 \sqrt{2 \pi }}$$

As it happens -- and as I discovered with some research and guesswork -- this formula is the PDF of NormalDistribution[Log[5], 2] evaluated at $m$. But is there a better way than staring or guessing to discover this fact? That is, help me write FindExactDistribution[f_, params_].

Notes

  • The motivation for the problem comes from thinking about Conjugate Prior distributions but I suspect it might have a more general application.

  • One could start with mapping PDF evaluated at $m$ over a variety of continuous distributions. And if I did this I would at some point get to what I will call $g$, which is the PDF or the NormalDistribution with parameters $a$ and $b$ evaluated at $m$.

    1/(b*E^((-a + m)^2/(2*b^2))*Sqrt[2*Pi])
    

$$\frac{e^{-\frac{(m-a)^2}{2 b^2}}}{\sqrt{2 \pi } b}$$

But unless I knew that if I replaced $a$ by Log[5] and $b$ by $2$ that I would get $f$, this fact would not mean a lot to me. I suppose I could look at the TreeForm of $f$ and $g$ and I would notice certain similarities, and that might be a hint, but I am not sure how to make much progress beyond that observation. Ultimately, the problem looks to be about finding substitutions in parts of a tree ( $g$) which, after evaluation, yield a tree that matches a target $f$. I have the suspicion that this is a difficult problem with an NKS flavor but one for which Mathematica and its ability to transform expressions might be well suited.

POSTED BY: Seth Chandler
10 Replies

Maybe this is not exactly what you are looking for, but at least this gives you a very close guess and it is easy to figure out the rest.

dis = ProbabilityDistribution[
   1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]), {m, -Infinity, Infinity}];

PDF[dis, m] // TraditionalForm

enter image description here

Generate data:

data = RandomVariate[dis, 10^4];

Match:

FindDistribution[data]

NormalDistribution[1.60001, 1.97652]

Check:

Log[5] // N

1.6094379124341003

So you did not get symbolic Log[5] but at least you got distribution right. This will work in much more complicated cases.

  • Possible continuous distributions for TargetFunctions are: BetaDistribution, CauchyDistribution, ChiDistribution, ChiSquareDistribution, ExponentialDistribution, ExtremeValueDistribution, FrechetDistribution, GammaDistribution, GumbelDistribution, HalfNormalDistribution, InverseGaussianDistribution, LaplaceDistribution, LevyDistribution, LogisticDistribution, LogNormalDistribution, MaxwellDistribution, NormalDistribution, ParetoDistribution, RayleighDistribution, StudentTDistribution, UniformDistribution, WeibullDistribution.

  • Possible discrete distributions for TargetFunctions are: BenfordDistribution, BinomialDistribution, BorelTannerDistribution, DiscreteUniformDistribution, GeometricDistribution, LogSeriesDistribution, NegativeBinomialDistribution, PascalDistribution, PoissonDistribution, WaringYuleDistribution, ZipfDistribution.

POSTED BY: Vitaliy Kaurov

I appreciate the responses here. But let me provide an example that is perhaps not so easy. Suppose the target function f is as follows: $\frac{7}{10 (a-2)^2}$ for the domain ( $-\infty,\frac{13}{10}$]. If we create a probability distribution out of this and then generate 10,000 random samples from the distribution and then run FindDistribution

 dis = ProbabilityDistribution[7/(10 (-2 + a)^2), {a, -\[Infinity], 13/10}];
 rv = RandomVariate[dis,10^4];
 fd=FindDistribution[rv,5]

The result is a mixture distribution of normal distributions, a beta distribution, a weibull distribution, a normal distribution and a mixture distribution of a normal distribution and a gamma distribution.

The mixture distributions are clearly of the wrong form, the normal distribution is clearly not right, Although I am not positive, I don't believe the Weibull Distribution or the Beta Distribution is correct either. In fact, I don't know what the correct answer is, though I think it might be a fairly simple transform of a single parameter distribution. The point, however, is that the FindDistribution process, does not seem to work in this case. And that's why I am hoping for something better.

ADDENDUM

Actually, I have now found (using a combination of my brain and a little Mathematica) a distribution whose PDF corresponds to the expression above. It is a transform of a Pareto Distribution. This is something that the FindDistribution method described above did not discover. The question is whether there is some algorithmic way whereby Mathematica can "see" this fact or at least make a good guess. My suspicion is that the answer will involve a closeness metric on expression trees.

TransformedDistribution[2 + (7 x)/10, 
 x \[Distributed] TransformedDistribution[-\[FormalX], \[FormalX] \[Distributed] ParetoDistribution[1, 1]]]
POSTED BY: Seth Chandler
f[m_] = 1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]);

Integrate[f[m], {m, -Infinity, Infinity}]

1

dist = ProbabilityDistribution[f[m], {m, -Infinity, Infinity}];

Since the integral of f[m] is unity, f[m] does not have to be scaled to be a distribution. A candidate distribution will probably have two parameters and must be defined on the interval {-Infinity, Infinity}. The built-in distributions with two parameters and defined on the interval {-Infinity, Infinity} are

distributions = ToExpression /@ Names["*Distribution"];

twoParamDist = Select[distributions,
    FreeQ[PDF[#[a, b], x], #] &&
      DistributionDomain[#[a, b]] === Interval[{-Infinity, Infinity}] &] // 
   Quiet;

The target PDF for comparison is

PDF[dist, m]

enter image description here

Comparing this PDF with the form of those of the built-in two-parameter distributions will greatly limit the candidates

({#, PDF[#[a, b], m]} & /@ twoParamDist) // 
  Prepend[#, {Distribution, PDF}] & //
 Grid[#, Frame -> All] &

enter image description here

The NormalDistribution is the likely distribution

dist2 = NormalDistribution[a, b];

param = Solve[{Mean[dist2] == Mean[dist],
   Variance[dist2] == Variance[dist],
   DistributionParameterAssumptions[dist2]}, {a, b}]

{{a -> Log[5], b -> 2}}

Verifying that the PDFs are identical

PDF[dist2 /. param[[1]], m] == PDF[dist, m] // Simplify

True

So the target distribution is

dist2 /. param[[1]]

NormalDistribution[Log[5], 2]

POSTED BY: Robert Hanlon

Hi Vitaliy,

I have suggested a similar approach somewhere else in this Community, but one has to be careful. For example, as you said the mean/expected value is not quite right yet. You could try to generate more random points and then use

WolframAlpha["1.60867", IncludePods -> "PossibleClosedForm"]

to get a good/sumbolic estimate of the parameter, but this is what happens:

dis = ProbabilityDistribution[1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]), {m, -Infinity, Infinity}];
Grid[Join[{{"Number of points", "Estimated distribution"}}, {#, FindDistribution[RandomVariate[dis, #]]} & /@ Table[10^i, {i, 1, 7}]], Frame -> All]

enter image description here

I sort of understand the Student-t distribution, because of the high degree of freedom. The mixture distribution is a bit different though. Both are not very useful to estimate the mean of the respective Normal Distribution. There is, of course, another parameter for FindDistribution to spit out the n most likely distributions. This works like so:

FindDistribution[RandomVariate[dis, 10^6], 3]
 {StudentTDistribution[1.6071, 1.99998, 15947.3], NormalDistribution[1.60267, 2.10096], NormalDistribution[1.60267, 2.10096]}

The last two of the results coincide and are NormalDistributions. The means are still to far off for WolframAlpha to find $Log(5)$ as a likely scenario.

Cheers,

Marco

POSTED BY: Marco Thiel

I believe that for this problem to be uniquely solvable one has to be able to first define a basis from the family of allowed functions/distributions, as well as allowed composition rules (e.g. ParameterMixtureDistribution and TransformedDistribution) in the space of finitely integrable positive functions (i.e. pdfs). I don't really know anything about this topic but it might be very hard to do so.

POSTED BY: Eduardo Serna

This approach makes use of the new Find Formula function. It can currently cope with either increasing or decreasing functions of a continuous random variable. The Method estimates g in the following equation Y=g(X) where Y and X are random variables.

index[assoc_Association, key_String]:=Map[Prepend[#[[2]], key -> #[[1]]] &, Normal@assoc];
index[assoc_List, key_String]:=Association@Map[#[key]->KeyDrop[#, key] &, assoc];

findgincreasingtransform[d1_,d2_,u_]:={Quantile[d1,u],Quantile[d2,u]};
findgdecreasingtransform[d1_,d2_,u_]:={Quantile[d1,u],Quantile[d2,1-u]};
formuladata=FindFormula[#,x,1,All,SpecificityGoal -> "Low",PerformanceGoal -> "Speed"]&;

finddistributiontransform[d1_,d2_,n_]:=Module[{fit1,fit2,datatofit1,datatofit2,u},
datatofit1=Table[findgincreasingtransform[d1,d2,N@(u/n)],{u,1,n-1}];
datatofit2=Table[findgdecreasingtransform[d1,d2,N@(u/n)],{u,1,n-1}];
fit1=index[Normal@formuladata[(datatofit1)],"Equation"];
fit2=index[Normal@formuladata[(datatofit2)],"Equation"];
{Prepend[First@fit1,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->"IncreasingTransform","Plot"->ListPlot@datatofit1,"Data"->datatofit1}],
Prepend[First@fit1,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->"DecreasingTransform","Plot"->ListPlot@datatofit2,"Data"->datatofit2}]}
]

distributionsearch[targetdist_, possibledists_, n_] :=Flatten@(finddistributiontransform[#, targetdist, n] & /@ (Flatten[{possibledists}]));

Example

distributions = {ArcSinDistribution[], UniformDistribution[],NormalDistribution[], CauchyDistribution[], 
   ChiSquareDistribution[1], WeibullDistribution[1, 1]};
Dataset@Query[MaximalBy[#Score &] /* First]@distributionsearch[TransformedDistribution[u^3, 
    u \[Distributed] WeibullDistribution[1, 1]], distributions, 100]

enter image description here

It looks like this method can be extended to cover more general uni variate transformations by taking information from the target probability density to identify where the critical points of the transformation are.

POSTED BY: Emerson Willard

So, I am thinking that this is quite a brilliant idea. The insight is to let FindFormula come up with a transform of a candidate distribution that best fits the underlying data. I'm going to explore and fiddle with the code a bit and will post something more detailed later, but I did want to herald that Mr. Willard's approach looks really good.

Here's a question for the community that would help me assess the utility of the method: Is it always the case that any member of a distribution family can be expressed as some mathematical transformation of another (base) form of the distribution? This is true for the normal distribution:

TransformedDistribution[b + a x,Distributed[x,NormalDistribution[0,1]] === NormalDistribution[a,b].  

And it is true for the UniformDistribution.

TransformedDistribution[a - a x + b x, 
 x \[Distributed] UniformDistribution[{0, 1}]]===UniformDistribution[{a,b}]

But is it true for all others, some others, few others? (The transformation doesn't have to be linear or affine; it just has to be some sort of formula).

If it is true for all or most distributions then the Willard method has greater promise than if it is true only for a few. Why? Suppose -- and I have no idea whether this is true -- that it is not true for the BetaDistribution. That is, suppose that one can not express BetaDistribution[3,2] as a transformation of BetaDistribution[1,1]. How, then would we find that our data matches a transform of some sort of BetaDistribution. We would have to specify BetaDistribution[1,1], BetaDistribution[2,1][, BetaDistribution[1,2], etc. (an infinite list) in order to find the match. What we need for the Willard method to work best would be

(a) a lot of distributions have this closure property (b) their probability density functions can be expressed exclusively in terms of the primitive TargetFunctions in FindFormula: Possible functions for TargetFunctions are Plus, Times, Power, Sin, Cos, Tan, Cot, Log, Sqrt, Csc, Sec, Sqrt, Abs, and Exp. If not, if, for example, there's a Gamma or a beta function in the PDF, I don't see how FindFormula is going to work well.

More later.

POSTED BY: Seth Chandler

One immediate question that came up in my mind with respect to Mr. Willard's idea was whether it could accommodate situations in which the target distribution was a censored distribution or a truncated distribution. The question becomes whether those derived distributions can be expressed as a transformed distribution such that FindFormula can find the proper transform. Unfortunately, at present it looks as if the answer is "no" for both. I think Mathematica's FindFormula could probably be tweaked so that at least some censored distributions would work; getting truncated distributions to work would be a lot harder.

Here's why.

A distribution censored to the interval [a,b] is really the equivalent of the transformed of that distribution clipped to a,b. Thus:

CensoredDistribution[{a,b},dist]===TransformedDistribution[Max[a,Min[b,x]],Distributed[x,dist]]

So, if the basis functions (TargetFunctions) in FindFormula included Min and Max, it should be able to find the right transformation. Unfortunately, if you evaluate this, you get the following error message.

FindFormula[
 Map[{#, Clip[3 #, {2, 5}]} &, Range[0, 2, 0.1]], x, 3, All, 
 TargetFunctions -> {Plus, Times, Sqrt, Abs, Exp, Min, Max}]

FindFormula::nofun: {Plus,Times,Sqrt,Abs,Exp,Min,Max} is not a non empty list of functions supported by FindFormula. >>

So, what has to happen is that Mathematica's FindFormula needs to support Min and Max. I would not think this would be difficult to implement.

Truncation is another matter. In theory, one can express a truncated distribution as a transformed distribution if one allows piecewise transformations. We need a function that maps a value greater than any right truncation point to the rightmost point of the domain of the underlying distribution and that maps a value less than any left truncation point to the leftmost point of the domain of the underlying distribution. If the value is inside the truncation interval then we need to map that point such that its CDF in the truncated distribution as the same as the CDF of the target point in the underlying distribution. So, the following Mathematica code should get you the transformation:

Refine[Quantile[dist, CDF[TruncatedDistribution[{a, b}, dist], x]], a < x < b]

The problem then becomes that Mathematica's FindFormula function may not have the basis functions sufficient to find the transformation. Here's a relatively simple example showing the problem. Let's see if FindFormula could find the transform necessary to mimic a NormalDistribution truncated to the interval [1,3].

With[{dist = NormalDistribution[0, 1]}, 
 Refine[Quantile[dist, CDF[TruncatedDistribution[{a, b}, dist], x]], 
  a < x < b]]

Here's the output you get. It's a conditional expression, but I am going to cheat and assume (correctly) that the condition is always true inside the interesting region. And I am going to simplify.

-Sqrt[2] InverseErfc[(2 (Erf[a/Sqrt[2]] - Erf[x/Sqrt[2]]))/(Erf[a/Sqrt[2]] - Erf[b/Sqrt[2]])]

FindFormula does not have the primitives that will ever let you find this exact formula. It does not have Erf or InverseErfc and these functions can not be decomposed into operations with the FindFormula primitives. So, we are going to have two problems. First, FindFormula does not understand piecewise functions. Second, FindFormula does not have primitives such as InverseErfc and Erf. While, conceivably Piecewise functions might be supported by a future FindFormula -- but how many pieces would you permit -- I have doubts that Mathematica if included gobs of special functions such as InverseErfc and Erf the whole system would work well. There might have to be some hierarchical system in which, if the regular functions didn't achieve some degree of fit, one went to special functions.

The short answer is that, as it stands, if the ProbabilityDistribution obtained in your formula is best represented as a censored or transformed distribution, the Willard method outlined above probably won't find it. This does NOT mean that the Willard method is bad -- as I've said, it is quite brilliant -- but it does seem to have limitations. More later.

POSTED BY: Seth Chandler

Now, transformations of even functions of a random variable can also be found.

index[assoc_Association,key_String]:=Map[Prepend[#[[2]],key->#[[1]]]&,Normal@assoc];
index[assoc_List,key_String]:=Association@Map[#[key]->KeyDrop[#,key]&,assoc];

findgincreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,CDF[d1,x]]};
findgdecreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,1-CDF[d1,x]]};
finduprigtheventransform[d1_,d2_,x_]:={x,Quantile[d2,Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
findinvertedheventransform[d1_,d2_,x_]:={x,Quantile[d2,1-Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};

transformtypedata={<|"TransformFunction"->findgincreasingtransform,"FitType"->"Increasing"|>,<|"TransformFunction"->findgdecreasingtransform,"FitType"->"Decreasing"|>,<|"TransformFunction"->finduprigtheventransform,"FitType"->"Even"|>,<|"TransformFunction"->findinvertedheventransform,"FitType"->"InvertedEven"|>};

formuladata=FindFormula[#,x,1,All]&;

particularfit[d1_,d2_,transform_,data_]:=Module[{datatofit,fit},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
Prepend[First@fit,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

finddistributiontransform[d1_,d2_,n_]:=Module[{fnames,flabel,data},data=RandomVariate[d1,n];
ParallelMap[particularfit[d1,d2,#,data]&,transformtypedata]];

distributionsearch[targetdist_,possibledists_,n_]:=(*Query[MaximalBy[#Score&]/*First]@*)Flatten@(Map[finddistributiontransform[#,targetdist,n]&,(Flatten[{possibledists}])]);
index[assoc_Association, key_String] := 
  Map[Prepend[#[[2]], key -> #[[1]]] &, Normal@assoc];

Search

Query[MaximalBy[#Score &] /* First]@
 Dataset@distributionsearch[
   ChiSquareDistribution[1], {NormalDistribution[], 
    UniformDistribution[{0, 1}]}, 20 ]

enter image description here

It looks like it will be quite easy to find candidate distributions in the Truncated case. If values of the candidate distribution d1 are generated in the same way as above then a plot of the CDF of d2 versus the CDF of d1 is linear once those points which correspond to truncation are removed. These points are easy to remove because the value of the CDF of U2 will be either one or zero. I will add this functionality in the next few days.

POSTED BY: Emerson Willard

Search for Truncated random variables seems to work for the few cases I have tested. Some error and message handling logic really should be included but here here is the altered code.

index[assoc_Association,key_String]:=Map[Prepend[#[[2]],key->#[[1]]]&,Normal@assoc];
index[assoc_List,key_String]:=Association@Map[#[key]->KeyDrop[#,key]&,assoc];

findgincreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,CDF[d1,x]]};
findgdecreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,1-CDF[d1,x]]};
finduprigtheventransform[d1_,d2_,x_]:={x,Quantile[d2,Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
findinvertedheventransform[d1_,d2_,x_]:={x,Quantile[d2,1-Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
trucated[d1_,d2_,x_]:={CDF[d1,x],CDF[d2,x]}/.{y_,1|0}:>Nothing;

transformtypedata={<|"TransformFunction"->trucated,"FitType"->"Truncated"|>,<|"TransformFunction"->findgincreasingtransform,"FitType"->"Increasing"|>,<|"TransformFunction"->findgdecreasingtransform,"FitType"->"Decreasing"|>,<|"TransformFunction"->finduprigtheventransform,"FitType"->"Even"|>,<|"TransformFunction"->findinvertedheventransform,"FitType"->"InvertedEven"|>};

formuladata=FindFormula[#,x,1,All]&;

particularfit[d1_,d2_,transform_,data_]:=Block[{datatofit,fit},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
Prepend[First@fit,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

particularfit[d1_,d2_,transform_,data_]/;(transform["FitType"]=="Truncated"):=Block[{datatofit,fit,equation,x},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
equation=fit[[1]]["Equation"];
Prepend[First@fit,{"TruncationInterval"->{Quantile[d1,x]/.First@NSolve[equation==0,x],Quantile[d1,x]/.First@NSolve[equation==1,x]},
"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

finddistributiontransform[d1_,d2_,n_]:=Module[{fnames,flabel,data},data=RandomVariate[d1,n];
ParallelMap[particularfit[d1,d2,#,data]&,transformtypedata]];

distributionsearch[targetdist_,possibledists_,n_]:=(*Query[MaximalBy[#Score&]/*First]@*)Flatten@(Map[finddistributiontransform[#,targetdist,n]&,(Flatten[{possibledists}])]);

Usage

Dataset@Query[MaximalBy[#Score &]/*First]@distributionsearch[TruncatedDistribution[{0.2,2},ChiSquareDistribution[1]],{NormalDistribution[],UniformDistribution[],ChiSquareDistribution[1]},100]

enter image description here

POSTED BY: Emerson Willard
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract