Group Abstract Group Abstract

Message Boards Message Boards

Given an exact formula get probability distribution with matching PDF?

POSTED BY: Seth Chandler
10 Replies
POSTED BY: Seth Chandler

Now, transformations of even functions of a random variable can also be found.

index[assoc_Association,key_String]:=Map[Prepend[#[[2]],key->#[[1]]]&,Normal@assoc];
index[assoc_List,key_String]:=Association@Map[#[key]->KeyDrop[#,key]&,assoc];

findgincreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,CDF[d1,x]]};
findgdecreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,1-CDF[d1,x]]};
finduprigtheventransform[d1_,d2_,x_]:={x,Quantile[d2,Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
findinvertedheventransform[d1_,d2_,x_]:={x,Quantile[d2,1-Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};

transformtypedata={<|"TransformFunction"->findgincreasingtransform,"FitType"->"Increasing"|>,<|"TransformFunction"->findgdecreasingtransform,"FitType"->"Decreasing"|>,<|"TransformFunction"->finduprigtheventransform,"FitType"->"Even"|>,<|"TransformFunction"->findinvertedheventransform,"FitType"->"InvertedEven"|>};

formuladata=FindFormula[#,x,1,All]&;

particularfit[d1_,d2_,transform_,data_]:=Module[{datatofit,fit},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
Prepend[First@fit,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

finddistributiontransform[d1_,d2_,n_]:=Module[{fnames,flabel,data},data=RandomVariate[d1,n];
ParallelMap[particularfit[d1,d2,#,data]&,transformtypedata]];

distributionsearch[targetdist_,possibledists_,n_]:=(*Query[MaximalBy[#Score&]/*First]@*)Flatten@(Map[finddistributiontransform[#,targetdist,n]&,(Flatten[{possibledists}])]);
index[assoc_Association, key_String] := 
  Map[Prepend[#[[2]], key -> #[[1]]] &, Normal@assoc];

Search

Query[MaximalBy[#Score &] /* First]@
 Dataset@distributionsearch[
   ChiSquareDistribution[1], {NormalDistribution[], 
    UniformDistribution[{0, 1}]}, 20 ]

enter image description here

It looks like it will be quite easy to find candidate distributions in the Truncated case. If values of the candidate distribution d1 are generated in the same way as above then a plot of the CDF of d2 versus the CDF of d1 is linear once those points which correspond to truncation are removed. These points are easy to remove because the value of the CDF of U2 will be either one or zero. I will add this functionality in the next few days.

POSTED BY: Emerson Willard

Search for Truncated random variables seems to work for the few cases I have tested. Some error and message handling logic really should be included but here here is the altered code.

index[assoc_Association,key_String]:=Map[Prepend[#[[2]],key->#[[1]]]&,Normal@assoc];
index[assoc_List,key_String]:=Association@Map[#[key]->KeyDrop[#,key]&,assoc];

findgincreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,CDF[d1,x]]};
findgdecreasingtransform[d1_,d2_,x_]:={x,Quantile[d2,1-CDF[d1,x]]};
finduprigtheventransform[d1_,d2_,x_]:={x,Quantile[d2,Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
findinvertedheventransform[d1_,d2_,x_]:={x,Quantile[d2,1-Sign[x]*(CDF[d1,x]-CDF[d1,-x])]};
trucated[d1_,d2_,x_]:={CDF[d1,x],CDF[d2,x]}/.{y_,1|0}:>Nothing;

transformtypedata={<|"TransformFunction"->trucated,"FitType"->"Truncated"|>,<|"TransformFunction"->findgincreasingtransform,"FitType"->"Increasing"|>,<|"TransformFunction"->findgdecreasingtransform,"FitType"->"Decreasing"|>,<|"TransformFunction"->finduprigtheventransform,"FitType"->"Even"|>,<|"TransformFunction"->findinvertedheventransform,"FitType"->"InvertedEven"|>};

formuladata=FindFormula[#,x,1,All]&;

particularfit[d1_,d2_,transform_,data_]:=Block[{datatofit,fit},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
Prepend[First@fit,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

particularfit[d1_,d2_,transform_,data_]/;(transform["FitType"]=="Truncated"):=Block[{datatofit,fit,equation,x},datatofit=transform["TransformFunction"][d1,d2,#]&/@data;
fit=index[Normal@formuladata[(datatofit)],"Equation"];
equation=fit[[1]]["Equation"];
Prepend[First@fit,{"TruncationInterval"->{Quantile[d1,x]/.First@NSolve[equation==0,x],Quantile[d1,x]/.First@NSolve[equation==1,x]},
"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->transform["FitType"],"Plot"->ListPlot@datatofit,"Data"->datatofit}]];

finddistributiontransform[d1_,d2_,n_]:=Module[{fnames,flabel,data},data=RandomVariate[d1,n];
ParallelMap[particularfit[d1,d2,#,data]&,transformtypedata]];

distributionsearch[targetdist_,possibledists_,n_]:=(*Query[MaximalBy[#Score&]/*First]@*)Flatten@(Map[finddistributiontransform[#,targetdist,n]&,(Flatten[{possibledists}])]);

Usage

Dataset@Query[MaximalBy[#Score &]/*First]@distributionsearch[TruncatedDistribution[{0.2,2},ChiSquareDistribution[1]],{NormalDistribution[],UniformDistribution[],ChiSquareDistribution[1]},100]

enter image description here

POSTED BY: Emerson Willard

So, I am thinking that this is quite a brilliant idea. The insight is to let FindFormula come up with a transform of a candidate distribution that best fits the underlying data. I'm going to explore and fiddle with the code a bit and will post something more detailed later, but I did want to herald that Mr. Willard's approach looks really good.

Here's a question for the community that would help me assess the utility of the method: Is it always the case that any member of a distribution family can be expressed as some mathematical transformation of another (base) form of the distribution? This is true for the normal distribution:

TransformedDistribution[b + a x,Distributed[x,NormalDistribution[0,1]] === NormalDistribution[a,b].  

And it is true for the UniformDistribution.

TransformedDistribution[a - a x + b x, 
 x \[Distributed] UniformDistribution[{0, 1}]]===UniformDistribution[{a,b}]

But is it true for all others, some others, few others? (The transformation doesn't have to be linear or affine; it just has to be some sort of formula).

If it is true for all or most distributions then the Willard method has greater promise than if it is true only for a few. Why? Suppose -- and I have no idea whether this is true -- that it is not true for the BetaDistribution. That is, suppose that one can not express BetaDistribution[3,2] as a transformation of BetaDistribution[1,1]. How, then would we find that our data matches a transform of some sort of BetaDistribution. We would have to specify BetaDistribution[1,1], BetaDistribution[2,1][, BetaDistribution[1,2], etc. (an infinite list) in order to find the match. What we need for the Willard method to work best would be

(a) a lot of distributions have this closure property (b) their probability density functions can be expressed exclusively in terms of the primitive TargetFunctions in FindFormula: Possible functions for TargetFunctions are Plus, Times, Power, Sin, Cos, Tan, Cot, Log, Sqrt, Csc, Sec, Sqrt, Abs, and Exp. If not, if, for example, there's a Gamma or a beta function in the PDF, I don't see how FindFormula is going to work well.

More later.

POSTED BY: Seth Chandler

This approach makes use of the new Find Formula function. It can currently cope with either increasing or decreasing functions of a continuous random variable. The Method estimates g in the following equation Y=g(X) where Y and X are random variables.

index[assoc_Association, key_String]:=Map[Prepend[#[[2]], key -> #[[1]]] &, Normal@assoc];
index[assoc_List, key_String]:=Association@Map[#[key]->KeyDrop[#, key] &, assoc];

findgincreasingtransform[d1_,d2_,u_]:={Quantile[d1,u],Quantile[d2,u]};
findgdecreasingtransform[d1_,d2_,u_]:={Quantile[d1,u],Quantile[d2,1-u]};
formuladata=FindFormula[#,x,1,All,SpecificityGoal -> "Low",PerformanceGoal -> "Speed"]&;

finddistributiontransform[d1_,d2_,n_]:=Module[{fit1,fit2,datatofit1,datatofit2,u},
datatofit1=Table[findgincreasingtransform[d1,d2,N@(u/n)],{u,1,n-1}];
datatofit2=Table[findgdecreasingtransform[d1,d2,N@(u/n)],{u,1,n-1}];
fit1=index[Normal@formuladata[(datatofit1)],"Equation"];
fit2=index[Normal@formuladata[(datatofit2)],"Equation"];
{Prepend[First@fit1,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->"IncreasingTransform","Plot"->ListPlot@datatofit1,"Data"->datatofit1}],
Prepend[First@fit1,{"Targetdistribution"->d2,"ProposedDistribution"->d1,"FitType"->"DecreasingTransform","Plot"->ListPlot@datatofit2,"Data"->datatofit2}]}
]

distributionsearch[targetdist_, possibledists_, n_] :=Flatten@(finddistributiontransform[#, targetdist, n] & /@ (Flatten[{possibledists}]));

Example

distributions = {ArcSinDistribution[], UniformDistribution[],NormalDistribution[], CauchyDistribution[], 
   ChiSquareDistribution[1], WeibullDistribution[1, 1]};
Dataset@Query[MaximalBy[#Score &] /* First]@distributionsearch[TransformedDistribution[u^3, 
    u \[Distributed] WeibullDistribution[1, 1]], distributions, 100]

enter image description here

It looks like this method can be extended to cover more general uni variate transformations by taking information from the target probability density to identify where the critical points of the transformation are.

POSTED BY: Emerson Willard

I believe that for this problem to be uniquely solvable one has to be able to first define a basis from the family of allowed functions/distributions, as well as allowed composition rules (e.g. ParameterMixtureDistribution and TransformedDistribution) in the space of finitely integrable positive functions (i.e. pdfs). I don't really know anything about this topic but it might be very hard to do so.

POSTED BY: Eduardo Serna
f[m_] = 1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]);

Integrate[f[m], {m, -Infinity, Infinity}]

1

dist = ProbabilityDistribution[f[m], {m, -Infinity, Infinity}];

Since the integral of f[m] is unity, f[m] does not have to be scaled to be a distribution. A candidate distribution will probably have two parameters and must be defined on the interval {-Infinity, Infinity}. The built-in distributions with two parameters and defined on the interval {-Infinity, Infinity} are

distributions = ToExpression /@ Names["*Distribution"];

twoParamDist = Select[distributions,
    FreeQ[PDF[#[a, b], x], #] &&
      DistributionDomain[#[a, b]] === Interval[{-Infinity, Infinity}] &] // 
   Quiet;

The target PDF for comparison is

PDF[dist, m]

enter image description here

Comparing this PDF with the form of those of the built-in two-parameter distributions will greatly limit the candidates

({#, PDF[#[a, b], m]} & /@ twoParamDist) // 
  Prepend[#, {Distribution, PDF}] & //
 Grid[#, Frame -> All] &

enter image description here

The NormalDistribution is the likely distribution

dist2 = NormalDistribution[a, b];

param = Solve[{Mean[dist2] == Mean[dist],
   Variance[dist2] == Variance[dist],
   DistributionParameterAssumptions[dist2]}, {a, b}]

{{a -> Log[5], b -> 2}}

Verifying that the PDFs are identical

PDF[dist2 /. param[[1]], m] == PDF[dist, m] // Simplify

True

So the target distribution is

dist2 /. param[[1]]

NormalDistribution[Log[5], 2]

POSTED BY: Robert Hanlon

I appreciate the responses here. But let me provide an example that is perhaps not so easy. Suppose the target function f is as follows: $\frac{7}{10 (a-2)^2}$ for the domain ( $-\infty,\frac{13}{10}$]. If we create a probability distribution out of this and then generate 10,000 random samples from the distribution and then run FindDistribution

 dis = ProbabilityDistribution[7/(10 (-2 + a)^2), {a, -\[Infinity], 13/10}];
 rv = RandomVariate[dis,10^4];
 fd=FindDistribution[rv,5]

The result is a mixture distribution of normal distributions, a beta distribution, a weibull distribution, a normal distribution and a mixture distribution of a normal distribution and a gamma distribution.

The mixture distributions are clearly of the wrong form, the normal distribution is clearly not right, Although I am not positive, I don't believe the Weibull Distribution or the Beta Distribution is correct either. In fact, I don't know what the correct answer is, though I think it might be a fairly simple transform of a single parameter distribution. The point, however, is that the FindDistribution process, does not seem to work in this case. And that's why I am hoping for something better.

ADDENDUM

Actually, I have now found (using a combination of my brain and a little Mathematica) a distribution whose PDF corresponds to the expression above. It is a transform of a Pareto Distribution. This is something that the FindDistribution method described above did not discover. The question is whether there is some algorithmic way whereby Mathematica can "see" this fact or at least make a good guess. My suspicion is that the answer will involve a closeness metric on expression trees.

TransformedDistribution[2 + (7 x)/10, 
 x \[Distributed] TransformedDistribution[-\[FormalX], \[FormalX] \[Distributed] ParetoDistribution[1, 1]]]
POSTED BY: Seth Chandler

Maybe this is not exactly what you are looking for, but at least this gives you a very close guess and it is easy to figure out the rest.

dis = ProbabilityDistribution[
   1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]), {m, -Infinity, Infinity}];

PDF[dis, m] // TraditionalForm

enter image description here

Generate data:

data = RandomVariate[dis, 10^4];

Match:

FindDistribution[data]

NormalDistribution[1.60001, 1.97652]

Check:

Log[5] // N

1.6094379124341003

So you did not get symbolic Log[5] but at least you got distribution right. This will work in much more complicated cases.

  • Possible continuous distributions for TargetFunctions are: BetaDistribution, CauchyDistribution, ChiDistribution, ChiSquareDistribution, ExponentialDistribution, ExtremeValueDistribution, FrechetDistribution, GammaDistribution, GumbelDistribution, HalfNormalDistribution, InverseGaussianDistribution, LaplaceDistribution, LevyDistribution, LogisticDistribution, LogNormalDistribution, MaxwellDistribution, NormalDistribution, ParetoDistribution, RayleighDistribution, StudentTDistribution, UniformDistribution, WeibullDistribution.

  • Possible discrete distributions for TargetFunctions are: BenfordDistribution, BinomialDistribution, BorelTannerDistribution, DiscreteUniformDistribution, GeometricDistribution, LogSeriesDistribution, NegativeBinomialDistribution, PascalDistribution, PoissonDistribution, WaringYuleDistribution, ZipfDistribution.

POSTED BY: Vitaliy Kaurov

Hi Vitaliy,

I have suggested a similar approach somewhere else in this Community, but one has to be careful. For example, as you said the mean/expected value is not quite right yet. You could try to generate more random points and then use

WolframAlpha["1.60867", IncludePods -> "PossibleClosedForm"]

to get a good/sumbolic estimate of the parameter, but this is what happens:

dis = ProbabilityDistribution[1/(2*E^((-m + Log[5])^2/8)*Sqrt[2*Pi]), {m, -Infinity, Infinity}];
Grid[Join[{{"Number of points", "Estimated distribution"}}, {#, FindDistribution[RandomVariate[dis, #]]} & /@ Table[10^i, {i, 1, 7}]], Frame -> All]

enter image description here

I sort of understand the Student-t distribution, because of the high degree of freedom. The mixture distribution is a bit different though. Both are not very useful to estimate the mean of the respective Normal Distribution. There is, of course, another parameter for FindDistribution to spit out the n most likely distributions. This works like so:

FindDistribution[RandomVariate[dis, 10^6], 3]
 {StudentTDistribution[1.6071, 1.99998, 15947.3], NormalDistribution[1.60267, 2.10096], NormalDistribution[1.60267, 2.10096]}

The last two of the results coincide and are NormalDistributions. The means are still to far off for WolframAlpha to find $Log(5)$ as a likely scenario.

Cheers,

Marco

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard