Message Boards Message Boards

Probability distribution of two numbers randomly changing?

Posted 8 years ago

Hello,

I need help from this community, hoping that there is an experienced mathematician reading this post. I would like to calculate the probability distribution for the following problem: two numbers can change randomly with probability m from a given set of numbers of length a. For instance, they can get values from 1 to 6 like dices. In this case a=6. So what is the probability that after v trials the numbers reach a couple of target numbers? This is simple if m=1, that is, if the numbers change at every trial. In that case, the probability for any number to be the right one is p=1/a. On a given unsuccessful trial either both numbers are false or one of them right and the other false. So before both numbers are correct with probability p^2 on trial v, there were v-1 unsuccessful trials. So we have

P[v_] := ((1 - p)^2 + 2 (1 - p) p)^(v - 1) p^2

Now if m<1 one would expect that p=m/a with the same distribution than above, but this is far from being the case, as numerical results show. I used the following code to get these numerical results:

z = 10^3; Do[e[i] = 0, {i, 0, 10^7}];
a = 6; t = 2; target = {2, 5};
Do[
 v = 0; parent = RandomInteger[{1, a}, t];
 While[parent != target,
  v = v + 1;
  Do[
   mutate = RandomInteger[{1, 20}];
   If[mutate == 1, parent[[i]] = RandomInteger[{1, a}]],
   {i, 1, t}]];
 e[v] = e[v] + 1,
 {z}]

For big z, the quantity e[v]/z tends to the probability P(v) and m=1/20 (mutate). I found an exact analytical solution, but it is horribly complicated. I would need several pages to explain it here. This is why I wonder if there is not a simpler solution. I am sure that this calculation has already been made and is written down in some book. Has anyone an idea?

POSTED BY: Ulrich Utiger
11 Replies
Posted 7 years ago

Well, for me this is so obvious that there is no need to answer it. If evolution is not random what else can it be? What is the opposite to random? There are not a lot of alternatives. I don't answer this in the article because I want the reader to answer it for him/herself.

POSTED BY: Ulrich Utiger

Interesting article with obviously a lot of work. My knowledge on this topic is almost zero but the direction of the thinking seems to be clear. Perhaps the conclusion is obvious but I felt it's not really clear. You ended your research with:

This is why the time is overdue to unmask this blind passenger, let him row back to his country of origin, the land of fantasy, and open the mind for the obvious origin of the immense and beautiful variety of all vegetal and animal species.

I just wondered, what is the obvious origin you're mentioning? Just curious.

POSTED BY: l van Veen
Posted 7 years ago

In case someone is interested, I published an update on ResearchGate. Do you know a journal that would be interested in publishing this kind of stuff?

POSTED BY: Ulrich Utiger
Posted 7 years ago

Hello,

I finally solved all problems and finished my paper. Took a lot of time... I published it on academia.edu. If you are interested, you can see it here. Any comments and suggestions are highly appreciated. You can do this directly in the article.

POSTED BY: Ulrich Utiger

After a bit of reflection, I see what's still missing from my story: everything I wrote is correct, but only for a process that does not terminate. That's where the nuance lies. The problem is now to find the probability distribution for the first time the process achieves a given state (which I haven't calculated yet). At the moment I don't have time to go into depth about this (and I'd have to figure this out for myself as well), but I suggest you take a look at FirstPassageTimeDistribution in Mathematica.

For example, if you have a single letter drawn from A, T, C and G that mutates with probability m, then Mathematica gives you as the probability for achieving the state A for the first time in v steps:

PDF[
  FirstPassageTimeDistribution[
     DiscreteMarkovProcess[{1, 1, 1, 1}/4, transitionMatrix[m]], 
     1
  ], v] = (9 (1 - m/3)^v m) / (4 (-3 + m)^2) 

(for v > 1). I suggest you play around with that.

POSTED BY: Sjoerd Smit
Posted 8 years ago

Maybe your method is much better than mine for this kind of problem but it still lacks something as it does not yield the same result than my solution, from which I know that it is exact because it agrees with the numerical solution (see the plot below: blue points=numerical result; red line=analytical result): Probability distribution: blue points=numerical; red points=analytic

What I got is the following for a=4, v=2 and v=3:

P[2] = -(3/256) m (-8 + 3 m);
P[3] = -((3 m (-128 + 80 m - 48 m^2 + 21 m^3))/4096);

Your solution does not seem to yield the same. Maybe this is because the letters are not independent from each other, because if a trial v is successful requires that all letters be correct, that is, correspond to the target. Also, whether a letter is correct or not on trial v depends on the issue on trial v-1, that is, on the previous one. For instance, if it was correct previously and does not mute it will be correct. This probability is: P(correct and does not mute) = 1/a (1-m). On the other hand: P(false and does mute to the correct number) = (1-1/a) m/a This is not the same. So there are conditional probabilities that possibly you did not consider. But your method may be improved. At the moment, I don't see clear since I am not familiar with Markov chains. But I will definitely take a closer look at this.

POSTED BY: Ulrich Utiger

Thanks for the clarification, that helps a lot. I can now say safely that the problem you're interested in (without natural selection) is definitely a Markov chain (or actually a set of Markov chains). So for this problem, I highly recommend reading up on those, there is a lot of knowledge about them that you can find and will help you.

So if I understand correctly, in the problem without natural selection, every generation each letter (I'm going to stick with letters for now) has a probability to change. Each change (if it occurs) draws a new value from a set of letters of size a completely randomly. From my point of view, this means that the probability distributions of each of the letters are completely independent of each other: all you need to know is the probability P(v) that one of the letters has the right value at a certain generation v. Then you just raise that probability to the power n of letters that you have. Do you agree on that?

This probability P(v) is straightforward to calculate from the theory of Markov chains. I'll try to do the translation as best as I can. Suppose we have as our letters A T G and C and we are going to follow what happens to a particular nucleotide as it mutates. At any given generation v, we will denote the probabilities that the nucleotide is a particular flavor by a vector:

{PA(v), PT(v), PC(v), PG(v)},

where:

PA(v) + PT(v) + PC(v) + PG(v) == 1

(because it's definitely one of the four). So e.g., PA(v) is the probability that the nucleotide has value A at generation v. The vector

{PA(v), PT(v), PC(v), PG(v)}

is called the state vector of the Markov chain at step v. Next, we want to do something with this vector: we want to advance it one step of time. This we do with a matrix that contains the transition probabilities to go from one state to another. For our 4-letter alphabet, this matrix is:

      transitionMatrix[m_] = 
        { 
          {1 - m, m/3, m/3, m/3},
          {m/3, 1 - m, m/3, m/3},
          {m/3, m/3, 1 - m, m/3},
          {m/3, m/3, m/3, 1 - m}
        }

Take the first row of this matrix: these are the transition probabilities if the nucleotide is an A right now. The 1 - m entry (which is on the diagonal of the matrix) is the probability that the nucleotide doesn't change: if it was an A, it stays a A. The off-diagonal m/3 entries are the probabilities that it changes into T (2nd entry in the row: m/3), C (3rd entry in the row: also m/3) or G (4th entry in the row: again m/3). The remaining rows are the transition probabilities for when the nucleotide is currently T, C or G respectively. Note that every row sums to 1.

Now here comes the crucial bit: to get the state

{PA(v+1), PT(v+1), PC(v+1), PG(v+1)}

for the next generation, you perform a left-side matrix multiplication:

 {PA(v+1), PT(v+1), PC(v+1), PG(v+1)} =  
   {PA(v+1), PT(v+1), PC(v+1), PG(v+1)} . transitionMatrix[m] 

Judging from what you wrote down, you already understand the basic idea here, though make sure you really understand this step.

Bear with me for a bit longer, we're nearly there. You assumed the letters at the first generation (v=1) are random, so {PA(1), PT(1), PC(1), PG(1)} = {1/4, 1/4, 1/4, 1/4} Note that this is a left-eigenvector of transitionMatrix[m] (with eigenvalue 1):

{1/4, 1/4, 1/4, 1/4}.transitionMatrix[m] == {1/4, 1/4, 1/4, 1/4 }

This means that the probability vector doesn't change: at every step, the probabilities for A, T, C and G remain 1/4, no matter the value of m!

If you now want the state vector at a given v for any given starting state, you use DiscreteMarkovProcess in Mathematica. For example, try:

FullSimplify[
 PDF[
  DiscreteMarkovProcess[{1,0,0,0}, transitionMatrix[m]][v], 
  k]
]

Here, {1,0,0,0} is the starting state (the nucleotide starts as A with probability 1). v is the generation again, and k is the index of the letters: k = 1 represents A, k = 2 represents T, k = 3 represents C, and k = 4 represents G. If you evaluate this code, you'll see that you get a probability distribution that exponentially decays to the equilibrium state vector {1/4, 1/4, 1/4, 1/4} with increasing v.

So there you have it. I think this should answer your problem, though if I misinterpreted part of the question you should let me know. At the very least this should put you on the right track, I hope.

POSTED BY: Sjoerd Smit
Posted 8 years ago

I am looking for an analytical solution for the problem Richard Dawkins has proposed numerically in his book "The Blind Watchmaker". His goal was to show that functional pieces of DNA can be formed by chance within reasonable time limits. In order to do this, he used a similar algorithm than mine shown in my first post. His target was the sentence from Shakespeare "methinks it is like a weasel". So a=27 here, that is, the length of the English alphabet with a space included. This sentence can be reached within only some steps if natural selection enters into the process. To simulate this, he created a parent in the first iteration, that is, an arbitrarily chosen series of 28 letters. Then he made kids of this parent by coping the parent a number of times. But these are not verbatim copies. Each letter can mutate with probability m when it is copied. Then he takes the kid that is the closest to the target and makes out of him the parent and the algorithm restarts. He showed that in this manner the target is reached after some 50 iterations or so. In my algorithm above, I did not include natural selection yet. Numerically, this is easy to do. But it is more difficult to get an analytical solution, which is what I am working on right now. I simplified the algorithm with numbers instead of letters (works faster) and it has a reduced alphabet. I found this probability distribution above with matrices in the case where the target is {2,5}. It doesn't matter what the target is. It could also be {1,1}. So my goal is to find an analytical probability distribution for a target of arbitrary length t: what is the probability P(v) that the target is attained in v trials. From this could then be calculated the mean value of v, which is a function of a, m, t and the number n of kids produced. With such a function, it would be possible to estimate whether natural selection is a probable process. This is the final goal.

POSTED BY: Ulrich Utiger

I get the impression that we don't really understand each other here. Could you repeat your basic problem again, but as clearly as possible? I'm not quite sure I understand what you're actually interested in and what you're doing.

POSTED BY: Sjoerd Smit
Posted 8 years ago

Hello Sjoerd,

Thanks for your answer. Your proposal is interesting. However, I don't see how this problem can be solved with Markov chains as I am not familiar with this theory, but I will read more about it. The distribution depends strongly on m. Take the case where m=0, then P(1)=1/a^2 and P(v)=0 for every v>1, or if m=1 then P(v) is as in my first post. The solution I found has a similar form:

P[v_] := If[v == 1, 1/a^2, 
   Sum[(G.MatrixPower[B, v - 2].A)[[i]], {i, 1, 3}]];

The matrices G and B are as follows (stand for good and bad results):

B = {{1 - m + ((-1 + a) m^2)/a^2, ((-1 + a) m^2)/a^2, ((a - m) m)/a^2},
   {((-1 + a) m^2)/a^2, 1 - m + ((-1 + a) m^2)/a^2, ((a - m) m)/a^2},
   {((-1 + a) (a - m) m)/a^2, ((-1 + a) (a - m) m)/a^2, (a - m)^2/
    a^2}};
G = {{((1 - m) m)/a + m^2/a^2, 0, 0},
   {0, ((1 - m) m)/a + m^2/a^2, 0},
   {0, 0, m^2/a^2}};

And the vector A is

A = {(-1 + a)/a^2, (-1 + a)/a^2, (-1 + a)^2/a^2};

As you can see, the dimension could be reduced to 2x2 matrices as there are equal elements, but for the different events it was easier to "visualize" the problem in this manner. It is quite difficult to explain, how I found this result. I would have liked a more straightforward method as I would like to calculate the distribution for a target of an arbitrary length (so not just for a pair of numbers). With the method I used, this will probably be a difficult task.

POSTED BY: Ulrich Utiger

The problem you describe sounds like two Markov chains (or discrete-time Markov processes). Each number (call them x and y) occupy a certain state and each step of the process they have a probability m to move to a different state (with each new state being equally likely).

This problem is actually fairly straightforward: the stationary distribution for the process x (or y) follows is simply that x is equally likely to be in any of the given states, so P(x = k) = 1/a. Because the initial value for x is also a uniform random distribution, if follows that x is already in the stationary distribution from the start.

It may sound strange, but the value of m has no influence on this. For example, try this code:

StationaryDistribution[
   DiscreteMarkovProcess[
      {1,0,0,0},
      { 
        {1 - m, m/3, m/3, m/3},
        {m/3, 1 - m, m/3, m/3},
        {m/3, m/3, 1 - m, m/3},
        {m/3, m/3, m/3, 1 - m}
      }
   ]
]

The matrix is the transition matrix of the Markov process (for 4 states in this case). Each entry M_ij is the probability that the next state is j if the current state is i (note that the rows of M sum to 1). As you can see by executing the code, the stationary distribution does not depend on m, but is simply 1/4 for every state. This knowledge should make your problem fairly easy to do, I think.

Of course, if the transition matrix is different (for example, not every state is equally likely to be visited), then the story becomes different. I suggest you read up on Markov chains and processes if you want to know more.

POSTED BY: Sjoerd Smit
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract