Results - Markov models for the evolution of duplicate genes, and microsatellites

3.9 Discussion

4.1.3 Results

where,VP is a˚ ˆ2 matrix of transition rates from the transient states intoP, where P “ tsp, spnu is the collection of states which correspond to pseudogenization (there are only ever two in the reduced state space),

sp “ « z 0 0 0 ff , and spn “ « z 0 0 1 ff , (4.17)

with sp corresponding to the usual pseudogenization scenario, and spn corresponding to the case where the preserved copy underwent neofunctionalization before the pseudogenization of the other copy.

Further, the extension of the model to a population of gene duplicates evolving under sub- and neofunctionalization is exactly analogous to the analysis in Section 3.7. The log likelihood given by Equations (3.107) and (3.108), but with ˜Fptq taken as the cumulative distribution of time to absorption into the states associated with pseudogenization from this model, i.e. the settsp, spnu. We restate these here for convenience, βptq “β0p1´F˜ptqq, (3.107)

logpLθ|Dq “ÿ

Dilogpβpsiqq ´βpsiq ´Γ logpDi`1q. (3.108)

4.1.3 Results

In this section, we discuss the data-driven analysis of the model described in Sec- tion 4.1.2. We have fit the model to theMus musculus andHomo Sapiens data from Section 3.8. We excluded theCanis Familiaris and Rattus norvegicus datasets on the basis of the earlier analysis in Section 3.8, which showed these genomes to be unin- formative. Further, we have tested the effect of including neofunctionalization on the pseudogenization rate as compared to the subfunctionalization model in Section 3.1. We examine a wide range of un with the remaining parameters fixed at biologically realistic values. We calculate the probability of neofunctionalization, and the relative probability of neofunctionalization before and after subfunctionalization given that it occurs at all.

Fitting to theMus musculus andHomo Sapiens genomes gave similar parameter estimates (for the shared parameters) to the earlier model (with MLEs given in Table 3.1 of Section 3.8). The maximum likelihood parameter estimates of ur and uc for the

116 Sub- and neofunctionalization for a pair of gene duplicates

two models were identical to at least the first 5 significant figures for both genomes. The MLE for the neofunctionalization rate wasun«10´10_{for both.} _un_{was similarly}

small for all but z“2, for which un“2.52ąur “1.74, and un “2.37 ąur “1.60 for Mus musculus and Homo sapiens respectively. The likelihood values themselves were also essentially-equal to the subfunctionalization-only model (Section 3.8), which would be preferred by either BIC or AIC (see Section 2.2).

As mentioned in Section 4.1, we expected that there may be identifiability issues in fitting this model to data. Our reasoning was based on the conclusion from Chapter 3 that subfunctionalization alone could lead to survival distributions of the kind which have previously been attributed to neofunctionalization. We did not find model identifiability to be a problem in practice, with a range of starting points converging to the same maximum likelihood parameter estimates for ur, uc, and z. However there was some variation in un. Moreover, for both genomes, the difference in likelihood betweenun“10´10 and un“10´3 (with the other parameters at their MLE values)

was extremely small. The two likelihood values were equal up to the 6th _{and 5}th_signif-

icant figure forMus Musculus andHomo Sapiens respectively. So, although we were able to identify a clear maximum likelihood estimate, the parameter un in particular would be associated with a wide confidence interval.

Our estimates were based on intervals withs“0.01, corresponding to roughly 1.1 million years (as discussed in Section 3.8). As such,un«10´10 implies that on average

1 out of every 1010 _{regulatory regulatory regions would neofunctionalize every 110}

million years. With roughly 20000 genes in the genomes we examined, even assuming all of these genes are targets for neofunctionalization at all times, with 3 regulatory regions each un «10´10 would correspond to the fixation of less than one beneficial

mutation on average over the entire history of life on earth. Clearly,un“10´10is an

underestimate. Since the likelihood did not drop off significantly untiluną10´3, esti-

mates closer to this end of the interval seem more realistic. In either case, our results are suggestive of a very low rate of neofunctionalization as compared to regulatory nonfunctionalization.

We computed hptq (Equation (3.34)) and hnptq (Equation (4.16) above) at the maximum likelihood parameter estimates of uc “ 20.1, ur “ 3.26, and z “ 3 from

the Mus musculus genome for a range of un. The rate for hptq, and hnptq with

un “ 0.001ur,0.01ur and 0.1ur is shown in Figure 4.1. The rates were somewhat divergent for un “ 0.1ur, but a ratio of 1 : 10 beneficial to neutral mutations in the

regulatory regions is extremely high, and it vastly exceeds the maximum likelihood estimate of 4.8ˆ10´10. For the other values, the graphs were very close in a by-

Sub- and neofunctionalization for a pair of gene duplicates 117

eye examination, with un “0.001 indistinguishable by-eye from un “0 (noting that hptq “hnptq when un“0).

We calculated the probability of neofunctionalization before and after subfunctionalization (which is of particular interest). We treated the states associated with neofunctionalization as absorbing, and computed the probability that the process (modified such that neofunctionalization is absorbing) is eventually absorbed into states associated with neofunctionalization only, and with neofunctionalization following subfunctionalization. We chose ur and uc as the MLE parameters from the Mus musculus genome again. Figure 4.2 shows the probability that the process ever neofunctional- izes as as a function of un in the range from 0 to 0.5ur forz “3, ...,10. Figure 4.3 shows the probability of neofunctionalization before subfunctionalization conditional on neofunctionalization occuring as a function ofunforz“3, ...,10, while Figure 4.2 shows the probability of neofunctionalization as a function of un (of course, this goes to 0 as un does). Most likelyun is orders of magnitude smaller thanur, since benefi-

cial mutations are expected to be very rare [42], and our fit to the Mus musculus and Homo Sapiens genomes supports this.

Based on this analysis, it appears that neofunctionalization is not a significant contrib- utor to the preservation of gene duplicates. Neofunctionalizing mutations appear to be extremely unlikely during the timescales over which regulatory subfunctionalization is resolved. Thus copies which do not undergo subfunctionalization are lost before neofunctionalization has a chance to occur. Given that neofunctionalization does occur, the probability (associated with the MLEs) that it occurs before subfunctionalization was 0.80 for both genomes.

From the modelling perspective, this is an intuitive result, given that for z“3, after

subfunctionalization occurs the next non-neofuntionalizing mutation must lead to the absorption of the process. As such, the window of opportunity for neofunctionalization after subfunctionalization is likely to be much shorter than it is beforehand in this case. However, this result only applies for the timescales over which regulatory subfunctionalization is resolved, and should not be regarded as counter evidence to the hypothesis that subfunctionalization plays a protective role in support of subsequent neofunctionalization.

In reality, it is likely that neofunctionalization may occur over much larger timescales — potentially, the non-functionalized regulatory regions would have a non-zero rate of neofunctionalizing mutations, provided that the coding region remains intact. A model in which the nonfunctionalized regulatory regions can become neofunctionalized at some small rate may be more realistic in this sense, and the proportion of neofunc-

118 Sub- and neofunctionalization for a pair of gene duplicates

t

0 0.05 0.1 0.15 0.2 0.25 0.3 0 5 10 15 20 25 30 35 40 45 h(t) = h n(t) for un = 0 h n(t) for un = 0.001 h n(t) for un = 0.01 h n(t) for un = 0.1

Figure 4.1: Pseudogenization rate hnptq for un “ 0, 0.001ur, 0.01ur,0.1ur with uc“20.1, ur “3.26 and z“3, which were the maximum likelihood parameter esti-

mates associated with the Mus musculus genome. hptq is given by Equation (3.34), while hnptq is given by Equation (4.16).

tionalization events preceded by subfunctionalization could only increase under such assumption.

Rastogi and Liberles [97] concluded from a lattice model analysis that neofunctionalization was the ultimate fate for all preserved gene duplicates. They noted that subfunctionalization could act to protect the genes from pseudogenization during short timescales, and our results appear to support that hypothesis. Under our model, if a gene is preserved at all, it is almost certainly due to sub- and not neofunctionalization. Subsequent neofunctionalization is extremely unlikely in the event that one of the copies becomes pseudogenized by null mutation in the coding region. Thus, for neofunctionalization to occur some other process must almost-always have preceded it to preserve the duplicate pair, and subfunctionalization appears to be a likely candidate. In this regard, our analysis can be thought of as providing an estimate of a lower bound on the proportion of neofunctionalization events preceded by subfunctionaliza-

Sub- and neofunctionalization for a pair of gene duplicates 119

u

(units of u

)

0 0.1 0.2 0.3 0.4 0.5

P(Neofunctionalization)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 z=9 z=8 z=7 z=6 z=5 z=4 z=3 z=2

Figure 4.2: The overall probability of neofunctionalization as a function of un for z“2, ...,9, (plots are strictly ascending inz) withuc“20.07, ur“3.26.

tion (which we estimate to be 0.2 forz“3, 0.35 forz“4, 0.46 for z“5 for theMus Musculus genome). Given the extremely low probability of neofunctionalization during the timescale of our model, and the possibility for subsequent neofunctionalization over a much larger timescale, the proportion of neofunctionalization events preceded by subfunctionalization is likely much larger. A caveat is that processes other than subfunctionalization (most notably dosage balance [116]) could play a similar role, but based on this analysis neofunctionalization alone is not very likely.

120 Sub- and neofunctionalization for a pair of gene duplicates

u

(units of u

)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

P(Sub then neo | neo)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 z=9 z=8 z=7 z=6 z=5 z=4 z=3 z=2

Figure 4.3: Probability that subfunctionalization occurs before neofunctionalization, conditional on neofunctionalization occurring as a function ofunforz“2, ...,9, (plots are strictly ascending inz) withuc“20.07, ur “3.26. When z “2 only one of sub-

Preliminary work modeling the evolution of gene families 121

4.2 Preliminary work modeling the evolution of gene

In document Markov models for the evolution of duplicate genes, and microsatellites (Page 123-129)