Distributions of cherries and pitchforks for the Ford model
Gursharn Kaur
∗, Kwok Pui Choi
†, and Taoyang Wu
‡October 7, 2021
Abstract
Distributional properties of tree shape statistics under random phylogenetic tree models play an important role in investigating evolutionary forces underlying real world phylogenies. In this paper, we study two subtree counting statistics, the number of cherries and that of pitchforks for Ford’s alpha model, a one-parameter family of ran- dom phylogenetic tree models which includes as specific instances of both the uniform and the Yule models, two tree models commonly used in phylogenetics. Based on a version of the extended P´olya urn models, in which negative entries are permitted for their replacement matrices, we obtain the strong laws of large numbers and the central limit theorems for the joint distribution of these two count statistics for the Ford model. Furthermore, we derive a recursive formula for computing the exact joint distribution of these two statistics, which leads to higher order asymptotic expansions of their marginal and joint moments.
1 Introduction
One important topic in many branches of biology is to understand evolutionary events and forces leading to current biological systems, such as a group of species or strains of a virus.
To this end, evolutionary relationships among the biological system under investigation are typically represented by a phylogenetic tree, that is, a binary tree whose leaves are labelled by the taxon units in the system. As these events and forces, such as rates of speciation and expansion, are often not directly observable [22, 16], one popular approach is to compare empirical shape indices computed from trees inferred from real datasets with those predicted by a null tree growth model [5, 15]. Furthermore, topological tree shapes are also closely
∗Biocomplexity Institute, University of Virginia, Charlottesville, USA 22911.
†Department of Statistics and Data Science, and the Department of Mathematics, National University of Singapore, Singapore 117546.
‡School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, U.K.
arXiv:2110.02850v1 [math.PR] 6 Oct 2021
related to several fundamental statistics in population genetics [12, 2] and certain important parameters in the dynamics of virus evolution and propagation [9].
One important family of tree shapes are balance indices, such as Colless’ index, Sackin’s index and the number of subtrees (see, e.g. [13] and the references therein). Various prop- erties concerning these statistics have been established in the past decades on the following two fundamental random phylogenetic tree models: the Yule model (aka the Yule-Harding- Kingman (YHK) model) [24, 11, 18] and the uniform model (aka the proportional to distin- guishable arrangements (PDA) model) [21, 6, 25, 8]. However, for phylogenetic trees inferred from real datasets, the Yule or uniform model may not always be a good fit [5], and several general classes of random trees have been proposed for modelling and analysing the observed data, two popular ones being Ford’s alpha model [14] and Aldous’ beta model [1].
In this paper, we confine ourselves to Ford’s alpha model, a one-parameter family of random tree growth models introduced by Daniel J. Ford in his PhD thesis [14]. More precisely, under the Ford model with a fixed parameter 0 ≤ α ≤ 1, a random tree of a given number of leaves is generated such that at any step in which a tree Tn with n leaves has been constructed from previous steps, a new leaf attaches to an internal edge of Tn with probability n−αα and to a leaf edge in Tn with probability n−α1−α. The resulting random tree model will be referred to as the Ford model (with parameter α) in this paper, which is also known as the alpha tree model (see, e.g. [10]). Note that the Ford model is a family of random tree models which includes the Yule model with α = 0, the uniform model with α = 1/2, and the Comb model with α = 1.
The tree shape indices studied in this paper are the number of cherries and that of pitchforks. Here a cherry is a subtree with precisely two leaves and a pitchfork a subtree with three leaves. The asymptotic properties of the number of cherries was first studied by McKenzie and Steel [21], who showed that the number of cherries is asymptotically normal for the Yule and the uniform models as the number of leaves tends to infinity. Later, similar properties of the number of cherries are extended to the Ford model [14, Theorem 57] and to the Crump-Mode-Jagers branching process [23]. For the number of pitchforks, Rosenberg [24] obtained its mean and variance and Chang and Fuchs [6] proved that the number of pitchforks is also asymptotically normal for the Yule and the uniform models.
For the joint distributions, Holmgren and Janson showed that [18] the joint distribution is asymptotically normal for the Yule model. This was recently extended by us to the uniform model based on a uniform version of the extended urn models in which negative entries are permitted for their replacement matrices [7].
In this paper, we establish the strong law of large numbers and the central limit theorem for the joint distribution of cherries and pitchforks under the Ford model (Theorem 3.2) by considering an associated nonuniform urn model (Theorem 3.1). These results are presented in Section 3, following Section 2 in which we collect background concerning the Ford model and limiting theorems on uniform urn models. Furthermore, we derive a recurrence formula for computing the exact joint distribution under the Ford model (Theorem 4.2) in Section 4, generalizing the results in [25, 8] for the Yule and the uniform model. This recurrence formula enables us to obtain exact expressions for the mean and variance of the number
of cherries and that of pitchforks and their covariance under the Ford model. This, in- particular, generalises the exact expressions of mean and variance for the number of cherries and that of pitchforks for the Yule and the uniform models [21, 24, 6] and the number of cherries for the Ford model [14, Theorem 60]. As an application, in Section 5 we obtain higher order expansions of the first and second moments of the joint distributions.
2 Ford Model and Urn Model
In this section, we first introduce the Ford model, which is a one-parameter family of random phylogenetic tree models. Next we present a nonuniform version of the extended urn models associated with the Ford tree model. Finally, we recall certain conditions on the related uniform version of the extended urn model under which the strong law of large numbers and the central limit theorem are obtained.
2.1 Ford model
A rooted binary tree is a finite connected simple graph without cycles that contains a unique vertex of degree 1 designated as the root and all the remaining vertices are of degree 3 (interior vertices) or 1 (leaves). A phylogenetic tree with n leaves is a rooted binary tree whose leaves are bijectively labelled by the elements in {1, . . . , n}. Edges incident with leaves are referred to as pendant edges.
Under the Ford model with parameter 0 ≤ α ≤ 1, a random phylogenetic tree Tn with n leaves is constructed recursively by adding one leaf at a time as follows. Fix a random permutation (x1, . . . , xn) of {1, . . . , n}. The initial tree T2 contains precisely two leaves (e.g.
one cherry) which are labelled as x1 and x2. For the recursive step, given a tree Tm with m leaves constructed so far, choose a random edge in Tm according to the distribution that assigns weight 1 − α to each pendant edge (i.e., those incident with a leaf) and weight α to each of the other edges. The new leaf labelled xm+1 bifurcates the selected edge and joins in the middle. Every single addition of a leaf in the tree results into a replacement of the selected edge with two new edges. Finally, we let An and Cn denote the numbers of pitchforks and cherries in tree Tn, respectively.
2.2 An urn model associated with trees
Consider an urn containing balls of d different colours where colours are denoted by integers {1, 2, . . . , d}. Let Un= (Un,1, . . . , Un,d) be the configuration vector of length d such that the i-th element of Un is the number of balls of colour i at time n. Let U0 be the initial vector of colour configuration, then at every time n ≥ 1, a ball is selected uniformly at random from the urn and if the colour of the selected ball is i then the ball is replaced along with Ri,j many balls of colour j, for every 1 ≤ j ≤ d. The dynamics of the urn configuration depends on its initial configuration U0 and the d × d replacement matrix R = (Ri,j)1≤i,j≤d.
2 2
ρ
T2
ρ ρ
ρ ρ
T3 T4 T5 T6
2 2 1 1 3 1 1 3 2 2
U0 U1 U2 U3 U4
(i)
(ii)
2 2 1 1 3 1 1 3 2 2
5
5 6
6 6 6 6
6 6
6 5
5
3
6 6 6 5 6 6 6
5
2 2 2 2
5 5
5
1 1 3
4 6 6
3
2 2 2 2
5 5
1 1 3 2 2
6 6
5 6
4 6
6
Figure 1: A sample path of the Ford model and the associated trajectory under the urn model. (i) A sample path of the Ford model evolving from T2 with two leaves to T6 with six leaves. The labels of the leaves are omitted for simplicity. The type of an edge is indicated by the circled number next to it. For 2 ≤ i ≤ 5, the edge selected in Ti to generate Ti+1 is highlighted in bold and the associated edge type is indicated in the circled number above the arrow. (ii) The associated urn model with six colours, derived from the types of pendants edges in the trees. In vector form, U0 = (0, 2, 0, 0, 1, 0), U1 = (2, 0, 1, 0, 0, 2), U2 = (0, 4, , 0, 0, 2, 1), U3 = (2, 2, 1, 0, 1, 3), and U4 = (2, 2, 1, 1, 1, 4).
We study the limiting properties of the numbers of cherries and pitchforks via an equiv- alent urn process. Towards this, we use six different colours and assign one colour to each type of edges of the tree in the following scheme introduced in [7]: colour 1 for all pendant edges of a cherry in a pitchfork; colour 2 for pendant edges of a cherry not contained in a pitchfork; colour 3 for pendant edges in a pitchfork but not in any cherry; colour 4 for pendant edges in neither a cherry nor a pitchfork; colour 5 for internal edges adjacent to a cherry but not in a pitchfork (i.e., those adjacent to colour 2 edges), and colour 6 for all other (necessarily internal) edges (including the one incident with the root). See Fig. 1 for an illustration of the scheme.
Consider an urn with colour configuration at time n as Un= (Un,1, . . . , Un,6), where Un,i denotes the number of edges of colour i in the tree at time n, which has precisely n + 2 leaves.
Then U0 = (0, 2, 0, 0, 1, 0), since at the initial time step (n = 0) there is one internal edge and one essential cherry in a rooted tree; see T2 in Fig. 1. Based on the colouring scheme of the edges, at any time n ≥ 0, we have
(An+2, Cn+2) = 1
2(Un,1, Un,1+ Un,2) , (1) where An+2and Cn+2are the numbers of pitchforks and cherries in Tn+2, respectively. Under the alpha tree model, the dynamics of the corresponding urn process evolves according to
the following replacement matrix
R =
0 0 0 1 0 1
2 −2 1 0 −1 2
−2 4 −1 0 2 −1
0 2 0 −1 1 0
2 −2 1 0 −1 2
0 0 0 1 0 1
.
Let ei, 1 ≤ i ≤ 6, denote a 6-vector in which the i-th component is 1 and 0 elsewhere; and χn the random vector taking value ei if, at time n, speciation happens at an edge with type i. Thus, we have the following recursion
Un= Un−1+ χnR, n ≥ 1, where
P (χn = ei|Fn−1) ∝
(1 − α)Un−1,i, for i ∈ {1, 2, 3, 4}, α Un−1,i, for i ∈ {5, 6}.
(2) Observe that the process (Un)n≥0, which describes the dynamics of the numbers of cherries and pitchfork, is a nonuniform urn model since the balls are not selected uniformly at random from the urn, which is different from the classical uniform urn models in which the balls are selected uniformly at random from the urn (see, e.g. [17, Chapter 7]).
2.3 Limiting theorems on uniform urn models
In this subsection, we recall the strong laws of large numbers and the central limit theorems on a version of uniform urn models developed in [7], which will be related to the nonuniform urn process in Subsection 2.2 later using the urn coupling idea in [4].
For the classical uniform urn models, it has been shown (see [3]) that the random process Un/n converges almost surely to the left eigenvector of R corresponding to the maximal eigenvalue and the asymptotic normality holds with a known limiting variance matrix under certain assumptions on R. Standard assumptions made in the urn model theory are that the replacement matrix is irreducible with a constant row sum and all the off-diagonal elements are non-negative (see, e.g. [20]). In [7], we extend this to the case when off-diagonal elements of a replacement matrix can be negative satisfying the following set of assumptions (A1)–
(A4), which was slightly rephrased from [7]. Let diag(a1, . . . , ad) denote the diagonal matrix whose diagonal elements are a1, . . . , ad.
(A1): Tenable: It is always possible to draw balls and follow the replacement rule.
(A2): Small: All eigenvalues of R are real; the maximal eigenvalue λ1, called the principal eigenvalue is positive with λ1 > 2λ holds for all other eigenvalues λ of R.
(A3): Strictly balanced: The column vector u1 = (1, 1, . . . , 1)>, is a right eigenvector of R corresponding to λ1; and it has a principal left eigenvector v1 (i.e., the left eigenvectors corresponding to λ1) that is also a probability vector.
(A4): Diagonalisable: There exists an invertible matrix V with real entries whose first row is v1 such that the first column of V−1 is u1 and
V RV−1 = diag(λ1, λ2, . . . , λd) =: Λ, (3) where λ1 > λ2 ≥ · · · ≥ λd are eigenvalues of R.
Let N (0, Σ) be the multivariate normal distribution with mean vector 0 = (0, . . . , 0) and covariance matrix Σ. Then we have the following result from [7, Theorems 1 & 2], which can also be alternatively derived from [19, Theorems 3.21 & 3.22 and Remark 4.2].
Theorem 2.1. Under assumptions (A1)–(A4), we have (nλ1)−1Un a.s.
−→ v1 and n−1/2(Un− nλ1v1)−−→ N (0, Σ),d (4) where λ1 is the principal eigenvalue and v1 is the principal left eigenvector of R, and
Σ =
d
X
i,j=2
λ1λiλju>i diag(v1)uj
λ1− λi− λj v>i vj, (5) where vj is the j-th row of V and uj the j-th column of V−1 for 2 ≤ j ≤ d.
3 Limit Theorems for the Joint Distribution
In this section, we present the strong laws of large numbers and the central limit theorems on the joint distribution of the number of cherries and that of pitchforks under the Ford model.
3.1 Main convergence results
For later use, we consider the following polynomials in α:
φ1 = 8α3− 32α2+ 45α − 23, φ4 = 8α3− 40α2+ 37α + 13, φ2 = 40α3 − 164α2+ 221α − 97, φ5 = 40α3− 112α2− 31α + 181, φ3 = 56α3 − 248α2+ 367α − 181, φ6 = 8α3+ 4α2− 71α + 71;
(6)
and for simplicity of notation, we do not indicate the φi’s as functions of α. Moreover, it can be verified directly that φ1, φ2, φ3 < 0 and φ4, φ5, φ6 > 0 for α ∈ (0, 1). Then, we have the following result on the joint asymptotic properties of the urn model process associated with the α-tree model.
Theorem 3.1. Suppose (Un)n≥0 is the urn process associated with the Ford model with parameter α ∈ (0, 1). Then,
Un n
−→ va.s. and Un− nv
√n
−−→ N (0, Σ) ,d (7)
as n → ∞, where
v = 1
2(3 − 2α)(2(1 − α), 2(1 − α), (1 − α), 1 + α, 1 − α, 5 − 3α) (8) and with the polynomials φ1, . . . , φ6 defined in (6),
Σ = 1 − α
4(3 − 2α)2(5 − 4α)(7 − 4α)
−12φ1 4φ2 −6φ1 −2φ4 2φ2 −2φ2 4φ2 −4φ3 2φ2 −2φ6 −2φ3 2φ3
−6φ1 2φ2 −3φ1 −φ4 φ2 −φ2
−2φ4 −2φ6 −φ4 φ5 −φ6 φ6 2φ2 −2φ3 φ2 −φ6 −φ3 φ3
−2φ2 2φ3 −φ2 φ6 φ3 −φ3
. (9)
The proof of Theorem 3.1 is given at the end of this section.
Remark 1. For later use, here we present the limiting results on the urn model using a scaling factor relating to the time n (which is motivated by noting that the number of leaves in the tree at time n is n + 2). However, the results can be readily rephrased using the proportion of color balls in the urn process.
Remark 2. Using the approach outlined in [7], Theorem 3.1 continues to hold for the un- rooted α-tree models.
With Theorem 3.1, we are ready to present one of our main results in this paper concern- ing limit theorems on the joint distribution of the number of cherries Cn and the number of pitchforks An under the Ford model.
Theorem 3.2. Under the Ford model with parameter α ∈ [0, 1], we have 1
n(An, Cn) a.s.
−→ (ν, µ) := 1 − α
2(3 − 2α)(1, 2),
and (An, Cn) − n(ν, µ)
√n
−−→ N (0, 0), Sd , where
S = τ2 ρ ρ σ2
= 1 − α
(3 − 2α)2(5 − 4α)
"−24α3+96α2−135α+69 4(7−4α)
−(2−α)(1−2α) 2
−(2−α)(1−2α)
2 2 − α
#
. (10)
Remark 3. We consider special cases of α-tree model, which are commonly studied in phy- logenetics. The first two have been established in [7].
1. The uniform model corresponds to α = 1/2, where all edges, internal or leaf, are selected with equal weight and the limit results hold with
(ν, µ) = 1
8(1, 2) and τ2 ρ ρ σ2
= 1 64
3 0 0 4
.
2. The Yule model corresponds to α = 0, where only leaf edges are selected with equal weight and the limit results hold with
(ν, µ) = 1
6(1, 2) and τ2 ρ ρ σ2
= 1 45
69/28 −1
−1 2
.
3. The Comb model corresponds to α = 1, a degenerate case. It is easy to see that (ν, µ) = (0, 0) and τ2 = ρ = σ2 = 0.
Proof of Theorem 3.2. First note that the case α = 1 reduces to a degenerate case of Comb model and therefore we only consider α ∈ [0, 1). The limiting results for the case α = 0 has been obtained in [7], which agree with the above results when α = 0. Thus, it is enough to prove the result for α ∈ (0, 1).
By (1), we have (An, Cn) = UnQ with Q> = 1
2
1 0 0 0 0 0 1 1 0 0 0 0
. (11)
Since
Un n
−→ v =a.s. 1
2(3 − 2α) 2(1 − α), 2(1 − α), 1 − α, 1 + α, 1 − α, 5 − 3α, (12) using the relation from equation (1) we get
1
n(An, Cn) = Un n
Q a.s.
−→ v Q = 1 − α
2(3 − 2α)(1, 2).
This concludes the proof of the almost sure convergence. We now prove the central limit theorem and obtain the expression for the limiting variance matrix.
Denoting covariance matrix Σ by (σi,j) for 1 ≤ i, j ≤ 6, we consider the matrix S = Q>ΣQ = 1
4
σ1,1 σ1,1+ σ1,2 σ1,1+ σ2,1 σ1,1+ σ2,1+ σ1,2+ σ2,2
= 1 − α
16(3 − 2α)2(5 − 4α)(7 − 4α)
−12φ1 −12φ1+ 4φ2
−12φ1+ 4φ2 −12φ1+ 8φ2− 4φ3
= 1 − α
(3 − 2α)2(5 − 4α)
−24α3+96α2−135α+69 4(7−4α)
−(2−α)(1−2α) 2
−(2−α)(1−2α)
2 2 − α
. Since (An, Cn) = UnQ, where Q is as defined in (11), we get
(An, Cn) − n(ν, µ)
√n = 1
√n (Un− nv) Q−−d→ N 0, Q>ΣQ = N (0, S) . This completes the proof.
We end this subsection with the following results on the behaviour of the first and second moments of the limiting joint distribution of cherries and pitchforks in the parameter region, as indicated by their plots in Figure 3.1.
Corollary 3.3. (i) For 0 < α < 1, An/Cn a.s.
−→ 1/2 as n → ∞. That is, the number of pitchforks is asymptotically equal to the number of essential cherries.
(ii) An/n a.s.
−→ 2(3−2α)1−α , which decreases strictly from 1/6 to 0, as α increases from 0 to 1.
(iii) The limiting variance of An/√
n, τ2, decreases strictly from 23/420 to 0, as α increases from 0 to 1.
(iv) The limiting variance of Cn/√
n, σ2, increases strictly from 2/45 to 0.0695 over (0, a0) and decreases from 0.0695 to 0 over (a0, 1), where a0 = 0.7339, the unique root of 19 − 48α + 36α2− 8α3 = 0 in (0, 1).
(v) The limiting covariance of An/√
n and Cn/√
n changes sign from negative to positive at α = 1/2. Specifically, it increases from −1/45 to 0.0225 over (0, a1) and decreases from 0.0225 over (a1, 1), where a1 = 0.8688, the unique root of −24α4+160α3−370α2+ 358α − 123 = 0 in (0, 1).
Figure 2: Plot of the limiting covariances of the joint distribution of cherries and pitchforks with respect to the parameter α under the Ford model.
3.2 A uniform urn model derived from U
nFor α ∈ (0, 1), consider the diagonal 6 × 6 matrix Tα = diag(1 − α, 1 − α, 1 − α, 1 − α, α, α) and
Uen := UnTα = ((1 − α)Un,1, . . . , (1 − α)Un,4, αUn,5, αUn,6) .
Clearly, there is a one to one correspondence between Un and eUn= UnTα for α ∈ (0, 1) and therefore it is sufficient to obtain the limiting results for the urn process eUn. Note that the off-diagonal elements of the replacement matrix Rα are not all non-negative, therefore we will use the limit results from [7] to obtain the convergence results for the urn process eUn. Theorem 3.4. Suppose α ∈ (0, 1). Then ( eUn)n≥0is an uniform urn process with replacement matrix Rα = RTα and
Uen n
−→a.s. ve1, (13)
where
ve1 = 1
2(3 − 2α) 2(1 − α)2, 2(1 − α)2, (1 − α)2, 1 − α2, α(1 − α), α(5 − 3α)
(14) is the normalized left eigenvector of Rα corresponding to the largest eigenvalue λ1 = 1.
Furthermore,
Uen− nev1
√n
−−→ N (0, ed Σ), (15)
with the polynomials φ1, . . . , φ6 defined in (6) and β = 1 − α,
Σ =e β
4(3 − 2α)2(5 − 4α)(7 − 4α)
−12β2φ1 4β2φ2 −6β2φ1 −2β2φ4 2αβφ2 −2αβφ2 4β2φ2 −4β2φ3 2β2φ2 −2β2φ6 −2αβφ3 2αβφ3
−6β2φ1 2β2φ2 −3β2φ1 −β2φ4 αβφ2 −αβφ2
−2β2φ4 −2β2φ6 −β2φ4 β2φ5 −αβφ6 αβφ6 2αβφ2 −2αβφ3 αβφ2 −αβφ6 −α2φ3 α2φ3
−2αβφ2 2αβφ3 −αβφ2 αβφ6 α2φ3 −α2φ3
.
(16) Proof of Theorem 3.4. First, observe that at any time n, there are n + 2 pendant edges and n + 1 internal edges in a rooted tree. That is,
Un,1+ Un,2+ Un,3+ Un,4= n + 2 and Un,5+ Un,6= n + 1.
This gives
k eUnk1 = (1 − α)
4
X
j=1
Un,j + α
6
X
j=5
Un,j = (1 − α)(n + 2) + α(n + 1) = n + 2 − α.
Therefore, from (2) we get,
E[χn|Fn−1] = Un−1Tα
kUn−1Tαk1 = Un−1Tα n + 1 − α, and
E[Un|Fn−1] = Un−1+ E[χn|Fn−1]R = Un−1+ 1
n + 1 − αUn−1TαR.
Multiplying both sides by Tα, we get
E[ eUn|Fn−1] = eUn−1+ 1
k eUn−1k1Uen−1
! RTα.
Hence, ( eUn)n≥0 is a classical uniform urn model with replacement matrix Rα = RTα.
Note that (A1) holds because the general Ford’s dynamics on a rooted tree is well defined at every time n, thus the corresponding urn model satisfies the assumption of tenability. That is, it is always possible to draw balls without getting stuck with the replacement rule. Note that Rα is diagonalisable as
V RαV−1 = Λ holds with Λ = diag 1, 0, 0, 0, −2(1 − α), −(3 − 2α),
V−1 =
1 β1 0 0 1 1 − α
1 0 1β 0 1 3 − α
1 −2β 0 β3 −(2−α)β −5 + α 1 0 0 β1 −(2−α)β −3 + α 1 0 −2α α1 1 3 − α
1 0 0 −1α 1 1 − α
(17)
and
V = 1
2(3−2α)
2β2 2β2 β2 (1+α)β αβ α(5−3α)
2β(1+α−α2) 2β3 −(2−α)β2 (2−α)β2 −αβ2 −αβ(5−3α)
2αβ2 2α(2−α)β αβ2 −αβ2 −α(3−α)β −3αβ2
2α(2−α)β 2αβ2 α(2−α)β −α(2−α)β α2β −3α(2−α)β
2(2−α)β −2β2 (2−α)β −(4−α)β −αβ αβ
−2β 2β −β β α −α
.
(18) Therefore, R satisfies condition (A4). Next, (A2) holds because Rα has eigenvalues
1, 0, 0, 0, −2(1−α), −(3−2α)
which are all real. The maximal eigenvalue λ1 = 1 is positive with λ1 > 2λ holds for all other eigenvalues λ of Rα. Furthermore, put ui = V−1e>i and vi = eiV for 1 ≤ i ≤ 4. Then (A3) follows by noting that u1 = (1, 1, 1, 1, 1, 1)> is the principal right eigenvector, and
ve1 = 1
2(3−2α) 2(1−α)2, 2(1−α)2, (1−α)2, 1−α2, α(1−α), α(5−3α) is the principal left eigenvector.
Since all the assumptions (A1)–(A4) are satisfied by the replacement matrix Rα, by Theorem 2.1, (13) holds. Furthermore, since
Σ =e
6
X
i,j=2
λiλju>i diag(v1)uj
1−λi−λj v>i vj, (19)
by (13) it follows that (15) holds.
3.3 Proof of Theorem 3.1
Proof. Observe that P6
i=1Un,i = 3 + 2n (since 2 balls are added into the urn at every time point), thus the vector of color proportions is Un/(3 + 2n). Since α ∈ (0, 1), it follows that Tα is invertible and its inverse is
Tα−1 = 1
α(1 − α)diag(α, α, α, α, 1 − α, 1 − α),
which is also a diagonal matrix, and so (Tα−1)> = Tα−1. Note that we have Un = eUnTα−1 and consider
v =ve1(Tα)−1 = 1
2(3 − 2α) 2(1 − α), 2(1 − α), 1 − α, 1 + α, 1 − α, 5 − 3α.
Since Uen n
−→a.s. ve1 holds in view of (13) in Theorem 3.4, Un
n
−→ v,a.s. (20)
which concludes the proof of the almost sure convergence in (7).
Consider the covariance matrix eΣ for eUn as stated in (16), then by straightforward cal- culation we have
Σ = (Tα−1)>ΣTe α−1 = Tα−1ΣTe α−1. Therefore, since
Uen− nev1
√n
−−d→ N (0, eΣ) in view of Theorem 3.4, we get
Un− nv
√n
−−d→ N 0, (Tα−1)>Σ Te α−1 = N (0, Σ).
This completes the proof.
4 Exact Distributions
In this section, we present recursion formulas for exact computation of the joint distributions of cherries and pitchforks, their means, variances and covariance for fixed n under the Ford model.
We begin with the following notation. Given a phylogenetic tree T , let E1(T ) be the set of pendant edges that are contained in a pitchfork but not a cherry; E2(T ) the set of edges in T that are contained in a cherry but not in a pitchfork (note that in our notation a cherry contains three leaves); E3(T ) the set of pendant edges that are contained in neither a cherry nor a pitchfork; and E4(T ) = E(T ) \ (E1(T ) ∪ E2(T ) ∪ E3(T )). In addition, E(T ) can be decomposed into the disjoint union of these four sets of edges. i.e., E(T ) = E1(T ) t E2(T ) t E3(T ) t E4(T ), where t denotes disjoint union. Let C(T ), A(T ) be the number of cherries and pitchforks in a tree T . The following result presented in [25] will be useful later.
Lemma 4.1. Suppose that T is a phylogenetic tree with n leaves. Then we have
E(T ) = E1(T ) t E2(T ) t E3(T ) t E4(T ). (21) In addition, we have |E1(T )| = A(T ), |E2(T )| = 3(C(T ) − A(T )), |E3(T )| = n − A(T ) − 2C(T ), and |E4(T )| = n − 1 + 3A(T ) − C(T ). Furthermore, suppose that e is an edge in T and T0 = T [e]. Then we have
A(T0) =
A(T ) if e ∈ E3(T ) ∪ E4(T ), A(T ) − 1 if e ∈ E1(T ),
A(T ) + 1 if e ∈ E2(T );
and C(T0) =
C(T ) if e ∈ E2(T ) ∪ E4(T ),
C(T ) + 1 if e ∈ E1(T ) ∪ E3(T ).
We start with the following result on the exact computation of the joint probability mass function (pmf) of An and Cn, which can be regarded as a generalization of the previous results on the Yule model (e.g. when α = 0 [25, Theorem 1]) and the uniform model (e.g.
α = 1/2 [25, Theorem 4]). A related result for unrooted trees is presented in [8].
Theorem 4.2. For n ≥ 3, 0 ≤ a ≤ n/3 and 1 ≤ b ≤ n/2, under the Ford model with parameter α ∈ [0, 1] we have
P(An+1 = a, Cn+1 = b)
= 2a + α(n − a − b − 1)
n − α P(An = a, Cn= b) + (1 − α)(a + 1)
n − α P(An= a + 1, Cn= b − 1) +(2 − α)(b − a + 1)
n − α P(An= a − 1, Cn = b) +(1 − α)(n − a − 2b + 2)
n − α P(An = a, Cn = b − 1).
Proof of Theorem 4.2. Fix n > 3, and let T2, . . . , Tn, Tn+1 be a sequence of random trees generated by the Ford process, that is, T2 contains two leaves and Ti+1= Ti[ei] for a random
edge ei in Ti chosen according to the Ford model for 2 ≤ i ≤ n. Then we have P(An+1 = a, Cn+1 = b) = P(A(Tn+1) = a, C(Tn+1) = b)
=X
p,q
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = p, C(Tn) = q)P(A(Tn) = p, C(Tn) = q)
=X
p,q
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = p, C(Tn) = q)P(An= p, Cn= q), (22) where the first and second equalities follow from the law of total probability, and the defini- tion of random variables An and Cn.
Let en be the edge in Tn chosen in the above Ford process for generating Tn+1, that is, Tn+1 = Tn[en]. Since Lemma 4.1 implies that
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = p, C(Tn) = q) = 0 (23) for (p, q) 6∈ {(a, b), (a + 1, b − 1), (a − 1, b), (a, b − 1)}, it suffices to consider the following four cases in the summation in (22): case (i): p = a, q = b; case (ii): p = a + 1, q = b − 1; case (iii): p = a − 1, q = b; and case (iv): p = a, q = b − 1.
Firstly, Lemma 4.1 implies that case (i) occurs if and only if en ∈ E4(Tn). Using Lemma 4.1 again, it follows that E4(Tn) contains precisely 2A(Tn) pendent edges and (n − 1) + A(Tn) − C(Tn) interior edges. Therefore we have
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = a, C(Tn) = b)
= 2A(Tn)(1 − α) + α(n − 1 + A(Tn) − C(Tn))
n − α = 2a + α(n − a − b − 1)
n − α . (24)
Similarly, Lemma 4.1 implies that case (ii) occurs if and only if en ∈ E1(Tn). Using Lemma 4.1 again, it follows that E1(Tn) contains precisely A(Tn) pendent edges and no interior edges. Therefore we have
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = a + 1, C(Tn) = b − 1) = (a + 1)(1 − α)
n − α . (25) Next, Lemma 4.1 implies case (iii) occurs if and only if en ∈ E2(Tn). Using Lemma 4.1 again, it follows that E2(Tn) contains precisely 2(A(Tn) − C(Tn)) pendent edges and A(Tn− C(Tn) interior edges. Thus we have
P(A(Tn+1) = a, C(Tn+1) = b | A(Tn) = a − 1, C(Tn) = b)
= 2(a − 1 − b)(1 − α) + α(a − 1 − b)
n − α = (2 − α)(b − a + 1)
n − α . (26)
Finally, Lemma 4.1 implies case (iv) occurs if and only if enis contained in E3(Tn). Using Lemma 4.1 again, it follows that E3(Tn) contains precisely n − A(Tn) − 2C(Tn) pendent edges and no interior edges. Hence, it follows that
P(A(Tn+1) = a, C(Tn+1 = b) | A(Tn) = a, C(Tn) = b − 1) = (1 − α)(n − a − 2b + 2)
n − α . (27)
Substituting Eq. (24)–(27) into Eq. (22) completes the proof of the theorem.
To study the moments of An and Cn, we present below a functional recursion form of Theorem 4.2, whose proof is straightforward and hence omitted here.
Theorem 4.3. Let ϕ : N × N → R be an arbitrary function. For n ≥ 3, under the Ford model with parameter α ∈ [0, 1] we have
(n − α)Eϕ(An+1, Cn+1) = E
α(n − An− Cn− 1) + 2An ϕ(An, Cn)
+(1 − α)Anϕ(An− 1, Cn+ 1) + (2 − α)(Cn− An)ϕ(An+ 1, Cn) +(1 − α)(n − An− 2Cn)ϕ(An, Cn+ 1)
.
For a fix integer k, consider the indicating function Ik(x, y) that equals to 1 if y = k, and 0 otherwise. Then by Theorem 4.3 the following result on the distribution of cherries follows.
Corollary 4.4. For integers n ≥ 3 and 0 ≤ k ≤ n/2, under the Ford model with parameter α ∈ [0, 1] we have
(n − α)P(Cn+1= k) = [(n − 1)α + 2(1 − α)k]P(Cn= k) + (1 − α)(n − 2k + 2)P(Cn+1= k − 1).
For the purpose of next section, we end this section by writing the recurrence relation in the following form in the next Corollary.
Corollary 4.5. For n ≥ 3, under the Ford model with parameter α ∈ [0, 1] we have
(n − α)E[Cn+1] − (n − 2 + α)E[Cn] = n(1 − α), (28) (n − α)E[An+1] − (n − 3 + α)E[An] = (2 − α)E[Cn], (29) (n − α)E[Cn+12 ] − (n − 4 + 3α)E[Cn2] = 2(n − 1)(1 − α)E[Cn] + n(1 − α), (30) (n − α)E[An+1Cn+1] − (n − 5 + 3α)E[AnCn] = (n − 1)(1 − α)E[An] + (2 − α)E[Cn2], (31) (n − α)E[A2n+1] − (n − 6 + 3α)E[A2n] = 2(2 − α)E[AnCn] + (2 − α)E[Cn] − E[A(32)n] with initial conditions E[A3] = E[C3] = E[A23] = E[C32] = E[A3C3] = 1.
Remark 4. Let µn= E[Cn] and σn2 = var(Cn). Substituting E[Cn2] = σ2n+ µ2n into (30) and applying (28), we obtain below a recurrence relation of the σn2, which was also obtained in Ford’s thesis (Theorem 60, [14]):
(n − α)σn+12 − (n − 4 + 3α)σn2 = −4(1 − α)2
n − α µ2n+ 2(1 − α)[(1 − 2α)n + α]
n − α µn+α(1 − α)n(n − 1)
n − α .
5 Higher Order Asymptotic Expansion of the Joint Moments
Although the leading terms of the first and second moments of the distributions of cherries and pitchforks, E[An], E[Cn], var(An), var(Cn) and cov(An, Cn), can be identified from Theo- rem 3.2, for better understanding of their asymptotic behaviour we derive their higher order expansions in this Section.
We start with the following result on the first moments. Note that Proposition 5.1 (i) has been obtained in [14].
Proposition 5.1. Under the Ford model with parameter α ∈ [0, 1], the following exact expansions hold for E[Cn] and E[An].
(i) E[Cn] = 1 − α
3 − 2α n + α
2(3 − 2α) + xn, where
x2 = (2 − α)
2(3 − 2α), x3 = α
2(3 − 2α), xn = α 2(3 − 2α)
n−1
Y
i=3
i − 2 + α
i − α , n ≥ 4.
Further, as n → ∞,
xn = αΓ(3 − α)
2(3 − 2α)Γ(1 + α)n−2(1−α)(1 + o(1)) . (33) (ii) E[An] = 1 − α
2(3 − 2α) n + α
2(3 − 2α) + yn, where
y2 = α − 2
2(3 − 2α), y3 = 1
2, yn = 1 2
n−1
Y
i=3
i − 3 + α
i − α +(2 − α)α 2(3 − 2α)
n − 3 n − 3 + α
n−1
Y
i=3
i − 2 + α
i − α , n ≥ 4.
Further, as n → ∞,
yn = (2 − α)Γ(3 − α)
2(3 − 2α)Γ(α) n−2(1−α)(1 + o(1)) . (34) Proposition 5.2. Under the Ford model with parameter α ∈ [0, 1], the following asymptotic expansions hold for var(Cn), cov(An, Cn) and var(An):
(i)
var(Cn) = (1 − α)(2 − α)
(3 − 2α)2(5 − 4α) n − α(1 − α)(2 − α)
(3 − 2α)2(5 − 4α) + O(n−2(1−α)).
(ii)
cov(An, Cn) = −(1 − α)(2 − α)(1 − 2α)
2(3 − 2α)2(5 − 4α) n − α(1 − α)(2 − α)
(3 − 2α)2(5 − 4α) + O(n−2(1−α)).
(iii)
var(An) = (1 − α)(69 − 135α + 96α2− 24α3)
4(3 − 2α)2(5 − 4α)(7 − 4α) n+3α(1 − α)(1 − 2α)(5 − 3α)
4(3 − 2α)2(5 − 4α)(7 − 4α)+O(n−2(1−α)).
Remark 5. When n is large, Cov(An, Cn) changes sign. Specifically, for α ∈ (0, 1/2), An and Cn are negatively correlated, which is expected; and for α ∈ (1/2, 1), An and Cn are positively correlated, which is unexpected.
5.1 Proofs of Propositions 5.1 and 5.2
We need the lemmas below to prove the two propositions.
Lemma 5.3. Suppose a real sequence {Xn, n ≥ n0} satisfies the recursion Xn+1= fnXn+ gn, n ≥ n0,
where {fn, n ≥ n0} and {gn, n ≥ n0} are sequences such that for every ` ≥ n0, |Qn
i=`fi| ≤ C(n/`)−a and |g`| ≤ C`−b, for some finite a, b and C > 0. Then, there exists a finite positive constant C0 (which depends on |Xn0| and C) such that |Xn| ≤ C0n−qa,b where qa,b :=
min{a, b − 1}.
Proof of Lemma 5.3. It is easy to verify that the solution to the given recursion is given by
Xn= Xn0
n−1
Y
i=n0
fi+
n−1
X
i=n0
gi
n−1
Y
j=i+1
fj, n ≥ n0.
Therefore,
|Xn| ≤ |Xn0|
n−1
Y
i=n0
fi
+
n−1
X
i=n0
|gi|
n−1
Y
j=i+1
fj . Under the assumptions of the Lemma,
|Xn0|
n−1
Y
i=k
fi
≤ C|Xn0|n−a≤ C0n−a;
and
n−1
X
i=n0
|gi|
n−1
Y
j=i+1
fj
≤ C
n−1
X
i=n0
|gi|(n/i)−a ≤ C2n−a
n−1
X
i=n0
i−bia≤ C0n−an−b+a+1 = C0n−b+1.
Thus
|Xn| ≤ C0max(n−a, n−b+1) = C0n−qa,b, where qa,b= min{a, b − 1}. This completes the proof.
Lemma 5.4. For finite non-negative integers l, k such that l ≥ k, m ≥ 1 and α ∈ [0, 1], there exists a positive constant K = K(α, l) such that
n−1
Y
i=l
i − k + mα i − α
≤ K (n/l)−k+(m+1)α for all 1 ≤ l ≤ n − 1. (35)
and as n → ∞
n−1
Y
i=l
i − k + mα
i − α = Γ(l − α)
Γ(l − k + mα)n−k+(m+1)α(1 + o(1)) . (36) Proof of Lemma 5.4. The bound in (35) follows from Lemma 2 of [7]. We now prove (36).
Note that, we can write
i − k + mα
i − α = Γ(i + 1 − k + mα)Γ(i − α) Γ(i − k + mα)Γ(i + 1 − α). Thus
n−1
Y
i=l
i − k + mα i − α =
n−1
Y
i=l
Γ(i + 1 − k + mα)Γ(i − α) Γ(i − k + mα)Γ(i + 1 − α)
= Γ(n − k + mα) Γ(l − k + mα)
Γ(l − α)
Γ(n − α) (37)
= Γ(l − α) Γ(l − k + mα)
Γ(n + mα) Γ(n − α)
k
Y
j=1
1 n − j + mα.
k
Y
j=1
1
n − j + mα = n−k(1 + o(1)) . (38)
By Stirling’s approximation formula, Γ(x) =√
2π xx−1/2e−x(1 + o(1)), we have Γ(n + mα)
Γ(n − α) =
√2π(n + mα)n+mα−1/2e−(n+mα)
√2π(n − α)n−α−1/2e−(n−α) (1 + o(1))
= n(m+1)α(1 + mα/n)n+mα−1/2
(1 − α/n)n−α−1/2 e−(m+1)α(1 + o(1))
= n(m+1)α(1 + o(1)) . (39)
Combining (38) and (39), we get (36).
Proof of Proposition 5.1. Recall µn = E[Cn]. By Theorem 3.2, µn = 3−2α1−α n + O(1). Thus, we write µn as
µn = 1 − α
3 − 2α n + α
2(3 − 2α) + xn. (40)
For simplicity, the dependence of µn and xn on α are suppressed.
Since µ2 = µ3 = 1, we get x2 = 1 − 2(3−2α)4−3α = 2(3−2α)2−α and x3 = 1 − 2(3−2α)6−5α = 2(3−2α)α . Substituting (40) into (28) leads to
(n − α)xn+1− (n − 2 + α)xn= 0, n ≥ 2, and hence,
xn =
α 2(3−2α)
Qn−1 i=3
i−2+α
i−α n ≥ 4,
α
2(3−2α) n = 3,
(2−α)
2(3−2α) n = 2.
To prove (33), we rewrite xn as follows
xn= x3
n−1
Y
i=3
i − 2 + α
i − α = x3Γ(3 − α) Γ(1 + α)
Γ(n − 2 + α)
Γ(n − α) , n ≥ 4.
Apply Lemma 5.4, (33) holds. Consequently, µn= 1 − α
3 − 2αn + α
2(3 − 2α) + αΓ(3 − α)
2(3 − 2α)Γ(1 + α)n−2(1−α)(1 + o(1)) . (41) This completes the proof of part (i).
The same method of proof can be used to prove part (ii). Recall νn= E[An]. By Theorem 3.2, νn= 2(3−2α)1−α n + O(1), and we write it as
νn= 1 − α
2(3 − 2α) n + α
2(3 − 2α) + yn, (42)
where, again, the dependence of νn and yn on α are suppressed. Substituting (42) into (29) leads to
yn+1 = n − 3 + α
n − α yn+ 2 − α
n − αxn, n ≥ 4.
The solution to this recurrence relation is given by
yn= y3
n−1
Y
i=3
i − 3 + α i − α +
n−1
X
i=3
2 − α i − αxi
n−1
Y
j=i+1
j − 3 + α j − α .
Since y3 = 1/2 and the expression for xi from part (i), we get
yn= y3 n−1
Y
i=3
i − 3 + α i − α +
n−1
X
i=3
2 − α
i − α × α 2(3 − 2α)
i−1
Y
j=3
j − 2 + α j − α ×
n−1
Y
j=i+1
j − 3 + α j − α
= 1 2
n−1
Y
i=3
i − 3 + α
i − α + (2 − α)α 2(3 − 2α)
n−1
X
i=3 n−1
Y
j=i+1
j − 3 + α j − α × 1
i − α ×
i−1
Y
j=3
j − 2 + α j − α
= 1 2
n−1
Y
i=3
i − 3 + α
i − α + (2 − α)α 2(3 − 2α)
n−1
X
i=3
1 3 − α
n−1
Y
j=4
j − 3 + α j − α
= 1 2
n−1
Y
i=3
i − 3 + α
i − α + (2 − α)α 2(3 − 2α)
(n − 3) (3 − α)
n−1
Y
j=4
j − 3 + α j − α . Thus, for n ≥ 5,
yn= 1
2 + (2 − α)α 2(3 − 2α)
(n − 3) (3 − α)
n−1 Y
j=4
j − 3 + α
j − α . (43)
By Lemma 5.4,
yn = 1
2 + (2 − α)α 2(3 − 2α)
(n − 3) (3 − α)
Γ(4 − α)
Γ(1 + α)n−3+2α(1 + o(1))
= (2 − α)α 2(3 − 2α)
Γ(3 − α)
Γ(1 + α)n−2+2α(1 + o(1))
= (2 − α)Γ(3 − α)
2(3 − 2α)Γ(α) n−2(1−α)(1 + o(1))
as n → ∞. This completes the proof of part (ii) and hence the Proposition.
Proof of Proposition 5.2. The method of proof is similar to that of Proposition 5.1.
Recall σn2 = var(Cn). From Theorem 3.2, we have σ2n = (5−4α)(3−2α)(1−α)(2−α)2n + O(1). We first consider E[Cn2]. As
E[Cn2] = µ2n+ σn2 = (1 − α)2
(3 − 2α)2n2+ O(n), we rewrite it as
E[Cn2] = (1 − α)2
(3 − 2α)2 n2 +2(1 − α)(1 + 2α − 2α2)
(5 − 4α)(3 − 2α)2 n − α(8 − 17α + 8α2)
4(5 − 4α)(3 − 2α)2 + zn, (44) and derive below a recursion on zn. Substituting (44) into (30) and after straightforward algebraic simplification, we have
(n − α)zn+1− (n − 4 + 3α)zn= 2(1 − α)(n − 1)xn, n ≥ 2.
Since C2 = C3 = 1, we get z2 = 3(2−α)(8α4(3−2α)22−21α+14)(5−4α) and z3 = 88α4(3−2α)3−213α22+152α−24(5−4α) . Consequently,
σn2 = (1 − α)(2 − α)
(5 − 4α)(3 − 2α)2 n − α(1 − α)(2 − α)
(5 − 4α)(3 − 2α)2 + vn− x2n, where
vn= zn−2(1 − α)
3 − 2α nxn− α
3 − 2αxn= zn− [2(1 − α)n + α]
3 − 2α xn. Then, for n ≥ 6,
(n − α)vn+1 = (n − α)zn+1− [2(1 − α)(n + 1) + α]
3 − 2α (n − α)xn+1
= (n − 4 + 3α)zn+ 2(1 − α)(n − 1)xn−[2(1 − α)(n + 1) + α]
3 − 2α (n − 2 + α)xn
= (n − 4 + 3α)vn+ (n − 4 + 3α)[2(1 − α)n + α]
3 − 2α xn + 2(1 − α)(n − 1)xn− [2(1 − α)(n + 1) + α]
3 − 2α (n − 2 + α)xn
= (n − 4 + 3α)vn−2(1 − α) 3 − 2α xn. Equivalently,
vn+1= n − 4 + 3α
n − α vn− 2(1 − α) (3 − 2α)
xn (n − α).
Applying Lemma 5.3, with fn= (n−4+3α)(n−α) , gn = −(3−2α)(n−α)2(1−α)xn , a = 4 − 3α and b = 3 − 2α, we get vn= O(n−2+2α). This proves part (i) of the proposition.
Part (ii) is proved in a similar fashion. By Theorem 3.2, Cov(An, Cn) = −(1−α)(2−α)(1−2α) 2(3−2α)2(5−4α) n+
O(1). Since E[AnCn] = Cov(An, Cn) + µnνn, with µn and νn found in Proposition 5.1, we write
E[AnCn] = (1 − α)2
2(3 − 2α)2 n2−(1 − α)(4 − 25α + 16α2)
4(5 − 4α)(3 − 2α)2 n − α(8 − 17α + 8α2)
4(5 − 4α)(3 − 2α)2 + tn.(45) Combining (31) and (45), tn satisfies the recursion,
(n − α)tn+1− (n − 5 + 3α)tn = (2 − α)zn+ (1 − α)(n − 1)yn, n ≥ 6. (46) By (40), (42) and (45),
Cov(An, Cn) = −(1 − α)(2 − α)(1 − 2α)
2(5 − 4α)(3 − 2α)2 n − α(1 − α)(2 − α)
(5 − 4α)(3 − 2α)2 + wn− xnyn,