Chapter 3 Optimising Quantile Risk
3.3 The Markov Chain
In this section we find an analytic description for the Markov chain that our optimi- sation process will follow (at least when it is at thermal equilibrium). We use this result in later sections as a foundation for further key results, which, as highlighted in the introduction, allow for very effective control of our optimisation process. We use a combination of (noisy) simulated annealing and underlying data samples to perform the quantile optimisation. In this section we assume that inverse tempera- tureβ, and the sample size nused to estimate VD(RT), are both fixed.
As discussed in the previous section, we assume that we are optimising the design of a random variable,FD, at a certain risk (or probability) level,RT. During the optimisation process we will always searchDusing a symmetric proposal density, i.e. h(D→D0) =h(D0 →D). Consequently,if we could calculateVD(RT) precisely at each step, the relative probability density for each design in a simulated annealing process when using the Metropolis acceptance function at fixedβ, would beπa(D) =
eβVD(RT). Note that this is the Boltzmann distribution, as defined in definition 2.2.5, withE=−VD.
We now further assume thatFD is composed fromm underlying real valued random variables whose joint distribution is independent of D. In other words, it is only this composition of random variables to formFD which is dependent on D. These underlying random variables are not required to be independent from each other. We then consider X to be a sample set of n independent, m dimensional, elements each drawn from the distribution of them random variables from which
FD is composed. We consider X to be the space of all possible choices of X. SinceVD(RT) is estimated using sampling it will inherently have some error associated with it. The precise error will depend on the specific sample set used. Due to this we can now instead consider a Markov chain which operates over the joint space ofD andX (instead of just D) whereβ andn are fixed.
We denote the estimate of VD(RT) using X as ˆVD(k,X) for some (given) choice ofk. We define ˆVD(k,X) to be thekth ranked value ofVD obtained from the
nelements contained in the sample set X. We would naively expect an appropriate choice ofk to be bRTnc, however, as we shall see, this may not always be the best choice. More specifically, ˆVD(k,X) is calculated by composing thensample elements contained in the sample set X into n samples drawn from the distribution of FD using the composition parametrised by D. These samples from the distribution of FD are then put in ascending order and the kth ordered point selected to be
ˆ
VD(k,X).
We will use a similar Metropolis acceptance function for this new Markov chain. That is, for fixedβ and k, the probability of accepting a move is
a(D,X→D0,X0) = min1, eβ(VˆD0(k,X0)−VˆD(k,X))
. (3.1)
Furthermore, the proposal densities for bothDandXwill be mutually independent of each other
h(D,X→D0,X0) =h(D→D0)h(X→X0), (3.2) and, as before, forD, the proposal density will be symmetric,
h(D→D0) =h(D0 →D). (3.3) The proposal density forXwill reflect the underlying densities of the random variables from whichFD is composed. A candidate sample setX0will be constructed by takingnnew randomly drawn samples from these underlying distributions. Al- ternatively,l, where 0< l < n, randomly selected elements of the current sample set Xwill be replaced byl newly drawn elements. The method used will be consistent throughout a simulation. Hence, by construction, the proposal density for X will satisfy
h(X→X0)
h(X0→X) =
H(X0)
where H(X) = Πx∈Xη(x) and η(x) is the joint probability density of all the m
random variables which are composed to form FD. Note that the product is over then sample elements which are contained in the sample setX∈ X. It should be clear that each sample element,x∈X, will itself be m dimensional.
Proposition 3.3.1. The relative probability density of the resulting Markov chain in the space of (D,X) will hence be
π(D,X) =H(X)eβVˆD(k,X).
for a fixed choice ofβ, n and k.
Proof. [Roberts and Rosenthal, 2004] state that given a factorisation of the proposal densities (equations 3.2, 3.3, 3.4) and acceptance probabilities (equation 3.1) the relative probability density of the Markov chain is as stated in the proposition, provided that the detailed balance condition can be proved. That is, we are required to show that
π(D,X)q(D,X→D0,X0) =π(D0,X0)q(D0,X0 →D,X) (3.5) whereq(D,X →D0,X0) = a(D,X →D0,X0)h(D,X→ D0,X0). Starting from the left hand side and substituting equations 3.2, 3.3, 3.1 and 3.4 leads to
π(D,X)q(D,X→D0,X0) =π(D,X)h(D,X→D0,X0)a(D,X→D0,X0) =π(D,X) min h(D,X→D0,X0), h(D0,X0 →D,X)H(X 0) H(X)e β(VˆD0(k,X0)−VˆD(k,X)) = min π(D,X)h(D,X→D0,X0), π(D0,X0)h(D0,X0 →D,X)
which is symmetric under the interchange of (D,X) and (D0,X0) and hence we have shown the required equality (equation 3.5).
We can now use the inversion method to project the stationary density from (D,X) to (D,[0,1]n). We can then further project the stationary density onto (D,[0,1]). This allows for a more interpretable description of the relative probability density.
To do this, we first note that ˆVD(k,X) must equalVD(Rk) for corresponding choices of Rk ∈ [0,1] for all possible choices of k. Here Rk is the kth ranked of n samples of the uniform distribution on the unit interval (by the inversion method).
Such an equivalence must exist by construction of ˆVD(k,X). Proof and further details can be found in [Devroye, 1986].
The projected relative density would then become
π(D, R1. . . . , Rn) =ζ(R1, . . . , Rn)eβVD(Rk)
whereζ(R1, . . . , Rn) is the joint density function for all of theRks. If we can then find the relative probability density of Rk we can further project the density to
π(D, Rk) =ζk(Rk)eβVD(Rk) (3.6)
π(D) = Z 1
0
dRkζk(Rk)eβVD(Rk) (3.7) The first statement is a result of integrating π(D, R1. . . . , Rn) with respect to Ri
∀i6=k. The second statement is the result of integratingπ(D, Rk) with respect to
Rk. These integrals will always exist whenVD(.) is upper bounded on [0,1].
The relative density of theRks can be easily calculated since theseRk relate to uniform draws from [0,1] by the inversion method. We can use the binomial dis- tribution to calculate their joint a priori probabilities of occurrence. The probability density ofRk, labelledζk(Rk), is then given by
P(k−1 samples are less thanRk)
×P(n−k samples are greater thanRk)
×n(Possible choices for thekth sample)
×
n−1
k−1
(Possible permutations of the remaining samples).
By substituting the correct binomial probabilities into the above calculation and simplifying, we find that
ζk(Rk) = n k kRkk−1(1−Rk)n−k. (3.8) We note thatRk has a Beta distribution with parametersα=kand β=n−k+ 1. Remark 3.3.2. In order to consider expectations over this chain we will need to assume that the normalisation constant for the above relative densities exist, i.e.
N(β, n, k) = Z D dD Z 1 0 dRkζk(Rk)eβVD(Rk)<∞. (3.9)
Clearly existence of the normalisation constant is dependent on the choice ofVD(.), which as previously discussed is unlikely to be known in closed form (if at all). However, if we further assume that D is bounded and that VD(.) is bounded in the region nearRT then it seems reasonable to expect that the integral will exist for a sufficiently large choice of nand equivalent choice of k=bRTnc.
It is worth noting that the two factors inπ(D, Rk) above will attempt to push
hRki in different directions. The ζk(Rk) term will seek to select values of Rk near to kn, whereas the eβVD(Rk) term will seek to bias the chain towards larger choices ofRk. This is because it will give more weight to overestimates of VD(RT) than to underestimates ofVD(RT). The pressure each term exerts on the choices ofRk will be driven byβ and nrespectively. This tension will mean that in practicehRkiwill almost always exceed kn.
In this section we have found the Markov chain weights for our optimisation process. In later sections we build on the above results to prove some of the key results contained in this chapter.