An auto validating trans dimensional von Neumann rejection sampler

(1)

An auto-validating trans-dimensional von

Neumann rejection sampler

Raazesh Sainudiin

Biomathematics Research Centre, Department of Mathematics & Statistics,

University of Canterbury, Christchurch, NZ

[with Tom York, Cornell University, Ithaca, NY, USA]

-2 0 2 4

(2)

Outline

• _{A Statistical Sampling Problem}

• Validated Numerics

• What is it ?

• How does it work ?

• _{A Solution to the Sampling Problem}

• Moore Rejection Sampling

• Examples

(3)

(4)

Sampling from a Density - MCMC (M-H)

Support : Θ _∋ θ Target : p := p∗/Np (Unknown Np) Proposal : q := q∗/Nq

p∗

q_I∗

q_L∗

1 Choose an arbitrary starting point θ0 and set i = 0.

2 Generate a candidate point θ′ _∼ q(θi,·) and u ∼ U(0,1).

3 Set:

θi+1 =

(

θ′ if u _≤ p∗(θ′)q(θ′,θi)

p∗₍_θ_i₎_q₍_θ_i_,θ′₎

θi otherwise

(5)

Sampling from a Density - Rejection Sampling

Support : Θ _∋ θ Target : p := p∗/Np (Unknown Np) Proposal : q := q∗/Nq

p∗

q∗

fq

Find “envelope function” fq(θ) = cq∗(θ) such that fq(θ) ≥ p∗(θ), ∀θ ∈ Θ

1 Generate a candidate point θ _∼ q(_·).

2 Draw u _∼ U(0,1).

3 If (u < p∗(θ)/fq(θ)), then θ is an exact and independent sample from p. DONE.

(6)

MCMC vs Rejection Sampling

MCMC (extensions of Metropolis-Hastings Chains)

•

_{“Easy” to implement; almost any proposal works – BUT ONLY asymptotically...,}

•

_{Poor proposal} _⇒ _{slow convergence – heuristic proposal ‘tuning’,}

•

_{Convergence diagnostics are generally not rigorous – can be misleading.}

Rejection Sampling (due to Von Neumann)

•

_{“Hard” to implement; envelope property is NECESSARY – or will NOT sample} _p_,

•

_{Poor proposal} _⇒ _{low acceptance probability – MUCH} _≪ ₁₀−10_,

(7)

What makes a density hard to sample?

•

_Generally:

1 Complexity – many peaks and valleys – size of Θ. 2 Curse of dimensionality.

•

_{If ignorant of global behavior of density:}

3 Widely separated peaks (hard to get from one to next), 4 Narrow peaks on smooth background (hard to find),

5 Peaks of strange shapes (e.g. Rosenbrock’s banana density) 6 Others... ?

Don’t treat target density as black box.

•

_{Look at expression (or code) for p*(x).}

•

_{Find locations, widths, shapes of peaks, etc.}

•

_{Construct a better proposal.}

(8)

Trivariate Needle in a Haystack – Heuristic Diagnostics

p∗(x) = 1

σ₁3 exp{−

1

2((x − µ1)/σ1)

2

} + 1

σ₂3 exp{−

1

2((x − µ2)/σ2)

2

}

µ1 = (0,0,0), µ2 = (1,1,1), σ1 = 1, σ2 = 0.006

-1 -0.5 0 0.5 1

10 100 1000 10000 100000 1e+06

(9)

Trivariate Needle in a Haystack – Heuristic Diagnostics

p∗(x) = 1

σ₁3 exp{−

1

2((x − µ1)/σ1)

2

} + 1

σ₂3 exp{−

1

2((x − µ2)/σ2)

2

}

µ1 = (0,0,0), µ2 = (1,1,1), σ1 = 1, σ2 = 0.006

-1 -0.5 0 0.5 1

10 100 1000 10000 100000 1e+06 1e+07

(10)

Some possibilities through

(11)

Validated Numerics

What is it ?

•

_{set-valued mathematics}

•

_{intervals replace real numbers}

Why use it ?

•

_{provides rigorous error bounds}

•

_{naturally models uncertainty in data}

•

_{may produce faster numerical methods}

Where has it been used recently ?

•

_{J. Hass, M. Hutchings, and R. Schlafly, The Double Bubble Conjecture. Electr.}

Research Announcements of the Amer. Math. Soc., 1, 98-102, 1995.

•

_{W. Tucker, A Rigorous ODE Solver and Smale’s 14th Problem,}

Found. Comput. Math. 2:1, 53-117, 2002.

•

_{T. C. Hales, Some algorithms arising in the proof of the Kepler conjecture. Discrete}

and computational geometry, 25, 489-507, Algorithms Combin., 2003.

Early work:

(12)

Intervals

Notations, Definitions and Features

h

(

X, Y

) =

y

−

x

d

(

X

)

m

(

X

)

0 x

y

h

X

i

|

X

|

y

−

x

•

_real: _x _∈ R

•

_{(compact) real interval:} _X _{= [}_x_,_x_{] = [inf(}_X₎_,_sup(_X_)]

•

_{set of all real intervals:} IR _:= _{_[_{a, b}_{] :} _a _≤ _{b, a, b} _∈ R_}

•

_Example: _[1_{, π}_]_,₁₇_,√₂ _∈ IR_{, but not} _[2_,_1] _or _[1_,_∞_]_.

•

_{real interval vector or} _box_: _X _{= (}_X₁_,_{· · ·} _{, X}_n₎T _∈ _IRn_{, where} _X

i = [xi, xi] ∈ IR,

1 _≤ i _≤ n

(13)

Arithmetic Over

_IR

Definition. If ◦is one of the operators +,−, /,· and if X, Y ∈ IR

X _◦ Y := _{x _◦ y : x _∈ X, y _∈ Y _}

except thatX/Y is undefined if 0 ∈ Y .

Uncountable many cases to consider!

Continuity, Monotonicity, and Compactness _⇒

X + Y = [x + y, x + y], X _· Y = [min_{xy, xy, xy, xy_},max_{xy, xy, xy, xy_}],

X ₋ Y = [x ₋ y, x ₋ y], and X/Y = X _· [1/y,1/y], 0 _∈/ Y.

On a computer we use directed rounding:

X + Y = [_▽(x_⊕ y),_△(x _⊕ y)]

(14)

Interval Extensions

One of the the main goals is to enclose the range of a real-valued function f :

R(f;D) := _{f(x) : x _∈ D_}

This is achieved by constructing an interval extension F : IR _→ IR _of _f _: R _→ R

Monotone functions are easy !

exp(X) = [exp(x),exp(x)]

p

(X) = [p(x),p(x)], if 0 _≤ x

log(X) = [log(x),log(x)], if 0 < x

arctan(X) = [arctan(x),arctan(x)].

Piecewise monotone functions are also OK!

Xn =

8 > > > < > > > :

[xn, xn] : if n _∈ Z+ _{is odd}_,

[_hX_in,_|X_|n] : if n _∈ Z+ _{is even}_,

[1,1] : if n = 0,

(15)

Class of Standard Elementary Functions

E

We define the class of standard functions to be the set

S₌ _{_exp_x,_log_{x, x}a_,_|_x_|_,_sin_x,_cos_x,_tan_x,_{· · ·} _,_arccos_x,_arctan_x,_sinh_x,_cosh_x,_tanh_x_}

For any f _∈ S_{, we can construct a} _sharp _{interval extension} _F_{, i.e.,}

f _∈ S_⇒ _R₍_f_;_X_{) =} _F₍_X₎_.

Building new functions is easy...

We use finite combinations of constants, elements of S_, _{₊_,₋_,_·_{, /}_}_{, and their}

compositions to build the elementary functions E_{. Interval versions of} S_and _{₊_,₋_,_·_{, /}_}

provide the corresponding interval extensions.

But, we may now over-estimate the range ...

If f(x) = ₁₊x_x2, then F(X) = ₁₊X_X2. For the interval X = [1,2], we have

R(f; [1,2]) = [2 5,

1

2] ⊆ [ 1

(16)

Interval Enclosures

Theorem (1). If f(x) ∈ E, and F(X) is well-defined, then

R(f;X) _⊆ F(X)

How tight is the enclosure ?

Theorem (2). If f ∈ E, X = X₁ ∪X₂ · · · ∪ X_k, and F(X) is well-defined, then

R(f;X) _⊆

k

[

i=1

F(Xi) ⊆ F(X)

If f is Lipschitz on X there is aK ≥ 0 s.t.

d

k

[

i=1

F(Xi)

!

− d(R(f;X)) = K max

i d(Xi)

(17)

Computer-Aided Proofs

A consequence of Theorem 1 : y /_∈ F(X) _⇒ y /_∈ R(f;X)

Exercise 1: Let f(x) = cos (x)3 + sin (x). Prove that f(x) ₆= 0, _∀x _∈ [0, 5₄].

Solution: Define F(X) = cos (X)3 + sin (X). Then, by Theorem 1, we have

R(f; [0, 5

4]) ⊆ F([0, 5

4]) = [cos ( 5 4)

3

,1] + [0,sin (5

(18)

‘Impossible’cases too

Exercise 2: Draw the graph of the function

fa(x) = x2 −

3 10e

−(a(x−1₂))2_{, for} _a _{= 200}_{, over the interval} _[

−1,1].

(19)

Rejection Envelopes via Computer-Aided Proofs

Obtain an envelope of the Lipschitz function ₋P5_k₌₁ k x sin (k(x₃−3)).

Solution: Define F(X) = ₋P5_k₌₁ k X sin“k(X₃−3)”. Then, by Theorem 1 R(f;X) _⊆ F(X)

Theorem 2 we can bisect the domain into smaller pieces until maxi d(F(Xi)) ≤ TOL

-10 -5 0 5

-150 -75

0 75 150

Recursive evaluation of the sub-expressions si by fi on the DAG for x · sin((x − 3)/3)

s

6

=

x

sin

`

_x₋₃

3

´

s

5

= sin

`

_x₋₃

3

´

s

3

=

x

−

3 s

2

= 3

s

1

=

x

s

4

=

x−3₃

f

5

= sin

f

4

=

/

f

3

=

−

f

1

=

s

1

f

2

=

s

2

(20)

Auto-validating von Neumann RS

⇔

Moore RS (MRS)

Suppose,

Compact domain Θ _{= [}_{θ, θ}_]

Target shape p∗(θ) : Θ _→ R

Target integral Np :=

R

Θp∗(θ)dθ Target density p(θ) := p∗_N(θ)

p :

Θ _→ R

Interval extension of p∗ P∗(Θ) : IΘ _→ IR

Partition ofΘ T _:= _{_Θ(1)_,_Θ(2)_{, ...,}_Θ(|T|) _}

then, by Theorem 1

p∗(Θ(i)) _⊆ P∗(Θ(i)) := [P∗(Θ(i)), P∗(Θ(i))], _∀i _{∈ {}1,2, ...,_|T_|}_.

Construct the T_{-specific proposal} _qT₍_θ₎ _{as a normalized simple function over} Θ

qT(θ) = “N_qT

”−1 X|T|

i=1

P∗(Θ(i))1

{θ ∈ Θ(i)_} , N_qT :=

|T_|

X

i=1

“

d(Θ(i)) _· P∗(Θ(i))”.

Then an envelope function f(qT(θ)) guaranteeing the necessary inequality is

f_qT(θ) =

|T_|

X

i=1

P∗(Θ(i))1

(21)

Auto-validating von Neumann RS

⇔

Moore RS (MRS)

Suppose,

Compact domain Θ _{= [}_{θ, θ}_]

Target shape p∗(θ) : Θ _→ R

Target integral Np :=

R

Θp∗(θ)dθ Target density p(θ) := p∗_N(θ)

p :

Θ _→ R

Interval extension of p∗ P∗(Θ) : IΘ _→ IR

Partition ofΘ T _:= _{_Θ(1)_,_Θ(2)_{, ...,}_Θ(|T|) _}

Efficiency _⇔ Large average acceptance probability

Ap

T =

R

Θ p∗(θ)∂θ R

Θ fq(θ)∂θ

= _P Np

|T_|

i=1

“

d(Θ(i)₎ _· _P∗_(Θ(i)₎” ≥

P|T_|

i=1

`

d(Θ(i)) _· P∗(Θ(i))´

P|T_|

i=1

“

d(Θ(i)₎ _· _P∗_(Θ(i)₎”

Furthermore, if p∗ _∈ E_L_{, the Lipschitz class of elementary functions}

Ap

T ր 1 − O

„

max

i∈{1,...,T_}d(Θ (i)₎

«

.

Efficiency can be further improved by:

(22)

MRS – Bivariate Levy Densities

E(X1, X2) = 5

X

i=1

icos ((i− 1)X1 + i)

5

X

j=1

j cos ((j + 1)X2 +j) + (X1 + 1.42513)2 + (X2 + 0.80032)2

lT(X1, X2) = exp{−E(X1, X2)/T} There are 700 modes !

-10

-5

0

5

10

-10

-5

0 5

10

0 20 40 60

-10

-5

0

5

-10

-5

(23)

MRS – Bivariate Levy Densities

Adaptive partitioning of the domain [−10,10] × [−10,10]into 150 rectangles for Moore rejection sampling from the Levy target density l40 (acceptance probab. = 0.01).

-10 -5 0 5 10

-10 -5

(24)

MRS – Bivariate Levy Densities

Acceptance probability (AlT

V_α) versus partition size (|Vα|) for Levy targetslT, where T is the is the

temperature parameter. There is an optimal CPU time (2.0GHz) to generate 104 samples.

1e-05 0.0001 0.001 0.01 0.1 1

1 10 100 1000 10000 100000 1e+06

A c c e p . P ro b . ( A

lT V

α

)

Partition size (_|V_α_|₎

(25)

MRS – Trivariate Needle in a Haystack

p∗(x) = 1

σ₁3 exp{− 1

2((x − µ1)/σ1)

2

}+ 1

σ₂3 exp{− 1

2((x− µ2)/σ2)

2 } -1 -0.5 0 0.5 1

10 100 1000 10000 100000 1e+06

R u n n in g M e a n o f x1 chain 1 chain 2 chain 3 chain 4 0.0001 0.01 1 V a ri a n c e o f x1 B W B/W

(26)

MRS – Trivariate Needle in a Haystack

0 10 20 30 40 50 60 70 80 90

-0.2 0 0.2 0.4 0.6 0.8 1

R

e

p

lic

a

te

s

Mean x1

LMHS(0.31) MRS (0.15) LMHS(3.36) MRS (0.15)

1

•

MCMC with Metropolis proposal (uniform in cube of side 6σ1 centered at x)

•

_{With burn-in defined as ending at} _B/W _{= 0}_.₀₅

•

_{Run length is} ₁₀ _{times burn-in (typical run length 20000-50000 for} _σ₂ _{= 0}_.₀₁_).

(27)

MRS – Multivariate Equi-Exponential Mixtures

eD_c (X) =

c

X

i=1

exp“−|X − α(i)|”, α_j(i) = a(i) ∈ Θ _{= [}₋_100,_100]D_{, j} _{= 1, . . . , D, i} _{= 1, . . . , c}

1e-05 0.0001 0.001 0.01 0.1 1

1 10 100 1000 10000 100000 1e+06

A

c

e

p

ta

n

c

e

P

ro

b

a

b

ili

ty

(

A

e

D c

V

α

)

e2₁

e2₁₀

e10₁

(28)

MRS – Multivariate Rosenbrock’s Density

rD(X) = exp{−

D

X

i=2

(100(Xi − X_i2₋₁)2 + (1− xi−1)

2

)}

-2 0 2 4

(29)

MRS – Multivariate Rosenbrock’s Density

Acceptance probability (ArD

V_α) versus partition size (|Vα|) for Rosenbrock targets rD, where D is

the is the dimension (Θ _{= [}₋_10,_10]D_{). CPU time (2.0GHz) to generate} ₁₀4 _samples.

1e-05 0.0001 0.001 0.01 0.1 1

1 10 100 1000 10000 100000 1e+06

A c c e p . P ro b . ( A

rD V

α

)

(30)

MRS – Multivariate Witch’s Hats

Acceptance probability (AhDr

V_α) versus partition size (|Vα|) for Witch’s Hat targets h D

r , where D is

the support dimension and R = 10−r

is the hat’s radius.

0.001 0.01 0.1 1

1 10 100 1000 10000 100000 1e+06

A c c e p . P ro b . ( A h D r V α )

h2₀ h2₂ h2₅ h2₁₀

b

h2₀ h10₀

(31)

Phylogenetic Likelihood

1:

input: (i) a tree T_k, (ii) branch lengths t = (t₁, t₂, . . . , t_b

k), (iii) transition probability

Pa_i,a_j(t) for any ai, aj ∈ A, (iv) stationary distribution π(ai) over each characters

ai ∈ A, (v) site pattern (data) at site q

2:

output: ℓ_q(k, t), the likelihood at site q

3:

initialize: For a leaf node h with observed character a_i at site q, set l(_hai) = 1 and

l_h(aj) = 0 for all j ₆= i. For any internal node h, set lh := (1,1, . . . ,1).

4:

recurse: compute l_h for each sub-terminal node h, then those of their ancestors

recursively to finally compute lr for the root node r to obtain the likelihood for site q,

ℓq(k, t) = lr =

X

a_i∈A

(π(ai) · l(rai) ) .

For an internal node h with descendants s1, s2, . . . , sℏ,

l

(ai)

h

=

X

j1,...,jℏ∈A

{

l

(j1)

s1

· P

ai,j1

(

t

s1

)

· l

(j2)

s2

· P

ai,j2

(

t

s2

)

. . . l

(jℏ)

(32)

Auto-validating Posterior Estimates of Human-Neandertal Divergence Time

Envelope via Interval-extended post-order traversals

site : 1 1 1 1 1 1

pattern : 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5

. . . .

neandertal : t t c a g g t g t c a a c a a

human : t t c a g g t a c c a g t a g

chimpanzee : t c c a g a a a t t g a c t g

. . . .

site : 6 1 6 6 4 1 2 1 2 1 1 1 1 1 1

pattern : 0 4 0 8 5 0 4 5

counts : 5 3 5 0

THE Pos. Distn. of Human-Neandertal Divergence Time

0.1 0.2 0.3 0.4 0.5 200

400 600 800

4MY H-C Div : (272680,571124,1073375)

8MY H-C Div : (545360,1142248,2146749)

Human-Neandertal Divergence Estimate

(461000,821000) of Green et. al. (Nature,

2006) is too narrow

10,000 i.i.d. samples from the posterior over the Chimpanzee, Human and

(33)

MRS –

Interval-extended Recursions on Trans-dimensional Tree Spaces

(i)0_t

1 0_t1 2 0_t1 3 A A A A A A A 0_t1 r

(ii)1_t

1 1_t1 2 1_t1 A A A A 0 1_t0 3 A A A A A A A

1_t0₊1_t1 r

(iii)2_t

2 2_t1 3 2_t1 A A A A 0 2_t0 1 A A A A A A A

2_t0₊2_t1 r

(iv)3_t

3 3_t1 1 3_t1 A A A A 0 3_t0 2 A A A A A A A

3_t0₊3_t1 r

(v)4_t

1 4_t1 2 4_t 3 3 S S S S 4_t2 0

Model Selection in Primate Inter-relatedness across 3 × 109 Human-Chimp-Gorilla Genomes

(34)

Summary

•

_Advantages_:

•

_{Exploits the DAG encoding of the function in the machine}

•

_{Automatically constructs proposal density shape adapted to the target density}

•

_{Produce guaranteed independent samples from inclusion isotonic densities}

•

_{Can be used to account for physical limits on numerical and empirical}

resolutions in inferential procedures

•

_Limitations_:

•

_{Can be overwhelmed by complexity, domain size, many dimensions.}

•

_{Much work refining partition before first sample.}

•

_{The MRS is ultimately RAM limited}

•

_{Discrete spaces without an apparent metric structure ?}

•

_POA_:

•

_{Pre-enclosures – DAG dissection – Hash Access}

•

_{Algebraic Statistics for dissolving symmetries in DAG (’minimal sufficiency’)}

•

_{Tighter enclosures via AD – higher-order Taylor expansions}

•

_{Extending arithmetic and} E _{to regular sub-pavings}

•

_{The support need not necessarily be Euclidean (CAT(0) space of trees is OK !)}

•

_{The puniest(}_< ₁ _ulp_)-headed ₁₁_{-dimensional witch "takes off her hat" to MRS !}

•

_{Moore Rejection Sampler is an Auto-validating von Neumann Rejection Sampler}

(35)

Acknowledgments

People:

Rick Durrett (MATH@Cornell) – Listening and guiding

Michael Nussbaum (MATH@Cornell) – Math. stats. seminars & discussons Laurent Saloff-Coste (MATH@Cornell) – MCMC asymptotics

Rob Strawderman (STATISICS@Cornell) – Finite mixtures

Warwick Tucker (MATH@Uppsala, SW) – Introduction to Interval analysis

Karen Vogtmann (MATH@Cornell) – Geometry of tree space

Marty Wells (STATISTICS@Cornell) – Various statistical insights

Comp. Sci. Univ. of Karlsruhe, DL – > 25 human years of prog. C++ librs.

Funding:

Research Fellow of the Royal Commission for the Exhibition of 1851 (Oxford)

-2 0 2 4

0 5 10 15 20