Outcome Weighted Learning as Maximum Likelihood

CHAPTER 5: OUTCOME WEIGHTED LEARNING AND MAXIMUM

5.3 Outcome Weighted Learning as Maximum Likelihood

For clarity, we consider linear treatment regimes of the formπ(x) = sign(x⊺_β₎_{. Extensions}

to nonlinear decision rules are discussed in Remark 5.1. Letg be defined as in Lemma 5.1. We suppose thatg(0) = 0andR_w∈_R_|w_|exp_{−g(w)_}dw < _∞. Our results concern the working model

Y =m(X) +Aq(X)X⊺_β∗ ₊_ǫ,

(5.1) whereq(x)_≥0for allx_∈Rp_,_||_β∗_||_{= 1}_{, and}_ǫ_{has density}_fǫ₍_u₎_∝_exp_{−_g₍_u₎_}_{. The following} result is proved in Appendix D.

Lemma 5.2. Assume working model (5.1) and thatm(X)is known. Then, the profile log-likelihood forβ∗ _{is given by}_ℓ₍_β_{) =} ₋_E_ng_{_Y ₋_m₍_X₎_}_1[sign_{_Y ₋_m₍_X₎_}_A₆_{= sign(}_X⊺_β_)]_.

Corollary 5.1. Assume working model (5.1) and thatm(X)is known. Then, the maximum likelihood estimator ofβ∗isβnb = arg minβ∈SpEng{Y −m(X)}1[sign{Y − m(X)}A 6= sign(X⊺_β_)]_{, where}_Sp _{is the}_{p-dimensional unit sphere.}

The preceding result shows that the objective function considered by Zhao et al. (2012) and Zhang et al. (2012a) is a profile log-likelihood for (5.1) whenm(x)_≡ 0andg(u) =_|u_|so thatǫfollows a Laplace (double exponential) distribution. This result also suggests a number of alternative esti- mators. For example, if the errors are normally distributed, the maximum likelihood estimator is

Generally, the functionmis unknown and must be estimated from data. Given estimatesβbn ofβ∗_,_q_b

nofq, and a known functiong, an estimator ofmis b

mn= arg min

m∈M

E_ngn_Y ₋_A_qn_b ₍_X₎_X⊺_βnb

−m(X)o,

where_Mis a class of working models. In some settings, it may be desirable to treatg as unknown and estimate it from the data. For illustration, we describe our estimation algorithm using a kernel density estimator for the distribution ofǫ, which implies an estimator forg. Alternatively, g can be held fixed at a value informed by underlying theory or preliminary exploratory data anal- yses. We use the following iterative algorithm to constructβnb .

1. Initializeβbn(0) to a starting vector inRp,qb_n,(0)₁, . . . ,qbn,n(0) to starting values in[0,∞), andbg(0)n (u) to₋log f0(u)for a starting densityf0(u). Sett = 1and fixσ > 0and some kernel func- tionk.

2. Repeat the following steps until convergence: (a) computem_b(nt)= arg minm∈MPn_i₌₁bgn(t−1)

n Yi₋Aiq_b_n,i(t−1)X⊺ iβb (t−1) n −m(Xi) o ; (b) setq_b(_n,it) = maxhAi

n Yi−mb(nt)(Xi) o /X⊺ iβb (t−1) n ,0 i i= 1, . . . , n;

i=1bg (t−1) n n Yi₋mb(nt)(Xi)−Aibq(n,it)X ⊺ iβ o ;

(d) ifgis fixed and known, set_bgn(t) = bgn(t−1); otherwise defineeb(_n,it) = Yi −mb(nt)(Xi)− b q_n,i(t)X⊺ iβb (t) n ,e¯(nt) =n−1Pn_i₌₁be(_n,it), b f_n(t)(e) = (nσ)−1 n X i=1 Kn(e₋_be_n,i(t)+ ¯e(_nt))/σo,

and set_bg(nt)(x) = −log fbn(t)(x).

The above algorithm can be implemented for a standard class of models,_M(e.g., linear). In our implementation we terminated the algorithm when

bβn(t)−βbn(t−1)

Step 2b follows from the relation

q_n,i(t) = arg min

qi≥0

gnYi₋m_b(t)(Xi)−AiqiX⊺iβbn(t) o

(see proof in Appendix D) which can be defined arbitrarily whenX⊺

iβb

(t−1)

n = 0.

Forg(u) = _|u_|, the proposed algorithm provides an alternative to the convex relaxation de- rived by Zhao et al. (2012). However, as we show next, the convex relaxation proposed by Zhao et al. (2012) cannot correspond to a maximum likelihood estimator under any model of the form Y =m(X) +AΨ(X, β)η(X) +ǫwhereΨis an arbitrary function andηis a nuisance parameter. The following result is proved in Appendix D.

Lemma 5.3. Letφ(u)be a continuous, onto function fromR_toR+_{. Let}_ǫ_{have density}_fǫ₍_u₎ _∝ exp_{−g(u)_}whereg satisfies the above conditions. There exists no model of the formY =

m(X) +AΨ(X, β)η(X) +ǫwhere,η:Rp _→_H _⊂R_{is unknown, for which}

−E_ng_{_Y ₋_m₍_X₎_}_φ_[sign_{_Y ₋_m₍_X₎_}_AX⊺_β_]

is the profile likelihood forβ.

We now turn to proving consistency of the maximum profile likelihood estimator. The proof of the following consistency result, which is included in Appendix D, involves verifying the conditions of Theorem 2.12 of Kosorok (2008). We allow formto be estimated from the data and assume that the estimator formis an element of a Vapnik-Cervonenkis (VC) class (Hastie et al., 2009).

Theorem 5.1. Letgbe a known function satisfying the conditions stated above and further assume thatgis continuous. Letm_bn ∈ Mbe an estimator ofmand assume thatMis a VC-class. Let

βn= arg min

β∈Sp

E_ng_{_Y ₋_mn_b ₍_X₎_}_1[sign_{_Y ₋_mn_b ₍_X₎_}_A₆_{= sign(}_X⊺_β_)]

be the maximum profile likelihood estimator for model (5.1). Then,

bβn₋β∗

We conclude this section with a few brief remarks.

Remark 5.1. We need not restrict ourselves to linear decision rules. Nonlinear decision rules can be estimated by substituting any parametric or semiparametric function ofXin place ofX⊺_β _in

the algorithm above.

Remark 5.2. We can also consider a slightly more general case of working model (5.1). LetL be a real-valued, odd function which is positive on[0,_∞). Then it can be shown that the model Y = m(X) +Aq(X)L(X⊺_β∗_{) +}_ǫ_{has the same profile likelihood for}_β∗_{as model (5.1). This}

follows from the relationsign(X⊺_β_{) = sign}_{_L₍_X⊺_β₎_}_.

Remark 5.3. Zhou et al. (2017) introduced residual weighted learning as an alternative to out- come weighted learning, proposing to approximate the minimizer ofE_n_{_Y ₋_m₍_X₎_}₁_{_π₍_X₎ ₆₌ A_}overπ for a functionm. AssumingE₍_Y_|_X ₌ _x_{, A} ₌ _a_{) =} _m₍_x_{) +} _ac₍_x₎_and_P₍_A ₌

1_|X =x) = 1/2with probability one, it is easy to see thatm(x) = E₍_Y_|_X ₌ _x₎_{, which can eas-} ily be estimated by regressingY onX. Zhou et al. (2017) proposed a smoothed ramp loss and a difference-of-convex algorithm to approximate the minimizer ofE_n_{_Y ₋_mn_b ₍_X₎_}₁_{_π₍_X₎₆₌_A_}_. Fixingmn_b to be the regression fit ofY onX(ignoring treatment assignment) and proceeding with the proposed algorithm provides an alternative to the difference-of-convex algorithm proposed by Zhou et al. (2017).

In document Luckett_unc_0153D_17640.pdf (Page 84-87)