• No results found

Outcome Weighted Learning as Maximum Likelihood

In document Luckett_unc_0153D_17640.pdf (Page 84-87)

CHAPTER 5: OUTCOME WEIGHTED LEARNING AND MAXIMUM

5.3 Outcome Weighted Learning as Maximum Likelihood

For clarity, we consider linear treatment regimes of the formπ(x) = sign(x⊺β). Extensions

to nonlinear decision rules are discussed in Remark 5.1. Letg be defined as in Lemma 5.1. We suppose thatg(0) = 0andRw∈R|w|exp{−g(w)}dw < . Our results concern the working model

Y =m(X) +Aq(X)X⊺β+ǫ,

(5.1) whereq(x)0for allxRp,||β||= 1, andǫhas density(u)exp{−g(u)}. The following result is proved in Appendix D.

Lemma 5.2. Assume working model (5.1) and thatm(X)is known. Then, the profile log-likelihood forβ∗ is given by(β) = Eng{Y m(X)}1[sign{Y m(X)}A6= sign(Xβ)].

Corollary 5.1. Assume working model (5.1) and thatm(X)is known. Then, the maximum likelihood estimator ofβ∗isβnb = arg minβ∈SpEng{Y −m(X)}1[sign{Y − m(X)}A 6= sign(X⊺β)], whereSp is thep-dimensional unit sphere.

The preceding result shows that the objective function considered by Zhao et al. (2012) and Zhang et al. (2012a) is a profile log-likelihood for (5.1) whenm(x) 0andg(u) =|u|so thatǫfollows a Laplace (double exponential) distribution. This result also suggests a number of alternative esti- mators. For example, if the errors are normally distributed, the maximum likelihood estimator is

b

Generally, the functionmis unknown and must be estimated from data. Given estimatesβbn ofβ∗,qb

nofq, and a known functiong, an estimator ofmis b

mn= arg min

m∈M

EngnY Aqnb (X)Xβnb

−m(X)o,

whereMis a class of working models. In some settings, it may be desirable to treatg as un- known and estimate it from the data. For illustration, we describe our estimation algorithm using a kernel density estimator for the distribution ofǫ, which implies an estimator forg. Alternatively, g can be held fixed at a value informed by underlying theory or preliminary exploratory data anal- yses. We use the following iterative algorithm to constructβnb .

1. Initializeβbn(0) to a starting vector inRp,qbn,(0)1, . . . ,qbn,n(0) to starting values in[0,∞), andbg(0)n (u) tolog f0(u)for a starting densityf0(u). Sett = 1and fixσ > 0and some kernel func- tionk.

2. Repeat the following steps until convergence: (a) computemb(nt)= arg minm∈MPni=1bgn(t−1)

n YiAiqbn,i(t−1)X⊺ iβb (t−1) n −m(Xi) o ; (b) setqb(n,it) = maxhAi

n Yi−mb(nt)(Xi) o /X⊺ iβb (t−1) n ,0 i i= 1, . . . , n;

(c) computeβbn(t) = arg minβ∈SpPn

i=1bg (t−1) n n Yimb(nt)(Xi)−Aibq(n,it)X ⊺ iβ o ;

(d) ifgis fixed and known, setbgn(t) = bgn(t−1); otherwise defineeb(n,it) = Yi −mb(nt)(Xi)− b qn,i(t)X⊺ iβb (t) n ,e¯(nt) =n−1Pni=1be(n,it), b fn(t)(e) = (nσ)−1 n X i=1 Kn(eben,i(t)+ ¯e(nt))/σo,

and setbg(nt)(x) = −log fbn(t)(x).

The above algorithm can be implemented for a standard class of models,M(e.g., linear). In our implementation we terminated the algorithm when

bβn(t)−βbn(t−1)

Step 2b follows from the relation

b

qn,i(t) = arg min

qi≥0

gnYimb(t)(Xi)−AiqiX⊺iβbn(t) o

(see proof in Appendix D) which can be defined arbitrarily whenX⊺

iβb

(t−1)

n = 0.

Forg(u) = |u|, the proposed algorithm provides an alternative to the convex relaxation de- rived by Zhao et al. (2012). However, as we show next, the convex relaxation proposed by Zhao et al. (2012) cannot correspond to a maximum likelihood estimator under any model of the form Y =m(X) +AΨ(X, β)η(X) +ǫwhereΨis an arbitrary function andηis a nuisance parameter. The following result is proved in Appendix D.

Lemma 5.3. Letφ(u)be a continuous, onto function fromRtoR+. Letǫhave density(u) exp{−g(u)}whereg satisfies the above conditions. There exists no model of the formY =

m(X) +AΨ(X, β)η(X) +ǫwhere,η:Rp H Ris unknown, for which

−Eng{Y m(X)}φ[sign{Y m(X)}AXβ]

is the profile likelihood forβ.

We now turn to proving consistency of the maximum profile likelihood estimator. The proof of the following consistency result, which is included in Appendix D, involves verifying the con- ditions of Theorem 2.12 of Kosorok (2008). We allow formto be estimated from the data and assume that the estimator formis an element of a Vapnik-Cervonenkis (VC) class (Hastie et al., 2009).

Theorem 5.1. Letgbe a known function satisfying the conditions stated above and further as- sume thatgis continuous. Letmbn ∈ Mbe an estimator ofmand assume thatMis a VC-class. Let

b

βn= arg min

β∈Sp

Eng{Y mnb (X)}1[sign{Y mnb (X)}A6= sign(Xβ)]

be the maximum profile likelihood estimator for model (5.1). Then,

bβnβ∗

We conclude this section with a few brief remarks.

Remark 5.1. We need not restrict ourselves to linear decision rules. Nonlinear decision rules can be estimated by substituting any parametric or semiparametric function ofXin place ofX⊺β in

the algorithm above.

Remark 5.2. We can also consider a slightly more general case of working model (5.1). LetL be a real-valued, odd function which is positive on[0,). Then it can be shown that the model Y = m(X) +Aq(X)L(X⊺β) +ǫhas the same profile likelihood forβas model (5.1). This

follows from the relationsign(X⊺β) = sign{L(Xβ)}.

Remark 5.3. Zhou et al. (2017) introduced residual weighted learning as an alternative to out- come weighted learning, proposing to approximate the minimizer ofEn{Y m(X)}1{π(X) 6= A}overπ for a functionm. AssumingE(Y|X = x, A = a) = m(x) + ac(x)andP(A =

1|X =x) = 1/2with probability one, it is easy to see thatm(x) = E(Y|X = x), which can eas- ily be estimated by regressingY onX. Zhou et al. (2017) proposed a smoothed ramp loss and a difference-of-convex algorithm to approximate the minimizer ofEn{Y mnb (X)}1{π(X)6=A}. Fixingmnb to be the regression fit ofY onX(ignoring treatment assignment) and proceeding with the proposed algorithm provides an alternative to the difference-of-convex algorithm pro- posed by Zhou et al. (2017).

In document Luckett_unc_0153D_17640.pdf (Page 84-87)

Related documents