CHAPTER 3: FEATURE-GUIDED CLUSTERING
3.4 Weight-Glycemia Subtypes of Type 1 Diabetes
4.2.2 Case-Control Generalized Outcome Weighted Learning
OWL views the problem as one of weighted classification, where the class labels areAi and the weights areYi(Zhao et al., 2012). By doing so, support vector machine methods (Cortes and Vapnik, 1995) may be applied by using the hinge loss in place of 0-1 loss in (4.2). Unfortunately, estimates ofD0from OWL tend to assign patients to their observedAi whenYi is negative (Chen et al., 2018; Wang et al., 2016; Zhou et al., 2017), spurring the development of improvements to OWL which better handle negative rewards such as GOWL (Chen et al., 2018). GOWL alter- natively uses|Yi|as the weights andAisign{Yi}as the class labels with a piece-wise hinge loss function.
4.2.2 Case-Control Generalized Outcome Weighted Learning
In a case-control design,n1 patients are randomly drawn from the control or healthy popula- tion, andn0patients are randomly drawn from the cases or the diseased population for a total of n = n1 +n0 in the sample. Because of the stratified sampling in case-control studies, the study sample is not representative of the overall population, particularly in the scenario where the dis- ease is rare (Glicksberg et al., 2018). In some scenarios,n0may include all cases from a dataset. Due to the nature of a case-control design, the outcome is binary, i.e.,Y ∈ {0,1}.Contrary to popular convention, we will denote cases asY = 0and controls asY = 1so that higher values ofY are still desirable. Otherwise, we will retain the same notation from before. Oftentimes, a matched selection process is used whereby patients are matched on some subset ofX. For the proposed method, matching is not necessary.
Next, let us defineq(X)andp(X)as the density ofX in the study sample and the overall population, respectively. Because of selection bias, maximization of (4.1) from the empirical data may not provide a consistent estimator forD0. Thus, we propose inclusion of a “selection factor,” namelyθ(X) = p(X)/q(X)such that we instead minimize
E |θ( X)Y| P(A|X)1{A6=D(X)} . (4.3)
Under Lemma 4.1, optimization of (4.3) is equivalent to optimization of (4.2); the proof is left to Appendix B.
Lemma 4.1. Under the given assumptions,
D0 = argmin D∈D Eq θ(X)Y P(A1|X) 1{A1 6=D(X)} .
To estimateD0, we then minimize the objective function given in (4.4) forf ∈ F, a repro- ducing kernel Hilbert space. Here,ψ(u, v) = max{1−sign(u)v,0},λnis a tuning parameter, and || · ||is theL2 norm. For details on solving the minimization problem in (4.4), we defer to Chen et al. (2018) and Kimeldorf and Wahba (1970).
argmin f∈F 1 n n X i=1 |θ(Xi)Yi| P(Ai1|Xi) ψ{Yi, Aif(Xi)}+λn||f||2, (4.4)
Especially with largep, estimation ofθ(X)can be very difficult using typical kernel den- sity estimation (Parzen, 1962; Rosenblatt, 1956; Silverman, 2018). However, by Lemma 4.2, we may more easily estimateθ(X)from estimates ofP(A = a|X)andP(Y = y|X, A = a), for a ∈ {−1,1}andy ∈ {0,1}, obtained through regression (e.g., random forests). We further as- sume knowledge ofP(Y), either from previous literature or a representative database. Although selection bias is present in the sample, we may still obtain valid estimates forP(Y|X, A)(Van Der Laan, 2008), and a weighted model forP(A|X)may be fit (Walsh et al., 2012). Particularly,
in our implementation with random forests, each individual may be weighted for sampling into each bootstrapped tree.
Lemma 4.2. Letrk =nk/nfork = 0,1. Under the given assumptions,
θ(X) = X a∈{−1,1} P(A=a|X) r0P(Y = 0|X, A=a) P(Y = 0) + r1P(Y = 1|X, A=a) P(Y = 1) −1 .
Thus, we define our estimated decision function as
b fn∗ = argmin f∈F 1 n n X i=1 |bθ(Xi)Yi| P(Ai|Xi) ψ{Yi, Aif(Xi)}+λn||f||2, (4.5)
and our proposed estimator of the optimal ITR isDb∗(X) = sign n b fn∗(X)o, where D∗ = argmin D∈D E |θ( X)Y| P(A|X)ψ{Y, Af(X)} . 4.3 Theoretical Results
LetR(f) = E[Y 1{A 6= sign[f(X)]}/P(A|X)]be the risk under 0-1 loss andRψ(f) = Ehbθ(X)|Y|ψ{Y, Af(X)}/P(A|X)
i
be that underψ-loss with the estimated adjustment for selection bias. Under Theorem 4.1, we have Fisher consistency forD∗(X) = sign{f∗(X)},
wheref∗(X) = argminf∈FRψ(f).
Theorem 4.1. Under the given assumptions,D∗(X) = D0(X).
Next, letF be the closure ofF andf0andf0∗be the minimizers ofR(f)andRψ(f), respec- tively, over all functionsf. Theorem 4.2 then provides us with global consistency.
Theorem 4.2. Letλn > 0be a sequence such thatλn → 0andλnn → ∞with probability going to 1 asn → ∞. For any distributionP of(X, A,Y),limn→∞Rψ(fbn∗) →P Rψ(f∗).
4.4 Numerical Experiments
To demonstrate the performance of the proposed method, we present results from a set of simulations. The covariates were generated as 50 i.i.d. variables from aU(−1,1)distribution. Awas sampled from{−1,1}with probabilityexpit{3(1 +X1 +X2)}of being 1.Y was sam- pled with probabilityexpit{5 −X1 − 2X2 +X3 + 5A[1 + 3(X1 +X2)]}from a binomial distribution. A population dataset of sizenpop = 100,000was generated. Under the data gener- ation described, the population dataset contained a 9.1% prevalence of cases. The allocation of the true optimal ITR, i.e. sign{1 + 3(X1 +X2)}, was 34.7% toA = −1and 65.3% toA = 1. Simulations were repeated 1,000 times each, varying the sample size of the study dataset across n= 50,100,200,500,1000where cases and controls were sampled at a 1:1 ratio.
A random forest was fit forP(Y|X, A)and a weighted random forest was fit forP(A|X) (Breiman, 2001; Ishwaran et al., 2008) using the randomForestSRC R package (Ishwaran and Kogalur, 2007), cases were weighted byP(Y = 0)and controls were weighted byP(Y = 1).P(Y)was assumed to be known from the population dataset. The proposed method was compared to the “naive” approach in which a case-control sample is analyzed but there is no correction for selection bias. A “cohort” method was also performed for comparison in which data of size2nwere randomly sampled from the population dataset. For all methods, GOWL was performed using the DynTxRegime package in R, version 3.4.1 (Holloway et al., 2018; R Core Team, 2017). The kernel was correctly specified to be linear, andλnwas selected from the grid [0.1,0.5,1,5,10,50,100,500]/nvia 5-fold cross validation.
Figure 4.1 displays the average classification accuracy and Pnpop[Y1{A=Db
∗(X)}]/P(A|X)
Pnpop[1{A=Db
∗(X)}/P(A|X)] ,
the estimated mean value (Qian and Murphy, 2011) wherePnis the empirical mean. At smaller sample sizes, the proposed and naive methods perform similarly in accuracy, with some advan- tage over the cohort method. With largern, however, the proposed method performs somewhere
Figure 4.1: Average classification accuracy (A) and estimated value (B) across 1,000 replications at each sample sizenin simulation.
between the naive and cohort approaches in terms of classification accuracy. The estimated value for the three methods appear to be parallel asnincreases, with the proposed method performing somewhere in between the naive and cohort methods. These results show promise for the pro- posed method at certain sample sizes, although the exact scenario in which the proposed method will greatly outperform the others is still unclear. We believe the data generation mechanism must produce data which are sufficiently different inp(X)andq(X)while maintaining a reasonable difference in treatment effect.
Results on the mean square error (MSE) across all simulations are provided in Table 4.1 for estimatingθ(X). Although the estimates decrease withn, the rate of convergence toward 0 may be of concern. These results merit further investigation into the algorithm for estimatingθ(X). It is known that tree-based methods produce non-smooth estimates (Friedman et al., 2001); in the future, we may instead use a different regression technique such as multivariate adaptive regression splines (Friedman, 1991).
Table 4.1: Mean (sd) mean square error for estimatingθ(X).
n 50 100 200 500 1000