Heterogeneous Treatment Effects Estimation

Part II Financial Engineering and Machine Learning Mod-

Chapter 4 Heterogeneous Treatment Effects in Asset Pricing

4.2 Heterogeneous Treatment Effects Estimation

The main model we consider is given as below:

Yi =µ∗(Xi) +Ziτ∗(Xi) +i, _E[i|Xi, Zi] = 0 (4.1) , where Y denotes the outcome we care about andZ denotes a continuous treatment

variable whose effect on Y is denoted byτ∗(X). X’s are the features other than the

treatment. µ∗(X) is the base-case outcome as a function of X when Z = 0. The

condition _E[i|Xi, Zi] = 0 is often called exogeneous assumption or no endogene-

ity assumptions in linear regression models. We will arrive at equation 4.1 later in

this section. First, we recall the framework of causal inference and treatment effect

estimation in the binary treatment case, which is dominant in the causal inference

literature. We then discuss generalizations to continuous treatment cases and as-

sumptions made therein. Note that binary and continuous treatment variables are

conceptually very similar, but to be clearer, we use different notations: Z denotes

continuous treatment variable while we use W if the treatment is binary. Lastly, we

discuss how we could estimate HTE, and cast the traditional regression approach in

asset pricing as a simple special case to build the connections. Basically, we treat

future stock returns asY, one firm characteristic as treatmentW orZ, and the rest

4.2.1 Causal inference framework: Binary treatment

W

We first briefly recall the causal inference framework before diving into HTE estima-

tion. The framework we take is often called potential outcome framework. We assume

we have n i.i.d. samples of (Xi, Yi, Wi) drawn from a joint distribution f(X, Y, W),

where

• Yi ∈R: observed outcome for individuali.

• Xi ∈Rp: features for individual i of dimension p

• Wi ∈ {0,1}: whether i receives treatment or not

• “Potential Outcome”: {Yi(0), Yi(1)} depending on whether i gets treatment: Yi =Yi(Wi)

The focus is on estimating HTE defined as:

τ∗(x) =_E[Y (1)−Y (0)|X =x] (4.2) , which are also called conditional average treatment effects (CATE). We use su-

perscript ∗ to emphasize that it is an unknown population quantity. Note that, as

hinted in the introduction section, in this chapter’s empirical study we mainly use

continuous treatment variables like book-to-market ratio but we could also think of

interesting binary treatment variables. For example, W could represent some events

of firms, such as dividend cancellation announcements, where the effects of the event

could potentially be heterogeneous.

The fundamental problem of causal inference is that we only observe one of

{Yi(0), Yi(1)} for each i. The other one in {Yi(0), Yi(1)} is always missing. In the

medical trial setting this problem can be stated as how long will the patient in the

control group live had him taken the drug. In our case, the problem can be stated as

in the current month. This is what makes causal inference problem special: if we can

observe bothY(0) andY(1), we get to observe treatment effect τi =Yi(1)−Yi(0) for

each observation i, too, and HTE estimation reduces to a supervised learning task

using any ML methods to fit to the data set (Xi, Yi(1)−Yi(0)).

To tackle the fundamental problem of causal inference, we need some structure

and assumption in order to recoverτ∗. There are usually two types of studies that try

to infer treatment effects: randomized experiments and observational studies. Ran-

domized experiments are settings where researchers are able to conduct experiments

in a controlled environment by randomly assigning treatment to individuals, which

are common in drug trials and A/B testing conducted by internet companies such as

Facebook and Google. Experiments remain the golden standard of causal inference,

but, unfortunately, in most economic applications, experiments are not feasible. In

this case, we need to work with observed data only and that is called observational

studies. One standard assumption to assume in observational studies is no unmea-

sured confounders: conditioning on the observed features Xi, treatment assignment

Wi is independent of potential outcome (Yi(0), Yi(1)), i.e.

• No unmeasured confounders

(Yi(0), Yi(1)) independent of Wi conditional on Xi (4.3)

Note that in a controlled randomized experiment, assumption (4.3) is automati-

cally satisfied. However, in observational studies, we usually need to assume equation

(4.3) is true. If we have reasons to believe that assumption (4.3) is violated, we have

an endogeneity problem and need to find instrumental variables (IV) for identifica-

tion of treatment effects. See corresponding chapters in textbook such as [89] for

more details along that direction. We explain the meaning of this assumption in our

empirical context in the next subsection, where we describe the continuous treatment

Under assumption 4.3, we can decompose observed Yi into the following three

terms:

Yi =E[Y (0)|X =Xi] +Wiτ(Xi) +i (4.4)

,where the last term i satisfies

E[i|Wi, Xi] =E[Yi−E[Yi(0)|X =Xi]−Wiτ(Xi)|Wi, Xi]

= 0

Our goal is then to flexibly estimate HTE, that is, the CATE function τ(Xi).

4.2.2 Causal inference framework: Continuous treatment

Z

We generalize the model to continuous treatment variable case where the treatment is

denoted byZ. Continuous treatment is not as widely studied in the causal inference

literature compared with binary treatment. Again, outcome variable is denoted byY

,and the observed outcome becomesYi =Yi(Zi), Zi ∈ Z, whereZ can be a continuum,

and for each individual i, we only get to observe its value for one Zi, Yi = Yi(Zi).

The problem here is more complicated than the binary case, and thus we need to

assume more than just equation (4.3). In particular, we assume that conditional on

X, the effect of increasing one unit of Z is constant regardless of current levels of

Z, in addition to no unmeasured confounders. The assumption in this case can be

described as below:

• Linear treatment and no unmeasured confounders

E[Yi|Xi =x, Zi =z] =µ∗(x) +zτ∗(x) (4.5)

The assumption above implies equation (4.1) in the starting paragraph of this section,

which we repeat below:

The specific model of equation (4.5) has also been studied recently by, for example [9]

and [1]. We will assume equation (4.5) throughout the paper, and therefore equation

(4.1) always holds in our setting. The zero expectation of residual conditional on

X, Z in equation (4.1) is often called exogenous assumption. It may seem at first

to be a strong assumption; however, we explain what this assumption means in our

empirical setting in the following section, and this assumption turns out to be no more

restrictive than what is assumed in usual factor model estimation. In fact, equation

(4.1) is much more flexible than standard assumptions of regression approach used in

empirical asset pricing.

4.2.3 Linear factor models as a special case

Before going to HTE estimation, we cast the traditional regression approach to factor

models as a special case of our main specification in equation (4.5). We simplify the

HTE estimation problem into a linear, homogeneoues setting and build the connection

between this setting and the estimation of traditional linear factor models.

Because we are generally interested in continuous variables in asset pricing models,

we cast regression models under our causal framework using equation (4.1), which is

implied by assumption (4.5). We assume the following two additional assumptions

which simplifies the problem significantly.

• The treatment effect is homogeneous, that is.

τ∗(x) = τ (4.6)

• Linear in features X for base case response µ∗(x):

µ∗(x) =xβ (4.7)

Then under assumptions (4.5), (4.6) and (4.7), equation (4.1) becomes

OLS theory says that ˆτOLS is consistent for true τ. In linear factor models, we

want to estimate whether or not a particular factor, say, book-to-market ratio for

stockj in month t, can predict stock j’s return for month t+ 1, after controlling for

other firm characteristics represented in Xjt. What we have is essentially panel data

regression, but we care only about the cross-sectional relationship between book-to-

market ratio and next month stock returns. Therefore one way to do the analysis is to

run a panel data regression with time fixed effects, which focuses on variations in the

cross section and averages out along the time axis. In terms of the point estimates,

adding time fixed effect is equivalent to demeaning both the target variable and all

regressors from their monthly average. Denote stock return byr, stock features byX,

the book-to-market ratio by BM. Following [40], we use the log of book-to-market

ratio, logBMjt, in the regression for it has distribution closer to normal distribution.

We have the panel regression equation with cross-sectionally demeaned variables:

rjt−rt¯ = Xjt−Xt¯ β+ logBMjt−log ¯BMt

τ +jt (4.9)

,where ¯rt = _J1_t PJ_l₌₁t rlt and log ¯BMt is similarly defined.

Regression equation (4.9) is related to the possibly more popular called Fama-

Macbeth regression proposed by [42]. See [75] for more about that connection.

Comparing equations (4.8) and (4.9), we can then map quantities in factor models

to ones in the causal framework: rjt−r¯tis our outcome variableY, Xjt−X¯t

are fea-

tures/other controlsX of dimensionp, and, most importantly, logBMjt−log ¯BMt

is the treatmentZ whose effects on next-month returns relative to the cross-sectional

averages,rjt−¯rt, are of primary interests. On the other hand, this comparison also

shows what assumptions are behind usual OLS estimation for traditional linear re-

gression in factor models (equation (4.9). For OLS estimator in equation (4.9) to be

consistent, equation 4.8 must hold, which is often called exogenous condition4_{. We’ve}

4_{Actually the weakest condition to ensure OLS is consistent is} E Xii Zii = 0

seen that this is equivalent to assumptions (4.5), (4.6) and (4.7). We keep assumption

4.5 throughout the paper. However, in section 4.2.4 we introduce HTE estimation

techniques developed by the machine learning and causal inference researchers, which

help us relax assumptions 4.6 and 4.7 in our empirical studies.

We next discuss the assumption of no unmeasured confounders or no endogene-

ity assumptions of equation (4.3) and (4.5). Those assumptions imply exogeneous

condition like _E[|X, Z] = 0, which is the type of assumptions that justifies OLS in linear models. We assume equation (4.5) and thus (4.1) throughout this chapter. We

argue that this assumption is reasonable for the following two reasons. First, it is no

stronger than the assumptions made in the literature when estimating traditional lin-

ear factor models using regressions. If we run the type of panel regression in equation

(4.9) or similar Fama-Macbeth regressions and start to interpret the sign and mag-

nitude of τ or conduct statistical inference onτ, we are implicitly assuming equation

(4.5) or (4.1) holds. Secondly, compared with prior regression approach that assumes

exogeneity in the finance literature, our empirical study based on HTE estimation

is in a better position to assume no unmeasured confounders and thus mitigate en-

dogeneity problem for the following two reasons: (1) In the existing literature, only

a few characteristics are collected as controls X for regression equations like (4.9),

whereas we include around 40 features as controls X. Endogenity problem caused

by omitted-variable bias is much less of an issue in our study. (2) We also utilize

the features X in a much stronger way in that we estimate a nonlinear function of

X, µ∗(X) as in equation (4.1) instead of using a linear form Xβ. It is possible that

X2 _{should have been included as a control, but in the linear setup researchers fail}

to add them. The mis-specification causes omitted-variable bias and the endogeneity

problem. In our procedure described in the following sections, we could estimate a

flexible functional form for the base model µ∗(X) to alleviate this concern compared

4.2.4 R-learner for HTE estimation

In this subsection, we explain how we are going to estimate the model in (4.1). There

are many alternatives but we focus on one particular estimator, R-learner, proposed

by [73]. The problem [73] try to tackle is how to turn a good generic black-box predic-

tor into a good treatment effect estimator that has some nice theoretical properties.

In terms of what machine learning tools we can use, R-learner is more flexible than

most HTE estimation methods where “causal” variants of machine learning methods

still require efforts from specialized researchers. Our empirical studies in section 4.4

focus on applications of R-learner.

In this section, we only assume assumption (4.5), and by taking expectation for

both sides of the equation 4.5 conditioning on only Xi, we have

E[Yi|Xi, Zi] =µ∗(Xi) +Ziτ∗(Xi)

E[Yi|Xi] =µ∗(Xi) +E[Zi|Xi]τ∗(Xi) (4.10)

Plugging in (4.10) back into regression equation (4.1), we have the following:

Yi−m∗(Xi) = (Zi−e∗(Xi))τ∗(Xi) +i, E[i|Xi, Zi] = 0, (4.11)

where

m∗(x) :=_E[Yi|Xi =x]

e∗(x) :=_E[Zi|Xi =x]

In the binary treatment setting, e∗(x) = _E[Zi|Xi = x] described above gives the conditional probability of getting treatment and is often called propensity score in the

causal inference literature. The exposition here follows from [73] and the only minor

difference is that we have a continuous treatment variable Zi as opposed to binary.

Equation (4.11) is our main estimation equation in R-learner. In equation (4.11), we

This step is often called residualization and is very intuitive: we want to take out

the effects of other controlsXi and isolate the treatment effectsτ∗. [78] first Utilized

The residualization form of equation (4.11) for the binary case.

We next recall the R-learner HTE estimator and a few of its variations below.

First, an oracle with knowledge of true population quantities m∗(x) and e∗(x) can

estimate function τ∗ by • Oracle estimator ˜τ(x): ˜ τ := arg min τ(x) ( 1 n n X i=1 (Yi−m∗(Xi)−(Zi−e∗(Xi))τ(Xi))2+Λn(τ(x)) ) , (4.12)

where Λn is the regularization term that should be tuned by cross-validation (CV).

We could think of it as similar to the L1 _{penalty term in LASSO regression, or the}

maximum tree depth in tree related methods. This term is crucial since without this

regularization, we could minimize the training error to a point overfitting the data

and failing to generalize.

In reality, we cannot implement the oracle estimator, but we can estimate first

m∗ and e∗ and plug our estimates ˆm and ˆe into the minimization problem in (4.12),

which is the R-learner proposed by [73]:

• R-learner ˆτ(x) is estimated as below, where ˆm(−i)₍_X

i) and ˆe(−i)(Xi) mean hold-

out predictions made by models ˆm and ˆe fitted to data without the ith _data

point: ˆ τ := arg min τ(x) ( 1 n n X i=1 Yi−mˆ(−i)(Xi)− Zi −eˆ(−i)(Xi) τ(Xi) 2 +Λn(τ(x)) ) (4.13)

In summary, we could implement the R-learner in the following two steps:

• Step 1: Fit m(x) and e(x) via any black-box predictive methods tuned for optimal predictive accuracy using CV.

• Step 2: Minimize the causal loss function plus regularization term Λn(τ(x)),

again via any black-box methods. Use CV to tune hyperparameter to combat

overfitting: ˆ τ := arg min τ(x) ( 1 n n X i=1 Yi−mˆ(−i)(Xi)− Zi −eˆ(−i)(Xi) τ(Xi) 2 +Λn(τ(x)) )

Note that we use hold-out predictions for nuisance components m∗(x) and e∗(x)

when estimating what we care about in step 2. This usage of hold-out predictions is

also known as cross-fitting for its similarity to cross-validation. The difference is that

here we want to use hold-out predictions in fitting our main parameters of interests

as opposed to evaluating performance of predictive models. It is a widely used trick

in making correct statistical inference when machine learning methods are involved

([10] and [30]). Usually, k-fold cross-fitting is used, and in our empirical study, we

setk = 5.

[73] show R-learner has nice theoretical guarantees. For example, with additional

assumptions on the true form of τ∗ and the machine learning algorithms used, [73]

prove that_E(ˆτn(Xi)−τ∗(Xi))2

converging to 0 as fast as_E(˜τn(Xi)−τ∗(Xi))2

Also, as mentioned earlier, although the nice properties from theory require certain

conditions, we are free to use any black box predictors in R-learner when implementing

it in practice, which makes the approach very flexible.

[73] named their procedure R-learner, based on Robinson’s transformation, partly

to recognize [78]’s work and partly to emphasize the importance of residualization.

In the next section, we describe how we apply R-learner to our setting of factor

In document Essays on Demand Estimation, Financial Economics and Machine Learning (Page 155-165)