Part II Financial Engineering and Machine Learning Mod-
Chapter 4 Heterogeneous Treatment Effects in Asset Pricing
4.2 Heterogeneous Treatment Effects Estimation
The main model we consider is given as below:
Yi =µ∗(Xi) +Ziτ∗(Xi) +i, E[i|Xi, Zi] = 0 (4.1) , where Y denotes the outcome we care about andZ denotes a continuous treatment
variable whose effect on Y is denoted byτ∗(X). X’s are the features other than the
treatment. µ∗(X) is the base-case outcome as a function of X when Z = 0. The
condition E[i|Xi, Zi] = 0 is often called exogeneous assumption or no endogene-
ity assumptions in linear regression models. We will arrive at equation 4.1 later in
this section. First, we recall the framework of causal inference and treatment effect
estimation in the binary treatment case, which is dominant in the causal inference
literature. We then discuss generalizations to continuous treatment cases and as-
sumptions made therein. Note that binary and continuous treatment variables are
conceptually very similar, but to be clearer, we use different notations: Z denotes
continuous treatment variable while we use W if the treatment is binary. Lastly, we
discuss how we could estimate HTE, and cast the traditional regression approach in
asset pricing as a simple special case to build the connections. Basically, we treat
future stock returns asY, one firm characteristic as treatmentW orZ, and the rest
4.2.1
Causal inference framework: Binary treatment
W
We first briefly recall the causal inference framework before diving into HTE estima-
tion. The framework we take is often called potential outcome framework. We assume
we have n i.i.d. samples of (Xi, Yi, Wi) drawn from a joint distribution f(X, Y, W),
where
• Yi ∈R: observed outcome for individuali.
• Xi ∈Rp: features for individual i of dimension p
• Wi ∈ {0,1}: whether i receives treatment or not
• “Potential Outcome”: {Yi(0), Yi(1)} depending on whether i gets treatment: Yi =Yi(Wi)
The focus is on estimating HTE defined as:
τ∗(x) =E[Y (1)−Y (0)|X =x] (4.2) , which are also called conditional average treatment effects (CATE). We use su-
perscript ∗ to emphasize that it is an unknown population quantity. Note that, as
hinted in the introduction section, in this chapter’s empirical study we mainly use
continuous treatment variables like book-to-market ratio but we could also think of
interesting binary treatment variables. For example, W could represent some events
of firms, such as dividend cancellation announcements, where the effects of the event
could potentially be heterogeneous.
The fundamental problem of causal inference is that we only observe one of
{Yi(0), Yi(1)} for each i. The other one in {Yi(0), Yi(1)} is always missing. In the
medical trial setting this problem can be stated as how long will the patient in the
control group live had him taken the drug. In our case, the problem can be stated as
in the current month. This is what makes causal inference problem special: if we can
observe bothY(0) andY(1), we get to observe treatment effect τi =Yi(1)−Yi(0) for
each observation i, too, and HTE estimation reduces to a supervised learning task
using any ML methods to fit to the data set (Xi, Yi(1)−Yi(0)).
To tackle the fundamental problem of causal inference, we need some structure
and assumption in order to recoverτ∗. There are usually two types of studies that try
to infer treatment effects: randomized experiments and observational studies. Ran-
domized experiments are settings where researchers are able to conduct experiments
in a controlled environment by randomly assigning treatment to individuals, which
are common in drug trials and A/B testing conducted by internet companies such as
Facebook and Google. Experiments remain the golden standard of causal inference,
but, unfortunately, in most economic applications, experiments are not feasible. In
this case, we need to work with observed data only and that is called observational
studies. One standard assumption to assume in observational studies is no unmea-
sured confounders: conditioning on the observed features Xi, treatment assignment
Wi is independent of potential outcome (Yi(0), Yi(1)), i.e.
• No unmeasured confounders
(Yi(0), Yi(1)) independent of Wi conditional on Xi (4.3)
Note that in a controlled randomized experiment, assumption (4.3) is automati-
cally satisfied. However, in observational studies, we usually need to assume equation
(4.3) is true. If we have reasons to believe that assumption (4.3) is violated, we have
an endogeneity problem and need to find instrumental variables (IV) for identifica-
tion of treatment effects. See corresponding chapters in textbook such as [89] for
more details along that direction. We explain the meaning of this assumption in our
empirical context in the next subsection, where we describe the continuous treatment
Under assumption 4.3, we can decompose observed Yi into the following three
terms:
Yi =E[Y (0)|X =Xi] +Wiτ(Xi) +i (4.4)
,where the last term i satisfies
E[i|Wi, Xi] =E[Yi−E[Yi(0)|X =Xi]−Wiτ(Xi)|Wi, Xi]
= 0
Our goal is then to flexibly estimate HTE, that is, the CATE function τ(Xi).
4.2.2
Causal inference framework: Continuous treatment
Z
We generalize the model to continuous treatment variable case where the treatment is
denoted byZ. Continuous treatment is not as widely studied in the causal inference
literature compared with binary treatment. Again, outcome variable is denoted byY
,and the observed outcome becomesYi =Yi(Zi), Zi ∈ Z, whereZ can be a continuum,
and for each individual i, we only get to observe its value for one Zi, Yi = Yi(Zi).
The problem here is more complicated than the binary case, and thus we need to
assume more than just equation (4.3). In particular, we assume that conditional on
X, the effect of increasing one unit of Z is constant regardless of current levels of
Z, in addition to no unmeasured confounders. The assumption in this case can be
described as below:
• Linear treatment and no unmeasured confounders
E[Yi|Xi =x, Zi =z] =µ∗(x) +zτ∗(x) (4.5)
The assumption above implies equation (4.1) in the starting paragraph of this section,
which we repeat below:
The specific model of equation (4.5) has also been studied recently by, for example [9]
and [1]. We will assume equation (4.5) throughout the paper, and therefore equation
(4.1) always holds in our setting. The zero expectation of residual conditional on
X, Z in equation (4.1) is often called exogenous assumption. It may seem at first
to be a strong assumption; however, we explain what this assumption means in our
empirical setting in the following section, and this assumption turns out to be no more
restrictive than what is assumed in usual factor model estimation. In fact, equation
(4.1) is much more flexible than standard assumptions of regression approach used in
empirical asset pricing.
4.2.3
Linear factor models as a special case
Before going to HTE estimation, we cast the traditional regression approach to factor
models as a special case of our main specification in equation (4.5). We simplify the
HTE estimation problem into a linear, homogeneoues setting and build the connection
between this setting and the estimation of traditional linear factor models.
Because we are generally interested in continuous variables in asset pricing models,
we cast regression models under our causal framework using equation (4.1), which is
implied by assumption (4.5). We assume the following two additional assumptions
which simplifies the problem significantly.
• The treatment effect is homogeneous, that is.
τ∗(x) = τ (4.6)
• Linear in features X for base case response µ∗(x):
µ∗(x) =xβ (4.7)
Then under assumptions (4.5), (4.6) and (4.7), equation (4.1) becomes
OLS theory says that ˆτOLS is consistent for true τ. In linear factor models, we
want to estimate whether or not a particular factor, say, book-to-market ratio for
stockj in month t, can predict stock j’s return for month t+ 1, after controlling for
other firm characteristics represented in Xjt. What we have is essentially panel data
regression, but we care only about the cross-sectional relationship between book-to-
market ratio and next month stock returns. Therefore one way to do the analysis is to
run a panel data regression with time fixed effects, which focuses on variations in the
cross section and averages out along the time axis. In terms of the point estimates,
adding time fixed effect is equivalent to demeaning both the target variable and all
regressors from their monthly average. Denote stock return byr, stock features byX,
the book-to-market ratio by BM. Following [40], we use the log of book-to-market
ratio, logBMjt, in the regression for it has distribution closer to normal distribution.
We have the panel regression equation with cross-sectionally demeaned variables:
rjt−rt¯ = Xjt−Xt¯ β+ logBMjt−log ¯BMt
τ +jt (4.9)
,where ¯rt = J1t PJl=1t rlt and log ¯BMt is similarly defined.
Regression equation (4.9) is related to the possibly more popular called Fama-
Macbeth regression proposed by [42]. See [75] for more about that connection.
Comparing equations (4.8) and (4.9), we can then map quantities in factor models
to ones in the causal framework: rjt−r¯tis our outcome variableY, Xjt−X¯t
are fea-
tures/other controlsX of dimensionp, and, most importantly, logBMjt−log ¯BMt
is the treatmentZ whose effects on next-month returns relative to the cross-sectional
averages,rjt−¯rt, are of primary interests. On the other hand, this comparison also
shows what assumptions are behind usual OLS estimation for traditional linear re-
gression in factor models (equation (4.9). For OLS estimator in equation (4.9) to be
consistent, equation 4.8 must hold, which is often called exogenous condition4. We’ve
4Actually the weakest condition to ensure OLS is consistent is E Xii Zii = 0
seen that this is equivalent to assumptions (4.5), (4.6) and (4.7). We keep assumption
4.5 throughout the paper. However, in section 4.2.4 we introduce HTE estimation
techniques developed by the machine learning and causal inference researchers, which
help us relax assumptions 4.6 and 4.7 in our empirical studies.
We next discuss the assumption of no unmeasured confounders or no endogene-
ity assumptions of equation (4.3) and (4.5). Those assumptions imply exogeneous
condition like E[|X, Z] = 0, which is the type of assumptions that justifies OLS in linear models. We assume equation (4.5) and thus (4.1) throughout this chapter. We
argue that this assumption is reasonable for the following two reasons. First, it is no
stronger than the assumptions made in the literature when estimating traditional lin-
ear factor models using regressions. If we run the type of panel regression in equation
(4.9) or similar Fama-Macbeth regressions and start to interpret the sign and mag-
nitude of τ or conduct statistical inference onτ, we are implicitly assuming equation
(4.5) or (4.1) holds. Secondly, compared with prior regression approach that assumes
exogeneity in the finance literature, our empirical study based on HTE estimation
is in a better position to assume no unmeasured confounders and thus mitigate en-
dogeneity problem for the following two reasons: (1) In the existing literature, only
a few characteristics are collected as controls X for regression equations like (4.9),
whereas we include around 40 features as controls X. Endogenity problem caused
by omitted-variable bias is much less of an issue in our study. (2) We also utilize
the features X in a much stronger way in that we estimate a nonlinear function of
X, µ∗(X) as in equation (4.1) instead of using a linear form Xβ. It is possible that
X2 should have been included as a control, but in the linear setup researchers fail
to add them. The mis-specification causes omitted-variable bias and the endogeneity
problem. In our procedure described in the following sections, we could estimate a
flexible functional form for the base model µ∗(X) to alleviate this concern compared
4.2.4
R-learner for HTE estimation
In this subsection, we explain how we are going to estimate the model in (4.1). There
are many alternatives but we focus on one particular estimator, R-learner, proposed
by [73]. The problem [73] try to tackle is how to turn a good generic black-box predic-
tor into a good treatment effect estimator that has some nice theoretical properties.
In terms of what machine learning tools we can use, R-learner is more flexible than
most HTE estimation methods where “causal” variants of machine learning methods
still require efforts from specialized researchers. Our empirical studies in section 4.4
focus on applications of R-learner.
In this section, we only assume assumption (4.5), and by taking expectation for
both sides of the equation 4.5 conditioning on only Xi, we have
E[Yi|Xi, Zi] =µ∗(Xi) +Ziτ∗(Xi)
E[Yi|Xi] =µ∗(Xi) +E[Zi|Xi]τ∗(Xi) (4.10)
Plugging in (4.10) back into regression equation (4.1), we have the following:
Yi−m∗(Xi) = (Zi−e∗(Xi))τ∗(Xi) +i, E[i|Xi, Zi] = 0, (4.11)
where
m∗(x) :=E[Yi|Xi =x]
e∗(x) :=E[Zi|Xi =x]
In the binary treatment setting, e∗(x) = E[Zi|Xi = x] described above gives the conditional probability of getting treatment and is often called propensity score in the
causal inference literature. The exposition here follows from [73] and the only minor
difference is that we have a continuous treatment variable Zi as opposed to binary.
Equation (4.11) is our main estimation equation in R-learner. In equation (4.11), we
This step is often called residualization and is very intuitive: we want to take out
the effects of other controlsXi and isolate the treatment effectsτ∗. [78] first Utilized
The residualization form of equation (4.11) for the binary case.
We next recall the R-learner HTE estimator and a few of its variations below.
First, an oracle with knowledge of true population quantities m∗(x) and e∗(x) can
estimate function τ∗ by • Oracle estimator ˜τ(x): ˜ τ := arg min τ(x) ( 1 n n X i=1 (Yi−m∗(Xi)−(Zi−e∗(Xi))τ(Xi))2+Λn(τ(x)) ) , (4.12)
where Λn is the regularization term that should be tuned by cross-validation (CV).
We could think of it as similar to the L1 penalty term in LASSO regression, or the
maximum tree depth in tree related methods. This term is crucial since without this
regularization, we could minimize the training error to a point overfitting the data
and failing to generalize.
In reality, we cannot implement the oracle estimator, but we can estimate first
m∗ and e∗ and plug our estimates ˆm and ˆe into the minimization problem in (4.12),
which is the R-learner proposed by [73]:
• R-learner ˆτ(x) is estimated as below, where ˆm(−i)(X
i) and ˆe(−i)(Xi) mean hold-
out predictions made by models ˆm and ˆe fitted to data without the ith data
point: ˆ τ := arg min τ(x) ( 1 n n X i=1 Yi−mˆ(−i)(Xi)− Zi −eˆ(−i)(Xi) τ(Xi) 2 +Λn(τ(x)) ) (4.13)
In summary, we could implement the R-learner in the following two steps:
• Step 1: Fit m(x) and e(x) via any black-box predictive methods tuned for optimal predictive accuracy using CV.
• Step 2: Minimize the causal loss function plus regularization term Λn(τ(x)),
again via any black-box methods. Use CV to tune hyperparameter to combat
overfitting: ˆ τ := arg min τ(x) ( 1 n n X i=1 Yi−mˆ(−i)(Xi)− Zi −eˆ(−i)(Xi) τ(Xi) 2 +Λn(τ(x)) )
Note that we use hold-out predictions for nuisance components m∗(x) and e∗(x)
when estimating what we care about in step 2. This usage of hold-out predictions is
also known as cross-fitting for its similarity to cross-validation. The difference is that
here we want to use hold-out predictions in fitting our main parameters of interests
as opposed to evaluating performance of predictive models. It is a widely used trick
in making correct statistical inference when machine learning methods are involved
([10] and [30]). Usually, k-fold cross-fitting is used, and in our empirical study, we
setk = 5.
[73] show R-learner has nice theoretical guarantees. For example, with additional
assumptions on the true form of τ∗ and the machine learning algorithms used, [73]
prove thatE(ˆτn(Xi)−τ∗(Xi))2
converging to 0 as fast asE(˜τn(Xi)−τ∗(Xi))2
.
Also, as mentioned earlier, although the nice properties from theory require certain
conditions, we are free to use any black box predictors in R-learner when implementing
it in practice, which makes the approach very flexible.
[73] named their procedure R-learner, based on Robinson’s transformation, partly
to recognize [78]’s work and partly to emphasize the importance of residualization.
In the next section, we describe how we apply R-learner to our setting of factor