2.3 Impact analysis
2.3.2 Propensity score matching
Causal effects of LPWFI and NDLP
Both LPWFI and NDLP are programmes that are accessible for all eligible persons on IS. Certainly, the central question for social scientists as well as for policy makers is whether these programmes actually increase the employment chances of the people they seek to help. However, such a causal effect of LPWFI/NDLP is difficult to detect because one can only observe one outcome for the participants – the participation outcome – and not the alternative comparison/non-participation outcome – the outcome one has to contrast the observed participation outcome to. This is required to estimate the effect of treatment-on-the-treated, the causal effect of the programme following the causality concepts, e.g. by Roy (1951) or Rubin (1974).
The causal effect of treatment-on-the-treated can be identified by comparing the results of a programme for the participating individuals after the treatment with the hypothetical situation of the same individuals if they had not taken part in the programme (see Section A.4.1 for a formal statistical exposition of this same discussion). The evaluation aim is to estimate the effect of treatment-on-the- treated, given by the difference in expected values for the outcome of the group that participated compared to the outcome this group would have had if there was no participation. This is an outcome which cannot be observed for programmes implemented non-experimentally, i.e. the participants will never provide the comparison/non-participation outcome.
In principle, two alternative approaches can be applied to estimate the average non- treatment outcome: the situation of programme participants before treatment (before/after comparison) or a control group of people who did not participate. Matching uses the second of these approaches. However, the average value of the outcome of non-participants typically does not represent the correct average non- treatment outcome. Participants and non-participants might differ in characteristics, which influence the outcome variable. The participants then differ from non- participants before treatment due to observable and unobservable characteristics giving rise to a selection bias. To correct for selection on observables, the CIA is used, which implies that it does not matter that one estimates the average results without treatment based on people in the comparison/non-participating group, as long as they have the same characteristics.
Conditional independence assumption
Under the CIA, the participating group and the non-participating group in a programme are comparable in their non-treatment outcome conditional on the observable characteristics. The observable characteristics consist of many observable
features and should include as many attributes as necessary to describe all the differences between the participants and non-participants. The CIA as discussed here is formally defined in Section A.4.2.
As described under Section 2.3.1, more than one outcome of the programme alternatives in the case of LPWFI/NDLP is observed. In particular, three outcomes of LPWFI/NDLP combinations and one outcome of total non-participation in the programme alternatives. In terms of a causal analysis, such a multiple treatment structure is exactly equivalent to the case for one treatment: Only one of the potential outcomes will be observed, the other alternatives remain all hypothetical counterfactuals.
To find out how much a person is better off due to participation in a combined LPWFI/NDLP, one must consider that this person could have been in any of the two other programme alternatives or could have been a non-participant.
Lechner (1999) proves that the CIA in the case of multiple treatments can be statistically described. Hence any causal effect of participation in a specific programme alternative (e.g. LPWFI/NDLP) compared with the hypothetical participation in another programme (e.g. LPWFI only) can be estimated if these two specific groups are used, conditioning on all the observable characteristics (in this case, the estimated effect would show the relative NDLP effect for participants in LPWFI). All the important policy effects of LPWFI/NDLP can be estimated as outlined under Section 2.3.1, and for each case of impact estimation, only the sub samples of participants in the treatments compared are needed to estimate the relative impact. The propensity score
The major disadvantage of matching is the ‘curse of dimensionality’, i.e. it might be difficult to match with respect to a high–dimensional vector of observable characteristics, because one might not be able to find appropriate comparison observations. Therefore, most evaluation studies use the result by Rosenbaum and Rubin (1983) that the CIA also holds with respect to the probability of treatment (‘propensity score’) as a function of the observable characteristics. The statistical definition of the propensity score as outlined here is provided in Section A4.3. This result allows matching upon the one-dimensional probability. Effectively the ‘closeness’ of the propensity score of control observations with respect to the treated individuals is used as an estimator for the non-treatment outcome. This dimension-reduction diminishes the problem of finding adequate matches and the problem of empty cells. However, propensity matching requires that the propensity score has to be estimated itself. Therefore, to draw robust inference for the estimated treatment effect, the standard error of the estimated treatment parameter should take account of the fact that the propensity score used for matching is a pre- estimated quantity (see Heckman et al., (1999, Section 7.4.1)).
As in the case of single treatment, rather than conditioning on the observable characteristics directly, it is possible to condition instead on the propensity score for
the multiple treatment case (Lechner (1999) shows this). Lechner (1999) proves: • it is sufficient to condition on the propensity score instead of a vector of observable
characteristics X;
• as before, only the sub samples of participants in the two treatments being compared are needed to estimate the relative impacts.
If the CIA is satisfied, matching offers an attractive means of carrying out programme evaluations since it is not dependent on any functional form assumptions matching then allows for the heterogeneity of effects across individuals and can correct for important biases associated with evaluations (Heckman et al., (1999)). However, participating individuals can only be matched if individuals with similar characteristics exist among the comparisons/non-participants (i.e. common support in the observable characteristics).
Subject to all requirements mentioned above, the CIA allows non-participants’ outcomes to be used to infer participants’ counterfactual outcomes, therefore allowing impacts to be estimated. The assumption required to estimate the effects of multiple treatment programmes is an intuitive generalisation of the single case. Now the outcome that would result from treatment is assumed to be independent of treatment group after controlling for differences in individual characteristics. The quality of matching and bootstrap
Balancing properties for observable characteristics
A simple test for the quality of matching is the standard t-test that assesses whether the means of two groups are statistically different from each other with respect to the observable characteristics prior to treatment. The observable characteristics of the matched controls are based on a local linear model applying the same weighting formula as for the dependent variable, predicting the covariates for the matched sample. These ‘non-treatment characteristics of the treated’ are then subject to a simple t-test12. The complete formula of the test is shown in Section A.4.4.
Pre-programme tests
Pre-programme tests investigate whether the chosen method has properly controlled for selection effects. The idea behind the pre-programme test is that the correction for selection bias should make the employment history of participants and non- participants comparable before treatment. If the matching does not align the pre- programme outcomes for the future participants and the control-group, the validity of the matching is rejected. (Heckman, Hotz 1989: 866). Put differently, significant differences before treatment indicate remaining time-invariant differences, which the matching might not have captured, so that the outcome cannot be observed directly by comparing the outcomes of the treatment group with the matched
12The t-test is a ratio of the difference between the two means of the treatment
and the matched comparison group (numerator) and the dispersion of the scores (denominator). It is an example of the signal-to-noise metaphor: the difference between the means is the signal; the denominator is a measure of variability that is essentially noise that may make it harder to see the group difference.
control group13. The formal definition of this is shown in Section A.4.4. In practice,
this pre-programme test is implemented by showing the benefit level for the participants and non-participant groups before matching, in graphical representations of the outcomes in Section 2.3.4.
Standard errors and bootstrap
The dimension reduction feature of matching on the propensity score comes at the cost that the propensity score itself is estimated by a parametric probit model. Therefore, the standard errors of the estimated treatment effects are likely to be underestimated. To take account of the sampling variability of the propensity score estimate, one can implement a bootstrap procedure for construction of confidence intervals. This procedure is further described in Section A.4.4. However Abadie and Imbens (2005) find that the bootstrap estimates are not in general valid. As a result, bootstrapping is not implemented, and the cautionary approach should be adopted when considering the standard errors due to potential underestimation.