Part II Statistical Methodology
9.2 Investigating the drop in performance of the IPD-SIDES method
The first step to better determine why the issue occurs is to observe what happens when applying the method to a single simulated dataset. The single dataset was generated (N=5000) with a very large single one-way interaction and the IPD-SIDES method was applied. Table 9.1 displays the final subgroups detected when applying the method. The method should only detect one subgroup (i.e. the first row only in Table 9.1), however it goes on to split further and detect another subgroup. One of the possible reasons for this happening is because there is no restriction placed on the computed differential effect p-value i.e. the splitting criterion (final column in Table 9.1). To recall how the method works, first the method evaluates the differential effect splitting criterion for all possible splits and retains the best single split i.e. split with the smallest p-value, for each covariate. These splits are then ordered in terms of the differential effect p-value from smallest to largest. Thereafter, only the top M splits with the largest differential effect are considered where M is pre-specified by the user. The ordered splits, though they may not be significant, are explored individually where only the subgroup with the more enhanced treatment effect is retained from each split provided it meets the continuation criterion. Hence, the differential effect splitting criterion is merely used to order the splits regardless of their significance. Therefore, it is possible for the method to detect a spurious subgroup resulting from a split with a non-significant differential effect. Thus, one possible solution to the problem would be to impose a restriction on the computed differential effect p-values such that the procedure only considers splits where the p-value is within a certain threshold e.g. 0.05 or 0.10. If such a restriction is applied to the example in Table 9.1, then only one
170
criterion is used to aid the search process of the method to identify candidate
subgroups. Therefore, the restriction placed on the splitting criterion p-value need not be that strict (pβ€0.05), and instead a less stringent restriction can be imposed (pβ€0.10) so that the method has the flexibility to identify subgroups that might be plausible.
Another reason why the procedure goes on to detect additional subgroups is because of the continuation criterion used by the method. Recall from chapter 6 that the
method only keeps a split provided it satisfies the continuation criterion ππ β€ πΎ β ππ i.e.
that the one-sided treatment effect p-value of the newly formed child node must be less than or equal to the complexity parameter value multiplied by the one-sided treatment effect p-value of the parent node from which it came. As we are dealing with large sample sizes, the treatment effect estimates in the simulated datasets will have more precision i.e. smaller standard error. Therefore, if the effect size is large, then dividing by a very small standard error will produce an extremely small one-sided p-value for the treatment effect. Often the p-values are so small that computational limitations means that they are calculated or stored as zeros. Therefore, the first subgroup selected in Table 9.1 (first row) has a treatment effect of 0.808 and has a one-sided p- value of zero. Subsequently, the method then goes on to select a second subgroup (second row in Table 9.1) because the one-sided p-value is equal to the p-value of the parent node from which it came; thus satisfying the continuation criterion. Thus, a solution to this problem would be to change the continuation criterion so that the one- sided p-value of the child node should be strictly less than the one-sided p-value of the parent node i.e. ππ < πΎ β ππ. However, if we think about the aim of the method, the aim is to identify several candidate subgroups with enhanced treatment effect. Therefore, changing the inequality to being strictly less than will certainly disregard several candidate subgroups that are similar to the parent node from which they were formed.
171
Thus, changing the inequality might not be a suitable solution with regards to the objective of the method.
From observing the output from the single simulated dataset, another issue with the method becomes apparent. Notice how the second subgroup identified by the method has a treatment effect of 0.843. This treatment effect is actually not that different from the treatment effect in the disregarded subgroup (0.773). The reason the method considers this split is because of the differential effect splitting criterion used by the method. Recall that this splitting criterion is of the form
π = 2 β [1 β Ξ¦ (|ππΈ1β ππΈ2|
β2 )] (9.1)
where ππΈ1 and ππΈ2 are the one-sided test statistics from the tests computed in child
nodes 1 and 2 respectively. As explained earlier, the splitting criterion is only used to order all of the splits in terms of the differential effect p-value. Thereafter, the best M
splits are considered from which the subgroups with larger treatment effect are retained (provided they meet the continuation criterion). Though the authors propose this splitting criterion to perform the differential effect search, it is probably not the most appropriate to directly evaluate the differential effect. As a result, it is quite possible that larger differential effects will go unnoticed. The reason why it is not the most appropriate splitting criterion is because it can give a significant p-value when the treatment effects in the two child nodes are similar and the standard errors are sufficiently different. In this case the difference in standardized Z statistics in (9.1) may be large even if the difference in (non-standardized) effect sizes is small. To consider an example, say a split is formed where both child nodes have a treatment effect of 4.0, however the SE in the left child node is 0.2 and the SE in the right child node is 0.9. Computation of the one-sided test statistics ππΈ1=4.0
0.2= 20 and ππΈ2= 4.0
172
suggest that there is a big differential effect between the two groups and thus the splitting criterion defined by equation (9.1) would indicate that the differential effect is highly significant. Now if the same split was evaluated using a regression model with the inclusion of an interaction effect, then the test statistic for the interaction effect would be very small thus being indicative of there being no differential effect present. What this means is that the current splitting criterion is more likely to select splits where there is a larger subgroup with more precision i.e. smaller SE, compared to a smaller subgroup with less precision i.e. larger SE, regardless of whether the treatment effects are the same or not. The current splitting criterion is therefore not an
appropriate approach for performing the differential effect search for the objective we wish to use it for and so another alternative criterion is required. As mentioned earlier on in this thesis, the most appropriate method of directly evaluating a differential subgroup effect is to use a statistical test for interaction. Therefore, one approach to get the method to do what we require it to do is to define a new splitting criterion that uses the interaction effect estimate test statistic to obtain a differential effect p-value. We can define the new splitting criterion as follows
π = 2 β [1 β Ξ¦(ππππ‘)] (9.2)
where ππππ‘ is the two-sided hypothesis test statistic computed for the interaction effect estimate that is obtained from fitting an appropriate regression model (linear, fixed or mixed model) with the inclusion of an interaction term and where Ξ¦(ππππ‘) is the cumulative distribution function of the standard normal distribution.
A closer inspection of the method coding provided by the authors identified yet another important issue with the method. It was found that having ordered the splits using the original splitting criterion; the method selects the subgroups with a larger one-sided test statistic i.e. smaller p-value rather than selecting the subgroups with the
173
largest treatment effect. Thus the method aims to find subgroups with the smallest p- value rather than largest treatment effect, which is not what we want the method to do. Therefore, the code provided by the authors doesnβt actually do what we require it to do. In order for the method to do what we require, the coding can be changed such that the method selects the subgroup with the largest treatment effect instead of selecting the subgroup with the largest one-sided test statistic. In this way, the objective of the method changes to identifying subgroups with enhanced treatment effect rather than identifying subgroups with smallest p-value. From now onwards, this will be referred to as the modified IPD-SIDES method.