• No results found

The reliability o f demonstrator-consistent responding, and the distribution o f effect sizes

Table 5.2 and 5.3 summarise the effect sizes for the complete, somewhat heterogeneous, sample o f study effects. Effect sizes are greater than zero when there is a demonstrator- consistent responding tendency, and less than zero when there is a demonstrator-inconsistent responding tendency. In Table 5.2, all 162 effect sizes are represented in the form of a stem- and-leaf display which illustrates the distribution of these effect sizes. Table 5.3 displays statistics summarising these data. The discrepancy between the value of .75 ( g j - Qi) and

SD (.8 and 1.3, respectively) suggests that these effect sizes are not normally distributed, because the quantity of .75 {Qj - Qj) is similar to that of SD when the distribution of a set of scores is normal (Rosenthal, 1991). This might be because the distribution is a little bimodal. In addition to the clear frequency peak for effect sizes that are approximately zero (mode = 0.05, 20 cases), there appears to be a second, smaller, peak for larger effect sizes (mode = 1.45, 7 cases). This suggests that the effect size of a demonstrator-consistent responding tendency might be influenced by more than just random error, and that certain testing conditions might result in a relatively large demonstrator-consistent responding effect. An evaluation o f whether either the type of experimental design employed, or the identity of the experimenter has moderated the magnitude of these effect sizes is presented in a later section of this chapter.

The stem-and-leaf display presented in Table 5.2 indicates that there is a number of quite extreme values. For example, the maximum effect size (5.5; JSl , NDR test, reported as Heyes & Dawson, 1990) is 4.1 standard deviations from the (unweighted) mean, while the minimum effect size (-10.7; JSl 1, unpub.) deviates by 8.4 standard deviation units. In terms of number o f subjects, the size of the studies yielding the 11 effect sizes categorised in the stem -and-leaf display as extreme values were small; the average number of observers employed in these studies {Mdn = 8) was small relative to that for the complete sample {Mdn

= 13). In general, effect sizes for smaller scale studies are thought to be representative of the true population effect size less often than those for larger scale studies (see, e.g., Rosenthal,

1991).

Apart from the mode, measures of the central tendency of these effect sizes suggest that the population treatment effect for bidirectional control experiments is a weak demonstrator- consistent responding tendency. The unweighted mean, weighted mean, and median effect sizes all provide different estimations of the population treatment effect, but aU are in agreement that this is small and positive. The difference between the weighted mean (M^= .22) and the unweighted mean { M ^ . 19) arises because the former takes the size o f the study (number of observers) into account. Because smaller scale studies were found to be a more variable estimator of a demonstrator-consistent responding effect, the weighted version is probably the preferable type of mean in this situation. In comparison to these means, the median effect size {Mdn = .12) was found to be small. In contrast to the mean, the median is fairly unaffected by the second peak that occurred for larger effect sizes. Thus, the median more faithfully represents the central tendency of the main part of the distribution.

The large variability within this sample of effect sizes limits the confidence with which the average effect size, measured by any of the indices of central tendency, is taken to represent the true extent of a demonstrator-consistent responding effect. Only slightly more than half (59%) of effect sizes were in the positive direction, and their standard deviation was large relative to the average effect size. This is illustrated by the 95% confidence interval which was calculated for the unweighted mean (based upon the number of effect sizes, not the number of observers), and found to range from -.02 to .40. The confidence interval could be thought of as the precision with which one may pinpoint the population effect size, and be confident that this claim would be correct on 95% of occasions in which this was attempted for a different random sample of 162 effect sizes drawn from the same theoretical population. Thus, the population effect size for demonstrator-consistent responding cannot be precisely located with any certainty, and a true effect size of 0.00 cannot be ruled out as being unlikely (at the 5% level).

Nonetheless, a demonstrator-consistent responding tendency was found to be statistically reliable using the two of types of significance tests most commonly employed in meta­ analyses (Rosenthal, 1991). The first of these tests is the vote counting method advocated by Hedges and Olkin (1980). Under the null hypothesis, 50% of the effect sizes should be

Figure 5.1: Boxplot of effect size by the seven categories of experimental design. Boxes contain the 50% of effect sizes between Quartiles 1 and 3; the central bar represents the median; whiskers extend to the highest and lowest values, excluding outliers and extreme values; outliers (circles) are between 1.5 and 3 box lengths from the box. Three extreme values (more than 3 box lengths from the box), which are not illustrated, were produced by the NDR test (-10.7, 5.2, 5.5). "NDR", non differential reinforcement test; "DR", differential reinforcement test; "REV", reversal test; "EXT", extinction test; "TRAN", transfer test following conditional discrimination training; "offB", shift from baseline; "NoTrain_X", extinction test without pretraining.

O) 2' 14 DR 4 REV 11 EXT 3 NoTrain X 123 NDR 4 TRAN 3 Off B

positive (suggesting demonstrator-consistent responding), and 50% of the effect sizes should be negative (suggesting demonstrator-inconsistent responding). In fact, as was noted above, 59% o f the effect sizes were positive. Using the binomial test, this proportion was found to have been unlikely to have resulted from chance alone, Z = 2.1, p = .033. Thus, any numerical difference between groups means found for a bidirectional control experiment is more likely to be in accordance with a demonstrator-consistent responding tendency than a demonstrator-inconsistent responding tendency.

A different approach was taken for the second significance test. Rather than testing for significance on the basis of 162 study effects, this approach takes the subject as the unit of analysis (7V= 2113). When the findings of the sample of study effects were combined using the Stouffer method (see, e.g., Rosenthal, 1991), observers' tendency to respond in the same direction as that of their demonstrator was found to be highly significant, Z = 4.3, p <

.00003. This result suggests that a demonstrator-consistent responding tendency occurs in bidirectional control tests which is detectable when data is pooled to approximate a large sample of observers. The discrepancy in the confidence with which the Stouffer and vote counting methods indicate a demonstrator-consistent responding tendency could be accounted for if bidirectional control tests are not sensitive enough to reliably establish demonstrator-consistent responding effects with the sample sizes that have typically been used. The sensitivity of bidirectional control tests is examined in a later section.