There are several identifiable limitations of the present study that must be bome in mind when interpreting the results. One of the most critical of these relates to the nature of the stimuli on which ratings were based. Participants were required to evaluate a hypothetical lecturer whose performance was
represented by statements contained in a written vignette. Using this method
The use of "paper people" as rating stimuli poses questions regarding the
extemal validity of the research. Can the results from the present study be generalised to "real life" rating situations, such as performance appraisals or
reference reports?
There is evidence that studies using "paper person" designs result in different experimental outcomes (larger effect sizes) compared to those that have used
direct observation designs (Murphy, Herr, Lockhart, & Maguire, 1 986) . Woehr
and Lance ( 1 99 1) have tested competing explanations for these observed
differences in effect sizes. They concluded that differences are attributable to
a greater signal-to-noise ratio in direct observation studies. That is, effect sizes are smaller in direct observation studies because they include more performance irrelevant information (background noise) than do paper people studies. Interestingly, Woehr and Lance found that scripts in which
performance statements were embedded in written descriptions that included irrelevant information resulted in recognition and accuracy rating outcomes
similar to those obtained using videotape stimuli. They suggest that carefully constructed performance scripts can simulate some of the additional cues present in "real life" rating situations. However, they also point out that none of the laboratory methodologies, including those using direct observation techniques such as videotape, are likely to capture fully all aspects of the
rating situation inherent in evaluations conducted in the "real world."
�
Nevertheless, it has to be acknowledged that the present study represents an idealised rating situation where performance-irrelevant information has been minimised. Therefore, further research is required to establish if the results
can generalise to more complex and "noisier" environments found in applied rating situations.
A related issue concems the nature of the rating task and the setting in which the study was conducted. Participants in the present investigation were students who were required to evaluate the performance of a lecturer. It has been suggested by some researchers that results from laboratory studies conducted in educational settings using upward appraisal may not generalise to different settings with non-student raters (Dipboye, 1985; Gordon, Slade, & Schmitt, 1 986; Ilgen & Favero, 1985; Slade & Gordon, 1988) . However,
others have argued that the processes elucidated from laboratory research
have extemal validity and that laboratory and field methodologies are
complimentary (Dobbins, Lane, & Steiner, 1988a, 1988b; Mook, 1983; Woehr & Lance, 1 99 1) . Concems that have been expressed regarding the generality
of research fmdings are certainly reasonable. However, in the present case,
characteristics of the sample may mitigate some of these concems. More
specifically, most of the distance education students who comprised the
sample were experienced raters and were very familiar with reference reports and performance appraisals. Moreover, the majority of the sample were
working full time, and, in addition, many of the participants were employed as managers or supervisors. The background and experience of the present sample sets them apart from the typical student participant used in many
-�
other investigations. In fact, their profile is likely to closely match that of raters in applied settings to whom the results are supposed to generalise. Nevertheless, limitations imposed by the artificial rating situation and nature
of the rating task remain, and place constraints on external validity in the
present study.
In addition to the problem of external validity, there are several other
methodological limitations that challenge the robustness of the results. For
example, the manipulation of rating purpose was simplistic and poorly done.
Part of the rationale for the manipulation was the avoidance of demand
characteristics. However, in hindsight, a more thorough explanation of rating
purpose would have been more likely to have communicated and established
the desired motivational context. Another limitation was the fact that no
manipulation checks were included. This omission means that it is difficult to
determine if the failure to observe effects was due to the weak manipulation of the variable, failure to attend on the part of participants, or simply because the variable was irrelevant. One might also ask questions about the reliability
of the measure of rater affect. Unfortunately, because it was a single item
measure, no reliability coefficients could be calculated.
The low return rate in the present study is also of concern. Requests for
participation were sent out to more than 900 individuals. Slightly less than
300 participated in the study, a return rate of only 3 1 o/o. In hindsight it would have been worthwhile to include a follow-up letter which may have
helped to bolster participant numbers. However, while the return rate was _,.
low it must be pointed out that it is consistent with those reported in other
studies which have used postal surveys (e.g. , Cleveland et al. , 1 989; Judge,
Furthermore, many investigations have reported far lower retum rates (e.g. ,
Arthur & Bennett, 1 995; Lin, 1996; Shaw, Kirkbride, Fisher. & Tang, 1 995) .
Nevertheless, because of the low retum rate the representativeness of the sample cannot be guaranteed, and questions remain conceming the extemal
validity of the results. Finally, the investigation would have been improved if
participants had been required to evaluate more than one ratee. The
inclusion of multiple ratees would have resulted in a design that allowed for
the calculation of the entire range of accuracy measures.