Towards a Rationale for Consensus Scoring

2.3 Solution Strategies to Scoring Challenges: Empirical Scoring

2.3.4 Concluding Comments

2.3.4.2 Towards a Rationale for Consensus Scoring

these findings, two areas which may contribute to a consolidated rationale of consensus scoring have been presented in the last section: Consensus theory of truth and the WoC effect. Do these areas help establish a rationale for consensus scoring that supports or goes beyond the reasons provided by Legree et al. (2005)? Is the consensus of individuals a sufficient source of knowledge to fulfill the definition of correctness? If so, under what conditions?

From a philosophical perspective, consensus as an indicator of truth is controversial. Even philosophical theories that support consensus as a criterion of truth set strict

preconditions for that situation. From that perspective, consensus as a result of ideal group interaction, as opposed to random de-facto consensus, is viewed as an appropriate criterion of truth. The literature on the WoC effect, in particular popular-science literature,

allocates consensus a variety of meanings and presents various examples of the effect. However, even here, limiting preconditions of the WoC effect are defined. These two areas partly underpin the rationale for consensus scoring as presented by Legree et al. (2005),

however, both the WoC effect and the consensus theory of truth emphasize the boundaries to consensus in terms of preconditions.

One of the preconditions for consensus to indicate truth, ability, is a common element in philosophical discussions of consensus as a criterion of truth and in literature about the WoC effect. Both areas support Legree et al. (2005), who stated knowledge as an

influencing factor for the quality of CBM. Hence, the role of ability/knowledge is not only supported by theoretical assumptions, but is highlighted as probably the key influence on the quality of consensus scoring.

Other mentioned theoretical preconditions are not supported throughout the literature. For consensus scoring to indicate unknown knowledge structures, diversity cannot be justified on a theoretical basis. The assumption that the variety of experience of journeyman might exceed the experience of experts is not directly supported when it comes to true knowledge structures (for scoring in ability measurement). There is no theoretical link between the variety or diversity of a group and their mean ability. In contrast,

agreement among individuals, that is, less diversity, has been seen as a result of a high level of knowledge. Hence, it remains unclear why diversity should increase quality of consensus scoring when individual judgments are averaged. However, diversity might play a different role when it comes to interaction of groups, and also regarding the consensus of opinions and decisions that are not directly related to the concepts of correctness or true knowledge structures.

Moreover, independence is not required as a theoretical precondition for consensus to indicate facts. The ideal speech situation, fictional as it is, introduces a situation of highly dependent individuals. As an interacting group can predict objective target values (Gigone & Hastie, 1997), there is no empirical or theoretical reason why interacting groups per se should be less appropriate than mean tendencies of individual judgments. However, the consensus of independent and dependent groups might be different with regard to possible conclusions (Lorenz et al., 2011; Solomon, 2006). As studies from social psychology have

shown, group interaction might lead to conformity under special conditions (Asch, 1955, 1956; Deutsch & Gerard, 1955). Conformity is a kind of consensus that might be different from independent group consensus, as group processes such as bias may shape it. On the other hand, consensus resulting from an ideal interaction need not necessarily differ from independent consensus, or might be even better in terms of possible conclusions (Gigone & Hastie, 1997). Independence has not been discussed as a major influencing variable for the quality of consensus scoring, yet consensus scoring methods require independent judgments.

Hence, transferring these thoughts to consensus scoring, the rationale of CBM (Legree et al., 2005) is only partly supported. Although consensus has been mentioned in intelligence literature as one evaluation standard for correctness, the theoretical rationale for consensus scoring, that is, consensus as an indicator of truth, cannot be justified stringently. From the theoretical perspective, only the consensus of highly able individuals may indicate correctness. If this condition is fulfilled, and consensus can indicate true knowledge structure, however, would it meet state-of-the-art standards for psychological measurement?

For investigation of consensus as a scoring criterion - which is the major aim here - one needs to canvass philosophical standards to measurement to evaluate this correctness criteria. From the philosophical perspective, the distinction of realist from non-realist theories of truth (and hence correctness) is important insofar as realist measurement ought to be based on a realist definition of scoring rules. The consensus theory of truth, however, is mostly seen as a non-realist criterion for truth, although some philosophers viewed parts of it as realist. If, on the other hand, correctness is defined on the basis of non-realist criteria, measurement is aligned to a non-realist perspective. Or can a correctness criterion, which is not independent of human thinking about it, be used for realist measurement? A more thorough philosophical discussion of this point is beyond the scope of this study, however, from a realist point of view, consensus should not define correctness, but rather - if at all - provide evidential support for (realist) correctness.

To conclude, consensus-based scoring should be seen critically from a theoretical perspective, as there is no unconditional theoretical support for consensus as an indicator of truth. However, because it might work under certain conditions, and because the idea has been used for scoring of intelligence tests, a systematic empirical investigation is required to evaluate empirical scoring methods which base on consensus. For this

systematic evaluation, the true correctness of a response has to be known, as it has been partly implemented for EI research. More studies are needed which compare the consensus scoring keys to veridical scoring keys. Besides the pending evaluation of the CBM scoring methods, the investigation of potentially promising other scoring methods, which have not been used for EI or SI, is a research aim addressed in the present studies. Without further investigation of scoring methods, one can surmise that other challenges concerning the measurement of new ability constructs that were only touched upon in this chapter will remain unsolved.

3 Description of Empirical Scoring Methods

This chapter will describe different methods that have been used, or suggested for use, for scoring measurements of ability constructs. Because of the intense critique of CBM scoring, the alternative methods are introduced and investigated in the present studies. As

empirical scoring methods, they supplement CBM described in section 2.3 and might represent promising alternatives to CBM scoring methods. The first of these methods is based on the assumption that agreement among respondents (consensus) indicates truth. Two other methods do not explicitly state this assumption, but use response information to estimate the ordering of response options with respect to the latent ability. These

data-driven scoring methods are used on the assumption that an underlying true ordering of response options exists, although this may not yet be evident - due, for example, to lack of elaborated theory.

Note, however, that the selection of scoring methods presented below is not claimed to be complete. Other empirical scoring methods not investigated any further here may also be available (e.g., Clemen, 1989; Merkle & Steyvers, 2011; Turner et al., 2014).

3.1 Consensus Analysis

Consensus Analysis includes a class of methods, also known as Cultural Consensus Theory (CTT), that were developed in order to gain objective empirical evidence as a basis for true cultural knowledge - that is, the knowledge inherent in a culture (Romney, Weller, & Batchelder, 1986; Romney, Batchelder, & Weller, 1987). These methods emerged in a period when the objectivity of anthropological studies was questioned. According to Batchelder and Romney (1988), CTT provides a statistical model that can objectively describe and define cultural knowledge. However, as Karabatsos and Batchelder (2003) state, the knowledge in question is group-specific, so the terms true and objective are somehow misleading.

information, but that knowledge statements of respondents are probabilistic, that is, as a function of knowledge, the probability of giving a correct response to a specific question will increase. Respondents with high ability tend to have higher probability of knowing the correct response to a question and thus give better information about the correctness of the endorsed response options. In CTT, the probability of a response option being correct is defined mainly by the competence of respondents endorsing that option. Additional

processes such as difficulty of an item or a bias parameter may also be modeled. The CTT model is structurally related to the latent class model (Batchelder & Romney, 1988;

Romney, 1999). However, it is not applied to respondents, but to items allocated to correctness classes.

As Batchelder and Romney (1988) state, CTT can be used for ability testing, in particular knowledge testing, whenever the researcher assumes that a common knowledge structure exists, but does not yet possess that knowledge. The crucial assumption of the model is that individuals who agree share a common knowledge, and that the level of agreement of independent respondents represents a measurable degree of true knowledge. Thus, like CBM scoring, CTT uses the consensus among respondents to estimate an unknown response key. However, in contrast to CBM, responses are weighted with the competencies of respondents. As Weller (1987) has stated, unweighted consensus among respondents also converges towards a true response; however using CTT models, smaller sample sizes might be possible and higher validity might be provided.

CTT has been suggested as an alternative scoring method for EI by Legree et al. (2005) and Schulze et al. (2007). However, no research has yet been published that uses CTT to score tests for EI, SI, or ability tests in psychological research in general. Possible reasons for this may be the complexity of the methods, strong model assumptions, or software limitations. In recent years, the development of CTT models has proceeded, and many versions of such models, as well as estimation procedures, are available (Anders & Batchelder, 2015; Aßfalg & Erdfelder, 2012a; Karabatsos & Batchelder, 2003; Oravecz,

Anders, & Batchelder, 2015). The following paragraphs will first describe the statistical model and its variants with reference to different estimation procedures and then present a number of CTT applications.

3.1.1 Statistical Model and Estimation. In its central function, CTT aims to

In document An Investigation of Empirical Scoring Methods for Ability Measurement (Page 69-75)