Chapter 6 : Methods and Analysis – Phase Three (Organization Survey)
6.7 Evaluating Interrater Agreement
6.7.1 Aggregation Indices
Generally speaking, indices of IRA are used to gauge the level of agreement or absolute consensus between raters or judges who are evaluating a single target. For example, students in a class rate an instructor, television viewers rate a show, or in this case, employees rate their firm’s entrepreneurial culture. The primary question with IRA is whether or not the raters are interchangeable with one another. If one wanted to make inferences about the kind of organizational culture a firm possessed, would it matter if you asked Rater A or Rater B? One might reach different conclusions if Rater A and Rater B were highly interchangeable (i.e., high agreement) than if they were barely interchangeable (i.e., low agreement). Conceptually, IRA differs from indices of
multiple raters of multiple targets (LeBreton & Senter, 2008). For example, if multiple students rated multiple instructors, how consistently are students in these classrooms evaluating or discriminating across faculty? Are raters rank ordering targets in a
consistent manner? It is possible that raters can have perfect reliability but non-existent agreement.
In cases where a single individual is surveyed for the purposes of representing an entire organization’s culture, this is an implicit assumption of perfect IRA. In other words, one respondent’s perspective on culture is perfectly interchangeable for any other
organizational rater. This is a very strong assumption which intuitively seems unlikely to be true, hence past concerns from researchers about the characterizations of culture from a management-centric perspective (Alvesson, 2002). In fact, as past in-depth research into culture has indicated, there are likely to be many different interpretations and
sentiments towards organizational culture (Kunda, 1992). For these reasons this research has endeavoured, since the onset, to sample a number of organizational respondents and use agreement indices to evaluate consensus on culture within the firm.
At least three different kinds of aggregation index can be used to evaluate IRA. This included the traditional variance comparison index rwg (James, Demaree, & Wolf, 1984; LeBreton & Senter, 2008), the more recent average deviation index AD (Burke &
Dunlap, 2002; Burke, Finkelstein, & Dusig, 1999), and the most recent ratio agreement index, awg (Brown & Hauenstein, 2005). For rwg, single item rwg(1) and multiple item rwg(j) indices were calculated using both the traditional uniform null distribution and the slight skew distribution for comparison (LeBreton & Senter, 2008). The formula for rwg(1) is:
where = (observed) variance and is the variance obtained from a theoretical distribution representing different proportions of responses. The formula for rwg(j) is:
where is the mean of the observed variances for J essentially parallel items. For AD, single item ADM(j) and multiple item ADM(J) indices were also calculated. These formulas essentially represent the average deviation of the scores from their central tendency (mean). However, because it has been argued that the use of the median instead of the mean is more robust in cases with high observed variance and small samples (Burke, Finkelstein & Dusig, 1999), median calculations were used as well. These indices are called ADMd(j) and ADMd(J) for a single item and multiple items, respectively. The formula for ADM(j) is:
where k=1 to K judges, Xjk is the kth judge’s rating on the jth item and is the item mean (or median in ADMd(j)) taken over the judges. The formula for ADM(J) is the average of the ADM(j) scores over the J items.
Brown and Hauenstein (2005) argued that both the rwg and AD indices can be problematic because of their use of the uniform distribution. In the case of rwg, the uniform distribution is commonly used as the comparative distribution ( ). In other words, the observed variance is compared against a uniform distribution of responses indicating random responses (e.g., an equal probability of responses 1-5 on a 5-point Likert scale). However, Brown and Hauenstein (2005) argue that the use of the uniform distribution is inappropriately applied in the majority of cases because the population of respondents is unlikely to truly respond in a random fashion. The use of different
probability distributions has been advocated to address this concern, however Brown and Hauenstein (2005) and others (e.g., LeBreton & James, 2008) argue that these need to be theoretically justified. The AD index does not use a distribution directly; instead the range of values constituting acceptable levels of agreement is indexed to a distribution. Brown and Hauenstein (2005) developed the awg index as an index unreliant on a probability distribution for comparison. As with rwg and AD, the awg index exists for individual items, awg(1), and multiple items, awg(j). The awg index is based on Cohen’s kappa, which estimates the level of agreement as a ratio of cases in agreement. The formula for awg is:
where H and L are the high and low rating anchors (5 and 1 respectively on a 5-point Likert scale), S2x is the observed variance, is the observed mean, and K is the number of
judges. For multiple items, awg(j) is calculated as the average of the awg(1) scores over the J items.
While the different indices have various attendant standards for the evaluation of acceptable levels of agreement, in a comprehensive review on the subject of IRA, LeBreton and Senter (2008) argue for plurality. LeBreton and Senter (2008) indicated that the various measures of IRA tend to yield highly consistent conclusions. Monte Carlo simulations find highly convergent results among the indices. As a result,
generally speaking, the indices can be used together to “point in the right direction.” In any case, the use of agreement indices and the standards for determining acceptable levels of agreement should be dependent on the context of the study, the nature/severity of the decision being made with the information (e.g., rating a TV show versus firing an employee), as well as the judgment of the researcher (LeBreton & Senter, 2008).