Inter-Rater Agreement - Protocol Validation

Protocol Validation

4.1 Inter-Rater Agreement

The protocol coding scheme and coding procedure have gone through revisions to improve the reliability and validity of the results. After initially coding videos of the three participants, the protocol was revised to reduce subjectivity of the definitions

and clarify the coding procedure.

After the revision of the protocol, inter-rater agreement for the coding scheme was assessed using three different fifteen minute long video segments; one from each participant. Three raters encoded the data in each video segment.

The agreement between each two raters was calculated on a second-by-second basis. The action data from the encoding worksheet was modified to list the action observed for every second within the given design process. This way, the ratio of the number or seconds where the two raters were in agreement compared to where they differed could be calculated. This method of calculation was done for a few reasons:

• This method does not penalize longer actions that were observed by both raters, but not recorded at the same time stamp.

• This method provides an intuitive means for understanding the amount of agree-ment between raters. With the second-by-second calculation, an agreeagree-ment of 0.8 means that 80% of the time was encoded the same by each rater.

• The major focus of this research is understanding how designers interact with physical media in engineering design. Also, the coding of the entities involved was designed to be objective as possible with each available entity classified before the experiment was conducted. For these reasons, the agreement calcu-lations focused on the actions rather than the entities.

The second-by-second approach to the agreement calculation will penalize the raters proportional to the amount of time that the encoding differs. For example, if Rater 1 observed a manipulation action from 10:34 to 10:42 while Rater 2 observed a manipulation action but recorded timestamps of 10:35 and 10:43, the two raters disagree on the action for two seconds but agree for the eight seconds. A method

that only takes into account the starting times of the actions would classify the whole time as a disagreement, when the difference could be attributed to the random error of starting and stopping video playback.

Using this method, the joint probability of agreement was calculated, compar-ing each rater to the other two raters. This was repeated for all three video segments for a total of nine comparisons. The results of the joint probably of agreement calcu-lations are presented in Table 4.1.

Table 4.1: Calculated joint probability of agreement for three video segments.

Rater Pair AP BM EN

1–2 0.52 0.49 0.33 1–3 0.54 0.49 0.31 2–3 0.80 0.71 0.62

As shown in Table 4.1, the agreement between raters 1 and 2 and between 1 and 3 are quite low and do not give much confidence in the reliability of the protocol and coding scheme. However, there was much better agreement between raters 2 and 3. It should be noted that raters 2 and 3 were more active in the development of the entity and action definitions as well as the coding scheme. Rater 1 was less active in these developments and was therefore less familiar with the coding scheme. This indicates that the protocol has the potential to be reliable, but in this case, there may have been a lack of necessary training.

Another test of the reliability was conducted after refining coding scheme and rater instructions. A new video segment of EN was given to a fourth rater formerly not associated with the experiment. This rater was given an overview of all the encoding procedures, entity definitions, action definitions and training on how to perform the analysis. The agreement between this fourth rater and rater 3 from

Table 4.1 was 0.6. This agreement is approximately equal to the agreement between raters 2 and 3 in Table 4.1. Futher analysis was done to determine that actions where the most dissagrement was occuring, and thus where the coding scheme needs the most refinement. For each second in the coded data where the two raters differed, the actions specified by each rater were noted. The sum of actions where differences occured are presented in Table 4.2.

Table 4.2: Total number of seconds of difference for each action coded.

Action Codes

Rater A B E R F H M U T L G Pause P S W D

4 0 0 0 0 0 61 13 30 15 55 0 4 3 75 15 0

3 0 0 0 0 0 101 61 10 0 11 0 18 24 10 36 0

Total 0 0 0 0 0 162 74 40 15 66 0 22 27 85 51 0

The larges number of differences occured when one of the raters coded a han-dling action and the second highest number of differences occured with sketching actions. It was observed that during sketching or writing, the participants would pe-riodically pause. Often, the participants would continue to hold the writing utensile during these pauses. Figure 4.1 shows an excerpt of a time distribution plot con-taining holding, sketching, and writing actions as well as pauses. It can be seen that these actions are often adjacent to each other. Many of the differences in the coding could be a result of one rater documenting one of the intermittent holding actions while the other rater did not. Also, the picking up a writing utensile at the beginning or end of a sketching or writing action could be documented as handling actions in some cases but not others, causing more discrepancies between the raters. Because these two actions are often adjacent to each other, clarifying the definitions of these actions and the instructions regarding timing could creatly increase the reliability of

the protocol.

Figure 4.1: Time distribution excerpt of Holding, Sketching, Writing, and Pause actions.

The action coding differences shown in Table 4.2 also indicated large differences in the coding of manipulating and looking actions. The definition of manipulating is subjective in that the rater must decide if the participant is focusing on the object or not. Likewise, when coding a looking action, the rater must decide if the participant is actively looking through materials in a bin versus casually looking in that particular direction. The definitions of manipulating and looking are also prime candidates for coding scheme revisions.

Chapter 5

In document Investigation of Prototype Roles in Conceptual Design using Case Study and Protocol Study Methods (Page 77-82)