Internal validity of research design - Description of research sample

3 Research design

3.3 Description of research sample

3.3.6 Internal validity of research design

In this section, aspects of internal validity of the research design will be described for the experiments.

Identification tasks

Regarding the identification audit tasks selected in the first experiment, the combination of an expert panel and an experiment will positively contribute to the internal validity of the research design since the judgment performance measures (accuracy, see also section 3.4) are based on external experts performing the same audit task on the same case-description. A potential threat to the internal validity of the design is the free response format that has been used in the questionnaire regarding the accurate interpretation of the wordings the participants and the wordings the expert panel have used. The participants’ answers provided in the response format will be used for the calculation of a judgment performance measure. The judgment performance measure (see Section 4.4 for a more detailed discussion) implies an equality relationship between the expert panel’s list of both business risks and controls and the participant’s list of business risks and controls. The participants were requested to describe in short wordings the top five of client’s business risks. It is hence possible that one participant used a catchword that covered a broader area compared to other participants. E.g., a participant identified the control ‘segregation of duties’ where other participants specified segregation of duties along departments/functions. For the purpose of mitigating this threat to a certain extent, I have prepared the ‘mapping’ of participant’s responses to the expert panel list as well as the underlying accuracy calculations. The entire mapping and calculation process was subsequently reviewed by and discussed with two other persons (an auditor of a Big4 audit firm and a member of the technical department of a Big4 audit firm and a researcher in auditing) resulting in a final accuracy scores. It was not possible to compute kappa’s measure of inter-rater reliability. Kappa’s measure

presumes equal values of the initial and final accuracy calculations. The dataset did not contain for all of the accuracy-measures equal values. Instead, 2-tailed Pearson correlations were computed, which are shown in the next table.

Table 3.2 Inter-rater Pearson correlations

Acc1ini Acc1final Acc2ini Acc2final Acc3ini Acc3final Acc4ini Acc4final

Acc1ini 1 .922*

(p=.000)

n.a. n.a. n.a. n.a. n.a. n.a.

Acc1final 1 n.a. n.a. n.a. n.a. n.a. n.a.

Acc2ini 1 .865*

(p=.000)

n.a. n.a. n.a. n.a.

Acc2final 1 n.a. n.a. n.a. n.a.

Acc3ini 1 .732*

(p=.000)

n.a. n.a.

Acc3final 1 n.a. n.a.

Acc4ini 1 .866*

(p=.000)

Acc4final 1

Table 3.2 shows that all inter-rater accuracy-measures to a high extent correlate significantly with each other.

A second internal validity issue relates to the limitation of the maximum number of five business risks and five entity-level controls the participants were requested to identify.

Limiting the response format to five items implies that participants may identify risks or controls that – although these risks and controls do not match with the expert panel’s list of risks and controls – are not simply wrong but only do not match with the most important risks and controls identified by the expert panel. This implies that judgment performance scores need to be cautiously interpreted. This issue is inherent to the design chosen.

Assessment task (assessing the impact of client’s business risks and controls on audit risk)

Subsequently, the following issues concerning the internal validity of the research design will be discussed.

Cue selection;

Consistency of participant’s response.

Cue selection

In section 3.3.5 the connection between the expert panel’s tasks and the experiment has been described. The selection of cues in the experiment is based on various criteria, amongst which the requisite that the expert panel exhibits substantial agreement on the importance of those cues (see also Tan and Libby, 1997)¹³. As no other fully suitable external criterion was available beforehand, this procedure contributed to a certain extent to the internal validity of the experiment mode of observation.

Consistency of participant’s response

Concerning the assessment tasks, a full factorial design is chosen. This implies that all selected cues are presented to the participants in all possible combinations (present:

yes/no) resulting in 15 cases of cue combinations (excluding the combination of zero cues present). A threat to internal validity of this design is concerned with the potential impact of participants’ fatigue or boredom (‘demand effects’) due to the lengthiness of the questionnaire and the relatively similar cases to be assessed. In order to mitigate this potential threat to internal validity to a certain extent, the design of the questionnaire included four elements:

• In addition to the 15 case combinations, participants were provided with four repeat cases. These cases replicated four cases included in the previously mentioned original fifteen cases of cue combinations. This procedure allowed for measuring stability¹⁴ of the participants’ responses (see also Colbert 1988; Cooksey, 1995);

• With the purpose of avoiding memory-carryover effects, all individual cases of cue combinations were presented on a separate page. In addition, participants were requested to complete the questionnaire subsequently completing page by page, and not to turn back to previous cases and participants’ responses.

• Three versions of the questionnaire were developed in which different sequences of case combinations were presented to participants in order to counter-balance the order-effect.

13 “Since the expert panel’s responses were to serve as benchmarks it was crucial that items included in the final measure of tacit managerial knowledge exhibit substantial agreement among the partners” (Tan and Libby, 1997).

14 Other studies (e.g., Ashton, 1973) included an additional measure of stability, namely the stability over time. This procedure involved performing the same experiment with the same participants at a different point in time. As this procedure doubles the resource capacity the Big4 audit firms would have made

• The debriefing questionnaire contained a question relating to the level of self-insight asking the participants to provide the subjective assessment of individual cue impact.

The three sequences/versions were distributed over the group of audit managers as follows:

• Group 1: auditors with sequence 1 (n=31);

• Group 2: auditors with sequence 2 (n=30); and

• Group 3: auditors with sequence 3 (n=24).

Table 3.3 Sequence of cue presence

Cue presence Sequence 1 Sequence 2 Sequence 3

1234 1 15 1

123 2 14 3

124 3 13 5

134 4 12 7

234 5 11 9

24 6 10 11

23 7 9 13

14 8 8 15

13 9 7 14

34 10 6 12

12 11 5 10

2 12 4 8

3 13 3 6

1 14 2 4

4 15 1 2

In this table, participants who completed “sequence 1” version of the questionnaire, first responded to the case setting in which all cues were present (“1234”), and subsequently responded to the case settings in which three cues were present (“123”, “124”, “134”, and “234”), and so on. Appendix B, tasks 5.1 to 5.15, is an example of sequence 1.

Participants who completed “sequence 2” version of the questionnaire started with the case setting in only one cue was present (“4”, “1”, “3”, and “2”), and subsequently responded to case settings with two cues present, and so on. So, sequence 2 is the opposite sequence of sequence 1. Sequence 3 was a mixture of sequences 1 and 2.

Section 3.3 has described the elements of the research sample. The next section will discuss the dependent variables, measures of judgment performance.

In document Auditors' Performance in Risk and Control judgments : An Empirical Study (Page 65-69)