CiteSeerX — Detection of Large-Scale Item Preknowledge in Computerized Adaptive Testing via Kullback–Leibler Divergence

(1)

LSAC RESEARCH REPORT SERIES

 Detection of Large-Scale Item Preknowledge in

Computerized Adaptive Testing via Kullback–Leibler Divergence

Dmitry I. Belov

 Law School Admission Council Research Report 12-01

March 2012

(2)

The Law School Admission Council (LSAC) is a nonprofit corporation that provides unique, state-of-the-art products and services to ease the admission process for law schools and their applicants worldwide. Currently, 219 law schools in the United States, Canada, and Australia are members of the Council and benefit from LSAC's services. All law schools approved by the American Bar Association are LSAC members. Canadian law schools recognized by a

provincial or territorial law society or government agency are also members. Accredited law schools outside of the United States and Canada are eligible for membership at the discretion of the LSAC Board of Trustees; Melbourne Law School, the University of Melbourne is the first LSAC-member law school outside of North America. Many nonmember schools also take advantage of LSAC’s services. For all users, LSAC strives to provide the highest quality of products, services, and customer service.

Founded in 1947, the Council is best known for administering the Law School Admission Test (LSAT^®), with about 100,000 tests administered annually at testing centers worldwide. LSAC also processes academic credentials for an average of 60,000 law school applicants annually, provides essential software and information for admission offices and applicants, conducts educational conferences for law school professionals and prelaw advisors, sponsors and publishes research, funds diversity and other outreach grant programs, and publishes LSAT preparation books and law school guides, among many other services. LSAC electronic applications account for 98 percent of all applications to ABA-approved law schools.

All rights reserved. No part of this work, including information, data, or other portions of the work published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission of the publisher. For information, write:

Communications, Law School Admission Council, 662 Penn Street, PO Box 40, Newtown, PA, 18940-0040.

This study is published and distributed by LSAC.

(3)

Table of Contents

Executive Summary ... 1

Introduction ... 1

Detection of Test Collusion ... 4

Case 1 ... 4

Case 2 ... 7

Case 3 ... 11

Computer Simulations ... 11

Computational Study of Case 1 ... 12

Summary ... 22

References ... 23

(4)

(5)

Executive Summary

Detection of test collusion is a new research direction in the area of test security.

Test collusion may be described as large-scale sharing of test materials, including answers to test questions (items). Test collusion for high-stakes testing programs can affect the scoring of the exam because of the potentially large number of test takers involved. Therefore, identifying such test takers in order to withdraw their responses from the data is an important task. Current methods of detecting test collusion are based on statistics also used in answer-copying detection. In computerized adaptive testing, these methods lose power because the items presented are not the same for all test takers. This report addresses this problem by introducing a new approach that searches through various partitions of the test and analyzes differences between corresponding distributions of test-taker ability. The suggested approach was found to be effective in detecting groups of test takers with item preknowledge, meaning those with access to a (possibly unknown) subset (or subsets) of items prior to the exam.

Introduction

Current research on test security is moving beyond the classical problem of

detecting answer copying between two test takers (Angoff, 1974; Belov, 2011; Belov &

Armstrong, 2010; Frary, 1993; Harpp & Hogan, 1993, 1998; Harpp, Hogan, & Jennings, 1996; Holland, 1996; Sotaridona, van der Linden, & Meijer, 2006; Wesolowsky, 2000;

Wollack, 1997; Wollack & Cohen, 1998) or detecting person misfit (Armstrong & Shi, 2009; Belov & Armstrong, 2011; Belov, Pashley, Lewis, & Armstrong, 2007; Guttman, 1944; Harnisch & Linn, 1981; Karabatsos, 2003; Meijer, 1996; Meijer & Sijtsma, 2001;

van der Flier, 1982; van Krimpen-Stoop & Meijer, 2000, 2001) and is focusing on methods of detecting larger groups involved in test collusion (Jacob & Levitt, 2003;

Wollack & Maynes, 2011; Zhang, Searcy, & Horn, 2011).

Wollack and Maynes (2011) outline the following types of test collusion:

Illegal coaching by a teacher or test-prep school, examinees accessing stolen test content posted on the world wide web, examinees communicating about test answers during an exam, examinees harvesting and sharing exam content using e-mail or the Internet, and teachers or administrators changing answers after tests have been administered. (p. 2)

They point out that the “last activity is actually tampering, but when the tampering is performed on several answer sheets consistently, it can be detected using collusion analysis” (p. 2).

In order to identify classrooms where teachers have changed answers, Jacob and Levitt (2003) analyzed the joint distribution of two summary statistics: answer strings summary and unexpected score fluctuations summary. Both cluster analysis (Wollack &

Maynes, 2011) and factor analysis (Zhang et al., 2011) were demonstrated to be applicable for detecting various types of test collusion. However, all three methods rely

(6)

there is a lack of matching between incorrect responses within a group of test takers involved in test collusion. Furthermore, statistics based on matching responses are generally useless for multiple-stage testing and computerized adaptive testing (CAT), where the actual test varies among test takers. This report presents a new approach that does not rely on response-matching statistics but instead utilizes the difference between posterior distributions, measured by the Kullback–Leibler divergence (Kullback

& Leibler, 1951).

Throughout the report the following notation is used:

 Lowercase letters a b c, , ,...;   , , ,... denote scalars (including random variables).

 Capital letters A B C, , ,... denote sets; S denotes the number of elements in a set .

S

 Bold capital letters A B C, , ,... denote functions (including discrete distributions defined by probability mass functions).

 Capital Greek letters   , , ,... denote collections of subsets.

This report studies the following type of test collusion. At some test center, a group of test takers had access to a subset of items S R prior to the exam, where R is a given CAT pool. This type of test collusion is known as item preknowledge (Zhang et al., 2011). Practical examples of S include difficult items, stolen items, or items with high exposure during a previous CAT administration(s). Test takers involved in test collusion will be called aberrant test takers. They form aberrant groups. Test centers with

aberrant groups will be called aberrant test centers. Three cases are considered:

Case 1: S is known (Figure 1).

Case 2: S is unknown but belongs to a known collection  



W W1, 2,...,W_m



. In other words, for each aberrant group, its S is completely covered by the collection . Case 3: S is unknown and S  Q R (Figure 2), where Q is a reduction of R (e.g., items with exposure higher than a certain threshold). In addition, l S u, where l and

u are known lower and upper bounds on the size of S, respectively.

Note 1: In Case 2 and Case 3 (Figure 2), subset S may vary among aberrant groups.

However, it is assumed that for aberrant groups within each aberrant test center the subset S does not vary. In practice, this assumption means that each aberrant test center has no more than one aberrant group.

(7)

FIGURE 1. Each aberrant group had access to the same known subset SR

FIGURE 2. An example of three aberrant groups each of which had access to a certain unknown subset of items from QR

Item preknowledge influences the posterior distribution

S Tj

P of ability or speed of each involved test taker ,j estimated on S T_j, where T_j is an actual subset of items administered to test taker j during CAT. For example, if answers to stolen items S are known to a group of test takers, then there should be a large difference between the posterior distributions of ability _\

T Sj

P and

S Tj

P for each low-ability test taker j from the group, assuming S T_j  . Therefore, the Kullback–Leibler divergence (Kullback &

Leibler, 1951) between corresponding posterior distributions ( _\ || )

j j

T S S T

D P P can be used as an instrumental statistic.

Given two distributions P and ₁ P the properties of the Kullback–Leibler divergence ₂, include D P( ₁||P₂)0, D P( ₁||P₂) 0 P₁ P , and in general, ₂ D P( ₁||P₂)D P( ₂||P ₁) (Cover & Thomas, 1991; Kullback & Leibler, 1951).

R

S

R

S1

S2

S3

Q

(8)

When S is known (Case 1), the use of Kullback–Leibler divergence is effective for detecting aberrant response patterns (Belov & Armstrong, 2011; Belov et al., 2007), including detection of pairs of test takers involved in answer copying for a high-stakes paper-and-pencil test (Belov & Armstrong, 2010). When S is unknown but belongs to a collection  



W W1, 2,...,W_m



(Case 2), then the multiple comparison problem will arise, because each empirical distribution of ( _\ || ),

j i i j

T W W T

D P P i1, 2,...,m has to be analyzed.

A standard approach to resolving this problem is to use Bonferroni or Šidák corrections for significance level (Abdi, 2007), which will decrease the statistical power. This report considers an even more difficult situation—when S and  are both unknown (see the description of Case 3 above).

Detection of Test Collusion Case 1 ( S is known)

Given a subset S R and a set of test takers J, a straightforward algorithm searching for aberrant test takers is introduced (Algorithm 1). For each test taker, it computes statistic :h the Kullback–Leibler divergence between corresponding posterior distributions. Then, given a significance level, the algorithm computes the critical value of the statistic using its empirical distribution.

Algorithm 1

Step 1: For each test taker jJ, the statistic  ( _\ || )

j j

j T S S T

h D P P is computed.

Step 2: The empirical distribution H of h _j, jJ is computed by binning h _j. Step 3: Given a significance level _H, the critical value v for H is computed.

Step 4: A test taker j is reported as aberrant if h_j v.

Algorithm 1 loses power in the presence of large aberrant groups. One can address this data-contamination problem by computing the critical value via simulations.

However, the use of a simulated distribution may lead to a large increase in the Type I error rate. The next algorithm addresses the data-contamination problem by the

following process:

(9)

1. Potentially aberrant test centers are identified via a new statistic g, which indicates how the distribution of statistic h within a particular test center is different from distributions calculated from other test centers.

2. Data from the identified test centers are ignored when computing the empirical distribution of statistic .h

Given a subset S R, a set of test takers J, and a set of test centers C, an algorithm searching for aberrant test takers is introduced (Algorithm 2). The subset

c

J J contains test takers taking a test at test center cC . Algorithm 2

Step 1: A subset of potentially aberrant test centers is initialized as A . Step 2: For each test taker jJ, the statistic  ( _\ || )

j j

j T S S T

h D P P is computed.

Step 3: For each test center cC, the empirical distribution H of _c h , _j jJ is _c computed. Figure 3 shows two simulated H for nonaberrant and aberrant test _c centers (see details of computer simulations below). The next step introduces a new statistic that capitalizes on the difference between distributions H _c.

Step 4: For each test center cC, the statistic



⁽ ^|| ⁾ ⁽ ^|| ⁾











c c x x c

x C

g D H H D H H is

computed. Note: The compensation for the asymmetry of Kullback–Leibler divergence is due to a small support of H _c.

Step 5: The empirical distribution G of g _c, c C is computed. 

Step 6: Given a significance level _G, the critical value v for G is computed.

Step 7: For each test center cC, if g_c v then AA { }.c Step 8: The empirical distribution H of h _j,



 _c

c A

j J is computed.

Step 9: Given a significance level _H, the critical value w for H is computed.

Step 10: A test taker j from test center cA is reported as aberrant if h_j w .

(10)

(a)

(b)

FIGURE 3. Two simulated distributions H_c for nonaberrant (a) and aberrant (b) test centers (see Algorithm 2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 6 11 17 22 28 33 39 44 50 56

Hc

Kullback–Leibler divergence

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 6 11 17 22 28 33 39 44 50 56

Hc

Kullback–Leibler divergence

(11)

Note 2: Informally, Algorithm 2 works in two stages. In Stage 1, potentially aberrant test centers are identified; in Stage 2, potentially aberrant test takers from identified test centers are reported. If number of examinees is 10,000 and number of test centers is 100 with 100 examinees per test center, then the number of incorrectly reported examinees is bounded by 100_G100_H (see results of computer simulations below).

Note 3: A particular set of test centers C influences the performance of Algorithm 2.

Commonly, the subset J_c J contains test takers taking a test at the corresponding geographic location c C (room, college, state, region, country, and so on). However,  each element c C can be a logical condition defining a relation between test takers.  Examples of such relations include same high school (or group of high schools

geographically close to one another), same undergraduate college (or group of colleges geographically close to one another), and relationships identified from Facebook or other social networks. This allows the detection of groups of aberrant test takers even if they take the actual exam at different geographic locations.

Case 2 ( S is unknown but belongs to a known collection  



W W1, 2,...,W_m



) Given the collection  



W W1, 2,...,W_m



, a set of test takers J, and a set of test centers C, an algorithm searching for aberrant test takers is introduced (Algorithm 3).

This algorithm, a modification of Algorithm 2, addresses the multiple comparison problem caused by enumeration through W , _i i1, 2,...,m (see Steps 3–6).

Algorithm 3

Step 1: A subset of potentially aberrant test centers is initialized as A . Step 2: For each i1, 2,..., ,m the following is performed:

Step 2.1: For each test taker jJ, the statistic _,  ( _\ || )

j i j i

j i T W T W

h D P P is

computed.

Step 2.2: For each test center cC, the empirical distribution H of _{c i}_, h , _{j i}_, jJ _c is computed.

Step 2.3: For each test center cC, the statistic

 

, ( , || ,) ( , || ,)









c i c i x i x i c i

x C

g D H H D H H is computed.

Step 2.4: Statistic g is normalized: _{c i}_, _, _, / _,







c i c i x i

x C

g g g , cC Figure 4 shows two . simulated examples of normalized g with 100 test centers, where the last 20 _{c i}_, test centers are aberrant (see details of computer simulations below).

(12)

Step 3: Statistic _,

1

,





^m

c c i

i

g g c C is computed. Figure 5 shows a simulated example  of g with 100 test centers, where the last 20 test centers were simulated as _c

aberrant and m100. One can see that aberrant test centers have the highest values of g _c.

Step 4: Empirical distribution G of g _c, c C is computed. Figure 6 demonstrates  the empirical distribution G of statistic g from Figure 5. _c

Step 5: Given a significance level _G, the critical value v for G is computed.

Step 6: For each test center c C the following is performed:  Step 6.1: If g_c v then A A { }.c

Step 6.2: ^* _,

1,2,...,

arg max .



c  c i

i m

i g

Step 7: For each i1, 2,..., ,m the following is performed:

Step 7.1: The empirical distribution H of _i h , _{j i}_,



 _c

c A

j J is computed.

Step 7.2: Given a significance level _H, the critical value w for _i H is computed. _i Step 8: A test taker j from test center cA is reported as aberrant if ,*  *

c c

j i i

h w .

(13)

(a)

(b)

FIGURE 4. Two simulated examples of normalized g_{c i}_, with 100 test centers, where the last 20 test centers are aberrant (see Algorithm 3). In (a), the maximum value of statistic

,1

gc i (unknown subset

 _i1

S W ) corresponds to aberrant test center 93. In (b), the maximum value of statistic

,2

gc i (unknown subset

 _i2

S W ) corresponds to aberrant test center 82.

0 0.05 0.1 0.15 0.2 0.25

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

gc,i

Test center

0 0.05 0.1 0.15 0.2 0.25

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

gc,i

Test center

(14)

FIGURE 5. Simulated example of g_c with 100 test centers, where the last 20 test centers were simulated as aberrant and m100 (see Algorithm 3)

FIGURE 6. Empirical distribution G of statistic g_c from Figure 5 (see Algorithm 3)

0 0.5 1 1.5 2 2.5 3 3.5

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

gc

Test center

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.00 0.34 0.67 1.01 1.35 1.69 2.02 2.36 2.70 3.04 3.37

G

gc

(15)

Case 3 ( S is unknown and S  Q R)

In this case, the collection  covering each unknown subset S is also unknown.

One can generate a large number of random subsets for the collection  and then run Algorithm 3. The following heuristic relies on a possibly large overlap between each unknown S and one of the generated subsets.

Algorithm 4

Step 1: Simulate CAT without test collusion and then compute item exposure.

Alternatively, one can compute item exposure from real data.

Step 2: Compute discrete distribution P for items from QR, where P is normalized item exposure.

Step 3: Form a random collection  



V V1, 2,...,V , where each _n



V is unique, each _i has a random size from [ , ]l u , and its items are drawn from the discrete distribution

.

P Using P is more realistic than applying the uniform distribution, because the greater the number of exposed items, the higher the probability that they will be used in test collusion.

Step 4: Run Algorithm 3 given collection .

Computer Simulations

Multiple simulation studies were conducted using disclosed Logical Reasoning (LR) items of the Law School Admission Test (LSAT). The response probability for each item was modeled by the three-parameter logistic (3PL) model (Lord, 1980). CAT pool R contained 500 LR items; details of the pool’s structure can be found in Belov,

Armstrong, and Weissman (2008). Nonaberrant test takers were simulated with abilities drawn from an N(0, 1) distribution; aberrant test takers were simulated with abilities drawn from a U(−3, 0) distribution. There were 100 test centers with 100 test takers per test center.

The item selection criterion for CAT was the maximization of Fisher information at the current estimate of ability ˆ. The estimator of  was the expected a posteriori (EAP) estimator with a uniform prior distribution over [−5, 5]. The ability estimate for each simulated examinee was initialized at ˆ 0.

Each aberrant test center could have no more than one aberrant group (see Note 1 in the Introduction). The number of aberrant test centers was n_c{5, 10, 20}. The number of aberrant test takers in each aberrant test center was n_e{5, 10, 20}. Each group was randomly assigned to a test center such that the total number of test takers in each test center would be 100. Algorithms 1–4 were implemented in C++ by the

(16)

The Type I error rate (that is, the empirical probability of a test taker being falsely reported as aberrant) was computed as follows:

[number of reported examinees] [number of correctly reported examinees]

[number of all nonaberrant examinees]

 (1)

The detection rate was computed according to the following:

[number of correctly reported examinees]

[number of all aberrant examinees] ⁽²⁾

Computational Study of Case 1

The following simulation study was performed:

1. Simulate CAT without test collusion.

2. Compute item exposure.

3. Given threshold  {0.7, 0.6}, form subsets of stolen items with exposure higher than . This step resulted in two subsets with 4 and 12 items, respectively.

4. Simulate CAT with test collusion for the two subsets S, where n_c{5, 10, 20}

and n_e{5, 10, 20}.

5. Run Algorithms 1 and 2 to identify aberrant test takers.

The results for _H {0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05}, _G 0.2 are

presented in Figures 7–10. Both algorithms performed with Type I error rates that were less than the nominal levels (used for computing critical values); see Figures 7–8. As expected, Algorithm 2 has a higher detection rate than Algorithm 1 in the presence of large-scale test collusion, with larger differences for smaller values of _H (see

Figures 9–10). Because of the two-stage structure of Algorithm 2 (see Note 2 above), on average it was able to reduce the Type I error rate by half and to double the

detection rate in comparison to Algorithm 1 (see Figures 7–10).

(17)

nc=5, n_e=5 n_c=5, n_e=10 n_c=5, n_e=20

nc=10, n_e=5 n_c=10, n_e=10 n_c=10, n_e=20

nc=20, n_e=5 n_c=20, n_e=10 n_c=20, n_e=20

FIGURE 7. Empirical Type I error rates (Case 1: S is known, S 4), where the abscissa is _H

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary Algorithm 1 Algorithm 2

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

(18)

nc=5, n_e=5 n_c=5, n_e=10 n_c=5, n_e=20

nc=10, n_e=5 n_c=10, n_e=10 n_c=10, n_e=20

nc=20, n_e=5 n_c=20, n_e=10 n_c=20, n_e=20

FIGURE 8. Empirical Type I error rates (Case 1: S is known, S 12), where the abscissa is _H

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

(19)

nc=5, n_e=5 n_c=5, n_e=10 n_c=5, n_e=20

nc=10, n_e=5 n_c=10, n_e=10 n_c=10, n_e=20

nc=20, n_e=5 n_c=20, n_e=10 n_c=20, n_e=20

FIGURE 9. Empirical detection rates (Case 1: S is known, S 4), where the abscissa is _H

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

Empirical detection rate

Nominal Type I error rate Algorithm 1

Algorithm 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

Nominal Type I error rate

Algorithm 1 Algorithm 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

(20)

nc=5, n_e=5 n_c=5, n_e=10 n_c=5, n_e=20

nc=10, n_e=5 n_c=10, n_e=10 n_c=10, n_e=20

nc=20, n_e=5 n_c=20, n_e=10 n_c=20, n_e=20

FIGURE 10. Empirical detection rates (Case 1: S is known, S 12), where the abscissa is _H

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.01 0.02 0.03 0.04 0.05

(21)

Computational Study of Case 2

The following simulation study was performed:

Step 1: Simulate CAT without test collusion.

Step 2: Compute item exposure.

Step 3: Form subset QR of potentially stolen items with an exposure value higher than 0.4. This step resulted in Q with 51 items.

Step 4: Compute discrete distribution P for items from QR, where P is driven by the item exposure (the higher [or lower] the value of exposure for an item, the higher [or lower] the corresponding value of P).

Step 5: To each aberrant group, assign a unique random subset S Q such that 20 S 30 and items are drawn from .P Using P is more realistic than applying the uniform distribution, because the greater the number of exposed items, the higher the probability they will be used in test collusion.

Step 6: Simulate CAT with test collusion, where n_c{5, 10, 20} and n_e{5, 10, 20}.

Step 7: Run Algorithms 1–3 to identify aberrant test takers.

Since S was unknown and varied among aberrant groups, Algorithms 1 and 2 ran on QR. For Algorithm 3, m100 random subsets of random size from [20, 30] with items from Q were generated, such that the resultant collection  included all subsets generated at Step 5 above. The results for _H {0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05}, _G 0.2 are presented in Figures 11–12. Algorithms 2 and 3 performed with Type I error rates that were substantially below the nominal level and below the Type I error rates for Algorithm 1 (see Figure 11). Algorithm 3 had the highest detection rate (see Figure 12), except when n =10, _c n =5. Because of the two-stage structure of _e Algorithm 3, it was able to simultaneously reduce the Type I error rate and increase the detection rate in comparison to Algorithm 1 (see Figures 11–12).

(22)

nc=5, n_e=5 n_c=5, n_e=10 n_c=5, n_e=20

nc=10, n_e=5 n_c=10, n_e=10 n_c=10, n_e=20

nc=20, n_e=5 n_c=20, n_e=10 n_c=20, n_e=20

FIGURE 11. Empirical Type I error rates (Case 2: S is unknown but covered by ), where the abscissa is _H

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

Algorithm 1 Algorithm 2 Algorithm 3

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0 0.01 0.02 0.03 0.04 0.05

Empirical

Nominal Boundary