• No results found

What makes teachers effective? A critical factor is their instructional policy, which spec- ifies the manner and content of instruction. We use the term ‘policy’ in the standard sense—as a set of procedures governing action, in this case, rules that guide how a student should be taught. Electronic tutoring systems have been constructed that implement domain-specific instructional policies (e.g., J. R. Anderson et al., 1989; K. R. Koedinger & Corbett, 2006; Martin & van Lehn, 1995). A tutoring system decides at every point in a session whether to present some new material, provide a detailed example to illustrate a concept, pose new problems or questions that are similar to previously presented examples, or lead the student step-by-step to discover an answer. Prior efforts have focused on higher cognitive domains (e.g., algebra) in which policies result from an expert-systems approach involving careful handcrafted analysis and design followed by iterative evaluation and refinement. As a complement to these efforts, we are interested in addressing fun- damental questions in the design of instructional policies that pertain to basic cognitive skills. For example, how long should the teacher wait after posing a question before providing an answer? How much time should the teacher spend on each subtopic within a topic? When the teacher asks a question, should the teacher offer additional support in the form of hints or partial answers to provide scaffolding for learning, and what hints should be provided? How difficult a question should the teacher select given the student’s study and performance history? Should successive questions concern the same concept/topic, or should a switch be made to a different concept/topic?

118 Consider a concrete example: training individuals to discriminate between two perceptual or conceptual categories, such as determining whether mammogram x-ray images are negative or positive for an abnormality. In training from examples, should the instructor tend to alter- nate between categories—as in pnpnpnpn for positive and negative examples—or present a series of instances from the same category—ppppnnnn (Goldstone & Steyvers, 2001)? Both of these strategies—interleaving and blocking, respectively—are adopted by human instructors (Khan, Zhu, & Mutlu, 2011). Reliable advantages between strategies has been observed (S. H. K. Kang & Pashler, 2011; Kornell & Bjork, 2008) and factors influencing the relative effectiveness of each have been explored (Carvalho & Goldstone, 2011). Why blocking vs. interleaving? The points of comparison are often selected based on the experimenter’s intuition about what will be effective and—in order to obtain a publishable comparison—ineffective.

Empirical evaluation of blocking and interleaving policies involves training a set of human sub- jects with a fixed-length sequence of exemplars drawn from one policy or the other. During training, exemplars are presented one at a time, and typically subjects are asked to guess the category label associated with the exemplar, after which they are told the correct label. (Jacoby, Wahlheim, and Coane (2010) have shown that actively engaging subjects by requiring them to assign labels yields better learning than passive viewing of labeled exemplars.) Following training, mean classification accuracy is evaluated over a set of test exemplars. Such an experiment yields an intrinsically noisy evaluation of the two policies, limited by the number of subjects and inter-individual variability. Consequently, the goal of a typical psychological experiment is to find a statistically reliable dif- ference between the training conditions, allowing the experimenter to conclude that one policy is superior.

Blocking and interleaving are but two points in a space of policies that could be parameterized by the probability, ρ, that the exemplar presented on trial t + 1 is drawn from the same category as the exemplar on trial t. Blocking and interleaving correspond to ρ near 1 and 0, respectively. (There are many more interesting ways of constructing a policy space that includes blocking and interleaving—e.g., ρ might vary with t or with a student’s running-average classification accuracy—

119

(a)

10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 training trial Pr(repetition)

(b)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Instructional Policy Performance

Figure 1: (a) Samples from a function space that characterizes policies for choosing the

category of training examplars over a sequence of trials; (b) Illustration of a 1D instructional

policy space: dashed line is performance as a function of policy, vertical black bars are

experiment outcomes with uncertainty; red line and pink shading represent Gaussian Process

posterior density

Empirical evaluation of such policies involves training a set of human participants with a

fixed-length sequence of exemplars. During training, exemplars are presented one at a time,

and typically participants are asked to guess the category label associated with the exemplar,

after which they are told the correct label. (Jacoby, Wahlheim, and Coane (2010) have shown

that actively engaging participants by requiring them to assign labels yields better learning

than passive viewing of labeled exemplars.) Following training, mean classification accuracy

is evaluated over a set of test exemplars.

To compare blocking and interleaving, two pools of participants are drawn from the same

population, one trained via blocking and one via interleaving. An experiment yields an

intrinsically noisy evaluation of a policy, limited by the number of participants and inter-

individual variability. Consequently, the goal of a typical psychological experiment is to find

a statistically reliable difference between the training conditions, allowing the experimenter

to conclude that one policy or the other is superior.

Blocking and interleaving are but two of many distinct policies that might be evaluated.

However, limits on the availability of experimental participants and laboratory resources

make it challenging to conduct studies exploring more than a few candidate policies in

the depth necessary to obtain statistically significant differences. The candidates are often

selected based on the experimenter’s intuition about what will be effective and—in order to

obtain a publishable comparison—ineffective.

1.2

Defining a policy space

Over the course of training, the blocking policy specifies that the exemplar on trial t + 1

is drawn with high probability from the same category as the exemplar on trial t; the

interleaving policy specifies that the exemplar on trial t + 1 is drawn with low probability

from the same category. Consider a class of policies in which the probability of a repetition

depends on t or on the student’s running-average classification performance. Figure 1a

shows some examples of the former—time-dependent policies. The fixed interleaved and

blocked policies are also depicted (the horizontal lines). These policies have the functional

form

P r(category repetition on trial t + 1) = β

1

+

β

2

1 + e

β3t+β4

,

(1)

where β defines a four-dimensional policy space, which includes time-invariant policies such

as blocking and interleaving.

2

(a)

10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 training trial Pr(repetition)

(b)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Instructional Policy Performance

Figure 1: (a) Samples from a function space that characterizes policies for choosing the

category of training examplars over a sequence of trials; (b) Illustration of a 1D instructional

policy space: dashed line is performance as a function of policy, vertical black bars are

experiment outcomes with uncertainty; red line and pink shading represent Gaussian Process

posterior density

Empirical evaluation of such policies involves training a set of human participants with a

fixed-length sequence of exemplars. During training, exemplars are presented one at a time,

and typically participants are asked to guess the category label associated with the exemplar,

after which they are told the correct label. (Jacoby, Wahlheim, and Coane (2010) have shown

that actively engaging participants by requiring them to assign labels yields better learning

than passive viewing of labeled exemplars.) Following training, mean classification accuracy

is evaluated over a set of test exemplars.

To compare blocking and interleaving, two pools of participants are drawn from the same

population, one trained via blocking and one via interleaving. An experiment yields an

intrinsically noisy evaluation of a policy, limited by the number of participants and inter-

individual variability. Consequently, the goal of a typical psychological experiment is to find

a statistically reliable difference between the training conditions, allowing the experimenter

to conclude that one policy or the other is superior.

Blocking and interleaving are but two of many distinct policies that might be evaluated.

However, limits on the availability of experimental participants and laboratory resources

make it challenging to conduct studies exploring more than a few candidate policies in

the depth necessary to obtain statistically significant differences. The candidates are often

selected based on the experimenter’s intuition about what will be effective and—in order to

obtain a publishable comparison—ineffective.

1.2

Defining a policy space

Over the course of training, the blocking policy specifies that the exemplar on trial t + 1

is drawn with high probability from the same category as the exemplar on trial t; the

interleaving policy specifies that the exemplar on trial t + 1 is drawn with low probability

from the same category. Consider a class of policies in which the probability of a repetition

depends on t or on the student’s running-average classification performance. Figure 1a

shows some examples of the former—time-dependent policies. The fixed interleaved and

blocked policies are also depicted (the horizontal lines). These policies have the functional

form

P r(category repetition on trial t + 1) = β

1

+

β

2

1 + e

β3t+β4

,

(1)

where β defines a four-dimensional policy space, which includes time-invariant policies such

as blocking and interleaving.

2

Figure 5.1: (left) Samples from a function space that characterizes policies for choosing the category of training examplars over a sequence of trials; (right) Illustration of a 1D instructional policy space: dashed line is performance as a function of policy; vertical black bars are experiment outcomes with uncertainty; red line and pink shading represent Gaussian Process posterior density

but we will use the simple fixed-ρ policy space for illustration.) Although one would ideally like to explore the policy space exhaustively, limits on the availability of experimental subjects and laboratory resources make it challenging to conduct studies evaluating more than a few candidate policies to the degree necessary to obtain statistically significant differences.

Figure 5.1a shows some examples of the former—time-dependent policies. The fixed in- terleaved and blocked policies are also depicted (the horizontal lines). These policies have the functional form

P r(category repetition on trial t + 1) = β1+

β2

1 + eβ3t+β4, (5.1)

where β defines a four-dimensional policy space, which includes time-invariant policies such as blocking and interleaving.