Chapter 6. Difficulty Level and Barrier Detection
6.1 Barrier Detection
As mentioned above, Ko et al. [37] have identified several barriers. Our own data
involving the students we monitored in help sessions, suggested a simpler classification scheme: algorithm design issues and difficulty with correcting incorrect output. However, by the time the difficulty studies were done in the course, issues with using this tool has been ironed out.
Therefore, we decided to determine if it was possible to automatically distinguish between design and incorrect output barriers.
Our recordings of the help sessions provided us with the data required to make this distinction. As before, we used coders to derive this information. To enable this process, we
developed a variation of the video observation tool of Figure 3.13 (chapter 3), which shows all segments where participants asked for help, and enables observers to identify the barriers the participants face. The coders agreed on 44 out of 50 difficulty points (k=0.79). 66% of these were classified as design barriers and 34% as incorrect output.
Now that we had ground truth, we had to identify appropriate features to automatically detect the barriers. Based on our observations of the recordings around difficulty points, we found the following: When programmers had incorrect output, the frequency of debug commands increased and the frequency of edit commands decreased. When they had design problems, they spent a large amount of time outside of the programming environment.
The feature set of our difficulty detection tool had been deliberately chosen to ignore wall-clock time. The reason was that we wanted to prevent our mechanism from classifying idle phases as difficult ones. Based on the observations above, it seemed we now had to consider the passage of time. We envisioned a two-phase difficulty detection scheme in which, first our previous time-independent detection features are used to determine difficulties, and then a new set of classification features is used to identify the barrier.
We included all of the previous detection features in the second classification set. In addition, we added features measuring the rate of interaction with the programming environment. The result was the following set of features. An asterisk indicates the previous detection features. Classification Features
(1) *Insertion ratio = # of insertions / # of total events. (2) *Deletion ratio = # of deletions/ # of total events. (3) *Navigation ratio = # of navigations/ # of total events.
(4) *Debug ratio = # of debugs / # of total events. (5) *Focus ratio = # of focus changes/ # of total events.
(6) Mean time between events = total time / # of total events.
(7) Mean insertion time = total insertion time/# of insertion events.
(8) Mean deletion time = total deletion time / # of deletion events.
(9) Mean focus time = total focus time/# of focus events.
(10) Mean navigation time = total navigation time/ # of navigation events. (11) Mean debug time = total debug time / # of debug events.
All of these times were measured in milliseconds. As before, we divided a log into 50- command segments, and computed these features independently for each segment.
To determine how indicative the detection and classificationfeatures are of
programmers’ behavior we graphed the programming behavior of six programmers. In each graph, the x-axis is session time and y-axis is the percent or time (in milliseconds) for each feature. Figure 6.1 shows portions of the graphs created for participant 1 and 2, respectively, illustrating commonalities in the behavior of the programmers when they are having difficulty correcting incorrect output. In both cases, participants’ debug percentages increased, and the edit (insertion and deletion) percentages decreased. Figure 6.2 shows commonalities in the behavior of participant 2 and 4 when they are having algorithm design issues. In both cases, the
participants spent a large amount of time outside of the programming environment, which is indicated by the mean focus time. In particular, participant 3 (4) spent 120 (350) seconds outside of the programming environment. Thus, the four graphs validate our feature choice.
Figure 6.2: Programming activities when participants are having issues designing algorithms.
We fed to our decision tree algorithm the features of (a) each segment during which the programmer had explicitly indicated difficulty, which we refer to as an explicit segment, and (b) each segment that preceded an explicit segment and occurred within two minutes of the explicit segment, which we refer to as an implicit segment. The reason for (b) is that, on average, coders took two minutes to determine the barrier, and we assumed an algorithm would need the same amount of information.
We used a standard technique known as cross validation, which executes 10 trials of model construction, and splits the logged data so that 90% of the data are used for training and
10% for evaluation. We used a group model, as in chapter 3, in which the data of multiple programmers was aggregated during the training phase.
Results
The confusion matrix of Table 6.1 shows the results of the using the decision tree algorithm on the group model. The positive is Incorrect output and Design is the negative. The accuracy of this algorithm is 82%, the true positive rate is 73%, the true negative rate is 86%, the false negative rate is 27%, and the false positive rate is 14%. These results show that the
algorithm correctly classified 25 of the 29 (86%) design barriers, and 11 of the 15 (73%) incorrect output barriers. To provide evidence to support part (a) of sub-thesis V, we compare the results of the decision tree algorithm to the baselines described in chapter 3. We use the same approach described in chapter 3 to compute each baseline.
Table 6.1: Barrier Confusion Matrix for Help Sessions. Predicted Incorrect
Output
Predicted Design
Actual Incorrect Output
11 (True positives) 4 (False negatives)
Actual Design 4 (False positives) 25 (True negatives)
Table 6.2 shows the results for the random baseline. The accuracy of this baseline is 53%, the true positive rate is 53%, the true negative rate is 52%, the false negative rate is 47%, and the false positive rate is 48%. This baseline correctly identifies 53% of incorrect output barriers and 52% of the design barriers. As in chapter 3, we use the binomial test to determine if there is a significant statistical difference between each baseline and the results of our decision
tree algorithm. The decision tree algorithm performs significantly better than the random baseline (TPR=73% vs. TPR=53%, p < .05) (TNR=86% vs. TNR=52%, p < .001).
Table 6.2: Confusion matrix for random baseline. Predicted Incorrect
Output
Predicted Design
Actual Incorrect Output
8 (True positives) 7 (False negatives)
Actual Design 14 (False positives) 15 (True negatives)
Table 6.3 shows the results for the modal baseline. The accuracy of this baseline is 66%, the true positive rate is 0%, the true negative rate is 100%, the false negative rate is 100%, and the false positive rate is 0%. This baseline correctly identifies all of the design barriers, but never identifies the incorrect output barriers. As mentioned in chapter 3, we do not use a significance test to compare the modal baseline to our approach. The true positive rate (73%) is better than the true positive rate (0%) of the modal baseline.
Table 6.3: Confusion matrix for modal baseline. Predicted Incorrect Output Predicted Design Actual Incorrect Output
0 (True positives) 15 (False negatives)
Actual Design 0 (False positives) 29 (True negatives)
Table 6.4 shows the results for the data distribution baseline. The accuracy of this baseline is 55%, the true positive rate is 33%, the true negative rate is 66%, the false negative rate is 67%, and the false positive rate is 34%. This baseline identifies 66% of the design barriers, but only 33% of the incorrect output barriers. The decision tree algorithm performs significantly better than the data distribution baseline (TPR=73% vs. TPR=33%, p < .001) (TNR=86% vs. TNR=66%, p < .001).
Table 6.4: Confusion matrix for data distribution baseline. Predicted Incorrect
Output
Predicted Design
Actual Incorrect Output
5 (True positives) 10 (False negatives)