Chapter 3: Programming Activity Difficulty Detection and Implementation
3.11 Limitations
A central limitation in this chapter is the size of both studies.
Size of studies: Both studies in this chapter had a limited sample size, which could limit
the generalizability of the studies. The first study had six developers and the second study had 14 developers. The number of participants in the second study is typical for studying software developers.
Developer experience: Developers in the first study were students, which could limit the
findings to education. Developers with more industry experience may perform different
programming actions when they are having difficulty. The second study partially addresses this limitation by using five industrial participants. However, as mentioned above, this study also had a limited sample size (14 developers).
Maintenance tasks: In both of our studies, developers implemented programs from
scratch, which could mean our results may not apply to maintenance tasks. Maintenance tasks can be expected to have more navigation for the same difficulty degree.
Difficulty-Detection Algorithm Performance: A limitation of the difficulty detection
algorithm is that it has a high false negative rate. This limitation can be addressed in a number of ways. One way is to determine the most accurate machine learning algorithm. In this chapter, we tried several machine learning algorithms and showed that the decision tree algorithm performed the best. Another way is to investigate different approaches that address the class imbalance problem. We use the SMOTE algorithm, which is a form of oversampling, increasing the size of the minority class (in our case, having difficulty). There are additional approaches such as undersampling, decreasing the size of the majority class (in our case, making progress) and cost-sensitive learning. In a two-class problem, as in our case, cost sensitive learning assigns a cost to both classes (making progress and having difficulty). Difficulty detection modules can
use these cost to a) change the distribution of the data set according to the costs or b) only predict the high-cost class when the module is confident about the prediction. It would also be useful to investigate different log segmentation methods such as segmenting the log based on time or using a sliding window approach. An alternative is to investigate additional features such as idle time, the amount of pressure on a mouse or keyboard, or features that can be computed from non-standard equipment such as body posture. The performance of the difficulty detection algorithm is addressed in the next chapter.
Ground Truth: Another limitation is using developers’ as ground truth. The argument for
using developers’ perceptions is that they are programming and should know whether they are having difficulty. However, people tend to underestimate their problems, which could mean developers may not always admit when they are having difficulty. Given that having difficulty is a rare event and that developers may not admit when they are having difficulty, we could miss collecting a large amount of data. To reduce the chance of missing a large amount of data, our study involved three observers. The first observer, I, watched participants while they were programming and labeled times when I thought they were having difficulty. Studies that rely on the experimenter to label data for difficulty detection modules are subject to experimenter’s bias. To reduce this bias, we recruited two additional observers who were not associated with the experiment to blindly label the making progress and having difficulty moments indicated by the first observer and participants.
Privacy: We did not formally evaluate privacy controls, but we have gotten some initial
feedback on users’ preferences. 3.12 Summary
approaches by logging developers’ interactions with programming environments and inputting these actions into a difficulty detection module that predicts whether developers are having difficulty or making progress. To evaluate this component, we describe several performance metrics, define three baselines, and conduct a small field study (6 participants). Our results show that the component performs better than baseline measures.
We also evaluate this component in a lab study (14 participants) with more participants than our field study (6 participants). Our results show that the framework performs better than baseline measures when using the perceptions of observers and developers as ground truth. These results combined with the results from our field study provide evidence to support sub- thesis I, which we restate here.
Programming Activity Difficulty Detection Sub-Theses (Sub-thesis I):
It is possible to develop an approach that a) uses developers' interactions with their
programming environment to determine whether developers are having difficulty with their task and b) performs better than baseline measures.
To determine how well programming-activity difficulty-detection works in practice, we develop the reusable difficulty detection framework, which uses standard design patterns, Mediator and Strategy, to enable the component to be used in two programming environments. Our results show that the number of lines of code to implement the reusable difficulty detection framework, 4,643 is significantly less than the number of lines of code to implement difficulty detection modules written specifically for Visual Studio (9,096) and Eclipse (11,000). These results provide evidence to support sub-thesis II, which we restate here.
Implementation Sub-Theses (Sub-thesis II): It is possible to develop a common set of difficulty detection modules for different programming environments that have significantly fewer lines of code than difficulty detection modules written specifically for each programming environment
Chapter 4: Multimodal Difficulty Detection