6.4 Evaluation Setup
6.4.3 Baseline System
Now that the data set used for the evaluation has been presented, we now present the baseline system implemented. The baseline system serves as a lower bound for the performance to be expected by the matrix factorization algorithm.
Following the literature, a baseline recommender system based on the ideas pre- sented in [BSM06, Mic00] was implemented to compare the matrix factorization ap- proach with. These systems are based on association rule mining.
6.5 Evaluation Results and Discussion Method λr λc factors g+
Class context Softmargin 10 1 45 10
Logistic 20 20 30 10
Method context Softmargin 10 1 30 10
Logistic 10 10 30 10
Table 6.1: Parameter settings used in the software engineering experiments
The problem of mining association rules is to find all rules A→B that associate one set of method calls with another set. Hereby, A is referred to as the antecedent and B as the consequent of the rule. Two measures for the relevance of a rule exist. The sup- port specifies how often the method calls appear together. The confidence measures how frequently they conditionally appear, in our case how many rows in the matrix that contain the method calls of the antecedent also contain all method calls of the conse- quent. To limit the result to most likely meaningful rules, a maximum antecedent size, a minimum support and a minimum confidence is specified for the mining process.
For mining the association rules, a freely available Apriori rule miner2is used. In order to obtain a reasonable baseline for our approach, we run a set of experiments with different thresholds for minimum confidence and minimum support, namely five min- imum confidence values (0.01, 0.2, 0.5, 0.7 & 0.9 ) and three different absolute support thresholds (5, 20 & 40). Furthermore, we used the rule miner’s built-in statistical sig- nificance pruning with standard settings. We restricted the antecedent size of the asso- ciation rules to one. We selected these parameters based on results reported in [BSM06] for the FrUiT code recommender system. Experiments showed that an absolute sup- port threshold of 5 and a minimum confidence threshold of 0.9 lead to the best results according to the F1 measure.
6.5 Evaluation Results and Discussion
For both the class and the method context, parameter tuning on one of the ten data splits for optimal F1 score was performed. Table 6.1 contains these best parameters found.
Figure 6.5 shows the performance for the two loss functions used in the evaluation as well as the results obtained with the baseline system. Table 6.2 contains the precise numerical results. Note that the unanswered cases, the cases for which the systems don’t suggest any call, are generally fewer for the matrix factorization system. This allows the system to help the learning programmer in more cases, a generally desirable property.
F1 Precision Recall Rules Soft Margin Logistic Regression Class data 0.67 0.73 0.74 0.79 0.74 0.78 0.58 0.73 0.70 F1 Precision Recall Method data 0.4 0.5 0.6 0.7 0.8 0.9 0.68 0.73 0.74 0.83 0.72 0.82 0.57 0.75 0.68
Figure 6.5: F1, precision and recall results for the Method and Class data for the rule based approach and matrix factorization using the soft margin and the re- gression loss functions.
Context Algorithm F1 Precision Recall Unanswered
Method Rules 0.68±0.01 0.83±0.02 0.57±0.01 144±11 Softmargin 0.73±0.01 0.72±0.02 0.75±0.01 118±11 Logistic 0.74±0.01 0.82±0.02 0.68±0.02 124±8 Class Rules 0.67±0.02 0.79±0.03 0.58±0.02 56±5 Softmargin 0.73±0.02 0.74±0.02 0.73±0.02 48±10 Logistic 0.74±0.02 0.78±0.02 0.70±0.02 50±8
Table 6.2: Averaged results in terms of F1, precision and recall over 10 runs. Note that the differences in F1 between softmargin and logistic loss as well as the differ- ence in precision between rules and logistic loss are not statistical significant
6.5 Evaluation Results and Discussion
We observe that the matrix factorization algorithm with a logistic regression loss per- forms best in terms of the F1 score, closely matched by the soft margin loss. Matrix fac- torization overall performs significantly better than the rule based approach in terms of the F1 score. However, both loss functions show very different performance character- istics: For the logistic regression loss, recall and precision differ significantly while the performance of the soft margin loss function is more balanced.
Influence of the weight
Note that precision and recall are very sensitive to the value of g+, the positive weight. Thus, a desired balance between the two measures can be easily achieved by tweaking this value.
Figure 6.6 depicts the relation between the performance measures and the weight pa- rameter for the matrix factorization system with a soft margin loss. As to be expected, the value of precision decreases as the positive weight increases. The recall behaves in an inverse manner and increases as the weight goes up. The optimal value of the F1 is reached for the weight value 10 (2.3 in the log plot). The weight parameter thus pro- vides a convenient way of adjusting the performance of the system to the requirements of the particular situation.
Influence of the number of factors:
Another interesting parameter to study is the number of factors and its impact on sys- tem performance. The run-time performance of the system is almost only determined by its value.
Figure 6.7 depicts the values of F1, precision and recall for different numbers of fac- tors for the soft margin loss function. We observe that for about 30 factors the system reaches an almost optimal performance while further increases do not affect the perfor- mance in terms of the F1 score in a significant way.
Runtime performance
Training the initial model takes about 7 minutes for the class context data and 25 min- utes for the method context data using the logistic loss function. Based on this model, the system can predict calls in a few milliseconds in the current scenario. This is fast enough to implement the algorithm within an interactive environment.
The main memory needed is the one for C, as R can be discarded after initial training. C for 45 factors needs 1557∗45∗8 Bytes = 547 kBytes of memory. This also is easily available on a developer workstation.
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 log(weight) F1 Precision Recall
Figure 6.6: F1, precision and recall for the matrix factorization system with a soft mar- gin loss for different value of the weight parameter (on a natural log scale)
6.5 Evaluation Results and Discussion 0 50 100 150 200 250 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of factors F1 Precision Recall
Figure 6.7: Results for F1, precision and recall obtained with the matrix factorization system with a soft margin loss and different values for the number of factors parameter
6.6 Conclusion
We presented an application of the Detailed Machine Teaching approach in this chapter. As before, the focus was on the question whether a machine learning model, namely the matrix factorization approach described in Chapter 4 can be applied to the task of generating detailed feedback for learners.
We chose the software engineering domain for this evaluation, as source code is natu- rally digital and well structured. Both of which render source code analysis suitable for this evaluation. In addition to this easy availability and interpret-ability of the artifacts, the knowledge needed to create them is usually not externalized. These observations make software engineering a suitable target for an evaluation of the Detailed Feedback Machine Teaching approach.
The results, which were obtained based upon a simulated programmer, indicate that the machine learning model is indeed capable of capturing important knowledge from the provided expert code. This is especially noteworthy, as we did not perform any adaptations of the model and algorithm as introduced in Chapter 4: The main design decision was to decide for the structural elements to operate on and for a loss function. Both decisions are inevitable and do not form a special case for the application in this chapter.
Thus, the argument from Section 2.5 on page 47 proofed itself to be true in this ex- ample: Even if one could design a machine learning model for each kind of artifact, it is not always necessary for sufficient performance. Future work should investigate the suitability of the algorithm presented in Chapter 4 to further substantiate that claim.
The results also show empirically that the algorithm meets the requirements set out for a recommender system to be used in Machine Teaching in Section 2.5 on page 49 in this application domain, namely:
R1: Accurate solutions for partial artifacts: As we have described, the system is capa- ble of predicting well given 50 % of structural elements, in this case of the calls in a context.
R2: Interactive Performance: Prediction generally took only a few milliseconds. Obvi- ously, this is fast enough for interactive use of the system, where a response times of up to 0.1 seconds are generally considered to be unnoticeable (see e. g. [Mil68] and [NL06], Chapter 5).
This performance, coupled with the low memory footprint, also facilitates further application of the approach to the multi-framework case where the number of potential callees rises significantly.
R3: Recommendation Coherence: This again follows from the evaluation results: The system achieved high precision scores in the order of 0.75 which indicate that it was indeed capable of providing coherent suggestions to the user. This is espe- cially notable as the system showed a far higher recall than the baseline system.
6.6 Conclusion R4: Easy Adaptation to Multiple Types of Prediction: Through choosing the weighted
variants of the logistic and hinge loss functions, we adapted the system to the type or prediction necessary in this case.
Even though specialized models may outperform the application of the rather gen- eral matrix factorization framework done here, it is thus safe to conclude that the ma- chine learning aspect of Detailed Feedback Machine Teaching can be provided by ma- trix factorization models, especially by the one introduced in Chapter 4 of this thesis. The predictions generated are accurate and quickly available, even in the case where changed artifacts occur frequently.
7.1 Summary
Problem Statement By its very design, current Technology Enhanced learning sys- tems cannot convey knowledge that is not standardized, structured and most of all ex- ternalized. However, important knowledge such as intuition, experience or more gen- erally speaking implicit knowledge does not meet these requirements. Additionally, the structured externalization of knowledge in the form of electronic learning material is often impossible for economic reasons, as its creation is costly.
In traditional education, the apprenticeship with its master-student relationship is used to convey this knowledge. In this setting, the master explains to, discusses with and corrects the learner in a situation centered around the learned activity. This ap- proach is very effective. Yet, it by design requires the availability of a master in the field to the learner.
Approach MACHINETEACHING is proposed in this thesis to overcome these limita- tions: We notice that knowledge, even if not standardized, structured and externalized, can still be observed through its application. We refer to this observable knowledge as PRACTICED KNOWLEDGE. Machine Teaching is build upon this notion: Machine Learning techniques are used to extract machine models of Practiced Knowledge from observational data. The models thus capture traces of the Practiced Knowledge that went into the artifacts and processes, be it implicit or explicit. These models are then applied in the learner’s context for his support.
It is important to note that the people creating the artifacts or following the processes need not to be aware of their role in the system: They are not teachers, they are prac- titioners. Therefore, Machine Teaching focuses on the practice, not the ideal, just as the apprenticeship.
While both processes and artifacts can be the anchor in a Machine Teaching scenario, the automatic detection of processes still is an open field of research. As this is a pre- condition to a successful implementation of a Machine Teaching system, the thesis puts a focus on artifacts.
Levels of Feedback
We identified two important sub-classes of Machine Teaching: General and Detailed Feedback Machine Teaching.
General Feedback Machine Teaching General feedback in a way is like the grades given at school: Given an artifact, the Machine Teaching system determines its qual- ity. This is a straight-forward application of techniques from the field of supervised machine learning and the main effort in building a specific Machine Teaching System lies in deriving appropriate feature extraction procedures. Thus, General Feedback Ma- chine Teaching is readily applicable to a wide range of scenarios where the feature engineering task is either straight-forward or has been studied to great extend. This in- cludes tasks related to text, images and of course artifacts which exhibit feature-vector
7.1 Summary structure themselves. The application presented in this thesis, learning to write for a specific community, shows that the approach is viable and that the predicted quality of the artifact can in fact be very close to the one perceived by the community.
Detailed Feedback Detailed Feedback Machine Teaching aims at providing feedback regarding the attributes of the artifact to the learner. To do so, the machine learning model needs to have a notion of these artifact and their relations. This, in principle, provokes the creation of a specific machine learning model for each detailed feedback application of Machine Teaching. However, it has been shown in this thesis that De- tailed Feedback Machine Teaching often exhibits a striking similarity to the task of a recommender system. Where the latter suggests items to customers, the former “sug- gests” attributes to artifacts. Despite this obvious similarity, Machine Teaching poses additional requirements onto the underlying algorithm not usually occurring in Rec- ommender Systems. These include a greater variety in the attributes of artifacts, the frequent prediction upon yet unseen artifacts and the real-time nature of a Machine Teaching system. Last but not least, Detailed Feedback Machine Teaching imposes con- sistency requirements for one artifact as a whole, while a Recommender System is usu- ally focused on predicting the rating a user would associate with a single item.
Matrix Factorization Model and Method This thesis presented a novel model and al- gorithm based on factor models that substantially generalizes the state-of-the-art in this field to address these requirements uniquely found in Machine Teaching. The al- gorithm is a valuable contribution to the field of Recommender Systems in its own right as it is the first to be able to predict rankings as opposed to ratings. Additionally, it forms a hybrid Recommender System that can use known features of the user (artifacts) and items (structural elements) in addition to the collaborative effect present in such data. The base algorithm as well as the extensions thereof proposed in this thesis have shown favorable performance on standard Recommender Systems evaluation data sets avail- able. It is important to note that the algorithm excels especially in what is called the new-user problem in Recommender Systems: Given a new, yet unseen user with some rated items and a trained model based on other users, the goal is to predict well for this new user. While this situation is the exception in Recommender Systems, it is the norm in Machine Teaching: every time the learner presents the system with an artifact, that artifact constitutes a “new user”. Thus, excellent predictive performance on this task is crucial for Machine Teaching.
Machine Teaching for Software Engineering The domain of learning to program well within a software framework is modeled as a task for Detailed Feedback Machine Teaching by forming a caller-callee matrix of representative code. This matrix stores every call to the software framework from the code analyzed, where each row repre- sents one method or class of the analyzed code and each column represents a possibly callable method of the framework. The goal of a Machine Teaching system when con- fronted with a partial row is to predict missing calls. As we have shown in the empirical
evaluation, the model is capable of capturing the structure in this matrix and can there- fore compute meaningful feedback for the learning programmer.
This application exemplifies the expressive power of the factor model presented in this thesis, as the model was applied to the problem in a straight-forward manner. So even while purpose-built models are likely to perform even better, the factor model can serve as the underlying model for a wide range of Machine Teaching applications. It therefore provides Detailed Feedback Machine Teaching with a foundation just as supervised machine learning forms the base of General Feedback Machine Teaching.