Using different mechanisms, we collected several types of information at different moments in time: before, during the experiment, and after the experiment. We used blind marking for grading the correctness of the participants’ solutions.
8.3.1
Personal Information
Before the experiment, we collected both personal information (e.g., gender, age, nationality) used for statistics and professional information (e.g., current job position and experience levels in a set of four skills identified as important for the experiment) used for blocking, by means of an online questionnaire presented in Section A.1. The collected data allowed us to know at all times the number of participants that we can rely on and to plan our experimental runs.
8.3.2
Timing Data
To measure the completion time, we asked the participants to log the start time, split times, and end time. However, we learned that this measure alone did not represent a reliable solution, on an occasion in which several participants, excited by the upcoming task, simply forgot to write down the split times. Moreover, we needed to make sure that none of them would use more than ten minutes per task, which was not something we could ask them to watch for.
To tackle this issue, we developed a timing web application in Smalltalk using the Seaside framework6. During the experimental sessions, the timing application would run on the experi-
menter’s computer and project the current time (See Figure 8.2).
Figure 8.2. The output of our timing web application
The displayed time was used as common reference by the participants whenever they were required in the questionnaire to log the time. In addition, the application displayed the following information for each participant: the name, the current task, and the maximum remaining time for the current task.
The subjects were asked to notify the experimenter every time they log the time, so that the experimenter could reset their personal timer by clicking on the hyperlink marked with the name of the subject. Whenever a subject was unable to finish a task in the allotted time, the message “overtime” would appear beside his name, which would make the experimenter ask the subject to immediately pass to the next task, before resetting his timer.
Since most of the times the experimenter would only get to meet the subjects just before the experiment, associating names with the persons by memory was not something we wanted to rely on. Therefore, the experimenter would always bring with him a set of name tags which would be placed near the corresponding subject and would thus help the experimenter identify the subjects in a timely manner.
At the end of an experimental session, the experimenter would export the recorded times of every participant. This apparently redundant measure allowed us to recover the times of participants who either forgot to log the time, or logged it with insufficient details (i.e., only hours and minutes) in spite of the clear guidelines.
Conversely, relying completely on the timing application would have also been suboptimal: On one occasion, the timing application froze and the only timing information available for the particular tasks the participants were solving were their own logs. On another isolated occasion, a participant raised a hand to ask a question and the experimenter assumed that the participant announced his move to the next task and reset his timer. In this case, the incident was noted by the experimenter and the time was later recovered from the participant’s questionnaire. The timing data we collected is presented in Table A.4 in Section A.4.
8.3.3
Correctness Data
The first step towards obtaining the correctness data points was to collect the solutions from our subjects using the questionnaires presented in detail in Section A.1.
Then, to convert task solutions into quantitative information, we needed an oracle set, which would provide both the superset of correct answers and the grading scheme for each task. How- ever, given the complexity of our experiment’s design, one single oracle set was not enough. On the one hand we needed a separate oracle for each of the two object systems. On the other hand, we needed separate oracles for the solutions obtained by analyzing an object system with different tools. This happens because for the first few tasks, the data resource of the control groups is source code, while the one of the experimental groups is a FAMIX model of the object system extracted from the source code, which in practice is never 100% accurate[JS03].
To obtain the four oracle sets, three experimenters independently solved the tasks with each combination of treatments (i.e., on the assigned object system, using the assigned tool) and came up with a grading scheme for the tasks. In addition, for the two experimental groups, they computed the results using queries on the model of the object system, to make sure that they did not miss any information caused by limitations of the visualization (e.g., occlusion, too small buildings, etc.). Eventually, by merging the solutions and after discussing the divergences, we obtained the four oracle sets, presented in Section A.4.
Finally, we needed to grade the solution of the subjects. To remove subjectivity when grading, we employed blinding, which implies that, when grading a solution, the experimenter is not
aware whether the subject that provided the solution has used an experimental treatment or a control treatment. For this, one of the experimenters created four code names for the groups and created a mapping between groups and code names, known only by him. Then he provided the other two experimenters with the encoded data, along with the encoded corresponding oracle, which allowed them to perform blind grading. In addition, the experimenter who encoded the data performed his grading unblinded. Eventually, the experimenters discussed the differences and converged to the final grading, presented in Table A.3 of Section A.4.
8.3.4
Participants’ Feedback
The questionnaire handout ends with a debriefing section, in which the participants are asked to assess the level of difficulty for each task and the overall time pressure, to give us feedback that could potentially help us improve the experiment, and optionally, to share with us any interesting insights they encountered during the analysis.