Data preprocessing is an important, and often neglected, activity within the machine learning field. Kotsiantis, Kanellopoulos, and Pintelas (2006) say that preprocessing tasks such as cleaning, transformation, and selection should be considered in order to generate suitable training and testing datasets. In addition, for supervised learning, the success of knowledge acquisition during the learning phase is highly influenced by the representation and quality of the datasets. Thus, in this study, cleaning and selection
preprocessing tasks were primarily considered for dataset preparation; then, transformation of students’ solution paths into sequential datasets was implemented.
121 1. Data Preparation
We performed a data cleaning (fixed incorrect data and eliminated of confidential data) and selection (elimination of noisy data) analysis before performing any data transformations; we were mainly looking for potential incorrect or invalid data and outliers that could have a significantly negative impact on the ITS learning process. However, most of the collected observations were considered acceptable. Interestingly, the entire dataset was considered representative of our learning domain by our expert with more experience in network security. And even though we encountered some configuration sequences significantly different in length (number of configuration rules) and content (use of parameters), those outliers were considered necessary for our classification process. Therefore, all CSs were selected and no significant cleaning, other than eliminating students’ identification data, was necessary.
An additional data preparation step for a portion of our experiments was the
categorization of each CS. An expert labeled each CS as correct, partially correct, or
incorrect based on whether the sequence addressed the intended security requirements. Then, CSs labeled as incorrect were sub-categorized based on the major type of configuration error. CSs identified with the same configuration error were grouped and assigned a tutorial action that would help students to identify and correct the configuration misconception. The result of this data preparation step was a set of clusters
consisting of similar CSs as defined by the domain expert. By two CSs being similar we mean that they require the same tutorial action from the ITS. Hence, for each cluster the expert formulates a tutorial action that is appropriate for all CSs in the cluster.
121 Interestingly, we observed that the sizes of clusters (in terms of number of CSs)
generated during preprocessing were generally highly unbalanced. While some clusters contained many CSs (i.e., tutorial action 2 and 11 in Figure 16), which were typically insignificantly different from each other, several other clusters have only one or a small number of CSs (i.e., tutorial action 1 and 10 in Figure 16). It was expected that the small clusters, as a result of students’ misconceptions, sometimes might be quite common. This type of behavior was expected because of the degree to which our domain was ill- defined. It is an advantage of our mixed-response approach over design time approaches that such misconception-driven clusters can be identified and (as we will illustrate later) handled by the ITS.
121 2. Data Transformation
As described earlier, solution paths selected by students consist of configuration sequences (CSs), which in turn consist of configuration rules (CRs). An example illustrating CSs from three students is shown in Table 3: Student 1 uses two CSs with three CRs respectively, while Student 2 and Student 3 each only use one CS.
In order to better represent CSs, we further parse each CR as a sequence of parameters with two different levels of granularity: token level and paired-symbols level. For instance, the rule “iptables -A FORWARD -j DROP” has five tokens and three paired symbols respectively. At token level the rule consists of individual symbols (“iptables”, “-A”, “FORWARD”, “-j”, and “DROP”) while at paired-symbols level it consists of individual symbols and parameter-value pairs (“iptables”, “-A FORWARD”, and “-j DROP”). For data representation, a configuration-rule level was used as well. This coarse-grained data representation level consists of entire CRs (“iptables -A FORWARD -j DROP”).
Table 3
Frequent behaviors and misconceptions in configuration rules Student
ID Configuration Rules Command Type
iptables –flush reset configuration command
1 iptables –L informative command
iptables -A FORWARD -p tcp --dport http -j ACCEPT
iptables –F different/correct command
1 iptables -A FORWARD -j DROP different/incorrect rule order
iptables -A FORWARD --dport http –p tcp -j ACCEPT different parameter order iptables -A FORWARD -p tcp --dport httt -j ACCEPT incorrect parameter name 2 iptables -A FORWARD -p tcp --dport ftp -j ACCEPT incorrect parameter value
iptables –F different/correct command
121 We note that the more coarser-grained the representation is, the more domain
specific the parser must be. While any tokenizer is able to process input at token level, it must understand the concept of command-line input to generate paired symbols from student input. In order to parse input at configuration rule level, the parser must understand what configuration rules are, and thus have a detailed understanding of the underlying domain.
Table 3 also displays examples of different configuration types within each CS
(e.g., the third CR within both CSs of Student 1 are semantically the same but different regarding the order of parameters, the first CR within both CSs of Student 1 are semantically the same but differ using complete command name and alias respectively), semantic and syntactic misconceptions, and parameter functionality type (e.g. the first
CS of Student 1 consists of a resetting command, an informative command, and finally of a configuration command).