Decision Tree Induction - Theoretical Background

3. DIAGNOSING THE EFFICACY OF VIRTUAL OFFSHORE EGRESS TRAINING

3.4. Theoretical Background

3.4.2. Decision Tree Induction

Among the different supervised machine learning techniques, this paper uses decision trees.

DTs can be constructed quickly and do not require prior assumptions about the data, particularly when compared to other methods, such as artificial neural network or support vector machines (Duffy 2009). DTs were selected for their visual simplicity and diagnostic capabilities. From a diagnostic perspective, the DT model of participants’ decision strategies can determine whether participants have achieved competence. This was especially important for assessing the efficacy of different training methods because the goal of this research was to provide a training diagnostic lens for instructional designers who do not have domain expertise in data-mining.

The decision tree algorithm is based on an induction process whereby generalizations are made based on observed phenomena (Badino 2004). Following the

rule-performance data from simulation training. In this paper, information from each participant’s performance in VE scenarios is used to populate a data matrix consisting of scenarios (S1-Sn), attributes (A1-An), values (V11-Vnn), and actions (E1-En). The scenarios and attributes are labelled inputs to the matrix and the participants’ corresponding actions in the scenarios are known as classes. As depicted in Figure 3.1, the induction process creates generalized decision rules based on the content of the data matrix. The goal of the induction process is to classify the data in the matrix into groups such that the dataset in each group belongs to the same class. This paper uses the ID3 decision tree algorithm, which uses information gain as an attribute selection method, the means to classify the data into groups (Han et al. 2011).

Figure 3.1: Decision tree development framework.

The ID3 decision tree algorithm takes two basic inputs: the performance data matrix from the VE scenarios, and the list of attributes that were varied in each scenario. The output is a decision tree that describes a participant’s decision preferences and can also be used to predict their future decisions based on the value of the attributes in a given scenario.

During the decision tree induction, data are iteratively classified using the attribute that has the highest information gain. The ID3 algorithm calculates the highest information gain

using three main calculations: 1) the entropy of the dataset, 2) the average information entropy of attributes, and 3) the information gain for each attribute.

First, the entropy of the entire dataset is calculated as a measure of the uncertainty of the data (Duffy 2009). This is achieved by defining the data matrix training set as S, where S contains m class labels and Si is a subset of scenarios within the training set S. Then the entropy of S is calculated using Eq. 1.

Second, the training set, S is partitioned using attribute A, where A has k distinct outcomes. This partition will result in subset Sj with j to k values. The average information entropy for all attributes (A1-An) in Sj are calculated using Eq.2.

Finally, the information gain, which is the difference in entropy before and after splitting the dataset on the attribute A is calculated for each attribute in the data matrix using Eq. 3.

𝐺𝑎𝑖𝑛 (𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐴) (3)

The attribute with the highest information gain is selected as the root node, which begins the partition of the dataset. The root node represents the attribute that minimizes the information needed and reduces the randomness of the partitions (Han, et al. 2011). This

for classification, or the data set is empty, or data in each group belong to the same class and no further classification is needed (Musharraf et al. 2018). A complete tree has branches to leaf nodes (that represent the class label or final action of the participant). Algorithm 1 describes the iterative steps used to develop a decision tree.

Algorithm 1. General algorithm to generate DT from data matrix (Han et al. 2011) Inputs: data matrix; attribute list; information gain attribute selection method

Output: a decision tree Method:

Start

(8) Create a node, Ai

(9) If all scenario examples at the current node are of the same class, then label the leaf nodes with the class labels and stop (e.g. branch, Vn; leaf node, En).

(10) If the data subset at the current node is empty then label the node with the majority class label in its parent data set (e.g. branch, Vn; internal node, An).

(11) If no attributes are left for further classification, then label the leaf node with the majority class label in the current data subset and stop (e.g. branch, Vn; leaf node, En).

(12) For each remaining attribute An, compute the value of information gain Gain(An) (13) Choose the attribute with the highest Gain(An) to branch the current node.

(14) For each branch node, go to step 2.

End

3.5. Methodology

The decision tree development and analysis framework used in this paper is depicted in Figure 3.2. First, a pedagogical experiment was conducted in the VE with 55 participants.

These participants were trained using the SBML approach. The participants’ performance data was collected and divided into two datasets: a training and a testing dataset. The training data was stored in a repository in the form of a data matrix. The test scenarios were set aside to form the testing dataset. The data matrix was used to train the DT algorithm and form the decision trees, which represent participants’ behavioural pattern for route selection (Musharraf et al. 2018). The testing dataset was used to calculate the prediction accuracy of the newly formed DTs. The resulting DTs were used to compare participants’

understanding of the training with the intended learning objectives and to assess the efficacy of different training techniques. Section 3.1 describes the experimental design, including a description of the participants, the AVERT simulator, and how the SBML training was applied to VE. Section 3.2 describes the decision tree modeling from the SBML data, including the development of the data matrices, how scenario frames were used from dynamic scenarios, and an illustration of the DT development.

Figure 3.2: Process used to develop decision trees and assess training efficacy (after Musharraf et al. 2018).

In document Pedagogical studies in virtual offshore safety training (Page 92-97)