Chapter 3. Model Specification and Study Design
3.2 ANN Model
3.2.1 The General Construct of ANN
ANN techniques evolved quickly over decades of development. But the fundamental theory and principles have remained solid. There is classical literature introducing the general ANN model such as Bishop (1995) and Ripley (1996). Literature dedicated to comparison of statistical method and ANN is also readily accessible (Warner and Misra 1996, Dreiseitl and Ohno-Machado 2002, Kumar 2005, Paliwal and Kumar 2009). Application of artificial intelligence in business has been growing (Hruschka 1993, West et al. 1997, Wong et al. 1997, Baesens et al. 2002, Cui and Curry 2005, Xu et al. 2013). Online resources are very rich and updated more frequently to the latest technique. For example, the online book by Michael Nielsen (Nielsen 2015) explains principles and techniques of ANN with vivid examples. The UFLDL Tutorial, contributed by well-recognized ANN
47
researcher Andrew Ng and his colleagues, covers general ANN models and techniques (Ng 2010).
All the resources mentioned above are very consistent in explaining ANN principles and general ANN models. A general introduction of ANN construct is provided in this section. It adopts two figures from (West et al. 1997) to explain the basic idea of NN models.
Figure 3.1 Single neuron perceptron, adopted from (West et al. 1997)
Figure 3.1 shows the construct of a single neuron. First, the aggregation node inside the big circle takes I inputs (π₯1 to π₯π) to calculate an aggregation value. The activation function node takes the aggregated value and transforms it to an output value. In most ANN implementations, the aggregation is a linear combination or weighted summation, which can be noted as β π₯π ππ€π. Parameter π€π
is the parameter be optimized using learning algorithm. A commonly used function activation function is sigmoid. Sigmoid function is what econometricians called logistic function. The big circle containing the aggregation and activation function
48
represents a neuron. Many such neurons can be connected to form a network as shown in Figure 3.2.
Figure 3.2 One Hidden Layer NN, adopted from (West et al. 1997)
Figure 3.2 shows a three hidden neurons (node π») ANN model. The node
π on the right hand side is the output node. The output node can be in the form of a neuron or any aggregation and transformation function defined by the model designer.
Multiple layer NN is not considered here for two reasons. First, one hidden layer is capable of capturing non-linear relationships (Intrator and Intrator 2001, West et al. 1997, Paliwal and Kumar 2009). Second, multi-layer NN is much more complicated in terms of choosing training algorithm and optimal structure design (Bengio et al. 2009, Nielsen 2015 Chapter 5). For example, it may be more subject to the trap of local optima (BaczyΕski and Parol 2004), especially when gradient descent learning is used. This project focuses on the one-hidden layer ANN model
49
for cross category predictions. A question remaining is how many hidden nodes should be chosen. This question is discussed in Section 3.2.4.
The figure below is the conceptual decision making model in cross category context.
Figure 3.3 Conceptual decision making model
To train an ANN model, one needs to specify a learning algorithm, error function, and stopping rule. The learning algorithm is a process to search for a set of optimal parameters that minimize the error function. There is no deterministic solution to ANN model. Thus, the learning is in essence a repeating βsearchβ and βtestβ process. The main parameters are the weights, πππ, as shown in Figure 3.2.
ο· Model Initialization
Common practice is to initialize the weights as random numbers drawn from range 0 to 1.
50
ο· Learning algorithm
A learning algorithm takes many rounds to update parameter (weights) toward an optimal solution. Back-propagation is arguably the most commonly used algorithm. In each round, it feeds data to the ANN model and calculates an error with current parameters. Then the parameters are updated using a searching rule. In the next round, the new parameters are used to calculate the error. Gradient descent (Newtonβs method) is usually used to update parameters in each round. Equation (3.9) describes the basic gradient descent method.
π€πππ‘+1 = π€πππ‘ β π ππΈ
ππ€πππ‘ (3.9)
In round π‘, a set of weights π€πππ‘ is either learned from previous round or set as the initial value when π‘ = 1. πΈ is the error function. ππΈ
ππ€πππ‘ is the first derivative of
πΈ over a weight parameter π€πππ‘. This term represents the gradient descent, the change of π€πππ‘ leading to largest reduction of error function. The new π€πππ‘+1 is calculated using formula (3.9). The learning rate parameter π controls for the speed of parameter updating. A convenient feature of the gradient descent learning formula is that the derivative term ππΈ
ππ€πππ‘ is usually a known form depending on the error function. Thus, it is calculated in each round and used in next round.
ο· Error function
The learning algorithm is designed to minimize an error function. An error function is a quantity to measure the difference between true value and estimated
51
value of dependent variables. A commonly used error function is sum of squared error and cross entropy error.
ο· Stopping rule
The learning algorithm stops when either the solution converges to a point where further rounds of learning would not gain much improvement on reducing error; or, the predefined maximum number of rounds is reached without solution convergence. The former case results in a success while the latter is a failure of learning.