• No results found

Chapter 3. Model Specification and Study Design

3.2 ANN Model

3.2.1 The General Construct of ANN

ANN techniques evolved quickly over decades of development. But the fundamental theory and principles have remained solid. There is classical literature introducing the general ANN model such as Bishop (1995) and Ripley (1996). Literature dedicated to comparison of statistical method and ANN is also readily accessible (Warner and Misra 1996, Dreiseitl and Ohno-Machado 2002, Kumar 2005, Paliwal and Kumar 2009). Application of artificial intelligence in business has been growing (Hruschka 1993, West et al. 1997, Wong et al. 1997, Baesens et al. 2002, Cui and Curry 2005, Xu et al. 2013). Online resources are very rich and updated more frequently to the latest technique. For example, the online book by Michael Nielsen (Nielsen 2015) explains principles and techniques of ANN with vivid examples. The UFLDL Tutorial, contributed by well-recognized ANN

47

researcher Andrew Ng and his colleagues, covers general ANN models and techniques (Ng 2010).

All the resources mentioned above are very consistent in explaining ANN principles and general ANN models. A general introduction of ANN construct is provided in this section. It adopts two figures from (West et al. 1997) to explain the basic idea of NN models.

Figure 3.1 Single neuron perceptron, adopted from (West et al. 1997)

Figure 3.1 shows the construct of a single neuron. First, the aggregation node inside the big circle takes I inputs (π‘₯1 to π‘₯𝑖) to calculate an aggregation value. The activation function node takes the aggregated value and transforms it to an output value. In most ANN implementations, the aggregation is a linear combination or weighted summation, which can be noted as βˆ‘ π‘₯𝑖 𝑖𝑀𝑖. Parameter 𝑀𝑖

is the parameter be optimized using learning algorithm. A commonly used function activation function is sigmoid. Sigmoid function is what econometricians called logistic function. The big circle containing the aggregation and activation function

48

represents a neuron. Many such neurons can be connected to form a network as shown in Figure 3.2.

Figure 3.2 One Hidden Layer NN, adopted from (West et al. 1997)

Figure 3.2 shows a three hidden neurons (node 𝐻) ANN model. The node

𝑂 on the right hand side is the output node. The output node can be in the form of a neuron or any aggregation and transformation function defined by the model designer.

Multiple layer NN is not considered here for two reasons. First, one hidden layer is capable of capturing non-linear relationships (Intrator and Intrator 2001, West et al. 1997, Paliwal and Kumar 2009). Second, multi-layer NN is much more complicated in terms of choosing training algorithm and optimal structure design (Bengio et al. 2009, Nielsen 2015 Chapter 5). For example, it may be more subject to the trap of local optima (BaczyΕ„ski and Parol 2004), especially when gradient descent learning is used. This project focuses on the one-hidden layer ANN model

49

for cross category predictions. A question remaining is how many hidden nodes should be chosen. This question is discussed in Section 3.2.4.

The figure below is the conceptual decision making model in cross category context.

Figure 3.3 Conceptual decision making model

To train an ANN model, one needs to specify a learning algorithm, error function, and stopping rule. The learning algorithm is a process to search for a set of optimal parameters that minimize the error function. There is no deterministic solution to ANN model. Thus, the learning is in essence a repeating β€œsearch” and β€œtest” process. The main parameters are the weights, π‘Šπ‘–π‘—, as shown in Figure 3.2.

ο‚· Model Initialization

Common practice is to initialize the weights as random numbers drawn from range 0 to 1.

50

ο‚· Learning algorithm

A learning algorithm takes many rounds to update parameter (weights) toward an optimal solution. Back-propagation is arguably the most commonly used algorithm. In each round, it feeds data to the ANN model and calculates an error with current parameters. Then the parameters are updated using a searching rule. In the next round, the new parameters are used to calculate the error. Gradient descent (Newton’s method) is usually used to update parameters in each round. Equation (3.9) describes the basic gradient descent method.

𝑀𝑖𝑗𝑑+1 = 𝑀𝑖𝑗𝑑 βˆ’ πœ‚ πœ•πΈ

πœ•π‘€π‘–π‘—π‘‘ (3.9)

In round 𝑑, a set of weights 𝑀𝑖𝑗𝑑 is either learned from previous round or set as the initial value when 𝑑 = 1. 𝐸 is the error function. πœ•πΈ

πœ•π‘€π‘–π‘—π‘‘ is the first derivative of

𝐸 over a weight parameter 𝑀𝑖𝑗𝑑. This term represents the gradient descent, the change of 𝑀𝑖𝑗𝑑 leading to largest reduction of error function. The new 𝑀𝑖𝑗𝑑+1 is calculated using formula (3.9). The learning rate parameter πœ‚ controls for the speed of parameter updating. A convenient feature of the gradient descent learning formula is that the derivative term πœ•πΈ

πœ•π‘€π‘–π‘—π‘‘ is usually a known form depending on the error function. Thus, it is calculated in each round and used in next round.

ο‚· Error function

The learning algorithm is designed to minimize an error function. An error function is a quantity to measure the difference between true value and estimated

51

value of dependent variables. A commonly used error function is sum of squared error and cross entropy error.

ο‚· Stopping rule

The learning algorithm stops when either the solution converges to a point where further rounds of learning would not gain much improvement on reducing error; or, the predefined maximum number of rounds is reached without solution convergence. The former case results in a success while the latter is a failure of learning.

Related documents