• No results found

3.2 Regression methods for machine learning potentials

3.2.4 Neural network

The linear regression model discussed in section 3.2.1takes the form

y(x; θ) = f (θTφ(x)), (3.57)

where f (·) is merely an identity function in this case. In order to apply such models to large-scale problems, it is necessary to adapt the basis functions φ(·) to the data [96]. An approach is to make the basis functions parametric and allow these parameters to be adjusted, along with the original parameters θ, during training. A feed-forward NN is a series of models in the form of equation (3.57) composed on top of each other, with the outmost function f (·) remaining an identity function, but the others replaced by some nonlinear activation functions.8 In such, we achieve the goal to transform each basis function to a nonlinear function of linear combination of the inputs.

AnNNcan be represented graphically in the form of a network diagram as shown in figure3.1. This exampleNNconsists of an input layer, two hidden layers, and an output layer. Each node in the hidden layers is connected to all the nodes in the previous layer and the next layer, and the node value is9

ynm= σ X n0 ynm−10 wnm0,n+ bnm ! , m = 1, 2, 3, (3.58) 8

If the inner activation functions are linear, then the network can be replaced by an equivalent model in the form of equation (3.57). This follows from the fact that the composition of successive linear transformations is itself a linear transformation.

9Following theNNliterature, a bias parameter b is separated from the set of weight parameters. The bias parameter b is associated with an input variable whose value is clamped at 1.

3 Input layer Hidden layer 1 Hidden layer 2 Output layer y1 1 y22 y01 y20 y30 y40 y50 y12 y13 y41 y42 y23 y21

w

15,4

w

4,42

w

34,1 y31

Figure 3.1: Schematic representation of an NNcomprised of an input layer, two hidden layers and an output layer. Each arrow connecting two nodes between adjacent NN layers represents a weight. Biases and activation function are not shown in this plot. See text for explanation of the variables.

where yn

m is the value of node n in layer m, w

n0,n

m is the weight connecting node n0 in

layer m−1 and node n in layer m, bn

mis the bias applied to node n of layer m, and σ is an

activation function (e.g. hyperbolic tangent) that introduces nonlinearity into the NN. In a more compact way, equation (3.58) can be written as ym = σ(ym−1Wm+ bm),10

where ym is a row vector of the node values in layer m, Wm is a weight matrix, and

bm is a row vector of the biases. For example, y1 and b1 are row vectors each with 4

elements and W1 is a 5× 4 matrix for the NN shown in figure3.1. Consequently, the

output can be expressed as11

y3= σ[σ[y0W1+ b1]W2+ b2]W3+ b3. (3.59)

In essence, the NN model is nothing but a nonlinear function y = f (x; θ) that maps a set of input to a set of output controlled by adjustable parameters θ ={W , b}. Therefore, training anNNis not different from training any other nonlinear parametric

10The activation function is applied element-wisely. 11

model. Given a training set D = (X, y), we minimize the loss function L(θ) = 1 2 N X i=1 kf(xi; θ)− yik2, (3.60)

with respect to the parameters θ. A large number of, if not all, minimization algorithms require the gradient of the loss function with respect to the parameters. Thanks to the structure of theNNmodel, there is an efficient technique to evaluate the gradient of the loss function in equation (3.60). This can be achieved by using a local message passing scheme in which information is sent alternately forwards and backwards through the NN, known as the error backpropagation [196]. The error backpropagation technique only requires an overall computational cost of O(W ), proportional to the number of weight parameters in the NN[96].

In principle, we can use any minimization algorithm to optimize the parameters, such as the Levenberg–Marquardt (LM)method discussed in section2.1.2and theBroyden– Fletcher–Goldfarb–Shanno minimization algorithm (BFGS)method that we shall use in section 3.3. The loss function in equation (3.60) decomposes as a sum over the training set, so does the gradient:

∇θL(θ) = 1 2 N X i=1 ∇θkf(xi; θ)− yik2. (3.61)

Therefore, the computational cost of one minimization step isO(N), proportional to the number of data points N in the training set. A recurring problem in machine learning is that large training sets are necessary for good generalization. So, batch optimization methods that require the whole training set (e.g. LM and BFGS) are computationally expensive for machine learning problems, although they typically have good convergence behaviors and lead to a small final loss.

In practice, nearly all machine learning is powered by thestochastic gradient descent (SGD) algorithm [197], an extension of the standard gradient descent algorithm. The insight of SGD is to treat the gradient as an expectation and estimate this expectation using a (small) subset of the training data. Specifically, at each minimization step, instead of using the whole training set to compute the gradient, we sample a minibatch

of examples {x1, . . . , xN0} from the training set. The minibatch size N0 is typically

chosen to be a relatively small number, ranging from 1 to a few hundred, and it is usually held fixed as the training set size N grows. In such, we can estimate the gradient at each minimization step withO(1) time:

∇θL(θ) ≈ N 2N0 N0 X i=1 ∇θkf(xi; θ)− yik2. (3.62)

Another crucial feature ofSGDis that it allows the loss to increase during minimization, which is inevitable in training dropoutNNsthat will be discussed in section4.3, because the NNstructure changes from step to step when dropout is applied.

3.3

A neural network potential for multilayer graphene

Related documents