Activation Functions - Multi Layer Perceptrons (MLPs)

CHAPTER 3 DEEP NEURAL NETWORKS

3.2 Multi Layer Perceptrons (MLPs)

3.2.1 Activation Functions

A non-smooth perceptron activation function means that the error rate of the model is a discontinuous function of the weight parameters which makes it difficult to adjust optimal weights by minimizing the loss function. To address this problem, we apply a continuous activation function. We saw in the previous chapter that the logistic sigmoid is an appro- priate choice for this purpose which is capable of learning nonlinear decision boundaries for nonlinear separable spaces. Activation functions in MLP should be nonlinear, continuously differentiable and monotonically non-decreasing (Rosen-Zvi et al., 1998). Additionally, it is desired to choose an activation function which its derivative would be computed easily. With the assumption of using a nonlinear activation function e.g. sigmoid, it could be pro- ven that an MLP is a universal function approximator. This means that a single-hidden layer MLP associated with sufficiently large and finite number of neurons can approximate

and learn any continuous function on a compact domain with an arbitrary accuracy (Cy- benko, 1989). This property is required while we apply the backpropagation learning method using gradient-based approaches. To use the backpropagation learning method, the activation function should be differentiable so that the range of the activation function is finite. Typically, by choosing an activation function which is bounded in a certain range of limits, the gradient-based optimization methods regularly tend to be more stable, as pattern presentations considerably change only limited weight parameters. For activation functions with infinite range, learning is typically more efficient since pattern presentations remarkably tend to affect most of the weight parameters. Hence, we should necessarily select smaller learning rates for updating parameters (Snyman, 2005).

If the activation function used is monotonic, then the error surface associated with a single- layer network (no hidden layer) is guaranteed to be convex, like logistic regression (Wu, 2009). According to the proof provided in (Kingman, 1961), if function f is concave and g is convex and monotonically non-increasing over a univariate domain, then h(x) = g{f (x)} would be convex, consequently. In their work, they proved that the output and global minima for a single perceptron could be found. But for multi-layer perceptrons, the problem becomes more complicated, yet is still possible. While moving to the solution using backpropagation through a gradient-based method, the monotonically decreasing loss function looks like a convex function with several local minima (Goodfellow et al., 2014). Therefore, the main feature of the learning problem is to derive the update rules for the weights adjustments via a gradient-based algorithm. During the learning, backpropagation specifies the influence proportion on each neuron in the next layer. Using a nonmonotonic activation function may cause that increasing the neuron’s weight affect less on the neurons in next layer and model might not converge to the solution.

Another desired property for choosing an activation function is to approximate near the origin. In this case, if the weight parameters are initialized with small random values, the MLP model tend to learn efficiently. Otherwise, we should care specifically while initializing the weights (Sussillo, 2014).

The most common choice of activation function used in MLP is so-called logistic sigmoid as it takes a real-valued input varied from −∞ to +∞ and saturates it to a bounded range between 0 and 1, which are the values used to represent the output class for a binary classification problem.

Despite the popularity of the sigmoid, a sigmoidal activation function in the form of a hyperbolic tangent is sometimes preferred empirically and theoretically for deep MLPs. Two forms of hyperbolic tangent activation functions are commonly used : tanh(x) whose range

is normalized to the range of −1 to 1, and _1+exp(−x)1 which is vertically bounded to 0 and 1. The former function is a rescaling of the logistic sigmoid :

tanh(x) = 2sigmoid(2x) − 1. (3.51)

The hyperbolic tangent typically transform the data from domain [−∞, +∞] to [−1, +1] and it is symmetric around the origin while the sigmoid is not. The outputs of a logistic sigmoid will be always a positive value in [0, 1]. The nonsymmetry property of sigmoid around zero makes it more prone to saturation of the next layers during the training via gradient-based algorithms and making learning more difficult consequently.

In (LeCun et al., 1998b), authors show that logistic sigmoid has been already demonstrated to slow down learning due to its non-zero mean which yields singular values in the Hessian during gradient-based learning. To address this problem, they provide evidence in great detail that normalizing the initial inputs to have mean 0 and variance 1 along the features typically makes a better and faster convergence during gradient-based learning. They also suggested that the activation function should be chosen as anti-symmetric and takes the form :

f (x) = a tanh(bx), (3.52)

where the f (.) maps [−∞, +∞] to [−a, +a], a = 1.7159 and b = 2₃.

Sigmoid and tanh function, however, both computes their derivatives very simply and efficiently which is a compelling reason for using them during the gradient optimization in a MLP.

While sigmoid and tanh have been commonly used activation functions, the more recent work of (Glorot et al., 2011) shows evidence that rectified linear units (relu) provide faster and more effective learning of deep neural networks on complex and high-dimensional data. Relu computes the function f (x) = max(0, x) and simply thresholds the input matrix at zero. One of the most important advantages of relu, compared to sigmoid and its counterpart tanh, is that it does not require expensive computation consisting of only comparison and multiplication. Furthermore, there is no saturation in relu which means that we have an efficient backpropagation without vanishing or exploding gradient that makes it a proper choice particularly for networks with deep layers. These advantages combined with its ability to produce sparse activations are considered as benefits for optimizing deep MLPs and CNNs. On minus side, relu units suffer from an important potential problem during training : dying relu neurons. Dying problem occurs when no gradient flows backward through the relu unit

and therefore, that neuron subsequently will never fire from that point on. Typically, large number of units in a multi-layer network may pushed into dead states, while a high value selected for the learning rate. One attempt to alleviate this problem is to set a small learning rate during the weights update in gradient descent.

Another approach to mitigate this issue is applying Leaky relu instead of relu which allows a small negative slope when a unit is not active. A parametric rectified linear unit (prelu) activation function that generalizes the rectified unit by making the slopes into a parameters that are adapted along with the other network parameters were proposed by (He et al., 2015). There are several commonly-used activation functions you may encounter in practice. Figure 3.8 illustrates several common activation functions and their functionalities.

In document Learning Activation Functions in Deep Neural Networks (Page 76-79)