Universal Approximation Properties and Depth

Without hidden layer, and for convex loss functions and regularizers (which is typically the case), we obtain a convex training criterion. We end up with linear regression, logistic regression or and other log-linear models that are used in many applications of machine learning. This is appealing (no local minima, no saddle point) and convenient (no strong dependency on initial conditions) but comes with a major disadvantage: such learners are very limited in their ability to represent more complex functions. In the case of classiﬁcation tasks, we are limited to linear decision surfaces. In practice, this limitation can be somewhat circumvented by handcrafting a large enough or discriminant enough set of hardwired features extracted from the raw input and on which the linear predictor can be easily trained. See next section for a longer discussion about feature learning.

To make the family of functions rich enough, the other option is to introduce one or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994) that with a single hidden layer of a suﬃcient size and a reasonable choice of non- linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent any smooth function to some desired accuracy (the greater the required accuracy, the more hidden units are required). However, these theorems generally do not tell us how many hidden units will be required to achieve a given accuracy for particular data distributions: in fact the worse case scenario is generally going to require an exponential number of hidden units (to basically record every input conﬁguration that needs to be distinguished). It is easier to understand how bad things can be by considering the case of binary input vectors: the number of possible binary functions on vectors v ∈ { , }0 1dis 22d and selecting one such function requires 2dbits, which will in general requireO(2d) degrees of freedom.

However, machine learning is not about learning equally easily any possible function: we mostly care about the kinds of functions that are needed to represent the world

around us. Theno-free-lunch theorems for machine learning (Wolpert, 1996) essen- tially says that without any prior on the ground truth (the data generating distribution or the optimal function), no learning algorithm is “universal”, i.e., dominates all the others against all possible ground truths.

One of the central priors that is exploited in deep learning is that the target function to be learned can be efficiently represented as a deep composition of simpler functions (“features”), where features at one level can be re-used to define many features at the next level. This is connected to the notion of underlying factors described in the next section. One therefore assumes that these factors or features are organized at multiple levels, corresponding to multiple levels of representation. The number of such levels is what we calldepth in this context. The computation of these features can therefore be laid down in a flow graph or circuit, whose depth is the length of the longest path from an input node to an output node. Note that we can define the operations performed in each node in different ways. For example, do we consider a node that computes the affine operations of a neural net followed by a node that computes the non-linear neuron activation, or do we consider both of these operations as one node or one level of the graph? Hence the notion of depth really depends on the allowed operations at each node and one flow graph of a given depth can be converted into an equivalent flow graph of a different depth by redefining which operations can be performed at each node. However, for neural networks, we typically talk about a depth 1 network if there is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal approximation properties for neural nets basically tell us that depth 2 is sufficient to approximate any reasonable function to any desired finite accuracy.

From the point of view of approximation properties, the important result is that one can find families of functions which can be approximated very efficiently when a particular depth is allowed, but which might require a much larger (typically exponen- tially larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio, 2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more recently, for deep networks of rectifier units (Pascanuet al., 2013). Of course, there is no qguarantee that the kinds of functions we want to learn in applications of machine learning (and in particular for AI) share such a property. However, a lot of experimental evidence suggests that better generalization can be obtained with depth, for many such AI tasks (Bengio, 2009b; Mesnil et al., 2011; Goodfellow et al., 2011; Cire- san et al., 2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013; Couprieet al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is a usefl prior and that in order to take advantage of it, the learner’s family of function needs to allow multiple levels of representations.

In document Deep Learning.pdf (Page 83-85)