Unsupervised Model - Preliminary: Simple, Shallow Neural Networks

2. Preliminary: Simple, Shallow Neural Networks

2.2 Unsupervised Model

Unlike in supervised learning, unsupervised learning considers a case where there is no target value. In this case, the training setDconsists of only input vectors:

D=x(n)

n=1. (2.10)

Similarly to the supervised case, we may assume that eachxinDis a noisy obser- vation of an unknown hidden variable such that

x=f(h). (2.11) Whereas we aimed to ﬁnd the function or mappingf given both input and output previously in supervised models, our aim here is to ﬁnd both the unknown function

fand the hidden variablesh∈Rq. This leads tolatent variable modelsin statistics (see, e.g., Murphy, 2012).

This is, however, not the only way to formulate an unsupervised model. Another way is to build a model that learns direct relationships among the input components

x₁, . . . , xp. This does not require any hidden variable, but still learns an (unknown)

structure of the model.

2.2.1 Linear Autoencoder and Principal Component Analysis

In this section, we look at the case where hidden variables are assumed to have lin- early generated training samples. In this case, it is desirable for us to learn not only an unknown functionf, but also another functiongthat is an (approximate) inverse function off. Opposite tof,grecognizes a given sample by ﬁnding a corresponding

x1 x2 xp ˜ x1 ˜ x2 ˜ xp h1 hq

(a) Linear Autoencoder

x1 x2 x3 xp

(b) Hopﬁeld Network

Figure 2.3.Illustrations of a linear autoencoder and Hopfield network. An undirected edge in the Hop- field network indicates that signal flows in both ways.

state of the hidden variables.3

Let us construct a neural network with linear units. There are as many input units aspcorresponding to components of an input vector, denoted byx, andqlinear units that correspond to the hidden variables, denoted byh. Additionally, we add another set ofplinear units, denoted byx˜. We connect directed edges fromxtohand from

htox˜. Each edgeeijwhich connects thei-th input unit to thej-th hidden unit has a

corresponding weightwij. Also, edgeejkwhich connects thej-th hidden unit to the k-th output unit has its weightujk. See Fig. 2.3 (a) for the illustration.

This model is called a linear autoencoder.4_{The encoder of the autoencoder is}

h=f(x) =Wx+b, (2.12)

and the decoder is

x=g(h) =Uh(x) +c, (2.13)

where we uses the matrix-vector notation for simplicity. W = [wij]_p_×_q are the

encoder weights, U = [ujk]_q_×_p the decoder weights, andbandcare the hidden

biases and the visible biases, respectively. It is usual to call the layer of the hidden units abottleneck.5_{Note that without loss of generality we will omit biases whenever}

it is necessary to make equations uncluttered.

In this linear autoencoder, the encoder in Eq. (2.12) acts as an inverse function

gthat recognizes a given sample, whereas the decoder in Eq. (2.13) simulates the unknown functionfin Eq. (2.11).

If we tied the weights of the encoder and decoder so thatW =U, we can see the connection between the linear autoencoder and the principal component analysis

3_{Note that it is not necessary for}_g_{to be an explicit function. In some models such as sparse} coding in Section 3.2.5,gmay be deﬁned implicitly.

4_{The same type of neural networks is also called}_{autoassociative}_{neural networks. In this} thesis, however, we use the termautoencoderwhich has become more widely used recently. 5_{Although the term}_bottleneck_{implicitly implies that the size of the layer is smaller than that} of either the input or output layers, it is not necessarily so.

(PCA). Although there are many ways to formulate PCA (see, e.g. Bishop, 2006), one way is to use a minimum-error formulation6that minimizes

J(θ) =1 2 N n=1 x(n)−x˜(n)2 2. (2.14)

The minimum of Eq. (2.14) is in fact exactly the solution the linear autoencoder aims to ﬁnd.

This connection and equivalence between the linear autoencoder and the PCA have been noticed and shown by previous research (see, for instance, Oja, 1982; Baldi and Hornik, 1989). However, it should be reminded that minimizing the cost function in Eq. (2.14) by an optimization algorithm is unlikely to recover the principal components, but an arbitrary basis of the subspace spanned by the principal components, unless we explicitly constrain the weight matrix to be orthogonal.

This linear autoencoder has several restrictions. The most obvious one is that it is only able to learn a correct model when the unknown functionfis linear. Secondly, due to its linear nature, it is not possible to model any hierarchical generative process. Adding more hidden layers is equivalent to simply multiplying the weight matrices of additional layers, and this does not help in any way.

Another restriction is that the number of hidden unitsqis upper-bounded by the input dimensionalityp. Although it is possible to useq > p, it will not make any difference, as it does not make any sense to use more thanpprincipal components in PCA. This could be worked around by using regularization as in, for instance, sparse coding (Olshausen and Field, 1996) or independent component analysis (ICA) with reconstruction cost (Le et al., 2011b).

As was the case with the supervised models, this encourages us to investigate more complex, overcomplete models that have multiple layers of nonlinear hidden units.

2.2.2 Hopﬁeld Networks

Now let us consider a neural network consisting of visible units only, and each visible unit is anonlinear,deterministicunit, following Eq. (2.6), that corresponds to each component of an input vectorx. We connect each pair of the binary unitsxiandxj

with an undirected edgeeijthat has the weightwij, as in Fig. 2.3 (b). We add to each

unitxia bias termbi. Furthermore, let us deﬁne anenergyof the constructed neural

network as −E(x|θ) =1 2 i=j wijxixj+ i xibi, (2.15)

6_{Actually the minimum-error formulation minimizes the mean-squared error}_E _x₋_˜_x2 2

which is in most cases not available for evaluation. The cost function in Eq. (2.14) is an approximation to the mean-squared error using a ﬁnite number of training samples.

whereθ= (W,b). We call this neural network a Hopfield network (Hopfield, 1982). The Hopfield network aims to finding a set of weights that makes the energy of the presented patterns low via the training setD=x(1), . . . ,x(N)(see, e.g., Mackay, 2002). Given a fixed set of weights and an unseen, possibly corrupted input, the Hopfield network can be used to find a clean pattern by finding the nearestmodein the energy landscape.

In other words, the weights of the Hopﬁeld network can be obtained by minimizing the following cost function given a setDof training samples:

J(θ) = N n=1 E x(n)θ . (2.16)

The learning rule for each weightwijcan be derived by taking a partial derivative

of the cost functionJwith respect to it. The learning rule is

wij←wij− η N N n=1 ∂Ex(n)|θ ∂wij =wij+ η N N n=1 x(_in)x(_jn)=wij+ηxixj_d, (2.17)

whereηis a learning rate andx_Prefers to the expectation ofxover the distribution

P. We denote by d the data distribution from which samples in the training setD

come. Similarly, a biasbican be updated by

bi=bi+ηxid. (2.18)

This learning rule is known as the Hebbian learning rule (Hebb, 1949). This rule states that the weight between two units, or neurons, increases if they are active to- gether. After learning, the weight will be strongly positive, if the activities of the two connected units are highly correlated.

With the learned set of weights, we can simulate the network by updating each unit

xiwith xi=φ ⎛ ⎝ j=i wijxj+bi ⎞ ⎠, (2.19)

whereφis a Heaviside function as in Eq. (2.6).

It should be noticed that because the energy function in Eq. (2.15) is not lower bounded and the gradient in Eq. (2.17) does not depend on the parameters, we may simply set each weight by

wij=cxixj_d,

wherecis an arbitrary , positive constant, given a ﬁxed set of training samples. An arbitrarycis possible, since the output of (2.19) is invariant to the scaling of the parameters.

In summary, the Hopﬁeld network memorizes the training samples and is able to retrieve them, starting from either a corrupted input or a random sample. This is one way of learning an internal structure of a given training set in an unsupervised manner.

The Hopfield network learns the unknown structure of training samples. However, it is limited in the sense that only direct correlations among visible units are modeled. In other words, the network can only learn second-order statistics. Furthermore, the use of the Hopfield network is highly limited by a few fundamental deficiencies in- cluding the emergence of spurious states (for more details, see Haykin, 2009). These encourage us to extend the model by introducing multiple hidden units as well as making them stochastic.

In document Foundations and Advances in Deep Learning (Page 33-37)