• No results found

Unsupervised Model

2. Preliminary: Simple, Shallow Neural Networks

2.2 Unsupervised Model

Unlike in supervised learning, unsupervised learning considers a case where there is no target value. In this case, the training setDconsists of only input vectors:

D=x(n)

N

n=1. (2.10)

Similarly to the supervised case, we may assume that eachxinDis a noisy obser- vation of an unknown hidden variable such that

x=f(h). (2.11) Whereas we aimed to find the function or mappingf given both input and output previously in supervised models, our aim here is to find both the unknown function

fand the hidden variableshRq. This leads tolatent variable modelsin statistics (see, e.g., Murphy, 2012).

This is, however, not the only way to formulate an unsupervised model. Another way is to build a model that learns direct relationships among the input components

x1, . . . , xp. This does not require any hidden variable, but still learns an (unknown)

structure of the model.

2.2.1 Linear Autoencoder and Principal Component Analysis

In this section, we look at the case where hidden variables are assumed to have lin- early generated training samples. In this case, it is desirable for us to learn not only an unknown functionf, but also another functiongthat is an (approximate) inverse function off. Opposite tof,grecognizes a given sample by finding a corresponding

x1 x2 xp ˜ x1 ˜ x2 ˜ xp h1 hq

(a) Linear Autoencoder

x1 x2 x3 xp

(b) Hopfield Network

Figure 2.3.Illustrations of a linear autoencoder and Hopfield network. An undirected edge in the Hop- field network indicates that signal flows in both ways.

state of the hidden variables.3

Let us construct a neural network with linear units. There are as many input units aspcorresponding to components of an input vector, denoted byx, andqlinear units that correspond to the hidden variables, denoted byh. Additionally, we add another set ofplinear units, denoted byx˜. We connect directed edges fromxtohand from

htox˜. Each edgeeijwhich connects thei-th input unit to thej-th hidden unit has a

corresponding weightwij. Also, edgeejkwhich connects thej-th hidden unit to the k-th output unit has its weightujk. See Fig. 2.3 (a) for the illustration.

This model is called a linear autoencoder.4The encoder of the autoencoder is

h=f(x) =Wx+b, (2.12)

and the decoder is

˜

x=g(h) =Uh(x) +c, (2.13)

where we uses the matrix-vector notation for simplicity. W = [wij]p×q are the

encoder weights, U = [ujk]q×p the decoder weights, andbandcare the hidden

biases and the visible biases, respectively. It is usual to call the layer of the hidden units abottleneck.5Note that without loss of generality we will omit biases whenever

it is necessary to make equations uncluttered.

In this linear autoencoder, the encoder in Eq. (2.12) acts as an inverse function

gthat recognizes a given sample, whereas the decoder in Eq. (2.13) simulates the unknown functionfin Eq. (2.11).

If we tied the weights of the encoder and decoder so thatW =U, we can see the connection between the linear autoencoder and the principal component analysis

3Note that it is not necessary forgto be an explicit function. In some models such as sparse coding in Section 3.2.5,gmay be defined implicitly.

4The same type of neural networks is also calledautoassociativeneural networks. In this thesis, however, we use the termautoencoderwhich has become more widely used recently. 5Although the termbottleneckimplicitly implies that the size of the layer is smaller than that of either the input or output layers, it is not necessarily so.

(PCA). Although there are many ways to formulate PCA (see, e.g. Bishop, 2006), one way is to use a minimum-error formulation6that minimizes

J(θ) =1 2 N n=1 x(n)x˜(n)2 2. (2.14)

The minimum of Eq. (2.14) is in fact exactly the solution the linear autoencoder aims to find.

This connection and equivalence between the linear autoencoder and the PCA have been noticed and shown by previous research (see, for instance, Oja, 1982; Baldi and Hornik, 1989). However, it should be reminded that minimizing the cost function in Eq. (2.14) by an optimization algorithm is unlikely to recover the principal compo- nents, but an arbitrary basis of the subspace spanned by the principal components, unless we explicitly constrain the weight matrix to be orthogonal.

This linear autoencoder has several restrictions. The most obvious one is that it is only able to learn a correct model when the unknown functionfis linear. Secondly, due to its linear nature, it is not possible to model any hierarchical generative process. Adding more hidden layers is equivalent to simply multiplying the weight matrices of additional layers, and this does not help in any way.

Another restriction is that the number of hidden unitsqis upper-bounded by the input dimensionalityp. Although it is possible to useq > p, it will not make any difference, as it does not make any sense to use more thanpprincipal components in PCA. This could be worked around by using regularization as in, for instance, sparse coding (Olshausen and Field, 1996) or independent component analysis (ICA) with reconstruction cost (Le et al., 2011b).

As was the case with the supervised models, this encourages us to investigate more complex, overcomplete models that have multiple layers of nonlinear hidden units.

2.2.2 Hopfield Networks

Now let us consider a neural network consisting of visible units only, and each visible unit is anonlinear,deterministicunit, following Eq. (2.6), that corresponds to each component of an input vectorx. We connect each pair of the binary unitsxiandxj

with an undirected edgeeijthat has the weightwij, as in Fig. 2.3 (b). We add to each

unitxia bias termbi. Furthermore, let us define anenergyof the constructed neural

network as −E(x|θ) =1 2 i=j wijxixj+ i xibi, (2.15)

6Actually the minimum-error formulation minimizes the mean-squared errorE x˜x2 2

which is in most cases not available for evaluation. The cost function in Eq. (2.14) is an approximation to the mean-squared error using a finite number of training samples.

whereθ= (W,b). We call this neural network a Hopfield network (Hopfield, 1982). The Hopfield network aims to finding a set of weights that makes the energy of the presented patterns low via the training setD=x(1), . . . ,x(N)(see, e.g., Mackay, 2002). Given a fixed set of weights and an unseen, possibly corrupted input, the Hopfield network can be used to find a clean pattern by finding the nearestmodein the energy landscape.

In other words, the weights of the Hopfield network can be obtained by minimizing the following cost function given a setDof training samples:

J(θ) = N n=1 E x(n)θ . (2.16)

The learning rule for each weightwijcan be derived by taking a partial derivative

of the cost functionJwith respect to it. The learning rule is

wij←wij− η N N n=1 ∂Ex(n)|θ ∂wij =wij+ η N N n=1 x(in)x(jn)=wij+ηxixjd, (2.17)

whereηis a learning rate andxPrefers to the expectation ofxover the distribution

P. We denote by d the data distribution from which samples in the training setD

come. Similarly, a biasbican be updated by

bi=bi+ηxid. (2.18)

This learning rule is known as the Hebbian learning rule (Hebb, 1949). This rule states that the weight between two units, or neurons, increases if they are active to- gether. After learning, the weight will be strongly positive, if the activities of the two connected units are highly correlated.

With the learned set of weights, we can simulate the network by updating each unit

xiwith xi=φ ⎛ ⎝ j=i wijxj+bi ⎞ ⎠, (2.19)

whereφis a Heaviside function as in Eq. (2.6).

It should be noticed that because the energy function in Eq. (2.15) is not lower bounded and the gradient in Eq. (2.17) does not depend on the parameters, we may simply set each weight by

wij=cxixjd,

wherecis an arbitrary , positive constant, given a fixed set of training samples. An arbitrarycis possible, since the output of (2.19) is invariant to the scaling of the parameters.

In summary, the Hopfield network memorizes the training samples and is able to retrieve them, starting from either a corrupted input or a random sample. This is one way of learning an internal structure of a given training set in an unsupervised manner.

The Hopfield network learns the unknown structure of training samples. However, it is limited in the sense that only direct correlations among visible units are modeled. In other words, the network can only learn second-order statistics. Furthermore, the use of the Hopfield network is highly limited by a few fundamental deficiencies in- cluding the emergence of spurious states (for more details, see Haykin, 2009). These encourage us to extend the model by introducing multiple hidden units as well as making them stochastic.