A multivariate Gaussian distribution is a generalization of the one dimensional univariate distribution to higher dimensions. The probability density function (pdf) of a normalized multivariate Gaussian distribution is given as:
P (x|µ, Σ) = N (µ, Σ) = 1 (2π)D/2|Σ|1/2exp −1 2 (x − µ) T (Σ)−1(x − µ) , (4.4) where µ is a D-dimensional mean vector, Σ is a D × D dimensional covariance matrix, and |Σ| denotes the determinant of Σ.
For the Gaussian distribution to be well defined, it is necessary for all of the eigen values λi of the covariance matrix to be strictly positive definite, otherwise the distribution cannot be properly normalized.
Figure 4.2: Probability density function and contours of a normal multivariate Gaus- sian distribution in 2 dimensions; the mean µ of the distribution is zero and the spread
is shown by the eigen vectors λi that define the major and minor axes of the ellipse.
For a D-dimensional MVG model, the multivariate normal density is completely specified by D(D+1)2 + D = D(D+3)2 parameters which consist of the elements of the mean vector, µ and the independent elements of the covariance matrix, Σ. For large D, the total number of parameters would increase quadratically, and the computational task of manipulating and inverting large matrices would be- come problematic. With high dimensional data becoming readily available, one is frequently faced with the problem of estimating covariance matrices in high dimensions which in most cases do not provide satisfactory estimate of the data covariance due to singularity (i.e their determinant becomes zero making the in- verse computation impossible). Various techniques have been proposed in the literature to resolve this issue that involve banding (Bickel & Levina, 2006), ta- pering (Furrer & Bengtsson,2007;Wu & Pourahmadi,2003) and shrinkage based regularization techniques (Copas,1993). An alternative way of avoiding this issue is to use restricted forms of covariance matrix, like the diagonal covariance matrix (Σ = diagonal(σ2
i)) and isotropic Gaussians (Σ = σ2I), the number of indepen- dent parameters will be linear (2D and D+1 respectively), and the cost incurred to calculate their inverses will be much smaller than the complete covariance matrix. Interestingly, experience in the context of classification in machine learning sug- gests that using a diagonal covariance matrix or ignoring the off-diagonal entries can lead to better classification results than those based on complete covariance matrix estimation (Pazzani,1997).
The reason of the wide usage of Gaussian as a data density model is because of its analytical tractability, i.e. a large number of results involving this distribution can be derived in explicit form. Secondly, the normal distribution arises as the outcome of the Central limit theorem, which states that under mild conditions, the
sum of a large number of random variables is distributed approximately normally. Finally, the ‘bell’ shape of the normal distribution makes it a convenient choice for modeling a large variety of random variables encountered in practice.
4.3.1 Karklin and Lewicki’s Model of Scene Analysis
Neurons in the early visual pathway act as linear feature detectors of natural scenes, however how these image features from similar objects are combined to give an invariant abstract representation in brain, is poorly understood. Image regions that are perceptually distinct produce response patterns that are highly overlapping and cannot be distinguished using individual features or low level linear transformations alone. Knowledge of the cognitive computations that are required to achieve this generalization across the visual stimuli is an important research problem that has not been completely resolved yet. Karklin and Lewicki (Karklin & Lewicki,2009) address this issue by proposing a computational model of visual feature generalization that takes into account the pattern variability of visual scenes and learns a compact set of features for image distributions typically encountered in natural scenes. The proposed model allows the neural probability distributions to be defined as a hierarchical statistical model in which the input image is represented at different levels of abstraction: first by a set of linear fea- tures bk and then by neural activities, yj. This model is a generalization of the standard model of complex cell properties, where each complex cell takes as input the squared output of two simple cells. In the proposed model, a neuron integrates the squared response of a large number of image features bk and learns them by correlating the pattern against its weights wjk.
For each model neuron y, the input image x is described by a multi-variate Gaus- sian distribution: P (x|y) = 1 (2π)N/2|C|1/2exp −1 2 x T(C)−1 x , (4.5)
with mean, µ = 0 and covariance matrix C defining the range and pattern of variability of features, bk. The dimensionality of the data is represented by N and the covariance matrix, C represents a function of the neural activity, y. This functional representation has the advantage that the model can in principle de- scribe arbitrary correlation patterns in features while still being mathematically tractable. C = f (y) = exp X jk yjwjkbkbTk , (4.6)
In the exponential space, the covariance matrix is calculated as the outer product of localized oriented edge like feature vectors bk, neuronal activity yj, and weights wjk that modify the encoded distribution of features bk as follows:
Figure 4.3: Distributed coding model proposed by Karklin et al. that infers for each image the most likely distribution (ellipses) encoding it. The top row identifies the
activation patterns of the model neurons yj. Absence of the activity corresponds to the
lack of image structure, which is therefore represented by a canonical distribution that reflects the statistics over all natural images (black circle). Increased neural activity represents deviations from this canonical distribution and captures statistical patterns
in local image regions (middle and right panels)(Karklin & Lewicki,2009).
wjk
> 0, if the neuron responds to a wider range of stimuli; < 0, if the neuron responds to a smaller range of stimuli; = 0, if the neuron remains neutral.
This model allows us to determine for each model neuron the most excitatory and inhibitory features. We compute the covariance matrix given in Eq. 4.6
by turning on only one neuron (yj = 1), and leaving the rest at 0. This fully specifies the distribution of images encoded by neuron j and accounts for all the contributions of individual features bk. When the neural activity is off (y = 0), the covariance matrix is equivalent to the identity matrix I, corresponding to the canonical distribution of whitened images. Non-zero values in neural activity y warp the encoded distribution by stretching or contracting along the linear features bk. The model parameters bkand wjkare initialized with small random values and optimized by maximizing the likelihood of the data under the model P (x|bk, wjk) through standard gradient ascent method. By adapting the model parameters, θ = {bk, wjk} to the data, one can find an efficient way to use a limited number of neurons to describe the wide range of distributions observed in natural images. See Figure4.3 for illustration of the proposed encoding model.
In order to compute the response of the model neurons y, the most likely/probable neural representation given the input image x is calculated by maximizing the
posterior probability P (y|x, {bk, wjk}) as follows: Maximum-a-posteriori ˆy = argmax y P (y|x, {bk, wjk}) = argmax y P (x|y, {bk, wjk})P (y)
The model places a sparse prior on the neural activity y. In order to write the model likelihood function of interest, i.e. log P (x|y) = −N D2 ln(2π) − N2 ln |C| − 1
2 PN
n=1(xn− µ)TC−1(xn− µ), the following assumption of the covariance matrix and matrix relations have been used:
C = exp X jk yjwjkbkbTk log |C| =trace(log C) =X jk trace(yjwjkbkbTk) = X jk yjwjk|bk|2 = X jk yjwjk
The norm of vectors bk is fixed to 1 as the weights can absorb any scaling. Thus log P (x|y) becomes
log P (x|y) ∝ −1 2 X jk yjwjk− 1 2x T exp −X jk yjwjkbkbTk x. (4.7)
The proposed model was trained on a large set of 20 × 20 image patches, sampled randomly from gray scale photographs of outdoor scenes (Hateren & Schaaf,1998). The number of neurons was set to 150 and the number of linear features, bk were set to 1000. After training, each of the model neurons was found to be tuned to different image structure properties such as phase invariance, orientation, location and complex suppressive effects. To compare the behavior of model neuron to that of the cells in visual cortex, the authors tested its response to stimuli used in classical physiological experiments and found that the model learns a much more general set of features that are determined by the statistical structures in images.