• No results found

Deep neural networks literally refer to the multi-layer neural network models connecting multiple processing layers sequentially, to learn complex and abstract features from given datasets. Traditionally, classical deep neural networks are a deterministic model rather than probabilistic models, since it is normally designed to predict the best estimation of uncertain outputs, which generates a deterministic number/vector as the output. However, this thesis seeks for a probabilistic model of deep neural networks, in other words, the proposed deep neural networks should be able to learn and generate probability distribution. This section discusses the probabilistic deep learning model and its theoretical basis: Bayesian Modelling and Variational Inference (VI). Bayesian Modelling is highly-rated since it can cope with most complex problems in the real world due to its model capacity. However, the implementation of Bayesian Modelling heavily suffers from its expensive computational costs caused by extensive sigma functions and cross-corpora calculations. In addition, they perform poorly with very small data sets. Therefore, in this section, Deep Learning models are employed as the model basis for Bayesian Modelling. Dropout Regularization units are specifically applied to introduce random features to the enable the deterministic deep learning model for probabilistic Bayesian Modelling. The efficacy and effectiveness are discussed and demonstrated in following sections.

4.4.1.1 Probabilistic Modelling based on Bayesian Modelling

Classical deep learning models map model inputs to the corresponding outputs in the form of deterministic numerical vectors. Given model input variables and output variables as variables π‘₯ and 𝑦, classical deep neural networks can be simply regarded as a synthesized function 𝑓(βˆ—). Specifically, this function 𝑓 is the combination of matrices multiplication and activation functions of each processing layer. The model parameters are denoted as πœƒ = {π‘Š, 𝑏, … }, which consists of weight matrices, bias matrices and other parameters of certain network architectures. Therefore, the model outputs can be obtained given input variable π‘₯ on condition of parameter vectors πœƒ:

𝑦 = 𝑓(π‘₯; πœƒ), where both π‘₯, 𝑦 are deterministic variables, which indicates classical deep learning models are deterministic models.

In terms of the probabilistic model, both model inputs and outputs are further considered as random variables. To express notations in consistency, a tilde accent is added to notations to replace the corresponding deterministic variables with random variables, i.e., 𝑦̃ refers to the uncertainty of model output. In this case, the deterministic model is predicting an estimate on the expectation of probability distribution of uncertain output 𝑦̃, i.e.:

π”Όπœƒ[𝑦̃] = π”Όπœƒ[𝑓(π‘₯; πœƒ)] = π”Όπœƒ[𝑝(𝑦̃|π‘₯, πœƒ)] ( 4-1 )

Assuming the probabilistic model of the corresponding deep neural network is 𝑓′, the probabilistic outputs of the deep neural network can be then formulated as:

𝑦̃ = 𝑓′(π‘₯; πœƒΜƒ)~𝑝(𝑦̃|π‘₯; πœƒΜƒ) ( 4-2 )

where the conditional distribution 𝑝(𝑦̃|π‘₯; πœƒΜƒ) is the probabilistic model of the deep neural network. The comparison between the classical deterministic model and probabilistic model of deep neural networks are shown in Figure 4-1 below:

Figure 4-1 Illustration of two types of deep neural networks: a) deterministic model; b) probabilistic model.

Given historical datasets 𝑋, π‘Œ as training data, we can obtain model parameters that are most likely to generate label of dataset π‘Œ from probabilistic deep neural network given inputs 𝑋. Mathematically, we look for the posterior distribution of model parameter 𝑝(πœƒΜƒ|𝑋, π‘Œ) on given dataset 𝑋, π‘Œ. By invoking Bayes’ theorem, the posterior distribution of interest can be formulated by:

𝑦̃ = 𝑓′(π‘₯; πœƒΜƒ)~𝑝(𝑦̃|π‘₯; πœƒΜƒ) ( 4-3 )

𝑝(πœƒΜƒ|𝑋, π‘Œ) =𝑝(π‘Œ|𝑋; πœƒΜƒ)𝑝(πœƒΜƒ)

𝑝(π‘Œ|𝑋) ( 4-4 )

With appropriate assumption on the prior distribution of parameters 𝑝(πœƒΜƒ), the posterior distribution of model parameters can be more easily estimated with integrals over training datasets. This process is widely known as Bayesian Modelling [89, 90, 91].

For the testing realizations of inputs π‘₯βˆ—, the prediction on the targeted distribution of π‘¦βˆ— can be accordingly inferred as:

𝑝(π‘¦βˆ—|π‘₯βˆ—; 𝑋, π‘Œ) = ∫ 𝑝(π‘¦βˆ—|π‘₯βˆ—; πœƒΜƒ)𝑝(πœƒΜƒ|𝑋, π‘Œ)π‘‘πœƒΜƒ ( 4-5 )

Notably, 𝑝(π‘¦βˆ—|π‘₯βˆ—; πœƒΜƒ) is the output distribution of the probabilistic model of the deep neural network, which can be simply sampled and simulated by feed in realizations of π‘₯βˆ— to the well-tuned deep neural network. Therefore, the targeted posterior distribution 𝑝(π‘¦βˆ—|π‘₯βˆ—; 𝑋, π‘Œ) can be easily implemented by sampling approaches such as Gibbs Sampling [92].

4.4.1.1 Variational Inference to Train Probabilistic Models

As discussed in the previous section, to invoke feed-forward prediction of the probabilistic model, the key component is to obtain the posterior distribution 𝑝(πœƒΜƒ|𝑋, π‘Œ) with given data samples. Normally, this distribution cannot be modelled directly by deep networks due to the fixed structure of deep networks. To approximate the targeted distribution with the probability model of deep network π‘ž(πœƒΜƒ), variational inference is introduced in the training process to minimize the difference between 𝑝(πœƒΜƒ|𝑋, π‘Œ) and π‘ž(πœƒΜƒ).

A measure that indicates the similarity between the targeting distribution 𝑝(πœƒΜƒ|𝑋, π‘Œ) and the distribution of the deep neural network π‘ž(πœƒΜƒ) can be evaluated by the KL divergence:

𝐾𝐿(π‘ž(πœƒΜƒ)||𝑝(πœƒΜƒ|𝑋, π‘Œ)) = ∫ π‘ž(πœƒΜƒ) π‘ž(πœƒΜƒ)

𝑝(πœƒΜƒ|𝑋, π‘Œ)π‘‘πœƒΜƒ ( 4-6 )

The KL divergence approaches its minimum when the distribution of deep learning model π‘ž(πœƒΜƒ) is close to the targeting posterior distribution 𝑝(πœƒΜƒ|𝑋, π‘Œ), which can be denoted as π‘žβˆ—(πœƒΜƒ). Therefore, we can replace the posterior distribution with neural network model in equation (4-13) to (4-19):

𝑝(π‘¦βˆ—|π‘₯βˆ—, 𝑋, π‘Œ) = ∫ 𝑝(π‘¦βˆ—|π‘₯βˆ—, πœƒΜƒ)π‘žβˆ—(πœƒΜƒ) π‘‘πœƒΜƒ ( 4-7 )

This process means the probability model of output π‘¦βˆ— with given inputs π‘₯βˆ— can be quantified by sampling through the deep learning model with variational parameter πœƒΜƒ

that follows the distribution π‘žβˆ—(πœƒΜƒ).

Notably, the KL divergence is intractable in many cases due to the posterior distribution in the integral form. Hence, minimizing the KL divergence is replaced by an equivalent formulation, i.e., evidence lower bound (ELBO) [93].

β„’(πœƒΜƒ) ∢= ∫ π‘ž(πœƒΜƒ)π‘™π‘œπ‘” 𝑝(π‘Œ|𝑋, πœƒΜƒ)π‘‘πœƒΜƒ βˆ’ 𝐾𝐿(π‘ž(πœƒΜƒ)||𝑝(πœƒΜƒ)) ( 4-8 )

Through optimization that maximizes the ELBO term, the KL divergence between π‘žβˆ—(πœƒΜƒ) and 𝑝(πœƒΜƒ) can be accordingly minimized. Hence, the feed-forward of proposed model can be further simplified into compact form as:

𝑝(π‘¦βˆ—|π‘₯βˆ—, 𝑋, π‘Œ) = ∫ 𝑝(π‘¦βˆ—|π‘₯βˆ—, πœƒΜƒ)π‘žβˆ—(πœƒΜƒ) π‘‘πœƒΜƒ β‰ˆ π‘žβˆ—(π‘¦βˆ—|π‘₯βˆ—, πœƒΜƒ) ( 4-9 )