Deep neural networks literally refer to the multi-layer neural network models connecting multiple processing layers sequentially, to learn complex and abstract features from given datasets. Traditionally, classical deep neural networks are a deterministic model rather than probabilistic models, since it is normally designed to predict the best estimation of uncertain outputs, which generates a deterministic number/vector as the output. However, this thesis seeks for a probabilistic model of deep neural networks, in other words, the proposed deep neural networks should be able to learn and generate probability distribution. This section discusses the probabilistic deep learning model and its theoretical basis: Bayesian Modelling and Variational Inference (VI). Bayesian Modelling is highly-rated since it can cope with most complex problems in the real world due to its model capacity. However, the implementation of Bayesian Modelling heavily suffers from its expensive computational costs caused by extensive sigma functions and cross-corpora calculations. In addition, they perform poorly with very small data sets. Therefore, in this section, Deep Learning models are employed as the model basis for Bayesian Modelling. Dropout Regularization units are specifically applied to introduce random features to the enable the deterministic deep learning model for probabilistic Bayesian Modelling. The efficacy and effectiveness are discussed and demonstrated in following sections.
4.4.1.1 Probabilistic Modelling based on Bayesian Modelling
Classical deep learning models map model inputs to the corresponding outputs in the form of deterministic numerical vectors. Given model input variables and output variables as variables π₯ and π¦, classical deep neural networks can be simply regarded as a synthesized function π(β). Specifically, this function π is the combination of matrices multiplication and activation functions of each processing layer. The model parameters are denoted as π = {π, π, β¦ }, which consists of weight matrices, bias matrices and other parameters of certain network architectures. Therefore, the model outputs can be obtained given input variable π₯ on condition of parameter vectors π:
π¦ = π(π₯; π), where both π₯, π¦ are deterministic variables, which indicates classical deep learning models are deterministic models.
In terms of the probabilistic model, both model inputs and outputs are further considered as random variables. To express notations in consistency, a tilde accent is added to notations to replace the corresponding deterministic variables with random variables, i.e., π¦Μ refers to the uncertainty of model output. In this case, the deterministic model is predicting an estimate on the expectation of probability distribution of uncertain output π¦Μ, i.e.:
πΌπ[π¦Μ] = πΌπ[π(π₯; π)] = πΌπ[π(π¦Μ|π₯, π)] ( 4-1 )
Assuming the probabilistic model of the corresponding deep neural network is πβ², the probabilistic outputs of the deep neural network can be then formulated as:
π¦Μ = πβ²(π₯; πΜ)~π(π¦Μ|π₯; πΜ) ( 4-2 )
where the conditional distribution π(π¦Μ|π₯; πΜ) is the probabilistic model of the deep neural network. The comparison between the classical deterministic model and probabilistic model of deep neural networks are shown in Figure 4-1 below:
Figure 4-1 Illustration of two types of deep neural networks: a) deterministic model; b) probabilistic model.
Given historical datasets π, π as training data, we can obtain model parameters that are most likely to generate label of dataset π from probabilistic deep neural network given inputs π. Mathematically, we look for the posterior distribution of model parameter π(πΜ|π, π) on given dataset π, π. By invoking Bayesβ theorem, the posterior distribution of interest can be formulated by:
π¦Μ = πβ²(π₯; πΜ)~π(π¦Μ|π₯; πΜ) ( 4-3 )
π(πΜ|π, π) =π(π|π; πΜ)π(πΜ)
π(π|π) ( 4-4 )
With appropriate assumption on the prior distribution of parameters π(πΜ), the posterior distribution of model parameters can be more easily estimated with integrals over training datasets. This process is widely known as Bayesian Modelling [89, 90, 91].
For the testing realizations of inputs π₯β, the prediction on the targeted distribution of π¦β can be accordingly inferred as:
π(π¦β|π₯β; π, π) = β« π(π¦β|π₯β; πΜ)π(πΜ|π, π)ππΜ ( 4-5 )
Notably, π(π¦β|π₯β; πΜ) is the output distribution of the probabilistic model of the deep neural network, which can be simply sampled and simulated by feed in realizations of π₯β to the well-tuned deep neural network. Therefore, the targeted posterior distribution π(π¦β|π₯β; π, π) can be easily implemented by sampling approaches such as Gibbs Sampling [92].
4.4.1.1 Variational Inference to Train Probabilistic Models
As discussed in the previous section, to invoke feed-forward prediction of the probabilistic model, the key component is to obtain the posterior distribution π(πΜ|π, π) with given data samples. Normally, this distribution cannot be modelled directly by deep networks due to the fixed structure of deep networks. To approximate the targeted distribution with the probability model of deep network π(πΜ), variational inference is introduced in the training process to minimize the difference between π(πΜ|π, π) and π(πΜ).
A measure that indicates the similarity between the targeting distribution π(πΜ|π, π) and the distribution of the deep neural network π(πΜ) can be evaluated by the KL divergence:
πΎπΏ(π(πΜ)||π(πΜ|π, π)) = β« π(πΜ) π(πΜ)
π(πΜ|π, π)ππΜ ( 4-6 )
The KL divergence approaches its minimum when the distribution of deep learning model π(πΜ) is close to the targeting posterior distribution π(πΜ|π, π), which can be denoted as πβ(πΜ). Therefore, we can replace the posterior distribution with neural network model in equation (4-13) to (4-19):
π(π¦β|π₯β, π, π) = β« π(π¦β|π₯β, πΜ)πβ(πΜ) ππΜ ( 4-7 )
This process means the probability model of output π¦β with given inputs π₯β can be quantified by sampling through the deep learning model with variational parameter πΜ
that follows the distribution πβ(πΜ).
Notably, the KL divergence is intractable in many cases due to the posterior distribution in the integral form. Hence, minimizing the KL divergence is replaced by an equivalent formulation, i.e., evidence lower bound (ELBO) [93].
β(πΜ) βΆ= β« π(πΜ)πππ π(π|π, πΜ)ππΜ β πΎπΏ(π(πΜ)||π(πΜ)) ( 4-8 )
Through optimization that maximizes the ELBO term, the KL divergence between πβ(πΜ) and π(πΜ) can be accordingly minimized. Hence, the feed-forward of proposed model can be further simplified into compact form as:
π(π¦β|π₯β, π, π) = β« π(π¦β|π₯β, πΜ)πβ(πΜ) ππΜ β πβ(π¦β|π₯β, πΜ) ( 4-9 )