• No results found

As discussed in Section 2.3.1, one way to estimate how fast an asset deteriorates is to find a parametric statistical distribution that can describe the rate of its deterioration. Maximum Likelihood Estimation (MLE) aims to find the best estimate of the parameter values that have the maximum likelihood of producing the evidence (data) and is a common approach for fitting a distribution. Unlike the MLE method that determines point estimates of the parameters of a statistical distribution, in a Bayesian statistical model, the parameters of the distribution are treated as random variables. By doing so, the uncertainty of the parameters is characterised, and the information from data is captured in the posterior distribution of the parameters. Moreover, in contrast to MLE where the estimate of the parameters depends solely on the data, Bayesian estimation integrates the data with any prior knowledge about the parameters.

When estimating a parametric statistical distribution in a Bayesian statistical model, Bayes’ theorem in Equation 3.1 is used as:

P(θ |D) = P(θ ) P (D|θ )

P(D) (3.6)

where θ is a vector of parameters. For example, in an exponential distribution θ = {λ } and a Weibull distribution θ = {β , η}. D denotes a vector of observations that can be used to fit the selected distribution.

To get the posterior distribution P(θ |D), we need to define P(θ ), P(D|θ ) and P(D). P(θ ) is a collection of the prior distribution of the parameters (details about how to define a prior distribution will be discussed in the next subsection). The likelihood function becomes the probability density function of the distribution parameterised by θ , for example, the pdf of a

8http://www.agenarisk.com/ 9http://www.bayesia.com

Weibull distribution in Equation 2.4. By assuming the data are mutually independent and sharing the same parameter θ , the likelihood P(D|θ ) is the multiplication of the likelihood function for every observation. P(D) is the marginal distribution of the observations, it serves as a normalisation factor to ensure the posterior density integrate to 1. P(D) can be represented as:

P(D) =

Z

P(θ )P(D|θ )dθ (3.7)

It becomes more difficult to evaluate this integral analytically especially with the increase of parameter dimensionality. Fortunately, this can be tackled with the support of numeri- cal approximation inference algorithm as mentioned in Section 3.3. To use the posterior distribution for prediction purpose, we can evaluate the distribution of an unobserved data Dpred that is conditional on the observed data D, where all data follow the distribution that is parameterised by θ . Therefore, the posterior predictive distribution becomes:

P(Dpred|D) =

Z

P(Dpred, θ |D)dθ =

Z

P(Dpred|θ , D)P(θ |D)dθ (3.8)

Assuming Dpred and D are all mutually independent, we have the predicted deterioration

distribution:

P(Dpred|D) =

Z

P(Dpred|θ )P(θ |D)dθ (3.9)

Sometimes, where multiple distributions belong to different groups, models can be built hierarchically so that the parameters within each group have shared prior distribution and can themselves be learned from the data within each subgroup. This type of modelling is called Bayesian hierarchical modelling.

3.4.1

Bayesian Hierarchical Modelling

A Bayesian hierarchical model can extend the Bayesian parameter estimation method to include multiple layers of information. This is achieved by creating additional parameters (called hyperparameters) as the parents of parameters (local parameters) that measure the uncertainty about the parameters themselves. In a hierarchical model, the distribution of individual within each subgroup is governed by a set of parameters, and these parameters are sampled from a set of hyperparameters that describes the characteristics of all the groups together.

Let Didenotes individual observations from subgroup i that follow a parameter set θi, and

parameters are all governed by a set of hyperparameters Θ. For a two-level hierarchical model, we have three stages:

• Stage 1: Di|θi∼ P(Di|θi, Θ)

• Stage 2: θi|Θ ∼ P(θi|Θ)

• Stage 3: Θ ∼ P(Θ)

where the likelihood function of individual observations from subgroup i is P(Di|θi, Θ)

with a prior distribution P(θi|Θ), and the parameters θiin each subgroup i are governed by

hyperparameters Θ with hyperpriors P(Θ). Assuming the individuals within each group are mutually independent, and subgroups within the overall population are also mutually independent, the posterior distribution can be expressed as proportional to:

P(θ , Θ|D) ∝ P(Θ)

k

i=1

P(Di|θi)P(θi|Θ) (3.10)

Notes that the parameters θihere represent a collection of parameters for each subgroup

i. Hence, if the likelihood function has multiple parameters, Equation in 3.10 can be further dissembled with another multiplication about the parameters.

As a result, the overall population-level parameters are jointly guided by its subgroups’ parameters and the individual subgroup-level parameters are informed from all other sub- groups’ parameters via the estimation of the overall-level parameters [85]. This suggests we could leverage the feature of Bayesian hierarchical modelling to tackle the challenge of small data amount (Objective II) via learning between one group of assets that has little deterioration data and other groups with more deterioration data.

3.4.2

Prior Probability Distributions for Parameters

Estimating statistical distributions in a Bayesian statistical model requires us to quantify the prior probability distributions for any unknown parameters. This gives a natural framework to include knowledge into priors as the uncertainty specification of the parameters. These priors for parameters are modelled as continuous variables and, in this subsection, we describe the elicitation of them.

When the parameters have conjugate priors (see a list of family illustrated in Fink [48]), choosing a prior from them can give a closed-form expression for the posterior, which is advantageous in simplifying the posterior distribution calculation. However, not all distributions’ parameters have conjugate priors. For example, there is no conjugate prior in a

Weibull distribution when both parameters shape and scale are unknown. Another common approach is to assign an uninformative prior when no past information is available so far. For example, by choosing a uniform distribution with extreme bounded values as the prior. With uniform priors for the parameters, the point estimate of the Bayesian method - Maximum A Posteriori (MAP) probability estimation, is identical with the result from MLE [120]. But when relevant information about the parameter is available, we can create an informative prior.

Lunn et al. [103] and Gelman et al. [52] introduce a range of methods to define an informative prior, including estimation based on data, a mixture of data and judgement, and pure expert judgement. When some historical data is available, we may obtain a prior distribution for the parameter based on an empirical estimate, for example, by matching the prior with the mean and standard deviation from the data pool. In the case when we believe the historical data is not fully representative, we may elicit knowledge from experts to discount the data weight by building a power prior [71, 103], or using a bias modelling procedure [59, 103]. Elicitation of subjective information from experts is also often used to determine a prior distribution. But subjective prior distribution about the uncertain parameter is often difficult to specify precisely. A simple approach is to ask experts directly about the central tendency and variation of the parameter as the mean and standard deviation of its prior distribution. For example, Neil et al. [125] used a triangular distribution to estimate the prior with vague ranges: a lower bound, a medium and an upper bound. The challenge is to explain and convince experts to assign these values since parameters in different distributions have different meanings and impact on their posterior distributions. Tackling this challenge is part of our objectives (Objective I) and is addressed in the next chapter.

Figure 3.6 Effects of the prior distribution and data quantity in learning distribution.

To illustrate the differences when distributions are fitted with different types of priors, two examples are shown in Figure 3.6 with the corresponding configuration in Table 3.1.

Table 3.1 Fitting Weibull distributions using Bayesian parameter estimation with different priors and data amounts.

Parameter Name With Uninformative Priors With Informative Priors True Value

Shape Uniform (0.0001, 10000) Triangular (1, 2, 4) 1.5

Scale Uniform (0.0001, 10000) Triangular (100, 500, 700) 400

Simulation is used to generate a set of data from a Weibull distribution with a shape value of 1.5 and a scale value of 400. The fitted distribution in yellow is governed by uninformative priors, in blue is governed by informative priors, and in red is the true distribution. From Figure 3.6 (a), we can observe that even with little data (5 data), with the help of informative priors, the posterior distribution can approximate the true distribution within a reasonable degree. This shows the advantage of choosing informative priors over uninformative priors when they are available. With the increase in sample amount (50 data), as shown in Figure 3.6 (b), both posteriors with uninformative and informative priors become almost identical to the true distribution (the yellow distribution overlaps the blue distribution). With enough sample amount (50 data in this case) to fit a distribution using the Bayesian method, the effects of choices of the prior distribution on the posteriors become minor, this is the Bernstein–von Mises theorem: with the increase of sample size, the posterior distribution for parameters is asymptotically independent of the prior distribution [171]. Conversely, if the data amount is small or only indirect information for learning the parameter is available, assigning a good prior is essential [103, 52].