Bayesian Framework - Uncertainty Quantification

2.6 Uncertainty Quantification

2.6.3 Bayesian Framework

Bayes Theorem is named after Rev. Thomas Bayes (1701–1761) and is stated mathe- matically in a simple form as follows:

S _|M = S M|_ S __{S M} (2.20)

In essence, Bayes Theorem states that given an initial prior probability of event _, S _ , we can compute its posterior probability, S _|M , based on the occurrence of event M based on the conditional probability S M|_ termed as the likelihood over evidence

S M . In the history matching framework, Bayes Theorem can be used to answer a ques-

tion of “how likely one set of model parameterisation _ is given the historical observed data M?”.

A further application of Bayes Theorem is Bayesian inference where we consistently update our knowledge on one or more unknown quantities of a physical system based on the observed data from the system. This technique can be applied on updating the initial reservoir model parameters probabilities based on observation data such as production rates. Started with a set of initial knowledge in the form of prior, we then sample a number of possible reservoir descriptions from the prior and examine how well the predictions made using these reservoir descriptions fit the data. The models that fit the data well are assigned higher probabilities, with the numerical values of the probabilities given by Bayes Theorem, as in Equation (2.21).

| = _Ù _|| _¸ (2.21)

where the prior probabilities contains initial probabilities for the model parameters, | is the likelihood function that measures the degree of the observed and modelled data differ, and | is the new posterior probability of the model parameters based on observations . The integral at the denominator of (2.21) is the evidence and acts as a normalisation constant, as such Bayes Theorem can be simply described as in Equation (2.22).

Xmu W XW ∝ s - s ℎXX¸ × W XW (2.22)

2.6.3.1 Prior Information

Priors represent our knowledge of the unknown model parameters before seeing the observed data. The information contained in the priors can be based on previous knowledge of the problem or system (i.e. expert knowledge) or quantitative data from reliable sources (i.e. scientific publications, reports). Based on how much information we have, Gelman in [211] categorised three groups of prior distributions:

1. Non-informative prior distributions which are usually set as uniform distributions on one or many uncertainty parameters. This type of distributions assigns the same probability to occur at all values of the parameters ranges.

2. Highly informative prior distributions, which are used when a fair amount of information about the possible values of model parameters are available. Typically, this type of priors is represented by normal distributions defined by the mean and the standard deviation.

3. Moderately informative hierarchical prior distributions, which are used when limited information about the model parameters values are available.

Prior distributions that incorporate the available information about the parameters probabilities will improve the posterior probability estimates. In this sense, by having more information on the prior, the posterior probability estimates will be more constrained towards realistic values based on the prior knowledge.

Some studies on the use of informative priors exist in the literature. Arnold [14] intro- duced a technique of modelling geological prior information based on published equations that relate channel width and thickness. The study highlighted the impact of preserving realism of geological models in history matching. Rojas [61] constructed complex relationships on the geological feature by using machine learning technique (i.e. support vector machine) and built sedimentological intelligent priors. The inclusion of intelligent priors in this study resulted in more realistic history-matched models and the reduction of computational time of history matching.

However, throughout the thesis, the non-informative prior distributions are used in the whole of the history matching and uncertainty quantification studies for the sake of the thesis’s focus.

2.6.3.2 Likelihood Definition

The likelihood of a reservoir model can be defined by comparing the simulation results with the observed data (i.e. to use the misfit value in a similar way to the objective function defined by Equation (2.7) in Section 2.3.2.2). For instance, if we are matching on oil rate, the likelihood | is the probability that the measured observation ^_N is equal to the simulated value ^ _+~ given the reservoir model . Assuming that the measurement errors at any time are Gaussian, independent, identically distributed (all

have the same variance) with zero mean error, and there are no simulation errors, the likelihood at timestep u can be defined as:

q| = _v√21 Ü−1₂ ^N − ^_v +~ qÝ (2.23)

where v is the standard deviation of the measurement error, ^_N and ^ _+~ are the observed and simulated data, respectively.

As the measurement errors are assumed to be independent between timesteps, the joint probability density is calculated by the product of probabilities of each measurement for

p data points, as given by Equation (2.24).

| = r 1 v√2 t n Þ Ü−1₂ ^N − ^_v +~ qÝ n qo (2.24) As ¿ ß√ àÁ n is a constant: | ∝ Þ Ü−1₂ ^N − ^_v +~ qÝ n qo (2.25)

Hence, if we use misfit definition * in Equation (2.7), we can define the likelihood function as in Equation (2.26) so that by minimising the misfit * we maximise the likelihood [6].

| ∝ •¦ _(2.26)

2.6.3.3 Posterior Probability Distribution

Determining PPD function is probably the most important part of the inverse problem [6]. In general, an inverse problem can be stated as follows: given prior information on some model parameters, inexact measurements of some observable parameters, and an uncertain relation between the observed data and the model parameters, how should we modify the prior probability density function (PDF) to include the information provided

by the observed data? In this sense, the PPD refers to the modified PDF. Hence, the solution to the inverse problem is represented by the construction of the PPD estimate that afterwards the realisations of the model are constructed by sampling the PPD.

Many history-matched models, typically referred to as the ensemble, are required to es- timate the PPD. This requirement is due to that a single history-matched models is usually not a good representative of the PPD. Then, the calculation of probability estimates on the reservoir prediction can be performed from the PPD estimation of each matched model.

Analogue to the frequentist’s confidence interval, the probability estimates in the pre- dictions is usually described by credible interval for Bayesian. First, the cumulative dis- tribution function is constructed from the PPD for a particular production data. Then, the credible interval can be reported in different ways. For instance, 90% maximum credible interval ( , Ó) represents the largest interval whose probabilistic uncertainty estimation for that particular production data encapsulating the true is 0.9. We can define the other way by taking = 0 and reporting Ó that corresponds to 0.9 quantile of the cumulative distribution function. In this case, the truth value is below the reported Ó value with probability of 0.9. In this thesis, the latter approach is used and 0.1, 0.5, 0.9 quantiles are reported as P90, P50, P10, respectively, as shown in Figure 2.16.

Figure 2.16—History matching and uncertainty quantification in the prediction period. Black dots show the observed data in the historical period; coloured lines show plausible fitting models. The history-matched models are used for making predictions, and the results are showed in typical P10, P50, and P90 credible

Techniques for Determining Posterior Probability Distribution

There are many techniques for estimating a PPD and consequently quantifying the un- certainty for an ensemble of models. These techniques fall into two categories, exact

methods and approximate methods [6].

In the exact method, a way of estimating PPD analytically was demonstrated in [6] for a known function. However, in real-world inverse problems such as history matching, the analytical method encounter difficulty due to the complex, nonlinear, high-dimen- sionality and ill-posedness characteristics of the history matching. Thus, it can be difficult to characterise the PPD analytically. Therefore, a numerical integration method to estimate the PPD is needed. A particularly common technique in this criterion is Monte Carlo sampling in which the parameter space sampling is done by Monte Carlo simulation runs, which are computationally expensive.

Another widely used techniques for uncertainty quantification are approximate methods which often provide better solutions than the exact method considering the quality of the approximation to the correct PPD. Included in these methods are experimental design and response surface method (see e.g. [212,213]). These statistical approaches firstly identify uncertain parameters that most affect the response variable in a limited number of simulations and then fit a surface, usually a linear or quadratic model, to the response variable. Afterwards, this surface is used as a proxy for the simulations runs when it is used in Monte Carlo sampling. The main issue with these methods is that the small number of samples will provide only an approximation of the true model response. Another issue is the use of proxy model instead of forward simulations may introduce significant modelling errors. Other limitations are the number of parameters that the method can handle and the difficulty to handle highly nonlinear surfaces. Nonetheless, such methods are useful for appraisal stage uncertainty analysis, where data is limited and no production data is available.

In the other literature, Erbas [13] distinguished three different types of method to estimate the PPD that are compared in [12] as below. Some terminologies are required.

Maximum likelihood (ML) solution represents the best history-matched model in the

term is incorporated. If the prior is uniform and wide enough to overlap the likelihood, ML and MAP solutions are the same.

1. Methods that characterise the PPD locally around the ML or MAP model.

In this method, the PPD is determined by characterising the distribution around the ML or MAP model, and then used this information into the prediction. An extended version of this approach is by using more than one equally good solution, such as in the case of multi-modal objective function. In this case, the local characterisation is performed on the found multiple ML or MAP solutions. An example of this extension is linearisation about the MAP (LMAP) [5].

2. Methods that use only a subset of the ensemble generated.

Randomised Maximum Likelihood (RML) and Pilot Point (PP) methods fall into this category. RML works by randomly sample initial models from the prior reservoir model and randomly sample the observed data. Then, both samples (model and data samples) are history-matched individually using an optimisation technique. The objective function includes both the misfit between the simulated response and data sample, and the discrepancy of the reservoir model from the initially sampled model. The approximation to RML is called the PP method, where the model parameters are only varied at particular locations (pilot points) in the parameter space.

3. Methods that sample from the complete PPD.

Rejection Sampling (RS) and Markov chain Monte Carlo (MCMC) methods fall into this category. RS produces independent samples from an initial model distribution, and they are either accepted or rejected based on the acceptance function. In the MCMC method, a chain of randomly selected samples is produced by taking a random step along each parameter axis from the present location to a new model. Then the model is accepted or rejected based on the acceptance criteria. The most common method for this mechanism in MCMC is the Metropolis-Hastings sampler [214, Ch. 6].

In both RS and MCMC, the acceptance criteria are defined in proportion to the likelihood of the models. Hence, it is expected that the PPD distribution can be represented after a long chain. However, the main problem of these methods is that

they require a large number of computationally expensive samples to be produced to determine the PPD accurately.

Liu et al. [12] considered that RS and MCMC methods as the correct sampling methods for PPD construction, whereas the others (i.e. LMAP, RML and PP) are only approximately correct. They evaluated all of these methods on a 1D simple reservoir case and found that LMAP failed to approximate match to the data, PP method overestimated the uncertainty, and the RML method provided an acceptable uncertainty estimation which can be an alternative to the MCMC.

NA-Bayes (NAB)

In this thesis, a methodology based on MCMC called NA-Bayes (NAB) [215] is used to estimate the PPD. The main difference between MCMC and NAB is that NAB resamples previously generated ensemble of models from a search algorithm (e.g. NA, GA or PSO), whereas MCMC require that after each new proposed state of the Markov Chain, a new forward simulation is carried out and the misfit calculated. NAB also infers the information from the complete ensemble, not only a subset of it and it requires no further solving the forward problem, but from the new ‘resampled’ ensemble.

NAB constructs Voronoi cells to represent the model parameter space and to interpolate the PPD of unknown points in the high-dimensional parameter space. The Voronoi interpolation is done by assigning a constant misfit value to the sample point in each cell. By means of this interpolation, NAB does not require any forward simulation during the posterior sampling. Then, NAB uses a Gibbs sampler, a special case of Metropolis- Hastings sampler, to resample the ensemble of models. In summary, two points can be highlighted on the use of NAB for posterior inference as follows:

1. All the models in the ensemble are used to infer the information and to evaluate the PPD for the credible interval in prediction.

2. There is no further forward reservoir simulation for all the models generated by the sampling algorithm, but only for the ones resampled by NAB. This helps to save computational time.

Figure 2.17 illustrates the working mechanism of Gibbs sampler in NAB resampling for a two-dimensional problem that is summarised in Algorithm 2.6.

Figure 2.17—NAB resampling process showing two random walks of Gibbs sampler (taken from [215]).

The PPD for each model is determined by the sampling density for that particular model. The sampling density is affected by the size of the Voronoi cell for that particular model and the likelihood value of the sampled model. In [13, Fig. 2.23], Erbas demonstrated that the correlation between PPD of the resampled models with the likelihood a is lower than the correlation between PPD of the resampled models with likelihood "a × 0 ss m â ". Hence, in the PPD construction from NAB the sampling density is directly proportional to "a × 0 ss m â ".

The limitation of NAB is the assumption that all the models in one Voronoi cell have the same misfit value, and consequently the same likelihood. This assumption may lead to missing out some good matched models within a cell that are represented by a poorly matched model if the cell has not been refined. This problem could happen when the misfit surface has several steep minima, or in the higher dimensional problem.

Figure 2.18—NAB workflow for uncertainty prediction (h=history; f=forecast; ã ä|å =posterior distribution) (taken from [13]).

Figure 2.18 illustrates how the results of NAB resampling are used to quantify the un- certainties in production forecasts. The number of resampled models is less than the input ensembles. The visit frequencies of these resampled models are counted and the posterior distribution, | is constructed. Then, forward simulations are performed on these resampled models for production forecasts. Note that the posterior inference is constructed from the complete ensemble, however we only need to run forward simula- tions of the resampled models for forecasting. The Bayesian credible intervals (P10, P50 and P90 values) are then computed individually at each timestep in prediction period and connected to construct the P10, P50 and P90 lines.

Several studies highlight the use of NAB as a method to estimate the PPD. Arnold [14, Fig. 3.9] provides an excellent summary and discussion of different methods for estimating PPD. In that, the NAB was chosen as it provides a robust method of producing probabilistic results with a greater accuracy (as it uses the entire ensemble) than LMAP, RML, and PP without the computational burden of using MCMC or RS. In other studies, NAB has been applied successfully on history matching and uncertainty quantification on reservoir simulation studies (see e.g. [84,87,88,92,108,216] for details).

Algorithm 2.6—Gibbs sampler in NAB resampling [215].

Step 1 A random walk starts at point B that can be a model from the input ensemble. A useful selection can be from the position of the better data fitting models;

Step 2 From this point, a series of steps or random walks is taken along each parameter axis in turn (i.e. two steps for the example in Figure 2.17);

Step 3 An interval (s₊ to ]₊ in Figure 2.17) is defined for each axis covering the entire parameter space, in which a conditional probability is constructed by computing the intersection points of the interval with the ensemble’s Voronoi cells. This results a PDF like Snæ +| •+ shown in Figure 2.17. The random walks produce samples with a

distribution that tends towards the approximate posterior distribution by taking the likelihood of the models (defined in (2.26)) as the probabilities that constructs the conditional PDF above;

Step 4 A new random step, ₊ç (point B’), is proposed by a uniform random deviate between the end points of the axis (i.e. in the interval s₊ to ]₊ shown in Figure 2.17). The probability of proposing the B’ in cell A is defined in Equation (2.27);

S Mè_{∈ _ =}0 ss m â æ

|]+− s+| (2.27)

where ]₊ and s₊ are the upper and lower ranges of parameter respectively. The 0 ss m â _æ is the length of the length of the s₊− ]₊ axis section passing through the cell A.

Step 5 This proposed step is accepted if a second random deviate, W, generated on the unit interval (0,1), satisfies Equation (2.28);

W ≤_SSnæ +ç| •+

næ +~ Y| •+ (2.28)

where S_{næ +}~ Y| _•+ is the maximum value of the conditional along the axis.

Step 6 If the proposed step is rejected, then the whole procedure is repeated until an accepted step is produced;

Step 7 The Gibbs sampler continues by generating the next step and cycles through each parameter axis in turn. An iteration is completed when all dimensions have been cycled through once;

Step 8 The constructed conditional PDF is believed to be a good approximation to the true posterior distribution after many independent walks starting from different locations [3], [4], [13], [108], [216].

The computational overhead for NAB depends on the setup of the algorithm. The user has to define the number of chains. The number of chains determines the number of independent random walks that starting from a different point in model parameter space. This will significantly reduce computation time as the calculations are performed simultaneously and improve the sampling of parameter space because each walk starts in a different place [13].

The user also has to define the burn-in period for NAB. The burn-in period is used to discard a number of steps at the beginning of each independent random walks on each chain. The motivation behind the use of burn-in is to improve the robustness of the results. The most commonly used method for determining burn-in period is by visual inspection of plots from the output as illustrated in Figure 2.19.

Figure 2.19—Schematic representation of a random walk and burn-in period. A number of iterations at the beginning of a random walk for each chain is discarded to ensure that the resulting target distribution é is

independent of the starting point ê_ë. (taken from [13]).

The chain length determines the number of steps to be performed on each chain at a particular simulation case. This length is related to the convergence of the chain. Con- vergence is achieved when independent chains converge to the same distribution [13].

In document Multi-objective methods for history matching, uncertainty prediction and optimisation in reservoir modelling (Page 93-105)