• No results found

Some probability concepts in engineering

2.3 Statistical parameters from observational data

2.3.2 Interval estimation

With point estimation, the degree of accuracy of the estimates is not conveyed.

Thus the interval estimation is concerned of calculating a range of values for the parameter from the observational data. The intervals provide a certain degree of confidence in estimating quantity.

Given the x is the sample mean of n observational data for estimating the pop-ulation mean µ. The accuracy of this estimation is dependent on the sample size n. The following sections will provide two approaches to determine the interval mean of the sample.

2.3.2.1 Confidence interval mean with known variance

Given a sample of size n of x1, x2, . . . , xnfrom a population X. The sample mean x and its expected value are given as following:

x = 1

The expected value of sample mean x is equal to the population mean µ, thus x is an ‘unbiased’ estimation of the population mean µ. This sample mean x varies depending on the chosen sample, thus the value of x is random. Due to its randomness, the variance of x is derived as:

V ar (x) = V ar 1

Based onEq. (2.26)andEq. (2.28), the sample mean has a mean of µ and standard deviation of σn. If the sample size n is greater than 30, the sample mean can be assumed as normal distribution by the central limit theorem (Nowak & Collins, 2012). The PDF and CDF of the sample mean are shown in Fig. 2.4.

Figure 2.4: PDF and CDF of the sample mean.

The following algorithm summarises the general procedure for establishing the confidence interval of the mean x with the known variance σ2:

1. Choose the confidence level (1 − α).

2. The bound values of the sample mean can be calculated as:

xα/2 = F−1(α/2), (2.29)

x1−α/2= F−1(1 − α/2), (2.30)

in which F−1(.) is the inverse function of CDF of the assumed distribution type for the sample mean. In this case, the sample mean is assumed to have

normal distribution. xα/2, x1−α/2 are the α/2 and 1 − α/2 percentile of the sample mean respectively.

The accuracy of the estimation of x is depending on the sample size n. When n increases, the sample mean x is closer to the population mean µ. In the extreme case, as n → ∞, x → µ.

If the sample size n is less than 30, the confidence interval of the sample mean needs to be determined by a non-parametric or a bootstrap method. The present study will focus on the bootstrap method which is described in the next section.

2.3.2.2 The bootstrap method

The bootstrap method is a computer-intensive re-sampling technique that was introduced by Efron (1979) for making certain kinds of statistical inference. The philosophy underlying the approach is that, in instances of limited data about the distribution and confidence interval, the observed data contains all the avail-able information about the underlying statistical characteristics of the data. Two asymptotic concepts can theoretically justify this idea. Firstly, the sample empir-ical distribution function (EDF) approaches the population PDF when the size of the sample n increases to infinity. Secondly, the bootstrapped estimation of the statistical sampling also approaches the statistical population when the number of re-sampling B approaches infinity. Therefore, the results from re-sampling from the sample is, therefore, the best guide to determine the distribution and statistic of interest, such as mean, median or standard deviation of the data.

In general, the bootstrap can be classified into two types: non-parametric and parametric method. In the non-parametric bootstrap, the original sample is re-garded as a miniature of the population, thus maintaining all the information concerning the population. Thus the original sample is treated as the virtual population, and the re-sampling procedure is based on the duplication of the

original sample. In the parametric bootstrap, a particular mathematical model needs to be derived firstly which fit the original sample data. The bootstrap sam-ples then are obtained from this derived model such as density or mass function (Hardle & Mammen, 1993). The present study will focus on the non-parametric bootstrap method.

The bootstrapped sample must have sizes kept the same as the original sample.

It may include some original data points more than once. Thus, the statistical parameters of each bootstrapped sample vary the original sample slightly. All the data from the original sample are assumed to be independent and identically distributed. The estimation of the distribution and confidence interval of the original data can be determined based on these duplicated samples (Yu, 2003).

The bootstrap method is essentially a sampling method (sampling with re-placement) which can be used to estimate the variation of point estimates (Efron, 1979). The selection is made with the replacement of each random variable ran-domly, with its probability is assumed to be identical. The reliable results using the bootstrap require the sufficient number of random samplings. This number is suggested between 1000 and 2000 for the estimation of 90-95 % (C.I) of distribu-tion parameters (Davison, 1997, Efron & Tibshirani, 1994). Bootstrap sampling becomes efficient if examining or collecting the entire population data is impos-sible or too costly.

Given a sample values of x1, x2, . . . , xn from a population X then the procedure of computing the 100(1 − 2α)% confidence interval for the mean parameter µ by the non-parametric bootstrap can be summarised as follows:

1. Compute a point estimate, ˆµ, for µ from the original dataset as:

µ =b 1

2. Construct a bootstrap sample (x1∗, x2. . . , xn∗). Compute the mean param-eter µ and the bootstrap difference δ = µ− ˆµ.

3. Repeat Step 2 for B times. Thus, we obtain (δ1, . . . , δB), in which δi repre-sents the bootstrap difference for the ith bootstrap sample.

4. Determine the 100(α)th and 100(1 − α)th percentile of (δ1, . . . , δB), denoted by δα and δ1−α. Then 100(1 − 2α)% confidence interval for θ is calculated as [ˆθ − δα, ˆθ − δ1−α ].

5. The standard error of the sample mean can be determined as:

s =

The bootstrap method is efficient when the distribution type of the population is unknown, or the process of deriving their statistical parameters are so compli-cated (Felsenstein,1985). The bootstrap only requires computing time. However, after the bootstrap sampling algorithm is set up in the programming software, the computer can do all the work. Matlab has some built-in functions for this boot-strap sampling such as ‘bootstrp’. With the current development of computer technology, the bootstrap is becoming increasing popular.