4 The unwanted consequences
4.5 Creating the response surface equation
Creating the response surface equation for computer model outputs requires information on both the input values and the computer output results. Regression analysis is used to obtain the analytical relationship between the input parameters and their corresponding output (Ang et al., 1975).
Several methods are available to create this analytical equation, such as the method of least squares and the method of maximum likelihood. The response surfaces used in this thesis were derived using the method of least squares.
4.5.1 The linear two-dimensional case
The simplest case of curve fitting is to derive an equation that represents data by a straight line, linear regression analysis. The task is to estimate λ and δ in the expression
y = +λ δx+e [4.3]
giving the estimate of the real variable y, Figure 4.1. The equation can also be interpreted as providing the conditional estimate
E(yx). The factor e represents the uncertainty in y . The
regression equation does not have to be restricted to two variables. Multiple variable regression analysis is similar, but the theoretical evidence will not be presented here.
• • • • positive negative (y - y )a a (y - y )b b • • • • • • • • • • • • •
Figure 4.1. Simple linear regression.
The regression analysis introduces new uncertainties into the parameters λ and δ as they only can be estimated and will therefore be associated with uncertainty, e.g. described by a mean and a standard deviation. This mean that λ, δ and e are subject to uncertainty as a result of the regression analysis.
The method of least squares works with any curve characteristics as the only objective is to minimise the difference between the sample data and the predicted surface. The important issue is to find a relation that describes the output in the best way and with as small a deviation from the data as possible.
The vertical differences between the data and the regression line, the residuals, will be evenly distributed on both sides of the regression line. This is a result of the method as it minimises the sum of the squares of the residuals. This means that the sum of the residuals is equal to 0.
The residual variance, se2,is a measure of how well the regression line fits to the data. It shows the variation around the regression line. The variable e in Equation [4.3] is usually estimated by a normal distribution (0, se).
Figure 4.2 shows the residuals from one of the sample calculations which will be presented in Chapter 7. The regression equation estimates the time before lethal conditions arise in the corridor of a hospital ward. All data points should preferably be located close to the solid horizontal line which represents the regression line. The vertical distances between the data points and the line shows the deviation between the computer results of this variable and the expression which was used in the uncertainty analysis. The depending variable, the fire growth rate, αf , is shown on the
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15
Figure 4.2. Residuals from the regression analysis of time before lethal conditions arise in a health care ward corridor as a function of the fire growth rate αf (kW/s2). No sprinklers are activated. The
dotted lines indicate ± one se.
The values on the vertical axis are logaritmic (ln( tucorr
)) due to reasons which are explained in Section 4.5.2.
The residuals are in the same units as the variable y, which means that the values from different regression analyses cannot be
compared directly determining whether or not the regression shows good agreement. A normalised measure of the deviation is the correlation coefficient. The correlation coefficient, r, is a measure of how close the data are to a linear relationship, and is defined as
r x x y y x x y y i i i n i I n i i n = − − − − = = =
∑
∑
∑
( )( ) ( ) ( ) 1 2 1 2 1 [4.4]The correlation coefficient can vary between -1 and +1, and values close to the outer limits of this interval represent good agreement. The sign indicates whether the correlation is positive or negative, see Figure 4.3. • • • • • • • • • • • • • • • • r > 0 r < 0
Figure 4.3. Correlation coefficients for a sample.
In multiple linear regression analysis, the coefficient of
determination, R2, is used instead of the correlation coefficient. For the linear case with only one dependent variable r2 = R2.
R y y y y i i n i i n 2 2 1 2 1 = − − = =
∑
∑
( ) ( ) [4.5]The coefficient of determination is a measure of how much of the residuals are explained by the regression model. The value should be as close as possible to 1. It is clear that the uncertainty in the prediction of y will depend on the sample size, n. Increasing the sample size decreases the overall uncertainty. The coefficient of determination, R2, for the analysis presented in Figure 4.2 above, is 0.97, which indicates good agreement between the regression line and the data.
One of the problems that may occur when using a response surface instead of the actual computer output, is that the residuals may increase as the value of one or more variable is increased. If this
happens, the uncertainty introduced by the regression analysis may have to be considered important.
As the regression analysis is used together with other variables that are subjected to uncertainty, the uncertainty variables from the regression analysis must be compared to the other uncertainties. For most cases these new introduced uncertainties can be omitted as their contribution to the overall uncertainty can be considered small.
4.5.2 Nonlinear problems
Linear problems are rare in most engineering disciplines. Most models result in nonlinear solutions and the traditional linear regression gives a poor representation of the data. There are two ways of solving this problem; optimising a nonlinear expression or transforming the model into a form that is linear, at least locally in the area of interest.
Most nonlinear solutions are based on approximating the data to a polynomial in various degrees, for example a 2nd order
polynomial. The curve-fitting technique is more or less the same as that described above. This approach is normally considered rather laborious and other means are preferable if they are available. The second technique transforms the data into a form in which the transformed variables are linear. One such transformation is to use the logarithmic values of the data. Other transformations such as squares or exponentials can also be considered. If the transformed values appear to be linearly dependent, linear regression analysis can be performed. The coefficient of determination can be used to determine the agreement between the data and the response surface for both the nonlinear and the transformed solutions. There are two good reasons for using the logarithmic values in some engineering applications.
1. In some cases the variation in the input variables is several orders of magnitude. The values located close to the upper limit of the response surface output, will then influence the
parameters in the equation more than others.
2. For some parameter combinations, a polynomial relationship can result in negative responses which are physically
impossible. This must definitely be avoided.
It appears that the linear approximation of the logarithmic data in determining the response surfaces is an appropriate choice for the cases considered in this thesis. The coefficient of determination,
R2, is generally very high in all equations. The large difference in magnitude of the variables will be drastically reduced and no negative responses will be derived using this approach. The response surface will have the following general appearance:
y xi i n i = =
∏
exp( )λ ( )δ 1 [4.6] where n is the number of variables, and λ and δi are the linearregression parameters. A problem arises when the uncertainties in
λ, δ and e are to be transformed. If a numerical procedure is used for the uncertainty analysis this will normally not be a problem. For an analytical method using hand calculations, these new uncertainties become a problem which might cause exclusion of the method. An approximate solution can be used, i.e. excluding these uncertainties, or special software capable of considering regression parameter uncertainty can be used.
In the risk analysis presented here, both the standard QRA and the extended QRA, these uncertainties are omitted, as they are small in comparison with the other variable uncertainties. To be able to draw this conclusion, the single subscenario uncertainty analysis was performed both with and without the uncertainty information in λ, δ and e.
4.5.3 Design of experiments
Creating a response surface to represent the output from a computer program requires a set of outputs from the program, together with the corresponding input data. Several sampling methods are available which describe how the calculation procedure should be designed in order to minimise the total number of computer runs. The most extensive sampling method is the factorial method, which requires an output for every
combination of input variables, Figure 4.4. The figure illustrates how computer outputs are calculated for every combination of the input variable values. The input data for the two variables are represented by a1 to a6 and b1 to b4. a1 a2 a3 a4 a5a6 b1 b2 b3 b4 Variable B Variable A
Figure 4.4. A 4 by 6 level factorial sample. A circle indicates that an output is calculated.
If the number of input variables and the different levels for each variable are large, methods are available which can reduce the number of computer runs by selecting certain combinations of the variables and corresponding levels. In these cases, methods such as the fractional factorial and Latin square methods (Vardeman, 1993) should be used. The term level is used here to define the number of values each variable will have in the calculation process
defining the various outputs. Using these methods will inevitably lead to some loss of information, but that has to be weighed against the time gained through the smaller number of computer
simulations.
In this thesis, both the number of variables and levels are considered low and complete factorial studies have been performed. Usually, only one or a few variables are used as uncertainty variables in calculating the response. In the computer programs, several other input variables are needed but these are treated as deterministic constants without any uncertainty. This is, of course, a simplification but it can be explained by the choice of subscenarios. Variables with great influence on the results, apart from the fire growth rate αf, are the dimensions of the
building. As the example risk analysis calculations have been performed on a standardised building with fixed dimensions, this simplification is justified.