Data Exploration - Statistical Investigation

6.2 Statistical Investigation

6.2.2 Data Exploration

Wetherill [74] also defines some features to be aware of when exploring characteristics of the data; some of these are briefly described below. Not all of the features are applicable to the current study, however they have been kept in mind when examining the slam data. If any of these features are identified, it is best practice to attempt understand why that particular feature is present and tailor the regression to suit the nature of the data if necessary.

Linear relationships between explanatory variables is a common issue in regression analysis. There are several methods to combat multicollinearities among the variables; with careful selection of variables it is not thought to be a significant issue in this project.

Replication of observations refers to repeating observations in the response variable space. This can be useful in estimating the variance of the underlying error; an indication of repeatability. This feature has been observed in the experimental data, for example a given slam load has resulted from a range of relative vertical velocities. This large variance of observed data is not the result of measurement error, it is rather an indication of the non-deterministic

nature of attempting to predict the slam load from relative vertical velocity alone, and other variables may be required to reduce the error variance.

Slam events could possibly be influenced by time trends to some extent. Slam events have been visually noticed to occur on consecutive waves in packets of three or four with the vessel motions becoming more violent after each wave encounter until the final, most severe slam effectively halts the heave and pitch motions of the vessel. This suggests that slams do influence motions and a periodic factor could be included in the slam model to emulate this observation if necessary. This observation is purely subjective at the moment; a future study focusing on ‘slam clusters’ would be valuable in investigating this phenomenon further.

Boundary points within the data should also be identified. One obvious boundary is the slam load; it is never observed to be negative. The regression model must be examined carefully to ensure it behaves correctly close to the boundary regions, if it does not then the model may require modification, or sections of data should be ignored to prevent poor fitting in these regions. An example of a poor-fitting model can be seen in Figure 6.4; relative vertical velocity at the encountered slam instant is used to predict the slam load in a linear fashion. On the occasions when the relative vertical velocity is negative (less than -3m/s) a negative slam load results. Additionally, unrealistic outcomes must be avoided; if the model shown in Figure 6.4 is to be used, a piecewise function could be applied where a second model is applied outside of the ‘operating point’ of the first model.

−10 −8 −6 −4 −2 0 2 4 6 8 10 −2 −1 0 1 2 3 4x 10 7

Full scale relative vertical velocity (m/s)

Full scale slam load (N)

Experimental data Linear regression

Figure 6.4: A poor regression model showing unrealistic negative slam loads when the relative vertical velocity at slam instant is less than -3m/s.

In some analyses system change pointsare apparent, this occurs when the system response changes at some point. In these cases a better approach may be to alter the regression model to account for this change, or have multiple models, one for each system. An example of this is the modified Wagner method proposed by Whelan [75] (recall Section 2.2.1 and Figure 2.5) where one model is applied during the first stages of the method and a separate one is applied for the second stage.

the results of measurement error or when sampling from a population with a heavy tailed distribution. Care must be taken when analysing data with outliers, the outlying points may be neglected or robust methods employed. In this work outliers in the data are intentionally included in the regression analysis as the primary goal of the work is to model the slam behaviour, which was found to be inherently heavy tailed.

The chosen potential slam prediction variables are shown in Table 6.1. Originally 25 different variables were collected, such as forward and aft demihull VBM, maximum elevations measured on the boat mounted wave probes prior to and at the slam instant, and the maximum immersion at the CBT after a slam event. Preliminary correlation analysis, inappropriateness and difficulties in adapting the variable for use in the time-domain method quickly eliminated some variables, leaving ten potentially important indicators of slam characteristics. For example the vertical bending moment was shown to be a good indicator of slam load (see Section 5.4.6), however it is more of an output parameter; the resulting demihull vertical bending moment is caused by the centre bow slam load and not vice versa. The ten remaining variables, shown in Table 6.1 include: the maximum slam load on the centre bow (once inertia and the global load is accounted for, see Section 5.3), the relative displacement of the wave surface and centre bow truncation at the slam instant (immersion, measured in metres), the instantaneous (at slam instant) and maximum relative vertical velocity at the CBT, the pitch angles and pitch velocity prior to and at the slam event, the location of the slam (measured from the transom) and the vessel speed.

All variables requiring a reference point shown in Table 6.1, such as relative vertical velocity and immersion are taken from the centre bow truncation by convention. Variables referring to a maximum are always measuredprior to the slam, while all others are at the instant the slam is identified (at the time when the surface pressure was measured to be maximum, see Section 5.2 for the slam identification procedure). Variables that appear post-slam are neglected (such as maximum immersion after the slam event) because they are inappropriate in a time-domain method as only past and present data is known.

A preliminary correlation analysis was conducted on all 25 variables by calculating Pearson’s correlation coefficient,r, and Spearman’s rank correlation coefficient,ρ, for each variable against each other variable. This is a first-pass indicator of the importance of a particular variable with regards to slam load and it was also used to identify correlations between other explanatory variables. If high correlations exist between the explanatory variables, then multicollinearity may become an issue, resulting in unstable estimated regression coefficients. The effects of multicollinearity can be reduced by using a ridge regression technique.

Perhaps the most important axiom to remember when performing a statistical analysis is “Correlation does not imply causation” [74]. If a strong correlation is detected between two variables, it cannot be concluded from the correlation analysis alone that one variable causes the other one. Other variables that have not been included in the analysis may be responsible for the observed correlation, or the variables may be collinear.

Table 6.1: Slam characteristic variables.

Fs Slam load on centre bow N

I Centre bow immersion at slam instant m

V Relative vertical velocity at slam instant m/s

Vmax Maximum relative vertical velocity prior to slam event m/s

x50 Pitch angle at slam instant rad

x50max Maximum pitch angle prior to slam event rad

x50 Pitch velocity at slam instant rad/s

x50max Maximum pitch velocity prior to slam event rad/s

Loc Slam location from transom m

U Vessel speed m/s

Spearman’s rank correlation method has the advantage over Pearson in that it is better suited to detecting non-linear trends in data. Where Pearson’s r is a reflection of the linear relationship between two variables, Spearman’s ρdescribes the degree in which the variable’s relationship monotonically increases (or decreases). However some information is lost in Spear- man’s method as the magnitude of the data is not considered, only its rank within the data set, this does make it more robust to outliers however.

The Pearson product-moment correlation coefficient, r, determines the degree in which a linear relationship exists between two variables (the predictor variable X, and the criterion variable,Y). The definition formula for the Pearson product-moment correlation coefficient is given by Equation 6.5: rXY = N X(xi−x¯) (yi−y¯) N sXsY . (6.5)

HereN is the number of observations,xiandyiare theithobservation ofXandY respectively, ¯

xand ¯yare the means, andsX and sY are standard deviations ofX andY.

The correlation coefficient r is an estimate of the correlation of the population which the variables where sampled from, it always falls within the range −1 ≤ r ≤ +1. The absolute value ofrindicates the ‘strength’ of correlation between the two variables, as|r| →1 the linear relationship becomes stronger. As|r| →0 the predictive relationship decreases and the use of the X to predict Y may be no more accurate than a prediction based on a random process. The sign ofrshows the direction of the relationship, a positiverimplies a positive association whereas a negativersuggests a negative association.

The Pearson product-moment correlation coefficient method assumes a linear association (i.e. y=mx+c) exists between the two variables. If the relationship is better described by a higher order function, the value ofrmay not represent the true extent of correlation between the variables. That being said, if the computedris close to 0, the possibility of a higher order correlation is very unlikely [61].

The assumptions of the Pearson product-moment correlation coefficient are:

1. The sample ofnmeasurements are randomly selected from the population it represents. 2. The two variables have a bivariate normal distribution.

3. Homoscedasticity1_.

4. Nonautoregression (autocorrelation or serial correlation).

These assumptions are quite standard in statistical analysis, similar assumptions are made when applying OLS (ordinary least squares) regression models. Heteroscedasticity was detected in the preliminary slam module described in Section 6.1.1 and an attempt was made to correct this by scaling the residuals with the predicted slam load.

It is important to remember that the correlation coefficient is not a robust statistic, therefore

rcan be greatly influenced by high leverage points (outliers) and some caution must be exercised when excluding potential explanatory variables from the regression analysis based primarily on

r values. The correlation coefficient table of all the chosen variables in Table 6.1 is shown in Table 6.2. These variables are the most likely candidates for inclusion in the slam module. Slam load, Fs, displays the highest correlation to relative vertical velocity (V) and it also has moderate correlations with pitch angle (both maximum and instantaneous) and maximum pitch velocity prior to slam ( ˙x50max). Vessel speed,U, has negligible correlation with slam load and poor correlation with the other variables. This is attributed to testing only three different speeds, however it will be included in the slam load prediction method as it is found that the inclusion of this variable reduces the residual variance [27].

Table 6.3 shows Spearman’s rank correlation coefficients for all the chosen variables. As noted above, these coefficients represent the degree in which one variable monotonically increases (or decreases) with another. Comparing Table 6.2 with Table 6.3 it can be seen that if two variables are moderately correlated according to Pearson, they generally achieve a similar Spearman coefficient as well. One notable exception is centre bow immersion, I; the Pear- son correlation coefficient is rather small (0.36) while the Spearman coefficient is considerably greater (0.68), suggesting that although the slam load tends to increase with immersion, the relation is not modelled well with linearly.

Homoscedasticity refers to the distribution of residuals. The Pearson correlation coefficient assumes that the variance of residuals is constant, this means the strength of correlation betweenXandY is equal along the entire range of both variables [61]. If this is not so, then the data can be said to be heteroscedastic. Heteroscedasticity does not cause OLS regression coefficient estimates to be biased, however it may cause the standard error of the estimate to be biased. This makes obtaining inferences from the data unreliable.

Table 6.2: Pearson product-moment correlation coefficients, r, for some selected variables. Correlations greater than 0.4 are highlighted in bold.

Fs I Vmax V Loc x50 x50max x50˙ x50˙ max S

Fs 1.00 I 0.36 1.00 Vmax 0.45 0.27 1.00 V 0.58 0.31 0.19 1.00 Loc 0.28 0.16 0.06 0.23 1.00 x50 -0.48 -0.26 -0.07 -0.49 -0.41 1.00 x50max -0.46 -0.25 0.05 -0.44 -0.39 0.93 1.00 ˙ x50 0.17 0.24 -0.26 -0.01 0.18 0.05 -0.18 1.00 ˙ x50max -0.54 -0.32 -0.17 -0.49 -0.42 0.94 0.89 0.02 1.00 U 0.00 -0.21 0.08 -0.15 0.10 0.28 0.30 0.21 0.21 1.00

Table 6.3: Spearman’s rank correlation coefficients,ρ, for some selected variables. Correlations greater than 0.4 are highlighted in bold.

Fs I Vmax V Loc x50 x50max x50˙ x50˙ max S

Fs 1.00 I 0.68 1.00 Vmax 0.44 0.26 1.00 V 0.61 0.61 0.17 1.00 Loc 0.29 0.29 0.03 0.24 1.00 x50 -0.48 -0.54 -0.06 -0.49 -0.44 1.00 x50max -0.46 -0.57 0.06 -0.46 -0.45 0.94 1.00 ˙ x50 0.18 0.18 -0.28 0.01 0.22 0.00 -0.20 1.00 ˙ x50max -0.54 -0.56 -0.15 -0.49 -0.45 0.94 0.90 -0.02 1.00 U 0.00 -0.42 0.10 -0.17 0.12 0.29 0.30 0.19 0.22 1.00

1 1.5 2 2.5 3

U

−1 −0.5 0 0.5 1

dX

/dt

−0.1 −0.05 0 0.05 0.1

X

50 −2 −1 0 1 2

V

−100 −50 0 50 100

I

0 100 200 300 400 1 1.5 2 2.5 3

F

U

−1 −0.5 0 0.5 1

dX

/dt

−0.1 −0.05 0 0.05 0.1

X

50 −2 −1 0 1 2

V

−100 −50 0 50 100

I

0 100 200 300 400

F

Figure 6.5: Scatter plots comparing variable correlations with each other. The diagonal plots show histograms of each variable, giving an indication of its distribution.

‘Instantaneous’ variables are chosen over the ‘maximum prior to slam’ variables as it is more realistic, and preferable, to be able predict the slamming behaviour of the vessel purely by the current state of the ship as opposed to recalling its past state. Relationships between the instantaneous variables are shown graphically in Figure 6.5 where all variables of note are plotted against each other, forming a matrix of plots. For the case where a variable is to be plotted against itself, a histogram showing the distribution of the variable is shown instead.

Further to the correlation coefficients, basic statistics of the chosen variables are shown in Tables 6.4 and 6.5. These tables effectively show a summary of each variable’s distribution. The minimum, maximum, range and upper and lower quartiles (25th and 75th percentile respectively) are included in Table 6.4, while the IQR (interquartile range), median, mean, variance and standard deviation statistics are shown in Table 6.5.

Table 6.4: Basic statistics of selected slam variables Quartiles

Variable Units Minimum Maximum Range Lower Upper

Fs N 6.5186 388.60 382.08 42.7380 118.65 I mm -13.8229 79.8825 93.7054 39.1547 50.7344 V m/s -1.3554 1.3154 2.6708 0.0218 0.3285 x50 rad -0.0913 0.0339 0.1252 -0.0460 -0.0129 ˙ x50 rad/s -0.2592 0.5750 0.8343 0.0974 0.1778 U m/s 1.5400 2.9200 1.3800 1.5400 2.9200

Table 6.5: Basic statistics of selected slam variables - continued

Inter- Standard

Variable Units quartile range Median Mean Variance deviation

Fs N 75.9090 72.1030 87.6750 3583 59.8580 I mm 11.5797 45.3780 44.2985 107.2347 10.3554 V m/s 0.3067 0.1794 0.1741 0.0711 0.2666 x50 rad 0.0331 -0.0291 -0.0303 0.00045 0.0211 ˙ x50 rad/s 0.1543 0.0974 0.1044 0.0135 0.1163 U m/s 1.3800 2.9200 2.1294 0.3129 0.5594

When selecting variables for the regression analyses detailed the following sections, the tables of correlations (Tables 6.2 and 6.3) and statistics (Tables 6.4 and 6.5) along with the features described above are revisited to ensure appropriateness or be aware of any limitations of a linear regression.

In document Slamming of large high speed catamarans in irregular seas (Page 143-151)