Parameter probabilistic analysis
4.2 Dispersion of values and outliers
In the previous topics, it was found that the variation and dispersion of results are largely influenced by the presence of values that appear to deviate markedly from the sample.
These values are often denominated as outliers, defined in ISO 16269-4 (ISO, 2010) as a member of a small subset of observations that appears to be inconsistent with the remainder of a given sample. Outliers can indicate a measurement error or that the sample has a heavy-tailed distribution. In the first case, one may suggest to discard the outliers or to use robust statistics, while in the second case the presence of outliers may indicate that the adjacent distribution has a high kurtosis (probability distribution evidencing a sharper peak and longer, fatter tails, while low kurtosis distribution has a more rounded peak and shorter, thinner tails). For the case of high kurtosis distributions, an outlying observation is merely an extreme manifestation of the random variability inherent in the data and the value should be retained and processed in the same manner as the other observations in the sample (ASTM E178, 2008).
Model-based methods which are often used for outlier detection take as premise that the data may be represented by a normal distribution, and by that assumption they identify observations that are likely outliers based on mean and standard deviation. A commonly used method is the Grubbs' test for outliers, where in a two-sided hypothesis test (either having or not having outliers in the sample), the test statistic is given by the largest absolute deviation from the sample mean in units of the sample standard deviation. The hypothesis of no outliers in a Grubb's test is then rejected regarding a given significance level.
Graphical methods, without any premise about the statistical distribution, are also used to evidence the dispersion of the sample and the possible presence of outliers. Within these methods, box plots are one of the most common. A box plot is a graphical method to describe the dispersion of a numerical data sample through the representation of five parameters: the smallest observation (sample minimum), the lower quartile (Q1), the median (middle quartile, Q2), the upper quartile (Q3) and the largest observation (sample maximum). The body of a box plot is composed by a rectangular shape box delimited by the lower and upper quartiles. The box is then connected to the maximum and minimum values by straight lines usually known as whiskers. In a box plot, outliers are identified based on the interquartile range, such that an outlier may be defined as any observation that is outside the range (respectively for lower and upper outliers):
( ) ( )
[
Q1−k⋅ Q3 −Q1 , Q3 +k⋅ Q3 −Q1]
(4.1)where k is a constant, that for usual analysis takes the value of 1.5 or 3.
Box plots are non-parametric since they display differences between populations without any inference or assumption of the underlying statistical distribution. The interval between the different sections of the box permits to evaluate the degree of dispersion and skewness in the data, and to identify outliers.
Skewness is a measure of the asymmetry of a probability distribution of a certain random variable, being either negative or positive, respectively depending whether data points are skewed to the left or to the right of the data average. A null skewness indicates that the data is relatively evenly distributed on both sides of the mean. In a box plot, the median line (Q2) can also suggest skewness in the distribution if it is noticeably shifted away from the centre. The location of the box within the whiskers can provide insight on the normality of the sample's distribution, such that if the box is shifted significantly towards the minimum value it presents positive skewness, whereas if shifted towards the other direction (maximum value) it may be indicative of a negative skewness. Box plots may also bring insight to the kurtosis of the distribution, such that a very thin box in relation to the whiskers may evidence a higher concentration of values within the interquartile range, as found in distributions with high kurtosis. A larger box compared to the whiskers may be associated to a low kurtosis distribution. Nevertheless, assumptions about the shape parameters of the underlying sample distribution, either skewness or kurtosis, must be taken with caution when only assessing a box plot, since it may lead to erroneous conclusions.
In this study, a dispersion analysis using box plots was made to the different test result samples in the four experimental phases. For outlier range definition a k = 1.5 was chosen.
4.2.1 Phase 1
The penetration depth was analysed in terms of dispersion by box plot and the results evidenced that no outlier was found (Figure 4.3). However, a rather significant dispersion is found regarding that the distance between maximum value and third quartiles is higher than the interquartile difference (Q3 - Q1). Attending to the location of the box plot, a small positive skewness is found.
Figure 4.3: Box plot for penetration impact depth in old beams.
4.2.2 Phase 2
With respect to the experimental tests made to the sawn beams, the propagation velocity, the MOE in bending (both local and global) and the bending strength were analysed. When
6 8 10 12 14 16
old beams
d(mm)
considering the propagation velocity results (Figure 4.4a), it was found that both tests made to lateral and bottom faces evidenced a significant number of lower outliers, denoting extreme values in the left tail of the underlying distribution (specially to the lateral face measurements). Comparing lateral and bottom measurements, a higher dispersion between quartiles Q1 and Q3, and longer whiskers, are found in the bottom measurements. However, both present similar median. No significant indication about the shape of the distributions is noticeable.
A comparative analysis between Em,l and Em,g (Figure 4.4b) indicates a higher dispersion for the sample of the latter, although two outliers (one lower and one upper) were found when analysing the values of the Em,l sample. The median value of Em,l is higher and its position is more centred with the quartiles Q1 and Q3 when compared to the Em,g median value which is closer to the inferior quartile Q1, indicative of a possible positive skewness. In both cases, the dispersion is quite high taking into account the range between the maximum and minimum values without outliers. The box plot for the bending strength (Figure 4.4c) must be considered with care, since the sample number is insufficient to construct a reliable dispersion statistic.
a) b) c)
Figure 4.4: Box plot for tests made on sawn beams: a) propagation velocity in ultrasound testing by the indirect method; b) MOE in bending tests; c) bending strength.
4.2.3 Phase 3
With respect to the experimental tests made to the sawn boards, the propagation velocity, the MOE in bending (both local and global), penetration depth and drilling resistance were analysed. When considering the propagation velocity results (Figure 4.5a), it was found that the tests made to bottom face of the boards by indirect method, presented a large number of lower outliers. This evidences extreme values in the left tail of the underlying distribution, which has also similarly occurred in the previous phase for the analogous measurements. In this case, by comparison to the distance from quartile Q1 to the lower outliers, a rather thin box is found, however, a small difference is found if comparing the size of the box with the length of the whiskers without outliers. The median is centred with quartiles Q1 and Q3.
The dispersion analysis of MOE in bending (Figure 4.5b) evidences a higher dispersion for the Em,l rather than the Em,g, which did not occur in the previous phase, however, the median of Em,l is still higher than the median of Em,g. In both cases, a significant number of observations are considered lower outliers, and also upper outliers are found for Em,l. The significant dispersion found in Em,l is consequent of the existence of a large number of extreme observations and a rather thin difference between quartiles Q1
and Q3, thus evidencing a possible underlying distribution with high kurtosis.
Regarding the penetration depth and drilling resistance, a larger dispersion of values was found for the samples of tests made to the lateral faces, although in both cases, a larger number of outliers were found in the bottom face measurements (Figure 4.5c, d). Within the same test, penetration depth median was similar in both bottom and lateral measurements, whereas in the case of the drilling resistance, a quantitative increase of quartiles Q1, Q2 and Q3 was found for the lateral measurements.
a) b)
c) d)
Figure 4.5: Box plot for tests made on sawn boards: a) propagation velocity in ultrasound testing by the indirect method; b) MOE in bending tests; c) penetration impact depth; d) drilling resistance by resistance measure.
0 2000 4000 6000
Ind. bottom vp(m/s) -sawn boards
0 10000 20000 30000
Em,l Em,g
MOE (N/mm2) -sawn boards
4 6 8 10 12 14
perp. bottom perp. lateral
d(mm) -sawn boards
150 200 250 300 350 400
perp. bottom perp. lateral
RM(bit) -sawn boards
bottom face Em,l Em,g
4.2.4 Phase 4
Phase 4 comprised a set of tests made to small wood specimens, and thus it is expectable that dispersion due to the material variability is decreased, since the specimens are selected in order to minimize the presence of defects which are a cause of mechanical property higher variation.
In the compression parallel to the grain tests, the propagation velocity, the MOE and ultimate strength in compression parallel to grain, were analysed. When considering the propagation velocity results (Figure 4.6a), a larger dispersion in the indirect measurements, with an extreme lower outlier is found. For direct measurements the values are more concentrated, especially in the interquartile interval (between Q1 and Q3). In both cases the median is centred within the box and both evidence lower outliers. The quartile Q3 and the maximum value are similar in both measurements.
A moderate dispersion in the Ec,0 sample (Figure 4.6b) is found, however it presents lower outliers and the box is closer to the minimum value, which may be indicative of a positive skewness. On the other hand, the fc,0 sample (Figure 4.6c) presents a higher dispersion, also evidencing lower outliers. Its median is centred in the interquartile interval and the box is centred within the extent of the whiskers.
a) b) c)
Figure 4.6: Box plot for tests made on compression parallel to grain specimens:
a) propagation velocity in ultrasound testing; b) MOE in compression parallel to grain tests; c) compression parallel to grain strength.
In the tension parallel to the grain tests, the propagation velocity, the MOE and ultimate strength in tension parallel to grain were analysed. In neither tests an outlier was found although large dispersions were detected. When considering the propagation velocity results (Figure 4.7a), a relative thin box is found centred within the maximum and minimum values, which may be indicative of a significant dispersion in the tails of the underlying distribution with a possible high kurtosis. Analogous situations are found with respect to Et,0 and ft,0 (Figure 4.7b, c), where also significant dispersions are found with
disperse in the lower tail. In the analysis of the box plot of ft,0, the approximation of the box to the upper quartile Q3 may be indicative of a slight negative skewness.
a) b) c)
Figure 4.7: Box plot for tests made on tension parallel to grain specimens: a) propagation velocity in ultrasound testing; b) MOE in tension parallel to grain tests;
c) tension parallel to grain strength.
Comparing the box plots for compression and tension parallel to grain, it is found that the latter present a higher dispersion even though no outliers are present. Previously it was already found that the COV in the tension parallel to grain tests was higher than the equivalent values for compression parallel to grain tests. Therefore, in this case the different experimental phases, it is found that lower outliers are found in almost all cases which may be either indicative of a tendency of these measures to take left tailed distributions, or that a measurement error associated to these procedure is being systematically considered. Measurement errors are most commonly random, thus taking either negative or positive values, however in this case no upper outliers are found and so the probability of considering the lower outliers as consequence of measurement errors must be considered low.
With respect to the density and moisture content determination tests, high dispersions are found, although no outlier is found (Figure 4.8). The median is centred both within the box and between the extreme values, evidencing a possible symmetric distribution such as of a normal distribution. The box plots for ρ0 and ρ12 are similar with only a shift of the
a) b)
Figure 4.8: Box plot for tests made on small clear specimens: a) density; b) moisture content.