• No results found

CHAPTER 4: Data Acquisition and Exploratory Data Analysis

4.4 Analytical Methods

4.4 Analytical Methods

4.4.1 Spearman’s rank correlation

Unlike the Pearson correlation, Spearman's rank correlation is a nonparametric measure that is used to determine correlation between two variables that may have linear or nonlinear (monotonic) relationship (Cohen, 1988; Puth, Neuhäuser, & Ruxton, 2015). Spearman’s correlation coefficient, rs, is a measure of how two variables correlate with

each other. It can have a positive or negative value between 0 and 1. A positive value of rs

indicates a positive correlation between the two variables, whereas a negative value of rs

indicates an inverse relation between the variables. The classification of the strength of rs

Pipe age = 27.099(Pipe roughness) + 17.253 R² = 0.69 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 Pip e a ge (y e ar s) Pipe roughness (mm)

Cast Iron pipes in WSZ1

(a)

Pipe age = 277.18(Pipe roughness) - 8.322 R² = 0.63 20 25 30 35 40 45 0.09 0.11 0.13 0.15 Pip e a ge (y e ar s) Pipe roughness (mm)

Ductile Iron pipes in WSZ2

(b)

104

is subjective; thus, it depends on the type of data and the purpose of the study. While an rs

value of ±0.7 may be classified as high in clinical research, it may be regarded as medium or low in a research in aeronautics. For this research, because Fe and Mn depend on many variables, some of which are interrelated, high values of R were not expected. Therefore, the classification by Cohen (1988) where |rs| > 0.5 was classified as strong correlation, 0.3

≤ |rs| ≤ 0.5 as moderate correlation and 0 < |rs| < 0.3 as weak correlation was adopted. The

equation for calculating rs is given as:

𝑟𝑠 = 1 − 6 ∑ 𝐷𝐹𝑖 2 𝑛𝑠 𝑖=1 𝑛𝑠(𝑛𝑠2− 1) (4.8)

where ns = the number of pairs of values in the sample; and

DFi = the difference between ranks of values in ith pair.

The aim of this study is not to predict any variable, but to understand the influence of chemical and biological processes on Fe and Mn accumulation in WDNs. In view of this, Spearman’s rank correlation analysis was performed at the DMA level. Fe was used as the dependent variable, and was plotted against each of the 36 water quality variables (dependent variables) in turn. Similarly, Mn was also plotted against each of the 36 water quality variables. When computing rs for a given pair of variables in a given DMA, it is

important to compare its value with those from other DMAs for the same pair of variables. This will give an idea as to whether the two variables are significantly correlated or are correlated by chance. Figure 4.5 shows selected plots to illustrate strong, moderate, and weak correlations between Fe (and Mn) and some water quality variables.

The percentages of graphs at the DMA level with negative or positive correlations of Fe and Mn against the water quality variables were also determined. The knowledge of how an independent variable negatively or positively correlates with a dependent variable is very important because it helps in the formulation of fuzzy rules in FISs. Details of the formation of fuzzy rules are presented in Sections 3.3.4.1 and 6.3.4

105

Figure 4.5 Plots showing ((a) and (b)) strong, ((c) and (d)) moderate, and ((e) and (f)) weak correlations between Fe (and Mn) and selected water quality variables

106 4.4.2 Linear regression

Regression models have been applied in almost every field of study, including economics, medicine, political science, sociology, and psychology. They have also been extensively used in water resource engineering. Some of the research done in water resource engineering includes the work of Murdoch and Shanley (2006), who used segmented regression analysis to assess water quality trends. Rajendra Prasad, Sadashivaiah and Ranganna (2011) used a regression model to predict total dissolved solids based on electrical conductivity values, while Christensen, Rasmussen and Ziegler (2002) developed a real-time water-quality monitoring model that uses regression analysis to estimate nutrient and bacteria concentrations in Kansas Streams, USA. Joarder, Raihan, Alam and Hasanuzzaman (2008) conducted research that used a linear regression equation to predict ground water quality with variables such as electrical conductivity, calcium, and dissolved solids.

Despite extensive use of regression models in the past few decades, they have been superseded by sophisticated models with strong learning capabilities, such as ANN and neuro-fuzzy logic models because of their learning capabilities. Also, the requirement that the variables of most regression models must be continuous and normally distributed makes them inappropriate to use on some data.

Pearson’s correlation coefficient, R, was used to determine any existing correlations between customer complaints and selected water quality variables. The equation for calculating R is given as:

𝑅 = ∑ ((𝑋𝑖− 𝑋)(𝑌𝑖− 𝑌̅))

𝑠𝑝 𝑖=1

√∑𝑠𝑝𝑖=1(𝑋𝑖 − 𝑋̅)2∑𝑛𝑖=1(𝑌𝑖− 𝑌̅)2

(4.9)

where R = Pearson’s correlation coefficient; X = independent variable;

Y = dependent variable; 𝑋̅ and𝑌̅ are the mean of X and Y, respectively; and

sp = the number of observations.

R is a measure of how well a model is likely to make predictions of future outcomes. In a positive correlation, as the values of predictive variable increase, values of determinant

107

variable also increase. On the other hand, an inverse or negative correlation occurs when the values of predictive variable increase and the values of the determinant variable decrease. R, whether positive or negative, range in strength from strong to weak between 0 and 1. For this research, the classification of R given by Rodgers and Nicewander (1988) was adopted. In their research, they classified |R| > 0.5 as strong correlation, 0.3 ≤ |R| ≤ 0.5

as moderate correlation and 0 < |R| < 0.3 as weak correlation.