Procedure for Topology Extraction Methods

The specific procedure followed to extract topology using each TEM is discussed in this section.

4.3.1. Linear cross-correlation

In order to extract topology from data using the LC method the following procedure was followed: 1) For each possible combination of variable pairs and for the chosen number of lags, Equation

3-1 was applied to each pair of variables in the data matrix. Since this method is symmetrical this only needed to be performed for the upper triangle of the connectivity matrix.

2) This calculation resulted in an MxM connectivity matrix with the maximum correlation for each pair of variables as the entries, as well as a corresponding MxM lag matrix. When the lag was zero it was an indication that there was no causality, so these entries were assigned zero in the CM. When the lag was less than zero it meant the causality was in the other direction, so this entry was then moved below the diagonal.

3) The remaining entries in the CM had to be tested for significance, so the values below the significance level were assigned zeros as well (setting significance threshold is discussed in section 4.3.4).

4) The CM was then used to construct a connectivity graph.

4.3.2. Partial cross-correlation

In order to extract topology from data using the PC method a similar procedure to that for LC was followed:

1) For each possible combination of variable pairs and for the chosen number of lags, Equation 3-4 was applied to each pair of variables in the data matrix, while conditioning on all remaining variables in the data. Since this method is symmetrical this calculation only needed to be performed for the upper triangle of the CM.

2) This calculation gave an MxM connectivity matrix with the maximum correlation for each pair as the entries, as well as a corresponding MxM lag matrix. When the lag was zero it was an indication that there was no causality, so these entries were assigned zero in the CM. When the lag was less than zero it mean the causality was in the other direction, so this entry was then moved below the diagonal.

Chapter 4 -Fault Diagnosis Method Page 47

4.3.3. Transfer entropy

To extract topology from data using the TE method the following procedure was followed:

1) For each possible combination of variable pairs and for the chosen number of lags Equation 3-5 was applied to each pair of variables in the data matrix to give an MxM matrix with entries Tx_y . Values suggested by Bauer et al. (2007) for parameters used in this equation

were a prediction horizon, h=4, sampling period, τ =4, and embedding dimensions, lx=ly =1. This method was asymmetrical so it had to be performed on all possible pairs of variables. 2) To give the causality , which was the difference between Tx_y and Ty_x (Equation 3-7), the

transpose of the connectivity matrix was subtracted from itself.

3) This calculation gave an MxM connectivity matrix with the TE causality measure for each pair as the entries, as well as a corresponding MxM lag matrix. The significance of each entry had to be tested and the values below the significance threshold were assigned values of zero (setting significance threshold is discussed in section 4.3.4).

4) The CM was then used to construct a connectivity graph.

4.3.4. Setting significance thresholds for topology extraction

Each TEM requires selection of a significance threshold. For LC and PC, the basis of determining causality is that the hypothesis of the presence of a causal relationship between two variables is rejected if there is no evidence of time delay between them and/or if the maximum correlation is not significantly large to indicate causality. For TE the hypothesis of the presence of a causal relationship between two variables is rejected if the difference in the transfer entropy from x to y and y to x is small.

Significance threshold for linear cross-correlation

Bauer and Thornhill (2008) presented a method which can be used for the selection of this significance threshold. This approach empirically estimates the distribution of the correlation under the null hypothesis that two variables, x and y, are uncorrelated random time sequences. The correlation between two series from the plant data that are connected will be unlikely to have originated from the same distribution; therefore their correlation value should be higher. Using a one-sided hypothesis test, the null hypothesis that two time series are uncorrelated is rejected and the correlation is deemed to be indicative of causality between the variables when the hypothesis presented in Equation 4-1 is valid.

Chapter 4 -Fault Diagnosis Method Page 48 In Equation 4-1, the subscript rnd indicates values calculated for the random vectors and μ_ρ

max,rnd LC

and σ_ρ_max,rndLC are the mean and standard deviation respectively of ρ_maxLC . The mean and standard deviation are functions of the sample number, or length of the series, N. Therefore to determine an empirical distribution (i.e. μ and σ) of ρmaxLC , random time sequences were generated, with varying

number of samples Nrnd, from 0 to 3000. For each Nrnd the correlations of each pair of time series

(31x31 pairs) was calculated and the mean and standard deviation of ρmaxLC were calculated.

Figure 4-6: Linear cross-coorelation mean and standard deviation for 31 pairs of random sequences with changing sample size

Figure 4-6 shows the plots of the mean and standard deviation against number of samples, N. They both follow a decreasing exponential trend that can be described by the Equation 4-2 and Equation 4-3. μ_ρ max,rnd LC = a₁(N)−b1 Equation 4-2 σ_ρ max,rnd LC = a₂(N)−b2 Equation 4-3

Substituting Equation 4-2 and Equation 4-3into Equation 4-1 results in an equation for ρmax,rndLC ,

which is now designated as the significance threshold for the correlation, ρ_max,thLC , as a function of N, as shown in Equation 4-4.

Chapter 4 -Fault Diagnosis Method Page 49 Curve fitting was used to determine the parameters in Equation 4-4, resulting in the an equation for the threshold as a function of sample size, as shown in Equation 4-5.

ρ_max,thLC (N) = 3N−0.452_{+ 0.11N}−0.658 _{Equation 4-5}

The fitted curves are also shown as the red dashed lines in Figure 4-6.

Significance threshold for partial cross-correlation

As with LC, PC also requires selection of a threshold to determine the significance of the maximum PC calculated for a pair of variables. The same approach as described for LC can be used for selection of the threshold for PC.

Figure 4-7: Partial cross-correlation mean and standard deviation for 31 pairs of random sequences with changing sample size

Figure 4-7shows the same kind of decreasing exponential trend for the PC as was observed for LC. So the general equation for ρmax,rndPC as a function of N is similar to Equation 4-4, and is shown in

Equation 4-6.

ρ_max,thPC (N) = a₁N−b1+ 3a

2N−b2 Equation 4-6

Curve fitting was used to determine the parameters in Equation 4-6, resulting in an equation for the PC threshold as a function of sample size, shown in Equation 4-7.

ρ_max,thPC (N) = 1.647N−0.428_{+ 3.864N}−0.772 _{Equation 4-7}

Chapter 4 -Fault Diagnosis Method Page 50

Significance threshold for transfer entropy

Again the selection of a threshold is required for the hypothesis testing using TE. Bauer et al. (2007) set this threshold using the method suggested by Schreiber and Schmitz (2000). This approach generates surrogate time series data and uses Monte Carlo methods to determine the mean and standard deviation of tx_y. The significance is then defined using a 6 sigma threshold, as shown in

Equation 4-8.

t𝐱→𝐲≥ t𝐱→𝐲,th= μt𝐱→𝐲,rnd+ 6σt𝐱→𝐲,rnd _{Equation 4-8}

However, Bauer et al. (2007) did not consider that the TE varies with increasing sample size, as was the case with LC and PC. Using random sequences of increasing sample sizes the TE mean and standard deviation were calculated and plotted against sample size, as shown in Figure 4-8. Figure 4-8 illustrates that TE increases with increasing sample size though.

Figure 4-8: Transfer entropy mean and standard deviation for 31 pairs of random sequences with changing sample size

The mean and standard deviation both follow an increasing trend that can be described generally according to Equation 4-9 and Equation 4-10 respectively.

μ_t_{𝐱→𝐲,rnd}= 𝑎₁(𝑁)𝑏1 _{Equation 4-9}

σt𝐱→𝐲,rnd= 𝑎2(𝑁)𝑏2 Equation 4-10

Substituting Equation 4-9 and Equation 4-10 into Equation 4-8 gives a general equation for the threshold as a function of sample size, N, as shown in Equation 4-11.

Chapter 4 -Fault Diagnosis Method Page 51 t𝐱→𝐲,th(𝑁) = 𝑎1𝑁𝑏1+ 6𝑎2𝑁𝑏2 Equation 4-11

Curve fitting was used to determine the parameters in Equation 4-11, resulting in an equation for the threshold as a function of sample size, as shown in Equation 4-12.

t_{𝐱→𝐲,th}(𝑁) = 0.0018𝑁0.465_{+ 0.0054𝑁}0.412 _{Equation 4-12}

The fitted curves are also shown as the red dashed lines in Figure 4-8.

In document Exploiting process topology for optimal process monitoring (Page 68-73)