Constructing the MBR Bayesian Belief Network

Conventional data analysis methods such as plotting scatterplot and determining correlation coefficient like r2 or Pearson r (ρ) can capture the relationship between two environmental parameters when a strong linear correlation exists, e.g. the scatterplot and ρ in Figure 4 demonstrate a clear linear relationship between mixed liquor suspended solid (MLSS) and mixed liquor volatile suspended solid (MLVSS) from 9 full-scale MBRs across Australia. However, scatterplot and ρ fail to clearly capture the relationship between multiple operational parameters such as dissolved oxygen (DO), hydraulic retention time (HRT), solid retention time (SRT), temperature, MLSS, MLVSS, and MBR permeate, as MBR treatment is a complicated dynamic process which is simultaneously affected by multiple factors including microbial, chemical and physical factors. Assessing which factors affecting MBR permeate quality is difficult because of complex reaction mechanisms that vary with both time and physical attributes (environmental conditions) of the system. Given the nonlinearity,

uncertainty, and dynamic features of MBR process, an alternative data analysis technique is needed. Artificial Neural Networks have ability to capture the relationships between multiple operational parameter variables and treated water quality (Côté et al. 1995, Lee et al. 2005) but they tends to be “black box” models which neither show dependencies between variables nor provide probabilistic predictions (Pittman 2008).

Over the last decade, BBN is increasingly used for modelling complex domain such as ecosystems and environmental management (Uusitalo 2007). A BBN is a probabilistic graphical model for reasoning under uncertainty, with a set of variables (or nodes) and directed arcs that describe the sets of conditional dependencies between variables (Pearl 2000, Korb et al. 2011). Based on these characteristics, BBN can offer several advantages including:

- Capability to model complex systems where there are multiple variables influencing each other;

- Ability to deal with uncertainty as the content of each variable is presented as

probability distribution so BBN not only gives the result but also its expected frequency; - Transparency, which provides opportunity to gain insights about the system as well as make it a good communication tool (Sahely et al. 2001);

- Ability to deal with missing data as algorithms in BBN can handle situations with missing observations which are often the case in environmental data;

- Capability to combine different sources of knowledge, e.g. expert knowledge regarding variables on which little or no data exist can be introduced as prior information to the net. These priors are then updated with real data to provide a synthesis of expert knowledge and real data. This synthesis can then be used as a prior in a new study (Uusitalo 2007);

- Bidirectional: the same network can be used without modification to diagnose causes to specific problems given information about the output variables or to predict increases in operational efficiency given information about the input variables (Sahely et al.

2001);

- Relatively easy to be modified and updated with new data and knowledge (Sahely et al. 2001).

Although, BBN offer a lot of benefits for modelling complex systems, their applications in wastewater treatment systems are still very limited. Bayesian analysis of MBR operating data in combination with LRV has not been attempted before. Accordingly, this study aims to develop and apply BBN to identify factors affecting LRV of microbial indicators through MBRs.

7.2.2 Constructing a Bayesian network for MBRs

The development of the MBR Bayesian net in this study follows the major steps in developing a BBN presented in Figure 5 below. Firstly, the objective and scope of the model need to be determined. Then the next step is to develop the model structure including defining nodes and connections between the nodes in the net. After that, the model is parametrised that includes defining states and intervals as well as filling the CPT table for each node. The final step is to evaluate and validate the model. More details about these steps are provided below.

Evaluate and validate the model Define model objective and scope Define model structure

Parameterise the model

Figure 5 - Major steps in developing a BBN (adapted from Ticehurst et al. (2008) and

Kraft (2009)).

7.2.2.1 Defining model objective and scope

As stated above, the objective of this study is to develop and apply BBN to identify factors affecting LRV of microbial indicators through MBRs.

7.2.2.2 Defining model structure

In this step, variables (nodes) and connections between the variables in the model are determined.

Selecting variables for the model

The variables in BBN should be controllable, predictable or observable (Borsuk et al. 2004, Chen et al. 2012). Insignificant variables should not be included as this increases the complexity of the network and reduces the sensitivity of the model outputs to important variables (Chen et al. 2012). In this study, the selection of variables for the MBR BBN was conducted considering literature on key membrane design and operating parameters (Judd et al. 2011), the Victorian guidelines for validation of MBRs (VDoH 2013), previous validation reports of full-scale MBR plants, as well as data available for model evaluation and validation. As presented in Section 7.2.1, MLSS and MLVSS data are linearly correlated with ρ = 0.98, so only one of the two parameters needs to be included in the model. MLSS was selected because it is quicker, easier to analyse offline, and it can also be monitored online. The selected variables and range of data available for these variables are presented in Table 22. LRV indicator was calculated from influent indicator density and permeate indicator density. As indigenous influent indicator density cannot be controlled, given a fixed influent indicator density, factors that potentially cause an increase in the likelihood of higher permeate indicator density were equivalent to causes of low LRV. This approach was applied throughout this study in determining factors influencing LRV indicators.

Defining connections between variables

Structure of BBN can be developed based on expert knowledge or automatic structure learning. Literature have shown that environmental processes, which often includes a lot of variation and uncertainty, cannot be completely accurately estimated based on available data (Uusitalo 2007, Chen et al. 2012). However, where the expert knowledge on the system is incomplete, structure learning process provides a new perspective on the problem, a better appreciation of the complexity of the system, and a better understanding of the system and the limitations of our data (Alameddine et al. 2011). In this study, automatic structure learning was conducted using R software (R-project 2014) to provide better insight about the system and the limitation of the data. Then, the structure of the net was developed based on expert knowledge through an iterative process during a series of workshops between experts of SP1 and SP4. The net was constructed in NeticaTM Bayesian modelling software (Norsys 2015). NeticaTM provides a popular and simple graphical interface for building and working with BBNs (Norsys 2015).

Table 22 - Selected variables for the MBR BBN and range of data available for these

variables.

Group of variables Variables (nodes) Unit Data range1

(Low, IQR, High)

Reactor variables Solid retention time (SRT) d 12, 32-126, 147

MLSS g/L 0.1, 3.4-13, 20

Hydraulic retention time (HRT) h 4, 17-39, 100

Dissolved oxygen (DO) mg/L 0, 1.5-4.8, 8.3

Temperature in mixed liquor °C 16, 21-25, 30

Membrane conditions Flux LMH 0.4, 5.2-22, 37

Transmembrane pressure (TMP) Kpa 0.4, 5.9-6.9, 50

Permeability LMH/Kpa 0.1, 0.8-5.1, 33

Membrane age months 1.0, 5.0-27, 217

Membrane pore size µm 0.04-0.4

Membrane configuration Chemical cleaning type

Time after chemical cleaning h

Bulk quality parameters Influent dissolved organic carbon

(DOC) mg/L 9.0, 61-88, 182

Mixed liquor DOC mg/L 4.6, 10-21, 77

Permeate DOC mg/L 4.9, 7.4-12, 932

Permeate turbidity NTU 0.01, 0.03-0.13, 3.7

Influent pH 6.3, 7.5-8.2, 10

Mixed liquor pH 3.8, 6.9-7.7, 9.0

Permeate pH 3.0, 6.9-7.7, 9.0

Capillary suction time (CST) s 11, 22-44, 274

Microbial indicator

densities Log Somatic influent 1.6, 5.0-5.5, 7.4

Log Somatic mixed liquor 2.0, 4.6-5.6, 6.7

Log Somatic permeate 1.0, 1.0-1.7, 3.1

Log FRNA influent 3.0, 5.0-6.0, 7.0

Log FRNA mixed liquor 2.0, 4.0-5.0, 6.0

Log FRNA permeate 1.0, 1.0, 2.0

Log E. coli influent 4.3, 6.6-7.1, 9.4

Log E. coli mixed liquor 3.3, 5.5-6.3, 8.6

Log E. coli permeate 0.0, 0.0-0.5, 2.2

Log Perfringen influent 3.5, 5.2-5.6, 7.1

Log Perfringen mixed liquor 5.8, 6.5-6.9, 7.4

Log Perfringen permeate 0, 0, 2.6

Calculated LRV microbial

indicators LRV Somatic Calculated from influent and

permeate densities

LRV FRNA Calculated from

influent and permeate densities

LRV E. coli Calculated from

influent and permeate densities

LRV Perfringen Calculated from

influent and permeate densities

1_{IQR = Interquartile range, Low = lowest and High = highest, of parameters from Full Scale} site sampling.

7.2.2.3 Parameterising the model

This step includes defining states and intervals for each node in the net. The more states, the more data are needed to fill the CPT table. In practice, data often are not large enough to allow high numbers of intervals per variable. Therefore, in order to build a meaningful BBN, the numbers of states are often restricted (Uusitalo 2007). In this study, due to limited variability in the available data, 2 states were selected for each node in the net to minimise empty probability in the CPT table. The intervals were defined by automatic discretisation with equal-frequency method in NeticaTM interface.

7.2.2.4 Evaluate and validate the model

10-fold cross validation method was used to validate the model using the R software (R- project 2014). This approach first partitioned the data into 10 equally sized sets and then used 9 of these partitions for parameter learning and the remaining for holdout testing (Koller

et al. 2009). This was repeated 10 times in order to test 10 partitions of the dataset. This cross validation was conducted using the package RNetica in R software (Almond R. 2014). RNetica provides an R software interface to NeticaTM including the same functionalities as the NeticaTM software. The script was designed to perform K-fold cross validation.

In this study, the area under the receiver operating characteristic (ROC) curve was used to assess the accuracy of the model. The ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The area under the ROC curve (AUC) is widely used to evaluate the accuracy of classification tests (Flach et al. 2011). An area of 1 represents a perfect test and an area of 0.5 represents a worthless test.

7.2.3 Identifying factors affecting LRV of microbial indicators by MBRs

As stated in Section 7.2.2.2, LRV indicator was calculated from influent indicator density and permeate indicator density. As indigenous influent indicator density cannot be controlled, factors effecting LRV of microbial indicators were considered to be equal to factors causing an increase in the likelihood of higher permeate indicator density given the same influent density.

Factors affecting log permeate indicator density were determined by considering the importance of the factor in predicting log permeate indicator density. Firstly, the AUC score for predicting log permeate indicator density when data of all other variables in the net available was determined and used as a baseline AUC. Then, the data of one variable in the net was removed, and the AUC score for predicting log permeate indicator density was calculated. This step was repeated for all other variables in the net. After that, the AUC scores in the absence of data of each variable were compared with the baseline AUC. The variable, without its data, the AUC score reduces more than 1% compared to the baseline AUC, was considered as important factor influencing the log permeate indicator density. In all cases, each AUC score was calculated with 3 times with 3 different seeding ratios and the mean of these 3 calculations was used.

In document National Validation Guidelines for Water Recycling: Membrane Bioreactors (Page 55-59)