How To Predict Chlorophyll A

(1)

Predicting chlorophyll-

a

in freshwater lakes by hybridising

process-based models and genetic algorithms

P.A. Whigham

a,

_{*, Friedrich Recknagel}

b

a_{Department of Information Science}_,_Uni₆_{ersity of Otago}_,_{PO Box}₅₆_,_Dunedin_,_{New Zealand} b_{Department of Soil and Water}_,_{Adelaide Uni}₆_ersity_,_{Waite Campus}_,_{Glen Osmond}_,_Adelaide

5064,South Australia,Australia

Abstract

This paper describes the application of several machine learning techniques to modify a process-based difference equation. The original process equation was developed to model phytoplankton abundance based on measured limnological and climate variables. A genetic algorithm is shown to be capable of calibrating the constants of the process model, based on the data describing a lake environment. The resulting process model has a significantly improved performance based on unseen test data. A symbolic genetic algorithm is then applied to the process model to evolve new expressions for the grazing term of the equation. The results indicate that this approach can be used to explore new process formulations and to improve the generalisation and predictive response of process models. © 2001 Elsevier Science B.V. All rights reserved.

Keywords:Chlorophyll-aprediction; Process-based model; Genetic algorithm; Time series model

/ /

1. Introduction

This paper describes the application of a pro-cess-based model to predict the timing and magni-tudes of algal blooms for Lake Kasumigaura, in the South-Eastern part of Japan. The process model is calibrated using a genetic algorithm, and subsequently portions of the model are modified using a symbolic learning system. This data has previously been studied using an artificial neural network (Recknagel, 1997; Recknagel et al., 1997, 1998; Liu and Yao, 1999) that demonstrated the

potential for these tools to predict highly non-lin-ear phenomena such as blue-green algal blooms in freshwater lakes. Although neural networks have produced accurate predictive models for this data-set it is difficult to use them for extracting process knowledge of the system. The purpose of this paper is to explore the combination of machine learning techniques and process-based descrip-tions to develop better predictive models and ex-plore extensions to process understanding in freshwater systems.

1.1. Lake Kasumigaura dataset

Lake Kasumigaura is situated in the South-Eastern part of Japan. It is a large, shallow water body where no thermal stratification occurs.

Wa-* Corresponding author. Tel.:+64-3-4797391; fax: + 64-3-4798311.

E-mail address: [email protected] (P.A. Whigham).

(2)

P.A.Whigham,F.Recknagel/Ecological Modelling146 (2001) 243 – 251

244

ter temperatures vary widely, from 4°C in the winter to 30°C in summer. The lake has high external and internal nutrient loadings and there-fore primary productivity is high. A number of climatic and limnological variables have been col-lected over a 10 year period (1984 – 1993) for Kasumigaura (Takamura et al., 1992), as shown in Table 1. A simple linear interpolation has been used to fill missing values to produce a complete daily time series for this period. For this study the last 5 years of data (1989 – 1993) have been se-lected, since they have some regularities in terms of

chlorophyll-a, and include the largest peak

concen-tration measured over the 10 year period. 1.2. The ecology of freshwater phytoplankton

Phytoplankton includes representatives of sev-eral groups of algae and cyanobacteria. They are usually distinguished by being freely floating and dependent on water movement for maintenance and transport (Reynolds, 1984). Many factors affect their population dynamics and they vary depending on the type of phytoplankton under consideration. However, all algae species rely on light as a basic input for photosynthesis and require nutrients such as nitrogen and phosphorus for growth and reproduction. Factors such as water temperature, turbidity, mixing, competition and grazing are also relevant to the population dynamics of algae. Even though much work has been done on phytoplankton, there are still difficulties with developing reliable predictive mod-els for algal growth. There are several reasons for this, including the highly non-linear behaviour of the population, the inherent errors associated with data collection, the time scales for data collection

versus the possible time scales of the system, and the complex, dynamic behaviour of aquatic ecosys-tems.

2. Genetic algorithms and genetic programming

Originally formalised by Holland (Holland, 1975), genetic algorithms (GA’s) are general opti-misation techniques based on using a population to search for good solutions. Standard GA’s use a bit string to represent each member of the population, which are then decoded to give the values of the variables being optimised for the problem. These population-based techniques have been success-fully applied to many different problems (Grefen-stette, 1985; Cleveland and Smith, 1989; Goldberg, 1989; Bennet et al., 1991; De Jong et al., 1993), and are generally applicable to problems where there is little knowledge available about the form of the best solution. The main search operators used with GA’s are crossover, where two parent bit-strings are combined to produce two children, and muta-tion, where random elements in a bit-string are flipped from a zero to one or vice-versa. The driving force behind GA’s (and all evolutionary systems) is the use of a fitness measure to bias selection of individuals. Population members that are fitter are more likely to be selected for breeding and therefore good partial solutions are propa-gated throughout the population.

Genetic Programming (GP) extended the con-cepts of GA’s by allowing a more flexible represen-tation for the population members (Koza, 1992). Each member of the population in GP is a func-tional program, represented as a tree. The search operators of crossover and mutation act on these trees, allowing the possible solutions to expand and grow during the evolution. One particular exten-sion to GP uses a context-free grammar (CFG) to represent the language that can be expressed by the

functional programs (Whigham, 1995, 1996;

Whigham and Crapper, 1999; Whigham, 2000) and has been successfully applied to several different spatial and temporal problems. This system, enti-tled CFG – GP, allows the form of any program to be controlled by a grammar, and therefore the system allows the user to search within a

con-Table 1

Factors measured with the daily time series data Average9S.D. Units Measured factor Ortho-phosphate (p) 14.14₉25.71 mg/l Solar radiation (l) 1281₉671 MJ/m Water temperature (t) 16.36₉7.79 °C ind/l 156.4₉83.7 Copepoda (cop)

Cladocera (clad) 169.99221.7 ind/l chlorophyll-a(chla) 74.43942.51 mg/l

(3)

Fig. 1. Prediction using difference equation (training RMSE=87.3, test RMSE=82.7).

strained representation of the problem. CFG – GP will be used in this paper to explore several forms of a partial extension to a process-based differ-ence equation.

3. Training and test data setup

Table 1 shows the measured variables used for developing the models. For all experiments, 2 years of daily data (1989 – 1990) were used for training and 3 years of daily data (1991 – 1993) used for testing the generalisation behaviour of the resulting equations. The root mean square error (RMSE) was used as the fitness function for the training data and as a measure of accuracy for the test data. A lower RMSE was taken to indi-cate a better prediction of the test data. When comparing two different learning techniques a lower RMSE for the unseen (test) data implied that the learning system had better generalised the patterns found in the training data.

4. Process-based modelling

A difference equation for algal growth Eq. (1) was originally developed as part of a module of the lake ecosystem model SALMO (Recknagel and Benndorf, 1982) based on the theoretical behaviour of freshwater systems and validated using laboratory experiments and field data. Us-ing this original equation the prediction for the training and test periods is shown in Fig. 1. The RMSE for the period 1989 – 1990 was 87.3 and for 1991 – 1993 was 82.7.

Chlat+1=Chlat+Chlat*(Phot−Re sp)

−Chlat*(Cop+Clad)*0.0001, (1)

where Phot=(0.068*T)*

0.025*L 28+0.025*L

*

P Chlat

,

1.7 X+ P X+ 1.7 Chlat+ P Chlat

X=5.76*Chlat0.41,

(4)

246

and

Resp=(0.00228*T)+0.3*Phot.

4.1. Using genetic algorithms to calibrate the process model

In an attempt to improve the performance of Eq. (1) the constants were tuned based on the training data. Each constant could vary within a

range of 920% based on the 2 years of training

data. The optimal settings for the constants were searched using the Genetic Algorithm Optimisa-tion Toolbox for Matlab 5. The parameters for this GA used a population of 100, with a simple binary crossover (90%) and binary mutation (5%). The selection scheme used was roulette wheel, and the GA was run for 150 generations. This setup was run 20 times, with the final solution selected from the lowest training result. The RMSE was used as the fitness measure, however only values

of chlorophyll-a that were above 75 mg/l

con-tributed to the error measure in an attempt to force the equation to better model the peak events. The prediction for the test and training period is shown in Fig. 2 and the modified equa-tion is shown as Eq. (2). This equaequa-tion had a dramatically improved RMSE of 27.6 for the training period and 31.9 for the 3 years of test data.

Chlat+1=Chlat+Chlat*(Phot−Resp)

−Chlat*(Cop+Clad)*0.00008, (2)

where Phot=(0.07634*T)*

0.0295*L 22.4+0.0299*L

*

P Chlat

,

1.36 X + P X+ 1.36 Chlat + P Chlat

Resp=(0.00273*T)+0.2696*Phot, and X=4.608*Chlat0.328.

Fig. 2. Prediction of chlorophyll-a using a difference equation with constants calibrated using the training data (training RMSE=27.6, test RMSE=31.9).

(5)

The success of this equation demonstrates that the constants of process-based models may be successfully calibrated to conditions in specific freshwater lakes using simple machine-learning techniques. The ability to also search for new constant values within some range (in this case

920%) is a useful method for constraining the

search within limits that are physically

meaningful.

4.2. Learning modifications to the process-based model

This section will consider using CFG – GP to extend the process-based model by searching for new representations of one component of the process model, a single component at a time. For example, the photosynthesis, respiration or graz-ing terms could be evolved within the original process-based model framework, while keeping

the other factors in the model constant.

To demonstrate this concept, the grazing term from Eq. (2) will be evolved within the original model. The purpose is to search for a better representation of the grazing term in relation to the theoretically and experimentally determined process model. The original grazing term

includ-ing parameter calibration was (−chlat*(cop+

clad)*0.00008). Two approaches will be described; the first will evolve a grazing term of the form

(chlat*(cop,clad)); the second will allow the chlat

variable to be used more then once in the grazing term, in other words, the grazing term is a

func-tion of all three variables:f(chlat,cop,clad). Note

that both equations do not allow the use of the ‘minus’ operator, to ensure that the grazing term is always positive and therefore, when substracted,

always lowers the predicted chlat+1concentration.

The following grammars represent the two graz-ing equations that will be evolved.

(6)

248

Fig. 3. Chlorophyll-aprediction using Ggrazing1.

Both equations were evolved using a popula-tion size of 1000 and evolved for 50 generapopula-tions. Crossover was set to 90% and mutation 5% over the non-terminal GT. Each setup was run 20 times, and the best result based on the train-ing data, over the 20 runs, selected as the solu-tion. The fitness measure used was RMSE, and the evaluation of the grazing term was achieved by incorporating Eq. (2), with the grazing term removed, into the fitness evaluation. Although it was possible to define Eq. (2) directly in the grammar, and let the system just evolve new terms for grazing, incorporating Eq. (2) into the CFG – GP program lowered the evaluation time for the program.

4.2.1. Resulting grazing equations

The best solution for Ggrazing1 had a training

error of 28.5 and a test error of 35.42. Note that these results are comparable with the origi-nal calibrated difference equation. Since it was not possible to significantly improve on the

orig-inal equation, when constrained to only use Copepoda and Cladocera as variables, it is pos-sible to conclude that this form of equation is an appropriate formalisation of the model. The

grazing term produced using Ggrazing1 was:

chlat*0.000048*[2cop+clad+0.000069+X] (3)

X=(0.000069clad+0.000069cop)coshclad

cop.

Eq. (3) is basically the same form as the origi-nal grazing term and can be simplified to

chlat*0.000048*(2cop+clad), since the cosh

terms are multiplied by a factor of 10−10 _and

therefore can be ignored. The resulting predic-tion is shown in Fig. 3. The main difference compared with the original equation is the in-creased dominance of the Copepoda term. Note that the constant 0.000069 appears a number of times in Eq. (3). This is a result of the crossover

operator propagating useful sub-expressions,

(7)

of generations to construct the solution. Because the constant was associated with good partial solutions it has spread throughout the popula-tion. In terms of producing process equations this is a useful side effect of the technique, since it allows general constants in an equation to be discovered, which may have some fundamental meaning for the system.

The best solution for Ggrazing2 had a training

error of 22.26 and a test error of 29.63, which is an improvement over the original, calibrated process model. The resulting predicted curve is shown in Fig. 4. The grazing term produced

us-ing Ggrazing2 is shown in Eq. (4).

exp(chlat/18.03)

cop

×

(18.03/clat)+(clad/cop)+clad+cop

(28.86/clat)+chlat

. (4)

Note that the constant 18.03 is repeated in this equation, and an interesting question arises as to whether 18.03 should also replace 28.86. This would indicate a general constant in the grazing term that could be given some meaning related to how the system is functioning. These considerations, however, are beyond the scope of this preliminary work. Eq. (4) has also used

the chlat term several times. Since this

formali-sation was not part of Eq. (2) it has extended the possible interpretation of the grazing term. The form of Eq. (4) is reminiscent of hyper-bolic/inverse hyperbolic relationships that are quite common in multiple resources dynamics. This is a promising outcome of the work and shows that it may be possible to use this data-driven approach to reconstruct and extend theo-ries regarding freshwater ecosystem dynamics.

5. Discussion

The previous studies have demonstrated that models can be developed for the non-linear dy-namics of phytoplankton. The use of a genetic algorithm to calibrate the constants of the pro-cess-based equation has been shown to signifi-cantly improve the predictive performance of the

model. Additionally, the use of an extended ge-netic programming system has been used to evolve new components of the process equation, improving the overall performance on unseen data. This is a promising area of work that should allow the exploration of new formula-tions of photosynthesis, respiration and grazing terms for process-based equations. Although this approach is not likely to produce the best pre-dictor (Liu and Yao, 1999) based on the train-ing and test data, it shows the possibility of allowing better process models to be produced and therefore to extend our knowledge of the underlying dynamics of ecological systems. In that respect, it is a complimentary approach to other machine learning and hybrid systems, since the combination of different techniques can help support new descriptions and develop-ments in a fundamental manner.

Several more general issues have been high-lighted by this work. The root mean square er-ror was used as the fitness function, however this is essentially a non-temporal measure for time series since it does not take into account the shape of the curves that are being compared (Whigham and Aldridge, 2000). Other error metrics for time series, which take into account the shape of the series, may improve the devel-oped models by changing the shape of the fitness landscape and driving the system towards better partial solutions. In particular, previous work in time series similarity and data mining have shown that RMSE is not always the most appropriate measure for comparing a predicted and actual time series (Keogh and Pazzini, 1999). Further work is required to determine whether better similarity measures would im-prove the development of these process-based equations, for both the constant calibration and the CFG – GP approaches.

This work has not attempted to place any physical meaning to the evolved expression for grazing. Rather, the fact that the new grazing term improves the prediction on the test data has been used to demonstrate that improve-ments for some portion of the process model are possible. Future work is required to explore

(8)

250

Fig. 4. Chlorophyll-aprediction using Ggrazing2.

whether using this approach to gradually evolve small modifications to the current process model can extend process understanding, however the initial results presented here are very promising.

The general formulation of Eq. (1), with some extensions, can also be applied to individual algal species. Future work will investigate the use of GA’s to calibrate the constants of this equation for predicting various algal species dynamics in different freshwater lakes. By comparing the evolved constants within each equation the driv-ing factors for the behaviour of each species should be able to be inferred and perhaps gener-alised by means of data from different lakes. Comparing the inferred behaviour with the known dynamics of each species will help to understand how a particular population of algal species behaves under the physical constraints of a variety of freshwater lakes.

6. Conclusion

The calibration of a deterministic process-based model using genetic algorithms has been

demon-strated. The dramatically improved performance of the calibrated model indicates that GA’s are an appropriate method for tuning the constants of a process-based model. The application of a sym-bolic machine learning system has also been shown for exploring modifications to the process model, by evolving a new term for one particular component of the system. The success of this approach indicates that it is possible to explore process descriptions in an incremental fashion.

Areas for future research have also been

highlighted.

References

Bennet, K., Ferris, M.C., Ioannidis, Y.E., 1991. A genetic algorithm for database query optimization. In: Belew, R.K., Booker, L.B. (Eds.), Proceedings of the Fourth International Conference on Genetic Algorithms

Cleveland, G.A., Smith, S.F., 1989. Using genetic algorithms to schedule flow shop releases. In: Schaffer, J.D. (Ed.), Proceedings of the Third International Conference on Ge-netic Algorithms

De Jong, K., Spears, W., Gordon, D., 1993. Using Genetic Algorithms for Concept Learning. Machine Learn. 13, 5 – 29.

(9)

Goldberg, D.E., 1989. Genetic Algorithms in Search, Opti-mization, and Machine Learning. Addison-Wesley. Grefenstette, J.J., 1985. Optimization of control parameters

for genetic algorithms. IEEE Trans. Syst., Man Cybern. 16 (1), 122 – 128.

Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI.

Keogh, E.J., Pazzini M.J., 1999. Relevance Feedback Retrieval of Time Series Data. The 22nd Annual international ACM-SIGIR conference on research and development of information retrieval.

Koza, J.R., 1992. Genetic Programming: On the Programming of Computers by means of Natural Selection. MIT Press, Cambridge, MA.

Liu, Y., Yao, X., 1999. Time Series Prediction by Using Negatively Correlated Neural Networks. Lecture Notes in Artificial Intelligence, vol. 1585. Springer-Verlag, Berlin, pp. 325 – 332.

Recknagel, F., 1997. ANNA — Artificial Neural Network model for predicting species abundance and succession of blue-green algae. Hydrobiologia 394, 47 – 57.

Recknagel, F., Benndorf, J., 1982. Validation of the ecological simulation model SALMO. Int. Revue ges. Hydrobiol. 67 (1), 113 – 125.

Recknagel, F., French, M., Harkonen, P., Yabunaka, K., 1997. Aritificial neural network approach for modelling and prediction of algal blooms. Ecol. Model. 96 (1 – 3), 11 – 28.

Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., Wilson, H., 1998. Modelling and prediction of phyto- and zooplankton dynamics in Lake Kasumigaura by artificial

neural networks. Lakes and Reservoirs: Research and Management 3, 123 – 133.

Reynolds, C.S., 1984. The Ecology of Freshwater Phytoplank-ton. Press Syndicate of the University of Cambridge, New York.

Takamura, N., Otsuki, A., Aizaki, M., Nojiri, Y., 1992. Phy-toplankton species shift accompanied by transition from nitrogen dependence to phosphorus dependence of primary production in Lake Kasumigaura. Arc. Hydrobiol. 124, 129 – 148.

Whigham, P.A., 1995. Grammatically-based Genetic Program-ming. In: Rosca, J. (Ed.), Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Appli-cations

Whigham, P.A., 1996. Search bias, language bias and genetic programming. Genetic Programming 1996: In: Koza, J.R., Goldberg, J.R., Fogel, D.E., Riolo, R.L. (Eds.), Proceed-ings of the First Annual Conference, MIT Press, Stanford University, Cambridge, MA.

Whigham, P.A., 2000. Induction of a marsupial density model using genetic programming and spatial relationships. Ecol. Model. 131 (2 – 3), 299 – 317.

Whigham, P.A., Aldridge C., 2000. A shape metric for evolv-ing time series models, Information Science Department Discussion Series, University of Otago.

Whigham, P.A., Crapper, P.F., 1999. Time series modelling using genetic programming: an application to rainfall-runoff models. In: Spector, L., Langdon, W.B., O’Reilly, U., Angeline, P.J. (Eds.), Advances in Genetic Program-ming 3, vol. 5. MIT Press, Cambridge, MA, USA, pp. 89 – 104.