Predicting chlorophyll-
a
in freshwater lakes by hybridising
process-based models and genetic algorithms
P.A. Whigham
a,*, Friedrich Recknagel
baDepartment of Information Science,Uni6ersity of Otago,PO Box56,Dunedin,New Zealand bDepartment of Soil and Water,Adelaide Uni6ersity,Waite Campus,Glen Osmond,Adelaide
5064,South Australia,Australia
Abstract
This paper describes the application of several machine learning techniques to modify a process-based difference equation. The original process equation was developed to model phytoplankton abundance based on measured limnological and climate variables. A genetic algorithm is shown to be capable of calibrating the constants of the process model, based on the data describing a lake environment. The resulting process model has a significantly improved performance based on unseen test data. A symbolic genetic algorithm is then applied to the process model to evolve new expressions for the grazing term of the equation. The results indicate that this approach can be used to explore new process formulations and to improve the generalisation and predictive response of process models. © 2001 Elsevier Science B.V. All rights reserved.
Keywords:Chlorophyll-aprediction; Process-based model; Genetic algorithm; Time series model
/ /
1. Introduction
This paper describes the application of a pro-cess-based model to predict the timing and magni-tudes of algal blooms for Lake Kasumigaura, in the South-Eastern part of Japan. The process model is calibrated using a genetic algorithm, and subsequently portions of the model are modified using a symbolic learning system. This data has previously been studied using an artificial neural network (Recknagel, 1997; Recknagel et al., 1997, 1998; Liu and Yao, 1999) that demonstrated the
potential for these tools to predict highly non-lin-ear phenomena such as blue-green algal blooms in freshwater lakes. Although neural networks have produced accurate predictive models for this data-set it is difficult to use them for extracting process knowledge of the system. The purpose of this paper is to explore the combination of machine learning techniques and process-based descrip-tions to develop better predictive models and ex-plore extensions to process understanding in freshwater systems.
1.1. Lake Kasumigaura dataset
Lake Kasumigaura is situated in the South-Eastern part of Japan. It is a large, shallow water body where no thermal stratification occurs.
Wa-* Corresponding author. Tel.:+64-3-4797391; fax: + 64-3-4798311.
E-mail address: [email protected] (P.A. Whigham).
0304-3800/01/$ - see front matter © 2001 Elsevier Science B.V. All rights reserved. PII: S 0 3 0 4 - 3 8 0 0 ( 0 1 ) 0 0 3 1 0 - 6
P.A.Whigham,F.Recknagel/Ecological Modelling146 (2001) 243 – 251
244
ter temperatures vary widely, from 4°C in the winter to 30°C in summer. The lake has high external and internal nutrient loadings and there-fore primary productivity is high. A number of climatic and limnological variables have been col-lected over a 10 year period (1984 – 1993) for Kasumigaura (Takamura et al., 1992), as shown in Table 1. A simple linear interpolation has been used to fill missing values to produce a complete daily time series for this period. For this study the last 5 years of data (1989 – 1993) have been se-lected, since they have some regularities in terms of
chlorophyll-a, and include the largest peak
concen-tration measured over the 10 year period. 1.2. The ecology of freshwater phytoplankton
Phytoplankton includes representatives of sev-eral groups of algae and cyanobacteria. They are usually distinguished by being freely floating and dependent on water movement for maintenance and transport (Reynolds, 1984). Many factors affect their population dynamics and they vary depending on the type of phytoplankton under consideration. However, all algae species rely on light as a basic input for photosynthesis and require nutrients such as nitrogen and phosphorus for growth and reproduction. Factors such as water temperature, turbidity, mixing, competition and grazing are also relevant to the population dynamics of algae. Even though much work has been done on phytoplankton, there are still difficulties with developing reliable predictive mod-els for algal growth. There are several reasons for this, including the highly non-linear behaviour of the population, the inherent errors associated with data collection, the time scales for data collection
versus the possible time scales of the system, and the complex, dynamic behaviour of aquatic ecosys-tems.
2. Genetic algorithms and genetic programming
Originally formalised by Holland (Holland, 1975), genetic algorithms (GA’s) are general opti-misation techniques based on using a population to search for good solutions. Standard GA’s use a bit string to represent each member of the population, which are then decoded to give the values of the variables being optimised for the problem. These population-based techniques have been success-fully applied to many different problems (Grefen-stette, 1985; Cleveland and Smith, 1989; Goldberg, 1989; Bennet et al., 1991; De Jong et al., 1993), and are generally applicable to problems where there is little knowledge available about the form of the best solution. The main search operators used with GA’s are crossover, where two parent bit-strings are combined to produce two children, and muta-tion, where random elements in a bit-string are flipped from a zero to one or vice-versa. The driving force behind GA’s (and all evolutionary systems) is the use of a fitness measure to bias selection of individuals. Population members that are fitter are more likely to be selected for breeding and therefore good partial solutions are propa-gated throughout the population.
Genetic Programming (GP) extended the con-cepts of GA’s by allowing a more flexible represen-tation for the population members (Koza, 1992). Each member of the population in GP is a func-tional program, represented as a tree. The search operators of crossover and mutation act on these trees, allowing the possible solutions to expand and grow during the evolution. One particular exten-sion to GP uses a context-free grammar (CFG) to represent the language that can be expressed by the
functional programs (Whigham, 1995, 1996;
Whigham and Crapper, 1999; Whigham, 2000) and has been successfully applied to several different spatial and temporal problems. This system, enti-tled CFG – GP, allows the form of any program to be controlled by a grammar, and therefore the system allows the user to search within a
con-Table 1
Factors measured with the daily time series data Average9S.D. Units Measured factor Ortho-phosphate (p) 14.14925.71 mg/l Solar radiation (l) 12819671 MJ/m Water temperature (t) 16.3697.79 °C ind/l 156.4983.7 Copepoda (cop)
Cladocera (clad) 169.99221.7 ind/l chlorophyll-a(chla) 74.43942.51 mg/l
Fig. 1. Prediction using difference equation (training RMSE=87.3, test RMSE=82.7).
strained representation of the problem. CFG – GP will be used in this paper to explore several forms of a partial extension to a process-based differ-ence equation.
3. Training and test data setup
Table 1 shows the measured variables used for developing the models. For all experiments, 2 years of daily data (1989 – 1990) were used for training and 3 years of daily data (1991 – 1993) used for testing the generalisation behaviour of the resulting equations. The root mean square error (RMSE) was used as the fitness function for the training data and as a measure of accuracy for the test data. A lower RMSE was taken to indi-cate a better prediction of the test data. When comparing two different learning techniques a lower RMSE for the unseen (test) data implied that the learning system had better generalised the patterns found in the training data.
4. Process-based modelling
A difference equation for algal growth Eq. (1) was originally developed as part of a module of the lake ecosystem model SALMO (Recknagel and Benndorf, 1982) based on the theoretical behaviour of freshwater systems and validated using laboratory experiments and field data. Us-ing this original equation the prediction for the training and test periods is shown in Fig. 1. The RMSE for the period 1989 – 1990 was 87.3 and for 1991 – 1993 was 82.7.
Chlat+1=Chlat+Chlat*(Phot−Re sp)
−Chlat*(Cop+Clad)*0.0001, (1)
where Phot=(0.068*T)*
0.025*L 28+0.025*L * P Chlat,
1.7 X+ P X+ 1.7 Chlat+ P Chlat X=5.76*Chlat0.41,P.A.Whigham,F.Recknagel/Ecological Modelling146 (2001) 243 – 251
246
and
Resp=(0.00228*T)+0.3*Phot.
4.1. Using genetic algorithms to calibrate the process model
In an attempt to improve the performance of Eq. (1) the constants were tuned based on the training data. Each constant could vary within a
range of 920% based on the 2 years of training
data. The optimal settings for the constants were searched using the Genetic Algorithm Optimisa-tion Toolbox for Matlab 5. The parameters for this GA used a population of 100, with a simple binary crossover (90%) and binary mutation (5%). The selection scheme used was roulette wheel, and the GA was run for 150 generations. This setup was run 20 times, with the final solution selected from the lowest training result. The RMSE was used as the fitness measure, however only values
of chlorophyll-a that were above 75 mg/l
con-tributed to the error measure in an attempt to force the equation to better model the peak events. The prediction for the test and training period is shown in Fig. 2 and the modified equa-tion is shown as Eq. (2). This equaequa-tion had a dramatically improved RMSE of 27.6 for the training period and 31.9 for the 3 years of test data.
Chlat+1=Chlat+Chlat*(Phot−Resp)
−Chlat*(Cop+Clad)*0.00008, (2)
where Phot=(0.07634*T)*
0.0295*L 22.4+0.0299*L * P Chlat,
1.36 X + P X+ 1.36 Chlat + P Chlat Resp=(0.00273*T)+0.2696*Phot, and X=4.608*Chlat0.328.Fig. 2. Prediction of chlorophyll-a using a difference equation with constants calibrated using the training data (training RMSE=27.6, test RMSE=31.9).
The success of this equation demonstrates that the constants of process-based models may be successfully calibrated to conditions in specific freshwater lakes using simple machine-learning techniques. The ability to also search for new constant values within some range (in this case
920%) is a useful method for constraining the
search within limits that are physically
meaningful.
4.2. Learning modifications to the process-based model
This section will consider using CFG – GP to extend the process-based model by searching for new representations of one component of the process model, a single component at a time. For example, the photosynthesis, respiration or graz-ing terms could be evolved within the original process-based model framework, while keeping
the other factors in the model constant.
To demonstrate this concept, the grazing term from Eq. (2) will be evolved within the original model. The purpose is to search for a better representation of the grazing term in relation to the theoretically and experimentally determined process model. The original grazing term
includ-ing parameter calibration was (−chlat*(cop+
clad)*0.00008). Two approaches will be described; the first will evolve a grazing term of the form
(chlat*(cop,clad)); the second will allow the chlat
variable to be used more then once in the grazing term, in other words, the grazing term is a
func-tion of all three variables:f(chlat,cop,clad). Note
that both equations do not allow the use of the ‘minus’ operator, to ensure that the grazing term is always positive and therefore, when substracted,
always lowers the predicted chlat+1concentration.
The following grammars represent the two graz-ing equations that will be evolved.
P.A.Whigham,F.Recknagel/Ecological Modelling146 (2001) 243 – 251
248
Fig. 3. Chlorophyll-aprediction using Ggrazing1.
Both equations were evolved using a popula-tion size of 1000 and evolved for 50 generapopula-tions. Crossover was set to 90% and mutation 5% over the non-terminal GT. Each setup was run 20 times, and the best result based on the train-ing data, over the 20 runs, selected as the solu-tion. The fitness measure used was RMSE, and the evaluation of the grazing term was achieved by incorporating Eq. (2), with the grazing term removed, into the fitness evaluation. Although it was possible to define Eq. (2) directly in the grammar, and let the system just evolve new terms for grazing, incorporating Eq. (2) into the CFG – GP program lowered the evaluation time for the program.
4.2.1. Resulting grazing equations
The best solution for Ggrazing1 had a training
error of 28.5 and a test error of 35.42. Note that these results are comparable with the origi-nal calibrated difference equation. Since it was not possible to significantly improve on the
orig-inal equation, when constrained to only use Copepoda and Cladocera as variables, it is pos-sible to conclude that this form of equation is an appropriate formalisation of the model. The
grazing term produced using Ggrazing1 was:
chlat*0.000048*[2cop+clad+0.000069+X] (3)
X=(0.000069clad+0.000069cop)coshclad
cop.
Eq. (3) is basically the same form as the origi-nal grazing term and can be simplified to
chlat*0.000048*(2cop+clad), since the cosh
terms are multiplied by a factor of 10−10 and
therefore can be ignored. The resulting predic-tion is shown in Fig. 3. The main difference compared with the original equation is the in-creased dominance of the Copepoda term. Note that the constant 0.000069 appears a number of times in Eq. (3). This is a result of the crossover
operator propagating useful sub-expressions,
of generations to construct the solution. Because the constant was associated with good partial solutions it has spread throughout the popula-tion. In terms of producing process equations this is a useful side effect of the technique, since it allows general constants in an equation to be discovered, which may have some fundamental meaning for the system.
The best solution for Ggrazing2 had a training
error of 22.26 and a test error of 29.63, which is an improvement over the original, calibrated process model. The resulting predicted curve is shown in Fig. 4. The grazing term produced
us-ing Ggrazing2 is shown in Eq. (4).
exp(chlat/18.03)
cop
×
(18.03/clat)+(clad/cop)+clad+cop(28.86/clat)+chlat
. (4)Note that the constant 18.03 is repeated in this equation, and an interesting question arises as to whether 18.03 should also replace 28.86. This would indicate a general constant in the grazing term that could be given some meaning related to how the system is functioning. These considerations, however, are beyond the scope of this preliminary work. Eq. (4) has also used
the chlat term several times. Since this
formali-sation was not part of Eq. (2) it has extended the possible interpretation of the grazing term. The form of Eq. (4) is reminiscent of hyper-bolic/inverse hyperbolic relationships that are quite common in multiple resources dynamics. This is a promising outcome of the work and shows that it may be possible to use this data-driven approach to reconstruct and extend theo-ries regarding freshwater ecosystem dynamics.
5. Discussion
The previous studies have demonstrated that models can be developed for the non-linear dy-namics of phytoplankton. The use of a genetic algorithm to calibrate the constants of the pro-cess-based equation has been shown to signifi-cantly improve the predictive performance of the
model. Additionally, the use of an extended ge-netic programming system has been used to evolve new components of the process equation, improving the overall performance on unseen data. This is a promising area of work that should allow the exploration of new formula-tions of photosynthesis, respiration and grazing terms for process-based equations. Although this approach is not likely to produce the best pre-dictor (Liu and Yao, 1999) based on the train-ing and test data, it shows the possibility of allowing better process models to be produced and therefore to extend our knowledge of the underlying dynamics of ecological systems. In that respect, it is a complimentary approach to other machine learning and hybrid systems, since the combination of different techniques can help support new descriptions and develop-ments in a fundamental manner.
Several more general issues have been high-lighted by this work. The root mean square er-ror was used as the fitness function, however this is essentially a non-temporal measure for time series since it does not take into account the shape of the curves that are being compared (Whigham and Aldridge, 2000). Other error metrics for time series, which take into account the shape of the series, may improve the devel-oped models by changing the shape of the fitness landscape and driving the system towards better partial solutions. In particular, previous work in time series similarity and data mining have shown that RMSE is not always the most appropriate measure for comparing a predicted and actual time series (Keogh and Pazzini, 1999). Further work is required to determine whether better similarity measures would im-prove the development of these process-based equations, for both the constant calibration and the CFG – GP approaches.
This work has not attempted to place any physical meaning to the evolved expression for grazing. Rather, the fact that the new grazing term improves the prediction on the test data has been used to demonstrate that improve-ments for some portion of the process model are possible. Future work is required to explore
P.A.Whigham,F.Recknagel/Ecological Modelling146 (2001) 243 – 251
250
Fig. 4. Chlorophyll-aprediction using Ggrazing2.
whether using this approach to gradually evolve small modifications to the current process model can extend process understanding, however the initial results presented here are very promising.
The general formulation of Eq. (1), with some extensions, can also be applied to individual algal species. Future work will investigate the use of GA’s to calibrate the constants of this equation for predicting various algal species dynamics in different freshwater lakes. By comparing the evolved constants within each equation the driv-ing factors for the behaviour of each species should be able to be inferred and perhaps gener-alised by means of data from different lakes. Comparing the inferred behaviour with the known dynamics of each species will help to understand how a particular population of algal species behaves under the physical constraints of a variety of freshwater lakes.
6. Conclusion
The calibration of a deterministic process-based model using genetic algorithms has been
demon-strated. The dramatically improved performance of the calibrated model indicates that GA’s are an appropriate method for tuning the constants of a process-based model. The application of a sym-bolic machine learning system has also been shown for exploring modifications to the process model, by evolving a new term for one particular component of the system. The success of this approach indicates that it is possible to explore process descriptions in an incremental fashion.
Areas for future research have also been
highlighted.
References
Bennet, K., Ferris, M.C., Ioannidis, Y.E., 1991. A genetic algorithm for database query optimization. In: Belew, R.K., Booker, L.B. (Eds.), Proceedings of the Fourth International Conference on Genetic Algorithms
Cleveland, G.A., Smith, S.F., 1989. Using genetic algorithms to schedule flow shop releases. In: Schaffer, J.D. (Ed.), Proceedings of the Third International Conference on Ge-netic Algorithms
De Jong, K., Spears, W., Gordon, D., 1993. Using Genetic Algorithms for Concept Learning. Machine Learn. 13, 5 – 29.
Goldberg, D.E., 1989. Genetic Algorithms in Search, Opti-mization, and Machine Learning. Addison-Wesley. Grefenstette, J.J., 1985. Optimization of control parameters
for genetic algorithms. IEEE Trans. Syst., Man Cybern. 16 (1), 122 – 128.
Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI.
Keogh, E.J., Pazzini M.J., 1999. Relevance Feedback Retrieval of Time Series Data. The 22nd Annual international ACM-SIGIR conference on research and development of information retrieval.
Koza, J.R., 1992. Genetic Programming: On the Programming of Computers by means of Natural Selection. MIT Press, Cambridge, MA.
Liu, Y., Yao, X., 1999. Time Series Prediction by Using Negatively Correlated Neural Networks. Lecture Notes in Artificial Intelligence, vol. 1585. Springer-Verlag, Berlin, pp. 325 – 332.
Recknagel, F., 1997. ANNA — Artificial Neural Network model for predicting species abundance and succession of blue-green algae. Hydrobiologia 394, 47 – 57.
Recknagel, F., Benndorf, J., 1982. Validation of the ecological simulation model SALMO. Int. Revue ges. Hydrobiol. 67 (1), 113 – 125.
Recknagel, F., French, M., Harkonen, P., Yabunaka, K., 1997. Aritificial neural network approach for modelling and prediction of algal blooms. Ecol. Model. 96 (1 – 3), 11 – 28.
Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., Wilson, H., 1998. Modelling and prediction of phyto- and zooplankton dynamics in Lake Kasumigaura by artificial
neural networks. Lakes and Reservoirs: Research and Management 3, 123 – 133.
Reynolds, C.S., 1984. The Ecology of Freshwater Phytoplank-ton. Press Syndicate of the University of Cambridge, New York.
Takamura, N., Otsuki, A., Aizaki, M., Nojiri, Y., 1992. Phy-toplankton species shift accompanied by transition from nitrogen dependence to phosphorus dependence of primary production in Lake Kasumigaura. Arc. Hydrobiol. 124, 129 – 148.
Whigham, P.A., 1995. Grammatically-based Genetic Program-ming. In: Rosca, J. (Ed.), Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Appli-cations
Whigham, P.A., 1996. Search bias, language bias and genetic programming. Genetic Programming 1996: In: Koza, J.R., Goldberg, J.R., Fogel, D.E., Riolo, R.L. (Eds.), Proceed-ings of the First Annual Conference, MIT Press, Stanford University, Cambridge, MA.
Whigham, P.A., 2000. Induction of a marsupial density model using genetic programming and spatial relationships. Ecol. Model. 131 (2 – 3), 299 – 317.
Whigham, P.A., Aldridge C., 2000. A shape metric for evolv-ing time series models, Information Science Department Discussion Series, University of Otago.
Whigham, P.A., Crapper, P.F., 1999. Time series modelling using genetic programming: an application to rainfall-runoff models. In: Spector, L., Langdon, W.B., O’Reilly, U., Angeline, P.J. (Eds.), Advances in Genetic Program-ming 3, vol. 5. MIT Press, Cambridge, MA, USA, pp. 89 – 104.