4.2 Methods
4.2.4 Prey assessment through statistical methods: Linear Models (LM) and
and Generalized Linear Models (GLM)
Linear Models (LM) form the core of classical statistics, and many modern modelling and analytical techniques build on its methodology. LM are simply a combination of elements from analysis of variance and regression, using a similar method for the partitioning of variance between explanatory variables, which can be either continuous or categorical. Using the statistical software S-PLUS (6.1 for Windows, Math-Soft Inc., Seattle, Washington), the structure of the model can be specified in the model formula in a word equation:
• response variable ∼explanatory variable(s);
where the dependent variable is on the left, and the variables we suspect of influencing the data are on the right hand side of the formula. The tilde symbols reads ”is modelled as function of” (Venables and Ripley, 1997).
Following standard procedures for linear models, the relationships between prey lengths and estimated mass according to franciscana dolphin sex and sexual maturity, temporal vari-
ation (season), and spatial variation (northern and southern areas) were analysed. Log- transformations were performed when the residuals contravened the assumption of normality. Accordingly, the models can be written as:
• LM (log(prey length) ∼area);
• LM (log(prey estimated mass) ∼season).
From the models, analysis of variance tables were used to obtain the values of degrees of freedom, residuals, F, and p, to identify significant relationships.
The estimated lengths (mm) and mass (g) of the prey species (n=31) used for the LM analysis, were calculated from the regression analyses previously cited (tables 4.3 and 4.2).
For LM it is assumed that the variance is constant and the errors are normally distributed. However, in count data, where the response variable is an integer and there are often lot of zeros in the data frame, the variance may increase linearly with the mean. Additionally, the response variable is expected to follow a Poisson distribution. Thus, LM is not a good choice to handle this kind of data (Crawley, 2002).
In this case, we are looking at estimates of fish abundance in terms of fish numbers (count data), and the data may be assumed to come from a Poisson distribution. The way to deal with these problems, in a single theoretical framework, is the technique of GLM.
GLM is used in the same way as the model-fitting procedure of LM, but we also need to specify a family of error structures, in our case the Poisson for count data, and a particular link function. The method for categorical data, such as our variable ”area” with two-level factor (northern and southern), uses the log-link function, and a linear variance-mean relationship, and this way the GLM suitable for categorical data is called the log-linear model (Graphen and Hails, 2002). Another advantage of GLM is that it works reasonably well with unbalanced
data, as is the case for the categorical variableseason (see the Chi-square test for hypothesis
of homogeneity of groups in the section 4.3.1).
Furthermore, the log-link is frequently used for count data, where negative values are pro- hibited. For our data we performed the analysis of deviance for a categorical explanatory variable with count data, using chi-square test at the 0.05 confidence level (Crawley, 2002). Besides, to compensate for overdispersion (where the residual deviance is greater than the
residual degrees of freedom), remedial measures when using the F test were taken. The F
mate equivalent to the error variance, and performs a test much harsher than the chi-square test. From the models, analysis of deviance was performed to obtain the values of degrees of freedom, residuals, and p, to identify significant relationships.
In short, the GLM has three important properties:
• (1) The error structure - a GLM allows the specification of a variety of different error
distributions (e.g. binomial errors, useful with data on proportions; poisson errors,
useful with count data, etc). The error structure is defined by means of the family directive, used as part of the model formula (see number 3).
• (2) The linear predictor - the right hand side of the model equations are of exactly the
same form in GLM as in LM. This has categorical terms, continuous terms, and all kinds of interaction. But instead of making this model equal to the fitted value directly, the expression is called the linear predictor, and related to it by a link function. The identity link, in which we just make linear predictor equal to the fitted value, is the link function for an ordinary LM.
• (3) The link function - the link function relates the mean value of y to its linear predictor.
An important criterion in the choice of link function is to ensure that the fitted values stay within reasonable bounds. We would want to ensure, for example, that counts are all greater than or equal to 0 (negative count data are not feasible values). In this case, a log link is appropriate because the fitted values are antilogs of the linear predictor, and all antilogs are greater than or equal to 0. Moreover, the most appropriate link function may be the one which produces the minimum residual deviance. The use of link function replaces transformation of variables without corrupting the error structure.
Thus, for example, the GLM formula can be written as:
• GLM (prey species∼area, family = poisson (link = log)).
Log-linear models were fitted using the GLM procedure in S-plus to analyse differences be- tween franciscana sex and sexual maturity, season, and sites (northern and southern), in
franciscana prey frequencies. The values of degrees of freedom, residuals, F, and p, are
demonstrated in the variance (LM) and deviance (GLM) tables.
The results from GLM analysis are used not only to identify significant results but also to determine if the prey importance is similar to the results obtained using the traditional
method (IRI). Beyond that, the LM and GLM methods were chosen because the analysis can be widely explored. These models allow inclusion of more than one explanatory variable, and combination of categorical and continuous variables, which permits exploring interactions
between them as well as adding and analysing ”oceanographic parameters” variables (e.g. sea
surface temperature, chlorophyll-a). It is here that the flexibility of these models will become apparent for further analysis, mainly in Chapter 6. Therefore identification of the significant parameters for the variety of franciscana prey through LM and GLM in this Chapter is the first step for more complex models in Chapter 6.