Data preparation and drought identification methods
3.3 TIME SERIES CONSTRUCTION 1 Data infilling
It is common for meteorological time series to include periods of missing data of variable lengths; ranging from a few days or months, to years and decades. A variety of methods were used to infill these, chosen depending on the length of missing data in each series and availability of data for the same period from alternative sources and neighbouring stations. Values were estimated to infill single missing months using linear interpolation, which involved calculating the arithmetic mean of the previous and following monthly values. Consecutive months and years of missing data were infilled primarily using data from neighbouring stations and linear regression analysis. Where data for the same period was not available from a nearby station, values were estimated by calculating the mean of all the same previous months in the series (i.e. the average of all January months in the series prior to the missing value for January). This approach was rarely applied and was required to infill periods of missing data only in the series that included the earliest observations of rainfall for which there are few, if any, records available for the same period from a nearby station.
3.3.2 Infilling and extension using Linear Regression
Linear regression is a frequently used statistical method, which has been applied in studies of long meteorological time series in the UK (Macdonald et al. 2008; Gormally et al. 2011). The purpose of linear regression is to assess the relationship between a single dependent variable y, to one or more other independent variables x1, x2, x3... Where there is only one other variable, this is known as simple linear regression and is the method applied in this thesis.
Simple linear regression attempts to model the relationship between two variables by fitting a regression line and linear equation to observed data. A common method for fitting the regression line is that of least-squares, which fits a straight line through a set of n
points so that the sum of the squared residuals of the model (vertical distance between points in a dataset and the fitted line) is as small as possible. A numerical measure of the association between two variables is the correlation coefficient, which ranges between -1 and 1 indicating the strength of the relationship between the variables. A linear regression line has an equation of the form:
Page | 40
(3.3) Where y is the dependent variable, x is the independent variable, b is the slope of the line, and a is the intercept (the value of the value of y when x = 0) and is depicted in Figure 3.3a. This method was applied to first, assess the strength of the relationship between primary (dependent variable) and neighbouring rainfall series (independent variable) and objectively select stations for infilling periods of missing data in the primary series and extending record lengths to present. A scatter plot, regression line and linear equation were plotted using monthly resolution data for a period of overlapping data between the primary and neighbouring rainfall series. The longest period of overlap available between the two series was used. In Excel correlation coefficients in the form of r2 were derived. No direct guidance exists on what correlation value indicates a strong enough relationship between the primary and neighbouring station for the latter to be accepted for infilling and extending primary series and must be inferred from the vast literature on homogenising climate time series. Peterson and Easterling (1994) provided guidance on methods for constructing reference meteorological series for homogeneity testing and recommended that correlation coefficients ≥=0.8 had to be achieved for the neighbouring station to be reliable enough to use. This threshold was adhered to as often as possible and was particularly applicable when dealing with data from 1860 onwards; however, lower correlation values were accepted because of a lack of available data for the earliest periods. Although this threshold was recommended for the purpose of homogeneity adjustment, this approach is applicable for infilling periods of missing data, since both applications require a reference series. Once a reliable neighbouring station was chosen, the linear regression equation derived between the two data series was used to predict values in the primary (y) rainfall series to infill periods of missing data and extend record lengths to present. Where values of x were available without accompanying values of y, the fitted linear equation (arrangement as it appears in Microsoft Excel is shown in Figure 3.3a) was used to make predictions of the value of y (Figure 3.3b).
Page | 41 Figure 3.3: Schematic of the simple linear regression method applied to constructing long meteorological time series
3.3.3 Homogeneity testing
In order to undertake accurate analyses of long-term climate series and for their application in drought analyses to be of value, climate data must be homogenous. Conrad and Pollack (1950) defined a homogenous climate series as one where variations are caused only by variations in weather and climate. It is common, however, for long climate time series to be affected by non-climatic factors that make the time series unrepresentative of actual climate variation over time. Factors such as: a change of instrument, relocation of the station, different observing practices and a change in the local environment of the station cause variation in climate time series. Some factors cause distinct discontinuities that are easily detectible (e.g. instrument change); whereas others cause a gradual change in the data, which are harder to detect (e.g. change in environment of station). These changes in climate time series are known as inhomogeneities and may be present in addition to variations arising from weather and climate. To avoid misinterpretation of the climate data, it is important to identify and remove these inhomogeneities (Peterson et al. 1998).
Detecting inhomogeneities is achieved through undertaking homogeneity testing. A plethora of homogeneity testing methods exist and are divided into two categories: direct and indirect methodologies. These are summarised in a review of methods by Peterson et
Page | 42
al. (1998) and in the World Meteorological Office’s guidelines on climate metadata and homogenization (Aguilar et al. 2003). Methods have been developed according to the climatic variable being studied (i.e. temperature or precipitation), the climate of a particular country, the availability of data (i.e. length of time series), and availability of metadata. Consequently, not all methods are applicable to all areas and types of data and are necessarily country and data specific.
Following recommendations made in Peterson et al. (1998) and Aguilar et al. (2003), and a review of homogeneity testing methods commonly applied in studies of long term climate series for the UK and Europe (Craddock 1979; Burt 2009; Burt and Howden 2011; Camuffo et al. 2013), the following homogeneity tests were undertaken in this thesis:
Inspection of plotted long climate time series to identify inhomogeneities shown by distinct discontinuities (steps or breaks)
Double mass curve plots using the primary climate series and data from nearby stations as reference series (Craddock 1979)
Standard normal homogeneity test also using the primary climate series and data from nearby stations as reference series (Alexandersson 1986)
Review of primary station metadata to identify physical causes of inhomogeneities (Guttman 1998)
Review of additional sources of climate data to verify the climate of a particular period identified as inhomogenous (e.g. using data and descriptions of the weather included in volumes of Symons British Rainfall)
It was not necessary to undertake all of the above tests on all climate time series used in this thesis, as they were of varying quality. For example some series, such as the Radcliffe Observatory rainfall record, have been the subject of a large amount of homogenisation work prior to being used here. For records such as these, it was necessary to undertake linear regression analysis to infill and extend the record length to present and visually inspect the constructed series. For other records which have received much less attention, such as the Carlisle rainfall series, all of the methods outlined above were undertaken and are described in more detail in Chapter four. Methods for homogeneity testing and adjustment of climate time series are varied and produce different results. Selecting appropriate methods is challenging and dependent on the data in question, country of origin and experience of the climatologist (Peterson et al. 1998; Aguilar et al. 2003).
Page | 43