Large data set example - The tag location problem

We describe a simple approach implemented in freely available tools that can be used for fast arbitrary access to very large time series of oceanographic data. The

example uses the 4-D GEM climatology described in Section 6.4.1.

The GEM data set consists of four 4-D (longitude, latitude, depth, time) arrays for temperature, salinity, and current vectors totalling about 150Gb of double- precision floating point values. The data traverses the entire meridional extent of the Southern Ocean, between latitudes -70 and -34.74. Depth extends from the surface to 5400m in 36 steps. The temporal range of the data is from 4 October 1992 to 10 September 2006 in weekly steps. Within the ocean region the data is relatively complete with few gaps, although horizontal coverage reduces at the deepest areas due to the ocean floor topography, and changes over time. The latitude and vertical axes are not regular, though the geographic space can be interpreted in a regular grid by using the Mercator projection. The vertical step length increases with increasing depth. Land areas and missing data are identified with missing values.

These data were converted to a generic form to provide arbitrary access tools that could overcome some memory limitations. The files were converted in Octave from Matlab export format to generic binary arrays with a reversed index order5. These were then used to populate memory-mapped files controlled by theRpackage

ff(Adler et al., 2010).

The final system consists of three memory-mapped objects for each variable, a read function to generate new objects as a single array from subsets to the entire 4D data and lookup functions to sample point values from the field. Using this system the entire GEM data set can be made accessible and large subsets of it (up to ˜16Gb each) manipulated in a single object for efficient and simple use with track data. The data for Figure 6.1 was prepared using this system, and via the lookup functions these data can be made available to complex location models such as the examples presented in Chapters 3 and 4.

6.6 Conclusion

We have demonstrated that there is far greater potential for the incorporation of large environmental data sets to location models than has been previously explored. Computing tools and environments must be carefully chosen to acommodate the requirements for these large-scale data sets. We have shown methods for accessing a very large oceanographic model that can be deployed on relatively low-end desktop computers, demonstrating that modern statistical techniques for location estimation can incorporate large and complex data sets.

5_{This orientation is not necessary for the final memory-mapped system, but allowed for simple}

georeferencing of the raw files for simple and efficient conversion to GIS formats via theGDALvirtual raster format (Warmerdam and the GDAL development team, 2010). Also, there are Rpackages that provide reading of Matlab files, but installation details and machine limitations meant that this was not feasible.

CONCLUSION

Location estimation for animal tracking is a developing field with a long history of contributions for many different applications and techniques. It is impossible to classify all of the contributions that have been made, and the great variety of issues dealt with by different applications. This thesis presents a unification of many disparate issues in animal tracking analysis focussed on the use of large raw data sets.

Chapter 2 surveyed a range of problems in tracking data regarding the accuracy of location estimates and the representation of track data. Uncertainties in location are compounded by inconsistencies with track representations and the lack of integrated software tools for analysing and exploring different methods. Seemingly simple things such as track via points or lines introduce assumptions and requirements that are not commonly dealt with explicitly, such as the way track data are manipulated to produce summaries of residency or time spent. The need for more systematic representations of track data and location error was demonstrated with examples, including easily available software. The trip package provides an integrated software environment for employing traditional methods. Track data are automatically validated as part of a formal system in order to avoid common problems, providing access to a huge range of tools required for dealing with spatial and temporal data. Also demonstrated was the need for wider use of geographic map projections for the representation of tracks and for simulation studies. Ac- cess to software tools providing efficient conversion between map projections is also made available bytrip. Chapter 2 also presented traditional methods for track data with modern software tools, providing much needed access to exploratory analyses in a single environment. The issues discussed regarding traditional methods were then used to put a new perspective on track estimation from data sources that are not inherently spatial, providing the context for a general framework for location estimation.

Chapter 3 provided a general framework for location estimation using Bayesian methods. The approach provides a broad classification of sources of location infor- mation that can be used for a variety of location methods. These sources are prior knowledge, primary location data, auxiliary location data and movement models. The framework includes a model for track representation that explicitly differen- tiates primary locations from intermediate locations. This distinction bridges the point and line track representations for traditional analyses and provides all the

required metrics and measures for modelling animal movement. The framework was applied to two example data sets: Argos Service locations and measurements from an archival tag, demonstrating the generality of the approach. The full detail of the light level geo-location example was omitted here in order to focus on the general applicability of the framework.

Chapter 4 focussed on the full detail for the light level geo-location example of Chapter 3. This provides a novel method for determining location from light level by relating measured light to solar elevation. Extensible and freely available software for running light level geo-location is provided to enable the application of this approach in the Rpackage tripEstimation.

Chapter 5 expanded on the track representation model presented in Chapter 3. This includes an integrated data structure design and binning mechanism for efficient visualization and analysis of individual location estimates or full-path estimates with in-built measures of uncertainty. The data structures can be easily used for gener- ating residency or time spent estimates from groups of individuals and visualizing, or otherwise quantifying, changing patterns of spatial residency over time. These provide a powerful mechanism for comparing spatial estimates consisting of probability distributions with data sets of enviromental covariates. Common problems with requirements for data resampling by disparate scales or different map projections is reduced. For example, by binning samples from the posterior to match an environmental data set, otherwise required warping of gridded data can be avoided. Chapter 6 illustrated the potential for using subsurface ocean properties for esti- mating location for marine animals. There is a largely unrecognized and unrealized potential for the use of both surface and subsurface data and there are challenging issues with scale and sampling disparities, data error problems and efficient access methods to large data sets. We have demonstrated methods for exploiting subsurface temperatures and salinity with large multi-dimensional ocean models for inclusion in statistical models.

For the first time the wide range of data analysis techniques required for animal tracking have been unified. This approach provides consistent data models for spatial and temporal data, location estimation, track representation and flexibility for exploring new methods.

MARKOV CHAIN MONTE CARLO

The computational challenge in applying Bayesian techniques is the evaluation of complex integrals required for practical inference. Markov Chain Monte Carlo (MCMC) techniques provide a simple and generic solution to this problem.

It is trivial to determine the posterior density to within a multiplicative constant, as by Bayes’ rule (3.1)

p(θ|y)∝p(y|θ)p(θ).

But any practical inference, such as the calculation of expected values and quantiles, requires the calculation of the normalization constant

p(y|θ)p(θ)dθ,

and other associated integrals of the posterior distribution. Unfortunately, the evaluation of the required integrals is often computationally demanding.

MCMC techniques provide a generic and simple approach to this problem through simulation. MCMC methods allow samples from the posterior distribution when the posterior density is known only to within a multiplicative constant. Any properties of the posterior can then be approximated by the properties of the sample—in essence, evaluating the complex integrals required for inference by Monte Carlo quadrature.

A.1 Metropolis Hastings

The Metropolis Hastings (MH) algorithm is the most generic form of MCMC. Given a target distributionpfrom which we wish to draw samples, the Metropolis Hastings algorithm constructs a Markov Chain that has p as its stationary distribution.

The algorithm is surprisingly simple. To generate a sequence X1, X2, X3. . . of draws from p, at each stage i a candidate point Y is drawn from a proposal distributionq(.|Xi). The candidate is accepted with probability α(Y, Xi), where

α(Y, X) = min 1,p(Y)q(X|Y) p(X)q(Y |X) . 115

If the candidate is accepted, then Xi+1=Y, otherwiseXi+1 =Xi.

The sample {X1, X2, X3. . .} constructed in this way will have distribution p, and so any property of pcan be approximated from the sample.

The key feature of the Metropolis Hastings algorithm is that p occurs in both the numerator and denominator of α, and so p need only be known to within a multiplicative constant. This is the reason the MH algorithm is so useful for Bayesian inference.

The choices of the initial pointX1and proposal distributionq(.|.) are effectively arbitrary, but poor choices impact efficiency. A poor choice ofX1can result in points in the neighbourhood of X1 being over-represented in the sample. For this reason it is common to discard the initial segment of the chain to reduce the dependence on X1. A poor choice ofq(.|.) will result in fewer candidates being acccepted, and it may become neccesary to draw a very large sample to obtain accurate results.

More details of the properties of the Metropolis Hastings algorithm can be found in Gilks et al. (1995).

In document The tag location problem (Page 128-133)