• No results found

Data Analysis: Statistical Modeling and Computation in Applications

N/A
N/A
Protected

Academic year: 2022

Share "Data Analysis: Statistical Modeling and Computation in Applications"

Copied!
81
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Analysis:

Statistical Modeling and Computation in Applications

Spatial and Environmental Data:

Introduction, Local Correlations

Stefanie Jegelka (and Caroline Uhler) 1 / 35

(2)

Overview

Environmental data Modeling flows

Short-range spatial correlations intuition

2 variables multiple variables

Stefanie Jegelka (and Caroline Uhler) 2 / 35

(3)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(4)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(5)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(6)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . )

Wildlife monitoring Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(7)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(8)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(9)

Environmental Data – Examples

Air Quality, Water Quality (pollutants)

Weather & climate (temperature, winds, moisture, precipitation, extreme conditions, . . . )

Storms

Ocean dynamics

Vegetation (forests, algae, . . . ) Wildlife monitoring

Earthquake magnitudes

Why do we care? What are questions of interest?

Stefanie Jegelka (and Caroline Uhler) 3 / 35

(10)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(11)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society

policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(12)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(13)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(14)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies

use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(15)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(16)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(17)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(18)

Environmental Data – Why do we care?

understand underlying processes, changes

(e.g. climate change: “statistics of weather over time”)

impacts on environment, health, economics, society policies

forecast events, warnings (e.g. community seismic network, storms, . . . )

resource/energy management, e.g., water, renewable energies use in planning, routing, backtracking, control(ships, airplanes, . . . )

Questions

relationships (correlations, association) trends; forecasting

planning

quantifying uncertainty, adaptive sensing

Stefanie Jegelka (and Caroline Uhler) 4 / 35

(19)

Environmental Data – What is special?

2 Chapter 1. Climate and Mathematics

Figure 1.1. Schematic view of the Earth’s climate system. Reprinted with permission from IPCC.

Looking at the cartoon, we see a climate system that is schematically made up of five components: theatmosphere; the hydrosphere (oceans, lakes, and other bodies of water);

thecryosphere (snow and ice); the lithosphere (land surface); and the biosphere (all living things). The components do not exist in isolation; they are interconnected and interact at several levels, either directly or indirectly. The system as a whole is powered by so- lar radiation and evolves under the influence of its own internal dynamics through ocean currents and atmospheric circulation. In addition, there are external factors which drive the system; these are calledforcings and include both natural phenomena such as cyclical changes in the Earth’s orbit around the Sun, volcanic eruptions, and variations in the solar output, and human-induced (anthropogenic) factors like changes in atmospheric compo- sition, land use, and so on. Let us agree that this is acomplex system, and let us also agree that it is not obvious how to approach it mathematically.

1.2 Modeling Earth’s Climate

As mathematicians, we are used to setting up models for physical phenomena, usually in the form of equations. For example, we recognize the second-order differential equation

L ¨x(t) + g sin x(t) = 0

as a model for the motion of a physical pendulum under the influence of gravity. Every symbol in the equation has its counterpart in the physical world:x(t) stands for the angle between the arm of the pendulum and its rest position (straight down) at the timet; L is the length of the pendulum arm; andg is gravitational acceleration. The mass of the bob turns out to be unimportant and therefore does not appear in the equation. The model is understood by all to be an approximation, and part of the modeling effort consists in

Downloaded 10/10/16 to 18.9.61.111. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

underlying physical processes; scientific models

Image source: H. Kaper, H. Engler. Mathematics & Climate.

Stefanie Jegelka (and Caroline Uhler) 5 / 35

(20)

Environmental Data – What is special?

2 Chapter 1. Climate and Mathematics

Figure 1.1. Schematic view of the Earth’s climate system. Reprinted with permission from IPCC.

Looking at the cartoon, we see a climate system that is schematically made up of five components: theatmosphere; the hydrosphere (oceans, lakes, and other bodies of water);

thecryosphere (snow and ice); the lithosphere (land surface); and the biosphere (all living things). The components do not exist in isolation; they are interconnected and interact at several levels, either directly or indirectly. The system as a whole is powered by so- lar radiation and evolves under the influence of its own internal dynamics through ocean currents and atmospheric circulation. In addition, there are external factors which drive the system; these are calledforcings and include both natural phenomena such as cyclical changes in the Earth’s orbit around the Sun, volcanic eruptions, and variations in the solar output, and human-induced (anthropogenic) factors like changes in atmospheric compo- sition, land use, and so on. Let us agree that this is acomplex system, and let us also agree that it is not obvious how to approach it mathematically.

1.2 Modeling Earth’s Climate

As mathematicians, we are used to setting up models for physical phenomena, usually in the form of equations. For example, we recognize the second-order differential equation

L ¨x(t) + g sin x(t) = 0

as a model for the motion of a physical pendulum under the influence of gravity. Every symbol in the equation has its counterpart in the physical world:x(t) stands for the angle between the arm of the pendulum and its rest position (straight down) at the timet; L is the length of the pendulum arm; andg is gravitational acceleration. The mass of the bob turns out to be unimportant and therefore does not appear in the equation. The model is understood by all to be an approximation, and part of the modeling effort consists in

Downloaded 10/10/16 to 18.9.61.111. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

underlying physical processes; scientific models

Image source: H. Kaper, H. Engler. Mathematics & Climate.

Stefanie Jegelka (and Caroline Uhler) 5 / 35

(21)

Environmental Data – What is special?

spatial and temporal correlation

from https: // www. epa. gov/ outdoor-air-quality-data ,

http: // public. dep. state. ma. us/ MassAir/ Pages/ MapCurrent. aspx? &ht= 1& hi= 101

Stefanie Jegelka (and Caroline Uhler) 6 / 35

(22)

Environmental Data – What is special?

spatial and temporal correlation

from https: // www. epa. gov/ outdoor-air-quality-data ,

http: // public. dep. state. ma. us/ MassAir/ Pages/ MapCurrent. aspx? &ht= 1& hi= 101

Stefanie Jegelka (and Caroline Uhler) 6 / 35

(23)

Environmental Data – What is special?

1.6. Models from Data 9

posite ways when compared between regions. Thus, ENSO is characterized by a dipole with centers over the western Pacific and over the central/eastern Pacific. These dipoles interact, and there is strong evidence that their influence can extend over thousands of kilometers, a phenomenon known asteleconnection. We emphasize that teleconnections are initially purely observational phenomena. They may manifest themselves differently in different dipole pairs and may require very different physical explanations. Many of the structures are poorly understood, in the sense that not even conceptual models ex- ist for them, although they contribute to global weather variability over time scales of decades. Indeed, a better understanding of interdecadal climate variability is one of the main current goals of the IPCC.

A systematic search for dipole structures and teleconnections is now possible. By ap- plying data mining techniques to the vast amounts of data from satellite observations and computer simulations and data that predate the satellite age, one can reconstruct recent climate states.1 Figure 1.8 shows some results for sea-level pressure data for the period 1948–1967 generated from the NCEP/NCAR Reanalysis project [75]. The color codes correspond to aggregates of various physical observations. Black lines correspond to tele- connections, which have been identified from 20 years of processed (“reanalyzed”) data.

Some of these teleconnections are well known. Others may represent new climatological phenomena, they may be consequences of known teleconnections, or they may just be the result of spurious observations.

Figure 1.8. Dipoles in NCEP sea-level data for the period 1948–1967. The color background shows the regions of high activity. The edges represent dipole connections between regions.

What can we infer from these empirical observations? Do these dipole structures rep- resent oscillatory modes of the climate system? What are possible coupling mechanisms that cause the teleconnections? Is it too far-fetched to think of the climate system as a system of coupled oscillators?

1This approach was already articulated by Sir Gilbert, who wrote in 1932: “I think that the relationships of world weather are so complex that our only chance of explaining them is to accumulate the facts empirically;

we know that it was impossible to explain cyclones (lows) until data of the upper air conditions were available, and there is a strong presumption that when we have data of pressure and temperature at 10 and 20 km, we shall find a number of new relations that are of vital importance.”

Downloaded 10/10/16 to 18.9.61.111. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

long-range spatial (anti-)correlations

Source: H. Kaper, H. Engler. Mathematics & Climate.

Stefanie Jegelka (and Caroline Uhler) 7 / 35

(24)

Environmental Data – What is special?

1.6. Models from Data 9

posite ways when compared between regions. Thus, ENSO is characterized by a dipole with centers over the western Pacific and over the central/eastern Pacific. These dipoles interact, and there is strong evidence that their influence can extend over thousands of kilometers, a phenomenon known asteleconnection. We emphasize that teleconnections are initially purely observational phenomena. They may manifest themselves differently in different dipole pairs and may require very different physical explanations. Many of the structures are poorly understood, in the sense that not even conceptual models ex- ist for them, although they contribute to global weather variability over time scales of decades. Indeed, a better understanding of interdecadal climate variability is one of the main current goals of the IPCC.

A systematic search for dipole structures and teleconnections is now possible. By ap- plying data mining techniques to the vast amounts of data from satellite observations and computer simulations and data that predate the satellite age, one can reconstruct recent climate states.1 Figure 1.8 shows some results for sea-level pressure data for the period 1948–1967 generated from the NCEP/NCAR Reanalysis project [75]. The color codes correspond to aggregates of various physical observations. Black lines correspond to tele- connections, which have been identified from 20 years of processed (“reanalyzed”) data.

Some of these teleconnections are well known. Others may represent new climatological phenomena, they may be consequences of known teleconnections, or they may just be the result of spurious observations.

Figure 1.8. Dipoles in NCEP sea-level data for the period 1948–1967. The color background shows the regions of high activity. The edges represent dipole connections between regions.

What can we infer from these empirical observations? Do these dipole structures rep- resent oscillatory modes of the climate system? What are possible coupling mechanisms that cause the teleconnections? Is it too far-fetched to think of the climate system as a system of coupled oscillators?

1This approach was already articulated by Sir Gilbert, who wrote in 1932: “I think that the relationships of world weather are so complex that our only chance of explaining them is to accumulate the facts empirically;

we know that it was impossible to explain cyclones (lows) until data of the upper air conditions were available, and there is a strong presumption that when we have data of pressure and temperature at 10 and 20 km, we shall find a number of new relations that are of vital importance.”

Downloaded 10/10/16 to 18.9.61.111. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

long-range spatial (anti-)correlations

Source: H. Kaper, H. Engler. Mathematics & Climate.

Stefanie Jegelka (and Caroline Uhler) 7 / 35

(25)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data hypotheses from data

large data sets heterogeneous data proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(26)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data

hypotheses from data large data sets heterogeneous data proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(27)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data hypotheses from data

large data sets heterogeneous data proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(28)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data hypotheses from data

large data sets

heterogeneous data proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(29)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data hypotheses from data

large data sets heterogeneous data

proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(30)

Environmental Data – What is special?

correlations in time

correlations in space (fields)

scientific models + statistics; simulations methodological challenges for statistics

no controlled studies, only observational data hypotheses from data

large data sets heterogeneous data proxy data

Stefanie Jegelka (and Caroline Uhler) 8 / 35

(31)

Overview

Environmental data Modeling flows

Short-range spatial correlations intuition

2 variables multiple variables

Stefanie Jegelka (and Caroline Uhler) 9 / 35

(32)

Data for the problem set: flow in space & time

Stefanie Jegelka (and Caroline Uhler) 10 / 35

(33)

Flow fields

V (x , t): flow at location x at time t vector in 2d or 3d

velocity: vector length

V (x , t) at a fixed time t

What does variation in time mean?

Stefanie Jegelka (and Caroline Uhler) 11 / 35

(34)

Flow fields

V (x , t): flow at location x at time t vector in 2d or 3d

velocity: vector length

V (x , t) at a fixed time t

What does variation in time mean?

Stefanie Jegelka (and Caroline Uhler) 11 / 35

(35)

Flow fields

V (x , t): flow at location x at time t vector in 2d or 3d

velocity: vector length

V (x , t) at a fixed time t What does variation in time mean?

Stefanie Jegelka (and Caroline Uhler) 11 / 35

(36)

Working with flow fields

forward prediction

simulate (propagate) distributions, include variation in time hindcasting

Stefanie Jegelka (and Caroline Uhler) 12 / 35

(37)

Working with flow fields

forward prediction

simulate (propagate) distributions, include variation in time hindcasting

Stefanie Jegelka (and Caroline Uhler) 12 / 35

(38)

Working with flow fields

forward prediction

simulate (propagate) distributions, include variation in time

hindcasting

Stefanie Jegelka (and Caroline Uhler) 12 / 35

(39)

Working with flow fields

forward prediction

simulate (propagate) distributions, include variation in time hindcasting

Stefanie Jegelka (and Caroline Uhler) 12 / 35

(40)

8 March 2014: Malaysian Airlines Flight 370 (MH370) disappeared on its way from Kuala Lumpur to Beijing.

source: Wikipedia

Stefanie Jegelka (and Caroline Uhler) 13 / 35

(41)

Airplane pieces found in multiple locations.

image source: BBC

Stefanie Jegelka (and Caroline Uhler) 14 / 35

(42)

full report:

http: // www. geomar. de/ fileadmin/ content/ service/ presse/ Pressemitteilungen/ 2016/ MH370_ Report_ May2016. pdf

Stefanie Jegelka (and Caroline Uhler) 15 / 35

(43)

full report:

http: // www. geomar. de/ fileadmin/ content/ service/ presse/ Pressemitteilungen/ 2016/ MH370_ Report_ May2016. pdf

Stefanie Jegelka (and Caroline Uhler) 16 / 35

(44)

Overview

Environmental data Modeling flows

Short-range spatial correlations intuition

2 variables multiple variables

Stefanie Jegelka (and Caroline Uhler) 17 / 35

(45)

Sensing and correlations in space

KRAUSE, SINGH ANDGUESTRIN

0 5 10 15

0 0.5 1 1.5 2

Number of sensors placed

RMS error

Empirical covariance N30 S30 Disk

model

(a) Comparison with disk model (Temperature)

0 5 10 15 20 25 30

0.8 1 1.2 1.4 1.6

Number of sensors placed

RMS error Disk

model N40 S40

Empirical covariance

(b) Comparison with disk model (Precipitation)

Figure 10: RMS curves for placements of increasing size, optimized using the disk model, station- ary and nonstationary GPs. Prediction for all placements is done using the empirical covariance. Stationary GPs and nonstationary GPs estimated from 30 sensors (N30, S30, for temperature data) or 40 sensors (N40, S40, for precipitation data).

(a) Placements of temperature sensors (b) Placements of rain sensors

Figure 11: Example sensor placements for temperature and precipitation data. Squares indicate locations selected by mutual information, diamonds indicate those selected by entropy.

Notice how entropy places sensors closer to the border of the sensing field.

for the morning (between 8 am and 9 am) and noon (between 12 pm and 1 pm) in the Intel lab data.

Figure 12(a) and Figure 12(b) plot the log-likelihood of the test set observations with increasing number of sensors for both models. Figure 12(e) presents the RMS error for a model estimated for the entire day. We can see that mutual information in almost all cases outperforms entropy, achieving better prediction accuracies with a smaller number of sensors.

Figure 12(f) presents the same results for the precipitation data set. Mutual information signifi- cantly outperforms entropy as a selection criterion—often several sensors would have to be addi- tionally placed for entropy to reach the same level of prediction accuracy as mutual information.

Figure 11(b) shows where both objective values would place sensors to measure precipitation. It can be seen that entropy is again much more likely to place sensors around the border of the sensing area than mutual information.

268

Measure & model correlations in space?

Estimate temperature / rainfall / gold in other locations?

Intuition: correlation is a function of distance

Stefanie Jegelka (and Caroline Uhler) 18 / 35

(46)

Sensing and correlations in space

KRAUSE, SINGH ANDGUESTRIN

0 5 10 15

0 0.5 1 1.5 2

Number of sensors placed

RMS error

Empirical covariance N30 S30 Disk

model

(a) Comparison with disk model (Temperature)

0 5 10 15 20 25 30

0.8 1 1.2 1.4 1.6

Number of sensors placed

RMS error Disk

model N40 S40

Empirical covariance

(b) Comparison with disk model (Precipitation)

Figure 10: RMS curves for placements of increasing size, optimized using the disk model, station- ary and nonstationary GPs. Prediction for all placements is done using the empirical covariance. Stationary GPs and nonstationary GPs estimated from 30 sensors (N30, S30, for temperature data) or 40 sensors (N40, S40, for precipitation data).

(a) Placements of temperature sensors (b) Placements of rain sensors

Figure 11: Example sensor placements for temperature and precipitation data. Squares indicate locations selected by mutual information, diamonds indicate those selected by entropy.

Notice how entropy places sensors closer to the border of the sensing field.

for the morning (between 8 am and 9 am) and noon (between 12 pm and 1 pm) in the Intel lab data.

Figure 12(a) and Figure 12(b) plot the log-likelihood of the test set observations with increasing number of sensors for both models. Figure 12(e) presents the RMS error for a model estimated for the entire day. We can see that mutual information in almost all cases outperforms entropy, achieving better prediction accuracies with a smaller number of sensors.

Figure 12(f) presents the same results for the precipitation data set. Mutual information signifi- cantly outperforms entropy as a selection criterion—often several sensors would have to be addi- tionally placed for entropy to reach the same level of prediction accuracy as mutual information.

Figure 11(b) shows where both objective values would place sensors to measure precipitation. It can be seen that entropy is again much more likely to place sensors around the border of the sensing area than mutual information.

268

Measure & model correlations in space?

Estimate temperature / rainfall / gold in other locations?

Intuition: correlation is a function of distance

Stefanie Jegelka (and Caroline Uhler) 18 / 35

(47)

Where we are headed

Gaussian Process Regression

-3 -2 -1 0 1 2 3

-2 -1 0 1 2

y( x)

x

Figure: Examples include WiFi localization, C14 callibration curve.

Stefanie Jegelka (and Caroline Uhler) 19 / 35

(48)

A simple start

Stefanie Jegelka (and Caroline Uhler) 20 / 35

(49)

Intuition: correlated (Gaussian) random variables

weaker correlation

Stefanie Jegelka (and Caroline Uhler) 21 / 35

(50)

Intuition: correlated (Gaussian) random variables

strong correlation

Stefanie Jegelka (and Caroline Uhler) 22 / 35

(51)

Intuition: 50 Gaussian random variables

Stefanie Jegelka (and Caroline Uhler) 23 / 35

(52)

Intuition: 50 Gaussian random variables

Stefanie Jegelka (and Caroline Uhler) 23 / 35

(53)

Intuition: 50 Gaussian random variables

Stefanie Jegelka (and Caroline Uhler) 23 / 35

(54)

Multivariate Gaussian distribution

y ∼ N (µ, Σ) p(y ) = 1

(2π)−d/2|Σ|1/2exp

12(y − µ)>Σ−1(y − µ)

Σij = cov(yi, yj)

Stefanie Jegelka (and Caroline Uhler) 24 / 35

(55)

Multivariate Gaussian distribution

y ∼ N (µ, Σ) p(y ) = 1

(2π)−d/2|Σ|1/2exp

12(y − µ)>Σ−1(y − µ)

Σij = cov(yi, yj)

Stefanie Jegelka (and Caroline Uhler) 24 / 35

(56)

Covariance for 2 variables: intuition

Prediction of f2 from f1

-1 0 1

-1 0 1

f1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents the joint distribution, p( f1,f2).

I We observe that f1= 0.313.

I Conditional density:p( f2| f1= 0.313).

Illustrations: Neil Lawrence

Stefanie Jegelka (and Caroline Uhler) 25 / 35

(57)

Covariance for 2 variables: intuition

Prediction of f2 from f1

-1 0 1

-1 0 1

f1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents the joint distribution, p( f1,f2).

I We observe thatf1= 0.313.

I Conditional density:p( f2| f1= 0.313).

Illustrations: Neil Lawrence

Stefanie Jegelka (and Caroline Uhler) 26 / 35

(58)

Covariance for 2 variables: intuition

Prediction of f2from f1

-1 0 1

-1 0 1

f1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents the joint distribution, p( f1,f2).

I We observe that f1= 0.313.

I Conditional density:p( f2| f1= 0.313).

Illustrations: Neil Lawrence

Stefanie Jegelka (and Caroline Uhler) 27 / 35

(59)

Covariance for 2 variables: intuition

Prediction of f5 from f1

-1 0 1

-1 0 1

f1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents the joint distribution, p( f1,f5).

I We observe that f1= 0.313.

I Conditional density:p( f5| f1= 0.313).

Illustrations: Neil Lawrence

Stefanie Jegelka (and Caroline Uhler) 28 / 35

(60)

Covariance for 2 variables: intuition

Prediction of f5 from f1

-1 0 1

-1 0 1

f1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents the joint distribution, p( f1,f5).

I We observe thatf1= 0.313.

I Conditional density:p( f5| f1= 0.313).

Illustrations: Neil Lawrence

Stefanie Jegelka (and Caroline Uhler) 29 / 35

(61)

Overview

Environmental data Modeling flows

Short-range spatial correlations intuition

2 variables multiple variables

Stefanie Jegelka (and Caroline Uhler) 30 / 35

(62)

Prediction: conditional probabilities

YA, YB Gaussian random variables. We observe YB = yB.

 YA YB



∼ N

 µA µB

 ,

 σA2 σAB σAB σ2B



Conditioning: p(YA|YB = yB) is also Gaussian with mean and variance

µA|B = µAAB

σB2 (yB − µB) σ2A|BA2 −σABσ−2B σAB.

Stefanie Jegelka (and Caroline Uhler) 31 / 35

(63)

Prediction: conditional probabilities

YA, YB Gaussian random variables. We observe YB = yB.

 YA YB



∼ N

 µA µB

 ,

 σA2 σAB σAB σ2B



Conditioning: p(YA|YB = yB) is also Gaussian with mean and variance

µA|B = µAAB

σB2 (yB − µB) σ2A|BA2 −σABσ−2B σAB.

Stefanie Jegelka (and Caroline Uhler) 31 / 35

(64)

Prediction: conditional probabilities

YA, YB Gaussian random variables. We observe YB = yB.

 YA YB



∼ N

 µA µB

 ,

 σA2 σAB σAB σ2B



Conditioning: p(YA|YB = yB) is also Gaussian with mean and variance

µA|B = µAAB

σB2 (yB − µB)

σ2A|BA2 −σABσ−2B σAB.

Stefanie Jegelka (and Caroline Uhler) 31 / 35

(65)

Prediction: conditional probabilities

YA, YB Gaussian random variables. We observe YB = yB.

 YA YB



∼ N

 µA µB

 ,

 σA2 σAB σAB σ2B



Conditioning: p(YA|YB = yB) is also Gaussian with mean and variance

µA|B = µAAB

σB2 (yB − µB) σ2A|BA2 −σABσ−2B σAB.

Stefanie Jegelka (and Caroline Uhler) 31 / 35

(66)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µAAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|BA2 −σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(67)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µAAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|BA2 −σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(68)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ]

σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µAAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|BA2 −σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(69)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µAAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|BA2 −σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(70)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µA+ σAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|B2A−σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(71)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µA+ σAB

σB2 (yB − µB)

⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σ2A|B2A−σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(72)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µA+ σAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N)

σ2A|B2A−σABσB−2σAB ⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(73)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µA+ σAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σA|B22A−σABσB−2σAB

⇒ σ∗|1:N22 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(74)

Conditioning on multiple variables / partial observation

Assume Y1, . . . , YN, Y are jointly Gaussian with µ = 0 and

σN∗ = [ cov(Y, Y1), . . . , cov(Y, YN) ] σ2= var(Y)

Observe y1, . . . , yN. Conditioning yields a Gaussian p(Y|y1, . . . yN) with parameters

µA|B = µA+ σAB

σB2 (yB − µB) ⇒ µ∗|1:N = µ + σN∗> Σ−1N (y1:N − µ1:N) σA|B22A−σABσB−2σAB ⇒ σ2∗|1:N2 − σ>N∗Σ−1N σN∗.

Stefanie Jegelka (and Caroline Uhler) 32 / 35

(75)

Comments

prediction: mean/mode ˆy = µ∗|1:N = µ + σ>N∗Σ−1N y1:N (for µ = 0)

variance shrinks as we observe more data: σ2∗|1:N ≤ σ2 example of Bayesian Inference.

Just need the covariance. . .

Idea: covariance is a function of distance!

Stefanie Jegelka (and Caroline Uhler) 33 / 35

(76)

Comments

prediction: mean/mode ˆy = µ∗|1:N = µ + σ>N∗Σ−1N y1:N (for µ = 0)

variance shrinks as we observe more data: σ2∗|1:N ≤ σ2

example of Bayesian Inference.

Just need the covariance. . .

Idea: covariance is a function of distance!

Stefanie Jegelka (and Caroline Uhler) 33 / 35

(77)

Comments

prediction: mean/mode ˆy = µ∗|1:N = µ + σ>N∗Σ−1N y1:N (for µ = 0)

variance shrinks as we observe more data: σ2∗|1:N ≤ σ2 example of Bayesian Inference.

Just need the covariance. . .

Idea: covariance is a function of distance!

Stefanie Jegelka (and Caroline Uhler) 33 / 35

(78)

Comments

prediction: mean/mode ˆy = µ∗|1:N = µ + σ>N∗Σ−1N y1:N (for µ = 0)

variance shrinks as we observe more data: σ2∗|1:N ≤ σ2 example of Bayesian Inference.

Just need the covariance. . .

Idea: covariance is a function of distance!

Stefanie Jegelka (and Caroline Uhler) 33 / 35

(79)

Comments

prediction: mean/mode ˆy = µ∗|1:N = µ + σ>N∗Σ−1N y1:N (for µ = 0)

variance shrinks as we observe more data: σ2∗|1:N ≤ σ2 example of Bayesian Inference.

Just need the covariance. . .

Idea: covariance is a function of distance!

Stefanie Jegelka (and Caroline Uhler) 33 / 35

(80)

Summary

Specifics of environmental data: spatio-temporal dependencies, physical models, simulations

Modeling flows via discretization

Modeling short-range spatial correlations with Gaussians

Stefanie Jegelka (and Caroline Uhler) 34 / 35

(81)

References and Further Reading

Environmental Data, Climate Informatics, Comp. Sustainability:

H. Kaper, H. Engler. Mathematics & Climate. Chapters 1, 8.

W. Menke, J. Menke. Environmental Data Analysis with Matlab Tackling Climate Change with Machine Learning.

https://arxiv.org/abs/1906.05433 list of high impact problems

Computational Sustainability: Computing for a Better World and a Sustainable Future. Communications of the ACM, Sep 2019. https://cacm.acm.org/

magazines/2019/9/238970-computational-sustainability/fulltext http://www.climateinformatics.org/

Stefanie Jegelka (and Caroline Uhler) 35 / 35

References

Related documents

Another important aspect of density/porosity analysis is that pyroclastic deposits commonly present a large range of density values, so sample sets must comprise a large num- ber

The distributed blackboard extends the centralized blackboard in many ways; it provides support for efficient data movement among heterogeneous hosts, scales for large data sets,

the distributed processing of large data sets across clusters of computers using simple programming models..

There is need to develop algorithms for mining large, massive and heterogeneous data sets, which increases continuously, with rapid growth of complex data types,

Technology: The project will develop capacity to access very large multivariate datasets; represent heterogeneous data types; integrate multiple GIS data sets stored in many GIS file

In summary, we showed that microservices allow for efficient horizontal scaling of analyses on multiple computational nodes, enabling the processing of large data sets. ,

Longitudinal analyses often use only respondents to each wave of interest [35, 19]. Even though the available data until the time of attrition could be used, at- triters are

• Quality control of initial data will help.. Challenges: Integration of multiple heterogeneous