• No results found

Data Science for Gapfilling Complex Earth Observations

N/A
N/A
Protected

Academic year: 2021

Share "Data Science for Gapfilling Complex Earth Observations"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Science for Gapfilling Complex Earth Observations

authors:

Verena Bessenbacher

Lukas Gudmundsson

Sonia I. Seneviratne

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

(2)

INTRODUCTION

Why gapfilling?

missing values are ubiquitous and

unavoidable

fragmentation of the observed record

limits wide-spread use

patterns of missingness are

non-trivial

non-trivial covariance structure

neighborhood relations

temporal autocorrelation

underlying physical constraints

Key limitations of state-of-the art

statistical imputation methods

cannot incorporate special structure of

geoscientific datasets

WHY

MODIS Skin Temperature

1

st

August 2010

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

(3)

MODIS Skin Temperature

1

st

August 2010

INTRODUCTION

Missing Completely At Random

= the fact that a point is missing does not depend on any other

variable, but can be described as a random process.

This is rarely the case with satellite data.

There are three fundamentally different ways how data can be missing.

Missing At Random

= the missing values share the same statistical properties as the observed values.

Swaths in satellite data leave such patterns.

Missing Not At Random

= the points missing are systematically different from the points observed.

E.g. skin temperature below clouds can expected to be lower than under clear sky, leaving the unobserved values different from the observed ones

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

(4)

OBJECTIVE

HOW

We use reanalysis data from the ERA5-Project (Guillory et al, 2017), which provide gap-free estimates of essential climate variables. We employ a "perfect dataset approach", where we assume the reanalysis data to be the "true" state of the land-climate interactions and introduce artificial missing values that are subsequently imputed.

The analysis is confined to daily, global land-only ERA5 data from 2003 to 2012, at 0.25° resolution. We exclude Antarctica and Greenland because in permanently glaciated areas soil moisture is not well defined. Furthermore, only ERA5 variables are considered that can be matched with available satellite remote sensing products: MODIS Aqua skin temperature

(Parkinson et al, 2003), GPM precipitation (Huffmann et al, 2019) and ESA-CCI soil moisture (Dorigo et al, 2017, Gruber and Scanlon, 2019, Gruber et al, 2017) of the uppermost soil layer. Additionally we assume constant maps of vegetation type, vegetation cover, topographic height and topographic complexity to be known and gap-free.

Usually, imputation focuses on gapfilling one variable only. This is often done with the help of other variables, spatial or temporal interpolation

we attempt multivariate, i.e. using more than one variable

mutual, i.e. gapfilling each variable with the help of all others multiple imputation, i.e. producing several estimates for each missing value incorporating:

-

covariance structure between variables

-

spatial correlation among variables

-

temporal autocorrelation among variables

MODIS Skin Temperature

1

st

August 2010

ERA5 Reanalysis

1

st

August 2010

with MODIS missingness pattern

perfect dataset approach

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

(5)

METHOD

Ridge Regression

Gaussian Process

Random Forest

Neural Network

skin temperature spatial interpolation skin temperature temporal interpolation precipitation spatial interpolation

precipitation temporal interpolation

surface layer soil moisture spatial interpolation surface layer soil moisture temporal interpolation while not converged: # iterative estimation of model and missing values

for variable in variables: # variables switch places so that each variable is predictor once

=

f

(

constant

)

variables

,

,

, …

skin temperature precipitation surface layer

soil moisture for random sample of data points: # bagging approach

We sample random data points from the ERA5 variables and impute all missing values in this sample. We iteratively produce estimates for the missing values and fit a

model to the data for each variable, in an expectation-maximisation alike fashion. This procedure is repeated until the estimates for the missing data points converge. The

method harnesses the highly-structured nature of gridded covarying observation datasets within the flexible function learning toolbox of data-driven approaches. The

imputation utilises (1) the temporal autocorrelation and spatial neighborhood within one variable or dataset and (2) the different missingness patterns across different

variables or datasets, i.e. the fact that if one variable at a given point in space and time is missing, another covarying variable might be observed and their local

covariance could be learned. A simple ridge regression is already able to outperform simple “ad-hoc” gapfilling procedures on high resolution daily satellite data,

however, we are working on additionally testing a nonlinear method (Gaussian Process, Random Forest and Neural Net).

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

(6)

RESULTS

pearson correlation coe

cient

skin temperature

surface layer soil moisture

Skin temperature is missing where cloud fraction is high and global fraction of missing values of 42 %. ESA-CCI soil moisture has a impressive 68% of

missing values, effectively obscuring tropical rainforest regions all the time and high-latitude areas with snow cover around half the time. Soil moisture

measurements are therefore exposing a non-trivial missingness pattern with a comparatively high fraction of missingness among remote sensing products,

making it especially challenging for imputation. The pearson correlation between the gap filled values and the original values for each land point mirrors that:

Correlation is high where much data can be observed, and low where data is missing a lot of times. However, correlation is never negative, showing that the

gap filling procedure applied indeed improves the estimates for the missing values.

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

[email protected]

(7)

RESULTS

temperatur

e [°C]

skin temperature 13h00, close to Basel, year 2003

To show exemplarily how the gap filling

works, the plot shows the daily skin

temperature of Basel for the year 2003. In

black

, the ERA5 skin temperature is

plotted. In

green

, the same data is used,

but only the values that would have been

observed by a satellite are shown. Days

where Basel was overcast with clouds

cannot be seen by the satellite, for

example much of December 2003.

In red, the initial gap filling procedure is

shown. We use the temporal mean.

In

blue

, the final result is shown. The

iterative procedure reduces the bias and

increases the correlation of original data

and gapfilled values by incorporating

information

-

from the other variables (soil moisture

and precipitation)

-

from the neighboring grid points

-

from the day before and after

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

[email protected]

ERA5 data

satellite observable ERA5 data

init gapfill

(8)

RESULTS: the correlations per variable align well with artificial experiments

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

[email protected]

By plotting the fraction of missing data with the pearson correlation of gapfilled vs. original values, the merit of different gap filling procedures can be compared. A perfect gap filling

procedure would show a pearson correlation of 1 for all fractions of missing data. The mean initial gapfill is shown with the diamond. As expected, filling in the mean shows no variance and therefore no correlation with the original values. The iterative procedure increases the correlation for all variables (square, triangle and circle), but the higher the fraction of missing values in this variable is, the lower is the correlation with the original values.

To benchmark the gap filling procedure, we additionally consider an artificial missing ness pattern, where we introduce „artificial swaths“ into the ERA5-dataset. We can see that with

increasing missing values, the imputation merit decreases for the artificial case (solid lines). However, our points with the real missingness pattern fall in the area of the lines. This means that although in the real world, satellite observations are missing not at random, we still achieve a correlation as if it would be missing at random. This means that the high physical dependency of the three variables helps overcome their complex missingness pattern.

(9)

CONCLUSION & OUTLOOK

-

consider another initial gap fill, using climatology

-

add non-linear method for gapfilling

-

add net radiation as a variable

-

check physical consistency of imputed values (e.g. soil gets wet

when it rains)

Verena Bessenbacher

7

th

May, EGU General Assembly

Land Climate Dynamics, ETH Zürich

[email protected]

References

Bessenbacher V., L. Gudmundsson and S. I. Seneviratne (2019): Testing Random Forest Imputation for Land Hydrology Data, Proceedings of the 9th International Workshop on Climate Informatics, pp 73-77

van Buuren, S. (2018): Flexible Imputation of Missing Data, Chapman and Hall.

Dorigo W. et al (2017): ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. Remote Sensing of Environment, 203, 185-215. Gruber, A. et al (2017): Triple Collocation-Based Merging of Satellite Soil Moisture Retrievals. IEEE Transactions on Geoscience and Remote Sensing, 55, 12.

Gruber, A. and Scanlon, T. (2019): Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth System Science Data, 11, 717-739. Guillory, A. (2017): ERA5. https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5

Huffmann, G. et al (2019): Integrated Multi-satellite Retrievals for GPM (IMERG) version 4.4. NASA's Precipitation Processing Center.

Parkinson, C. L. (2003): Aqua: an earth-observing satellite mission to examine water and other climate variables. IEEE Transactions on Geoscience and Remote Sensing. 41, 2. Reichstein, M. et al (2019): Deep learning and process understanding for data-driven Earth system science, Nature, 566, 7743, 195ff

Rubin, D. B. (1976): Inference and missing data. Biometrika, 63, 3, pp 581-92

Scher, S. et al (2019): Weather and climate forecasting with neural networks: using GCMs with different complexity as study-ground. Geoscientific Model Development, 12, 2797-2809 Shen, H. et al (2015): Missing Information Reconstruction of Remote Sensing Data: A technical review. IEEE Geosci. Remote Sens. Mag., 3, 3, 61-81.

Stekhoven, D. J. and P. Bühlmann (2012): MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 1, 112-118.

Conclusions

Outlook

-

We gapfill several remote sensing datasets and test possible

algorithms on gapfree ERA5 data

-

A simple Ridge Regression is able to outperform trivial initial

gapfilling procedures

-

The high physical dependency between the variables makes

gapfilling possible although a

missing not at random

pattern is

observed

-

soil moisture observations are missing in around 68% of the time,

making it a challenging case for gapfilling

References

Related documents

Hence, it is not surprising if there are movements appear to anticipate membership expansion because the countries that will join later were Western Balkan coun- tries such as

In this study, we report an electrochemical sensor for rapid determination of GA in plants extracts, using differential pulse voltammetric (DPV) technique based on

Испитани се 516 ученици со ПУ во одделенија со посебни по- треби, 154 ученици со ПУ од инклузивни одделенија и 245 ученици без ПОП во инте- грирани одделенија со

Using Research to Guide an Organization Development Project 223.. Samuel

Youth advocacy and engagement will continue to be central to our future work program as we strive to develop headspace into an integrated suite of services that support

note that the defect mechanisms for Na and Bi non-stoichiometry in NBT are different and can lead to opposite effects in their electrical properties The most important

Through an artefactual field experiment with 200 Bolivian microfinance borrowers, we observe that subjects from real-world delinquent borrowing groups do not prefer risky

Internally generated risks arise within a project management team or its host organisation, from their management systems, culture and decisions.. Even when a project applies