Data Science for Gapfilling Complex Earth Observations
authors:
Verena Bessenbacher
Lukas Gudmundsson
Sonia I. Seneviratne
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
INTRODUCTION
Why gapfilling?
•
missing values are ubiquitous and
unavoidable
•
fragmentation of the observed record
limits wide-spread use
•
patterns of missingness are
non-trivial
•
non-trivial covariance structure
•
neighborhood relations
•
temporal autocorrelation
•
underlying physical constraints
Key limitations of state-of-the art
statistical imputation methods
cannot incorporate special structure of
geoscientific datasets
WHY
MODIS Skin Temperature
1
stAugust 2010
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
MODIS Skin Temperature
1
stAugust 2010
INTRODUCTION
Missing Completely At Random
= the fact that a point is missing does not depend on any other
variable, but can be described as a random process.
This is rarely the case with satellite data.
There are three fundamentally different ways how data can be missing.
Missing At Random
= the missing values share the same statistical properties as the observed values.
Swaths in satellite data leave such patterns.
Missing Not At Random
= the points missing are systematically different from the points observed.
E.g. skin temperature below clouds can expected to be lower than under clear sky, leaving the unobserved values different from the observed ones
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
OBJECTIVE
HOW
We use reanalysis data from the ERA5-Project (Guillory et al, 2017), which provide gap-free estimates of essential climate variables. We employ a "perfect dataset approach", where we assume the reanalysis data to be the "true" state of the land-climate interactions and introduce artificial missing values that are subsequently imputed.
The analysis is confined to daily, global land-only ERA5 data from 2003 to 2012, at 0.25° resolution. We exclude Antarctica and Greenland because in permanently glaciated areas soil moisture is not well defined. Furthermore, only ERA5 variables are considered that can be matched with available satellite remote sensing products: MODIS Aqua skin temperature
(Parkinson et al, 2003), GPM precipitation (Huffmann et al, 2019) and ESA-CCI soil moisture (Dorigo et al, 2017, Gruber and Scanlon, 2019, Gruber et al, 2017) of the uppermost soil layer. Additionally we assume constant maps of vegetation type, vegetation cover, topographic height and topographic complexity to be known and gap-free.
Usually, imputation focuses on gapfilling one variable only. This is often done with the help of other variables, spatial or temporal interpolation
we attempt multivariate, i.e. using more than one variable
mutual, i.e. gapfilling each variable with the help of all others multiple imputation, i.e. producing several estimates for each missing value incorporating:
-
covariance structure between variables-
spatial correlation among variables-
temporal autocorrelation among variablesMODIS Skin Temperature
1
stAugust 2010
ERA5 Reanalysis
1
stAugust 2010
with MODIS missingness pattern
perfect dataset approach
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
METHOD
Ridge Regression
Gaussian Process
Random Forest
Neural Network
skin temperature spatial interpolation skin temperature temporal interpolation precipitation spatial interpolation
precipitation temporal interpolation
surface layer soil moisture spatial interpolation surface layer soil moisture temporal interpolation while not converged: # iterative estimation of model and missing values
for variable in variables: # variables switch places so that each variable is predictor once
=
f
(
constant)
variables
,
,
, …
skin temperature precipitation surface layer
soil moisture for random sample of data points: # bagging approach
We sample random data points from the ERA5 variables and impute all missing values in this sample. We iteratively produce estimates for the missing values and fit a
model to the data for each variable, in an expectation-maximisation alike fashion. This procedure is repeated until the estimates for the missing data points converge. The
method harnesses the highly-structured nature of gridded covarying observation datasets within the flexible function learning toolbox of data-driven approaches. The
imputation utilises (1) the temporal autocorrelation and spatial neighborhood within one variable or dataset and (2) the different missingness patterns across different
variables or datasets, i.e. the fact that if one variable at a given point in space and time is missing, another covarying variable might be observed and their local
covariance could be learned. A simple ridge regression is already able to outperform simple “ad-hoc” gapfilling procedures on high resolution daily satellite data,
however, we are working on additionally testing a nonlinear method (Gaussian Process, Random Forest and Neural Net).
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
RESULTS
pearson correlation coe
ffi
cient
skin temperature
surface layer soil moisture
Skin temperature is missing where cloud fraction is high and global fraction of missing values of 42 %. ESA-CCI soil moisture has a impressive 68% of
missing values, effectively obscuring tropical rainforest regions all the time and high-latitude areas with snow cover around half the time. Soil moisture
measurements are therefore exposing a non-trivial missingness pattern with a comparatively high fraction of missingness among remote sensing products,
making it especially challenging for imputation. The pearson correlation between the gap filled values and the original values for each land point mirrors that:
Correlation is high where much data can be observed, and low where data is missing a lot of times. However, correlation is never negative, showing that the
gap filling procedure applied indeed improves the estimates for the missing values.
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
[email protected]
RESULTS
temperatur
e [°C]
skin temperature 13h00, close to Basel, year 2003
To show exemplarily how the gap filling
works, the plot shows the daily skin
temperature of Basel for the year 2003. In
black
, the ERA5 skin temperature is
plotted. In
green
, the same data is used,
but only the values that would have been
observed by a satellite are shown. Days
where Basel was overcast with clouds
cannot be seen by the satellite, for
example much of December 2003.
In red, the initial gap filling procedure is
shown. We use the temporal mean.
In
blue
, the final result is shown. The
iterative procedure reduces the bias and
increases the correlation of original data
and gapfilled values by incorporating
information
-
from the other variables (soil moisture
and precipitation)
-
from the neighboring grid points
-
from the day before and after
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
[email protected]
ERA5 data
satellite observable ERA5 data
init gapfill
RESULTS: the correlations per variable align well with artificial experiments
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
[email protected]
By plotting the fraction of missing data with the pearson correlation of gapfilled vs. original values, the merit of different gap filling procedures can be compared. A perfect gap filling
procedure would show a pearson correlation of 1 for all fractions of missing data. The mean initial gapfill is shown with the diamond. As expected, filling in the mean shows no variance and therefore no correlation with the original values. The iterative procedure increases the correlation for all variables (square, triangle and circle), but the higher the fraction of missing values in this variable is, the lower is the correlation with the original values.
To benchmark the gap filling procedure, we additionally consider an artificial missing ness pattern, where we introduce „artificial swaths“ into the ERA5-dataset. We can see that with
increasing missing values, the imputation merit decreases for the artificial case (solid lines). However, our points with the real missingness pattern fall in the area of the lines. This means that although in the real world, satellite observations are missing not at random, we still achieve a correlation as if it would be missing at random. This means that the high physical dependency of the three variables helps overcome their complex missingness pattern.
CONCLUSION & OUTLOOK
-
consider another initial gap fill, using climatology
-
add non-linear method for gapfilling
-
add net radiation as a variable
-
check physical consistency of imputed values (e.g. soil gets wet
when it rains)
Verena Bessenbacher
7
thMay, EGU General Assembly
Land Climate Dynamics, ETH Zürich
[email protected]
References
Bessenbacher V., L. Gudmundsson and S. I. Seneviratne (2019): Testing Random Forest Imputation for Land Hydrology Data, Proceedings of the 9th International Workshop on Climate Informatics, pp 73-77
van Buuren, S. (2018): Flexible Imputation of Missing Data, Chapman and Hall.
Dorigo W. et al (2017): ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. Remote Sensing of Environment, 203, 185-215. Gruber, A. et al (2017): Triple Collocation-Based Merging of Satellite Soil Moisture Retrievals. IEEE Transactions on Geoscience and Remote Sensing, 55, 12.
Gruber, A. and Scanlon, T. (2019): Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth System Science Data, 11, 717-739. Guillory, A. (2017): ERA5. https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5
Huffmann, G. et al (2019): Integrated Multi-satellite Retrievals for GPM (IMERG) version 4.4. NASA's Precipitation Processing Center.
Parkinson, C. L. (2003): Aqua: an earth-observing satellite mission to examine water and other climate variables. IEEE Transactions on Geoscience and Remote Sensing. 41, 2. Reichstein, M. et al (2019): Deep learning and process understanding for data-driven Earth system science, Nature, 566, 7743, 195ff
Rubin, D. B. (1976): Inference and missing data. Biometrika, 63, 3, pp 581-92
Scher, S. et al (2019): Weather and climate forecasting with neural networks: using GCMs with different complexity as study-ground. Geoscientific Model Development, 12, 2797-2809 Shen, H. et al (2015): Missing Information Reconstruction of Remote Sensing Data: A technical review. IEEE Geosci. Remote Sens. Mag., 3, 3, 61-81.
Stekhoven, D. J. and P. Bühlmann (2012): MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 1, 112-118.