GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

(1)

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L1-NORM

BARRIOT Jean-Pierre, SARRAILH Michel

BGI/CNES 18.av.E.Belin 31401 TOULOUSE Cedex 4 (France) Email: [email protected]

1/Introduction

The Bureau Gravimétrique International is managing a worldwide gravity database. These data have different origins and must be controlled to detect and eliminate outliers. Up to now, we used a prediction technique based on the L₂-norm (collocation) method. We have developed a new method, using the L₁-norm. We present here shortly the outlines of this method, and compare it for different test cases with the L2-method.

2/ Theory of the L1 prediction method

Self-validation is the detection of outliers in a survey from the cross-comparison of all the values of the survey.

Let g

~

the N- vector of the set of observed gravity values over a survey: g g g g_N ~ . =             1 2 The N-vector g ~ ^

of the "true" (unknown) values is related to the N-vector g

~ of observed values by I g_N g ~ ^ ~ ~

= +ε (1), where I_N is the identity matrix of order N and ε

~ is the N-vector of errors. In a perfect world, ε ~ =0~ and then g_~ g ^ ~

= . In an imperfect (our) world, ε

~ ≠0~. We have then to

solve Eq. (1) contaminated by errors.

• L₂- norm solution: ε

~ and g_~ ^

are considered as random variables with a priori 0

~ means and respectively

σε2IN and Cov g( )

~ ^

covariances. The L₂ a posteriori estimate of g

~ ^ as then g* Cov g Cov g I_N g ~ ^ ~ ^ ~ ^ ~ ( ) ( ) =  +   − σε2 1 mean

(2)

and σ_ε− − − +_ _ __       2 1 1 I_N Cov g ~ ^ covariance.

This is the usual least-squares collocation solution.

• L₁- norm solution:

From a L₁-norm point of view, we select the particular g i ~ ^ ( ) , which realizes min ^ ( ) ^ ( ) i l i l j l l N j j M g −g = = = =

∑

1 1

over a set of M realizations of the N-vectors g

~ and ε ~, with g g i i i ~ ^ ( ) ~ ( ) ~ ( )

= +ε . Of course, in the real world, we have to cope with a unique realization of g

~

and ε

~ (and we know only their sum g_~ ^ ), so we 2-1/ select from g ~ ^ observed M subvectors γ ~ ^ ( )i (i=1,...,M) of dimension K,

2-2/ complete through a given interpolation-extrapolation procedure the missing N−K values in order to get M vectors Γ

~ ^ ( )i

of dimension N, 2-3/ select the best estimate Γ

~ ^ ( )i

of g

~

as the one which realizes min ( ) ^

i l i l l N g Γ − =

∑

1 .

Fig. 1: Fitting a line through 3 data points. The L2 solution (a) goes through the 3 data points( , )x yi i by realizing a b i i i y a x b ,

min

∑

( −( + ))2. For the L1 solution, the solution fulfills

a b i i i y a x b , min

∑

−( + ), and

corresponds to one of the lines (b1, b2, b3) that joins the 3 two points subsets.

For L₁ norm, there is no equivalent of covariance matrices, so if we want to have some indication about the robustness of the solution, we can only construct Monte-Carlo estimates of the errors

(3)

by adding to the observed g

~ ^

values a random vector ζ

~

of 0

~ mean and known σζ 2

variance and

infer from this perturbation the corresponding mean and variance of Γ

~ ^ ( )i

.

3/ L1 prediction method algorithm

For a given gravity station where we have to predict the gravity value:

3-1: search of all the neighbouring points, up to a given radius;

3-2: determination of the "best" plane (Fig. 2) or paraboloid (Fig. 3) approximation of the local gravity around the station, by using the gravity values of a subset of selected neighbours, in the sense of the L₁ norm. The gravity value at the predicted point is excluded. As we consider only a limited number of neighbouring points, we study all the subsets of neighbours (subsets of 3 points for the "best" plane, subsets of 6 points for the "best" paraboloid), instead of considering the simplex method;

3-3: computation of the difference between observed and predicted anomaly, interpolated from the "best" L₁-surface, at the location of the predicted value;

3-4: comparison with a given threshold;

3-5: rejection or validation of the gravity value. 4/ Pros and cons of the L1 norm method

4-1 Pro: no "contamination" of the neighbouring points by "bad" points (i.e. a "good" point can be flagged as false, if compared to erroneous ("bad") neighbouring points);

4-2 Pro: no need to use residual anomalies; 4-3 Con: systematic rejection of extrema;

4-4 Con: rejection of points near the edges of the map (only with paraboloid prediction);

4-5 Con: rejection only based on a threshold on the difference between observed and predicted anomaly;

4-6 Con: no error estimate of the predicted anomaly. 5/ Pros and cons of the L2 norm method

5-1 Pro: rejection based on thresholds for the difference between observed and predicted anomaly and for the standard deviation error of the predicted anomaly;

5-2 Con: robustless solution: a "good" point can be flagged as "false", if compared to "bad" neighbouring points (see 4-1);

5-3 Con: need of computing residual anomalies before prediction; 5-4 Con: rejection of extrema.

(4)

Fig. 2: Dark triangle: « best » approximating plane going through the neighbouring gravity values. Black bar: difference between the observed and the predicted anomaly on the selected point (dot).

Fig. 3: Light grey: « best » approximating paraboloid going through the neighbouring gravity values. Black bar: difference between the observed and the predicted anomaly on the selected point (dot).

6/ Future improvements of the L1 method

6-1: estimating of the error on the predicted anomaly by Monte-Carlo method (see point 2); 6-2: Replacing planar or paraboloidal approximation by collocation prediction, to take into

account the covariance function of the anomalies. This will realize a "mix" between L₁ and L₂ methods.

(5)

7/ Example of data validation

Fig. 4: Bouguer anomaly map: good points (cross marker), doubtful points (circle marker) are identified by prediction using a collocation technique, taking into account the local covariance function.

Fig. 5: With the collocation technique, "bad" points can "contaminate" neighbouring points. Such points must be repredicted, after flagging of the erroneous ("bad") points with the largest differences between observed and predicted anomaly (see points 4-1 and 5-2).

(6)

Fig. 6: Prediction using L1 norm and plane approximation (Fig. 2). Seven neighbouring points are selected

per predicted point. The points predicted are considered doubtful if the difference between the observed anomaly and the predicted one is larger than 7 mGals.

Fig. 7: Prediction using L1 norm and paraboloid approximation (Fig. 3). Ten neighbouring points are

selected per predicted point. The points predicted are considered doubtful if the difference between the observed anomaly and the predicted one is larger than 7 mGals.