Robust Statistics for (big) data analytics

Riani M1., Atkinson A.C.2, Corbellini A.1, Perrotta D.3 & Torti F.3

University of Parma

2_LSE 3

European Commission Joint Researc Centre [email protected]

Abstract

Data rarely follow the simple models of mathematical statistics. Often, there will be distinct subsets of observations so that more than one model may be appropriate. Further, parameters may gradually change over time. In addition, there are often dispersed or grouped outliers which, in the context of international trade data, may correspond to fraudulent behavior. All these issues are present in the datasets that are analyzed on a daily basis by the Joint Research Centre of the European Commission and can only tackled by using methods which are robust to deviations to model assumptions (see for example Perrotta et al., 2020).

This distance between mathematical theory and data reality has led, over the last sixty years, to the development of a large body of work on robust statistics. In the seventies of last century it was expected that in the near future “any author of an applied article who did not use the robust alterative would be asked by the referee for an explanation” (Stigler, 2010). Now, a further forty years on, there does not seem to have been the foreseen breakthrough into the wider scientific universe. In this talk, we initially sketch what we see as some of the reasons for this failure, suggest a system of interrogating robust analyses, which we call “monitoring” (Cerioli et al. 2018) and describe the robust and efficient methods which are currently used by the Joint Research Centre of the European Commission to detect model deviations, groups of homogeneous observations (Torti et al. 2018), multiple outliers and/or sudden level shifts in time series (Rousseeuw et al. 2019).

Particular attention will be given to robust and efficient methods (kwown as forwad search) which enables to use a flexible level of trimming and understand the effect that each unit (outlier or not) exerts on the model (see for example Atkinson and Riani, 2000, Riani Atkinson and Cerioli, 2009). Finally we discuss the extension of the above methods to transformations (Atkinson et al. 2020) and to the big data context. With a large set of data, any model is liable to be approximate. Our ﬁrst guesses at a model may even be seriously inadequate. But a suitable measure of lack of ﬁt of themodel can provide low-dimensional plots that are informative about inadequacies. The inadequacy may be systematic, or it may be due to lack of homogeneity in the data as well as to the presence of, possibly large, numbers of outliers. For

these reasons, we need to ﬁt the model in a robust manner. It is customary in robust analyses to assume that at least 50% of the observations come from the same uncontaminated distribution. However, this need not be the case when there are several clusters. For small samples, starting in different clusters can be informative about cluster structure. Random start forward searches (Atkinson et al. 2018) provide one way of starting in a variety of conditions. Theoretical results on such extreme trimming are given by Cerioli et al. (2019). All the methods described in the talk have been included in the FSDA Matlab toolbox freely donwloadable as a toolbox from Mathworks file exchange or from github at the web address https://uniprjrc.github.io/FSDA/

References:

Atkinson A.C., Riani M. (2000) Robust Diagnostic Regression Analysis, Springer–Verlag, NewYork.

Atkinson A.C., Riani M., Cerioli A. (2010) The forward search: theory and data analysis (with discussion). J Korean Stat Soc 39:117–134.

https://doi.org/10.1016/j.jkss.2010.02.007

Atkinson A.C., Riani M., Cerioli A. (2018) Cluster detection and clustering with random start forward searches. J Appl Stat 45:777–798.

https://doi.org/10.1080/02664763.2017.1310806

Atkinson A.C., Riani M., Corbellini A. (2020). The analysis of

transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, Applied Statistics, https://doi.org/10.1111/rssc.12389 Cerioli A., Farcomeni A., Riani M. (2019) Wild adaptive trimming for robust estimation and cluster analysis. Scand J Stat 46:235–256.

https://doi.org/10.1111/sjos.12349

Cerioli A., Riani M., Atkinson A.C., Corbellini A. (2018) The power of monitoring: how to make the most of a contaminated multivariate sample (with discussion). Stat Methods Appl 27: 559–587.

https://doi.org/10.1007/s10260-017-0409-8

Cerioli A., Riani M., Atkinson A.C., Corbellini A. (2018) The power of monitoring: how to make the most of a contaminated multivariate sample (with discussion). Stat Methods Appl 27:559–587.

https://doi.org/10.1007/s10260-017-0409-8

Perrotta D., Cerasa A., Torti F., Riani M. (2020), The Robust Estimation of MonthlyPrices of Goods Traded by the Euro-pean Union A technical guide, JRC report, https://ec.europa.eu/jrc/en/publication/robust-estimation- monthly-prices-goods-traded-european-union

Riani,M., Atkinson, A. C. and Cerioli, A. (2009). Finding an unknown number of multivariate outliers. J. R. Stat. Soc. Ser. B Stat. Methodol. 71

447–466. https://doi.org/10.1111/j.1467-9868.2008.00692.x Rousseeuw P.J., Perrotta D., Riani M., Hubert M. (2018). Robust Monitoring of Time Series with Application to Fraud Detection.

Econometrics and Statistics. https://doi.org/10.1016/j.ecosta.2018.05.001 Stigler S.M. (2010). The changing history of robustness. The American Statistician, 64, 277–281.

Torti F.. Perrotta D., Riani M. Cerioli A. (2018). Assessing Trimming Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification: 227-257, https://doi.org/10.1007/s11634-018- 0331-4

A proposal of an algorithm to simulation right

In document Advances in Reliability, Risk and Safety Analysis with Big Data: Proceedings of the 57th ESReDA Seminar: Hosted by the Technical University of Valencia, 23-24 October, 2019, Valencia, Spain (Page 100-103)