Statistical Models for Data Analysis

(1)

Studies in Classiﬁ cation, Data Analysis,

and Knowledge Organization

Statistical Models

for Data Analysis

Paolo Giudici

Salvatore Ingrassia

(2)

Studies in Classiﬁcation, Data Analysis,

and Knowledge Organization

Managing Editors Editorial Board H.-H. Bock, Aachen D. Baier, Cottbus

W. Gaul, Karlsruhe F. Critchley, Milton Keynes M. Vichi, Rome R. Decker, Bielefeld C. Weihs, Dortmund E. Diday, Paris

M. Greenacre, Barcelona C.N. Lauro, Naples J. Meulman, Leiden P. Monari, Bologna S. Nishisato, Toronto N. Ohsumi, Tokyo O. Opitz, Augsburg G. Ritter, Passau M. Schader, Mannheim

For further volumes:

(3)

(4)

Paolo Giudici

Salvatore Ingrassia

Maurizio Vichi

Editors

Statistical Models

for Data Analysis

(5)

Editors Paolo Giudici

Department of Economics and Management University of Pavia

Pavia Italy

Maurizio Vichi Department of Statistics

University of Rome “La Sapienza” Rome

Italy

Salvatore Ingrassia

Department of Economics and Business University of Catania

Catania Italy

ISSN 1431-8814

ISBN 978-3-319-00031-2 ISBN 978-3-319-00032-9 (eBook) DOI 10.1007/978-3-319-00032-9

Springer Cham Heidelberg New York Dordrecht London Library of Congress Control Number: 2013941993

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

(6)

Preface

This volume contains revised versions of the selected papers presented at the 8th biannual meeting of the Classiﬁcation and Data Analysis Group (CLADAG) of the Italian Statistical Society, organized by the Department of Economics and Management of the University of Pavia, in September 2011.

The conference has encompassed 170 presentations, organized in 3 plenary talks and 46 sessions. With 230 attendees from 10 different countries, the conference provided an attractive interdisciplinary international forum for discussion and mutual exchange of knowledge. The topics of all plenary and specialized sessions were chosen, in a peer-review process, to fit the mission of CLADAG which is to promote methodological, computational and applied research, within the fields of classification, data analysis and multivariate statistics.

The contributions in this volume were selected in a second peer-review process, after the conference. In addition to the fundamental areas of clustering and dis-crimination, multidimensional data analysis and data mining, the volume contains manuscripts concerning data analysis and statistical modelling in application areas like economics and ﬁnance, education and social sciences and environmental and biomedical sciences.

We would like to express our gratitude to all members of the scientiﬁc program committee, for their ability in attracting interesting contributions. We also thank the session organizers, the invited speakers, the chairpersons and the discussants of all sessions for a very stimulating scientiﬁc atmosphere. We are very grateful to the referees, for their careful reviews of the submitted papers and for the time spent in this professional activity.

We gratefully acknowledge ﬁnancial support from the Italian Ministry of Research (PRIN programme), the University of Pavia, the Credito Valtellinese banking group and the IT company ISED. We also thank PRAGMA Congressi for the precious support in the organization of the conference.

A special thanks is due to the local organizing committee and, in particular, to its coordinator, Dr. Paola Cerchiello, for a very well organized conference, with the related scientiﬁc proceedings. Finally we would like to thank Ruth Milewski and

(7)

vi Preface

Dr. Martina Bihn of Springer-Verlag, Heidelberg, for the support and dedication to the production of this volume.

Pavia, Italy Paolo Giudici

Catania, Italy Salvatore Ingrassia

Roma, Italy Maurizio Vichi

(8)

Contributors

G. Adelﬁo Dipartimento di Scienze Statistiche e Matematiche “S. Vianelli”, Universit´a degli Studi di Palermo, Palermo, Italy

Claudio Agostinelli Department of Environmental Science, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy

Massimo Attanasio Dipartimento di Scienze Statistiche e Matematiche “Silvio Vianelli”, Universit degli Studi di Palermo, Palermo, Italy

Antonio Balzanella Second University of Naples, Naples, Italy

Tonio Di Battista University G. D’Annunzio, Rome, Italy

Annamaria Bianchi DMSIA, University of Bergamo, Bergamo, Italy

Silvia Bifﬁgnandi DMSIA, University of Bergamo, Bergamo, Italy

Giuseppe Boari Dipartimento di Scienze statistiche, Universit´a Cattolica del Sacro Cuore, Milano, Italy

Andrea Bonanomi Dipartimento di Scienze statistiche, Universit´a Cattolica del Sacro Cuore, Milano, Italy

Stefano Bonnini Department of Economics, University of Ferrara, Ferrara, Italy

Giovanni Boscaino Dipartimento di Scienze Statistiche e Matematiche “Silvio Vianelli”, Universit degli Studi di Palermo, Palermo, Italy

Anne-Laure Boulesteix Biometry and Epidemiology of the Faculty of Medicine, Department of Medical Informatics, University of Munich, Munich, Germany

Giuseppe Bove Dipartimento di Scienze dell’Educazione, Rome, Italy

Riccardo Bramante Department of Statistical Sciences, Catholic University of Milan, Milan, Italy

Gabriele Cantaluppi Dipartimento di Scienze statistiche, Universit´a Cattolica del Sacro Cuore, Milano, Italy

(13)

xii Contributors

Andrea Capotorti Dip. Matematica e Informatica, Universit´a di Perugia, Perugia, Italy

Carlo Capuano University of Naples Federico II, Napoli, Italy

Vincenza Capursi Dipartimento di Scienze Statistiche e Matematiche “Silvio Vianelli”, Universit degli Studi di Palermo, Palermo, Italy

Paola Cerchiello Department of Economics and Management, University of Pavia, Lombardy, Italy

M. Chiodi Dipartimento di Scienze Statistiche e Matematiche “S. Vianelli”, Universit´a degli Studi di Palermo, Palermo, Italy

Rosario D’Agata University of Catania, Catania, Italy

A. D’Alessandro Istituto Nazionale di Geoﬁsica e Vulcanologia, Centro Nazionale Terremoti, Terremoti, Italy

Livia Dancelli Department of Quantitative Methods, University of Brescia, Brescia, Italy

Maria Rosaria D’Esposito University of Salerno, Fisciano (SA), Italy

Carlo Drago University of Naples “Federico II” Complesso Universitario Monte Sant’Angelo via Cinthia, Naples, Italy

Lara Fontanella University G. d’Annunzio, Chieti-Pescara, Italy

S. Fontanella University G. d’Annunzio, Chieti-Pescara, Italy

C. Fusilli University G. d’Annunzio, Chieti-Pescara, Italy

L.A. Garc´ıa-Escudero Facultad de Ciencias, Universidad de Valladolid, Valladolid, Spain

A. Gordaliza Facultad de Ciencias, Universidad de Valladolid, Valladolid, Spain

Alessandra Guglielmi Politecnico di Milano, Milano, Italy

Francesca Ieva Politecnico di Milano, Milano, Italy

Domenica Fioredistella Iezzi Tor Vergata University, Rome, Italy

L. Ippoliti University G. d’Annunzio, Chieti-Pescara, Italy

Antonio Irpino Dipartimento di Studi Europei e Mediterrranei, Seconda Universit degli Studi di Napoli, Caserta, Italy

Valen E. Johnson University of Texas MD Anderson Cancer Center, Houston, TX, USA

Frank Klawonn Ostfalia, University of Applied Sciences, Wolfenb¨uttel, Germany

(14)

Contributors xiii

Francesco Lagona University Roma Tre, Rome, Italy

Carlo Lauro University of Naples “Federico II” Complesso Universitario Monte Sant’Angelo via Cinthia, Naples, Italy

Vincent Lemaire Orange Labs, Lannion, France

Caterina Liberati Economics Department, University of Milano-Bicocca, Milan, Italy

Eleonora Lorenzini Department of Economics and Management, University of Pavia, Lombardy, Italy

Pietro Giorgio Lovaglio Department of Quantitative Methods, University of Bicocca-Milan, Milan, Italy

D. Luzio Dipartimento di Scienza della Terra e del Mare, Universit´a degli Studi di Palermo, Palermo, Italy

Mario Mastrangelo Sapienza University, Rome, Italy

C. Matr´an Facultad de Ciencias, Universidad de Valladolid, Valladolid, Spain

Marica Manisera Department of Quantitative Methods, University of Brescia, Brescia, Italy

Paolo Mariani Statistics Department, University of Milano-Bicocca, Milan, Italy

Federico Marsich Trieste, Italy

A. Mayo-Iscar Facultad de Ciencias, Universidad de Valladolid, Valladolid, Spain

Angelo Mazza Dipartimento di Economia e Impresa, Universit di Catania, Catania, Italy

Paul D. McNicholas University of Guelph, Guelph, ON, Canada

Alfredo Del Monte University of Naples Federico II, Napoli, Italy

Isabella Morlini Department of Economics, University of Modena and Reggio Emilia, Modena, Italy

Riccardo Di Nisio University G. D’Annunzio, Rome, Italy

Eugenia Nissi Dipartimento di Economia, Universit`a “G. d’Annunzio” di Chieti-Pescara, Chieti-Pescara, Italy

Margret-Ruth Oelker Department of Statistics, University of Munich, Munich, Germany

Biometry and Epidemiology of the Faculty of Medicine, Department of Medical Informatics, University of Munich, Munich, Germany

(15)

xiv Contributors

Akinori Okada Graduate School of Management and Information Sciences, Tama University, Tama, Japan

Anna Maria Paganoni Politecnico di Milano, Milano, Italy

Marco Picone University Roma Tre, Rome, Italy Marine Service, Ispra, Italy

Antonella Plaia Dipartimento di Scienze Statistiche e Matematiche “Silvio Vianelli”, Universit degli Studi di Palermo, Palermo, Italy

Mariano Porcu Dipartimento di Scienze Sociali e delle Istituzioni, Universit´a di Cagliari, Cagliari, Italy

Antonio Punzo Dipartimento di Economia e Impresa, Universit di Catania, Catania, Italy

Emanuela Rafﬁnetti Department of Economics, Management and Quantitative Methods, Universit`a degli Studi di Milano, Milano, Italy

Agnese Rapposelli Dipartimento di Economia, Universit`a “G. d’Annunzio” di Chieti-Pescara, Pescara, Italy

Lidia Rivoli University of Naples Federico II, Naples, Italy

Luca La Rocca Dipartimento di Comunicazione e Economia, University of Modena and Reggio Emilia, Reggio Emilia, Italy

Mario Romanazzi Department of Environmental Science, Informatics and Statis-tics, Ca’ Foscari University of Venice, Venice, Italy

Elvira Romano Second University of Naples, Caserta, Italy

David Rossell Institute for Research in Biomedicine of Barcelona, Barcelona, Spain

Fabrizio Ruggeri CNR IMATI Milano, Milano, Italy

Luigi Salmaso Department of Management and Engineering, University of Padova, Vicenza, Italy

Christophe Salperwyck Orange Labs, Lannion, France LIFL, Universit´e de Lille 3, Villeneuve d’Ascq, France

Scipione Sarlo Sapienza University, Rome, Italy

A. Sarra University G. D’Annunzio, Rome, Italy

Monica Scannapieco Istat, Rome, Italy

Germana Scepi University of Naples “Federico II” Complesso Universitario Monte Sant’Angelo via Cinthia, Naples, Italy

(16)

Contributors xv

Klaus B. Schebesch Faculty of Economics, Vasile Goldis¸, Western University Arad, Arad, Romania

Gabriella Schoier Dipartimento di Scienze Economiche Aziendali Matematiche e Statistiche, Universit´a di Trieste, Trieste, Italy

Ricardo Silva University College London, London, UK

Francesca Solmi Department of Statistical Sciences, University of Padova, Padova, Italy

Ralf Stecking Department of Economics, Carl von Ossietzky University Oldenburg, Oldenburg, Germany

Domenico De Stefano University of Trieste, Trieste, Italy

Isabella Sulis Dipartimento di Scienze Sociali e delle Istituzioni, Universit´a di Cagliari, Cagliari, Italy

Donatello Telesca University of California, Los Angeles, CA, USA

Venera Tomaselli University of Catania, Catania, Italy

Hiroyuki Tsurumi College of Business Administration, Yokohama National University, Yokohama, Japan

Barbara Vantaggi Dip. Scienze di Base e Applicate per l’Ingegneria, Universit´a La Sapienza, Roma, Italy

Rosanna Verde Second University of Naples, Naples, Italy

Dipartimento di Studi Europei e Mediterrranei, Seconda Universit degli Studi di Napoli, Caserta, Italy

Maria Prosperina Vitale University of Salerno, Fisciano (SA), Italy

Giorgio Vittadini Department of Quantitative Methods, University of Bicocca-Milan, Bicocca-Milan, Italy

Marika Vezzoli Department of Quantitative Methods, University of Brescia, Brescia, Italy

Roland Winkler German Aerospace Center, Braunschweig, Germany

Angelo Zanella Dipartimento di Scienze statistiche, Universit´a Cattolica del Sacro Cuore, Milano, Italy

Diego Zardetto Istat, Rome, Italy

(17)

Ordering Curves by Data Depth

Claudio Agostinelli and Mario Romanazzi

Abstract Application of depth methods to functional data provides new tools of analysis, in particular an ordering of curves from the center outwards. Two specific depth definitions are band depth and half-region depth (López-Pintado

& Romo (2009).Journal of the American Statistical Association, 104, 718–734;

L´opez-Pintado & Romo (2011). Computational Statistics & Data Analysis, 55,

1679–1695). Another research area is local depth (Agostinelli and Romanazzi

(2011). Journal of Statistical Planning and Inference, 141, 817–830.) aimed to

identify multiple centers and dense subsets of the space. In this work we suggest local versions for both band and half-region depth and illustrate an application with real data.

1 Introduction

The data considered here are a set ofncurvesyi.t /yi,t 2T R,iD1; : : : ; n

to be interpreted as IID observations of a functional variabley.t / deﬁned in an

infinite dimensional space. Examples arise in chemometrics (spectrometrics curves), environmetrics (time trajectories of concentrations of pollutants), finance (stock prices), medicine (ECG and EEG curves) and many other fields. Two comprehensive

references on the statistical methods areRamsay and Silverman(2005) andFerraty

and Vieu (2006). A general problem is to deﬁne an ordering for functional data able to rank the curves according to centrality. Useful notions of mean curve and

median curve are already available (e.g.,Ferraty and Vieu(2006, p. 127)) but a more

general solution is in terms of functional depth, an extension of the order statistic.

C. Agostinelli ()M. Romanazzi

Department of Environmental Science, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy

e-mail:[email protected];[email protected]

P. Giudici et al. (eds.),Statistical Models for Data Analysis, Studies in Classiﬁcation, Data Analysis, and Knowledge Organization, DOI 10.1007/978-3-319-00032-9 1, © Springer International Publishing Switzerland 2013

(18)

2 C. Agostinelli and M. Romanazzi

Data depth was established at the end of the last century as a general nonparametric

methodology for multivariate numerical data (Liu 1990; Zuo and Serﬂing 2000).

The extension to functional data is a more recent research area (López-Pintado and Romo 2009) with some useful achievements including a center-outwards ordering of curves and the notion of depth regions to be interpreted similarly to interquartile intervals. An important application is supervised classification of a curve into one of several classes (López-Pintado and Romo 2005).

In this work we concentrate on local versions of functional depth. Unlike the usual deﬁnition, local depth functions (Agostinelli and Romanazzi 2011) allow recognition of multiple centers and dense subsets of the reference space, a topic of interest also in the context of functional data. An obvious application is curve clustering. The content of the paper is the following. The basic deﬁnitions of

functional depth are reviewed in Sect.2 and their local versions are described in

Sect.3. Some real data applications are discussed in Sect.4.

2 Functional Depth

The basic tool in data depth is the depth function which maps each object of the reference space to a non negative real number—the depth value—to be interpreted as a centrality rank. Depth ranks are different from the standard ranks generated by the univariate order statistic. The maximum depth identiﬁes the center whereas the minimum depth corresponds to the outskirts of the distribution. In the case

of the euclidean spaceRp _{some popular deﬁnitions are Mahalanobis’, halfspace,}

projection and simplicial depth (Zuo and Serﬂing 2000). Half-region and band depth can be interpreted as extensions of halfspace and simplicial depth, respectively, to the functional setting. Roughly speaking, half-region depth is the minimum probability that a randomly sampled curve lies below or above the curve under consideration whereas band depth is the probability that the curve is covered by the band determined by a suitable number of random copies of the underlying stochastic process. Alternative, less restrictive, deﬁnitions are obtained by replacing

probabilities with the expected time the curve satisﬁes either constraint and are

calledmodiﬁeddepths. Precise notation and deﬁnitions for the sample situation are

given below. The population versions can be found in the original papers.

The observed data are a collection of functions yn D fyi 2 C.T / W i D

1; ; ng, with C.T / denoting some functional space, e.g., the space of all

continuous functions on some compact intervalT R. The graph of a function

yis the subset of the spaceG.y/D f.t; y.t //Wt 2Tg.

We start with the deﬁnition of half-region depth. The hypograph (epigraph) of

a curve y is the set of points lying on or below (on or above) its graph. Let

Rhypo.yIyn/ (Repi.yIyn/) be the sample proportion of graphs belonging to the

(19)

Ordering Curves by Data Depth 3

Deﬁnition 1 (Half-region depth,L´opez-Pintado and Romo 2011). For a

func-tional datasetyn, the half-region depth of a curveyis

dHR.yIyn/Dmin.Rhypo.yIyn/; Repi.yIyn// : (1)

Let./denote Lebesgue measure. The modiﬁed version of (1) is

dMHR.yIyn/Dmin.EL.yIyn/;HL.yIyn//; (2) where EL.yIyn/D _{n.T /}1 Pn iD1 .t 2T Wy.t /yi.t // ; (3) HL.yIyn/D _{n.T /}1 Pn iD1 .t 2T Wy.t /yi.t // : (4)

We proceed with the deﬁnition of band depth. For any choice ofknfunctions

yi1; yi2; ; yik out of thenfunctionsy1; y2; ; yn, a bandB.yi1; yi2; ; yik/is

the smallest region of the plane including all of them, that is

B.yi1; yi2; ; yik/D [k lD1hypo.yil/ \[k mD1epi.yim/ (5) D [k lD1[kmD1 hypo.yil/\epi.yim/ : The quantity d_B.k/.yIyn/D n k !1 _X 1i1<i2<<ikn 1.G.y/B.yi1; yi2; ; yik// (6)

is the proportion of bands B.yi1; yi2; ; yik/ containing the whole graph ofy.

Here,1./is the indicator function of its argument.

Deﬁnition 2 (Band depth,L´opez-Pintado and Romo 2009). Let2 K nbe

a ﬁxed value. The band depth of a curveyis

dB;K.yIyn/D

K X kD2

d_B.k/.yIyn/ : (7)

The modiﬁed version of (7) is

dMB;K.yIyn/D K X kD2 d_MB.k/.yIyn/ ; (8) where d_MB.k/.yIyn/D n k !1 1 .T / X 1i1<i2<<ikn .t 2T WG.y/B.yi1; yi2; ; yik// : (9)

(20)

An often used value isKD2.

Together with the depth values, another basic statistic is the deepest—or

median—curve. A useful generalization is the100p% depth regionDOn.p/, deﬁned

as the subset of the100p% deepest curves,0 < p < 1. Frequently,p D 0:5and

O

Dn.0:5/is a possible generalization of the interquartile interval. We end this section

recalling that band and half-region depth, and their maximizers, are consistent estimators of their population counterparts (L´opez-Pintado and Romo 2009, Th. 4, L´opez-Pintado and Romo 2011, Th. 3 and 4).

3 Local Functional Depth

The rationale of local depth is to measure depth conditional on a bounded neighbourhood of the objects under consideration, so as to capture features of the

nearby portion of the space, only. It depends on a tuning parameter , constant

across the space, measuring the size of the neighbourhoods. When grows higher,

local depth becomes more similar to ordinary depth. For simplicial depth, the tuning parameter is just simplex size measured, e.g., by diameter or volume. For halfspace depth, it is the width of minimum probability slabs. The reader is referred toAgostinelli and Romanazzi(2011) for more details. With functional data sets, local versions of both half-region and band depth are obtained by constraining the

widths of hypo/epigraphs and bands to a ﬁnite value. Setting equal to thep-th

order percentile of pairwise distances of curves works well in practice, withp in

the range 10%–30%. Accordingly, local half-region depth of a curve y.t /is the

minimum proportion of curves lying at a distance not greater thanbelow or above

y.t /and local band depth is the sample proportion of bands coveringy.t /formed

byK curves with pairwise distances not greater than. The computing formulae

are given in Eqs. (10) and (11) below.

ldHR.yIyn; /Dmin 1 n n X iD1 1.G.yi/hypo.yI //;1 n n X iD1 1.G.yi/epi.yI // ! ; (10) where

hypo.yI /Dhypo.y/\epi.y /D f.t;z/Wt 2T; y.t / zy.t /g;

epi.yI /Dhypo.yC /\epi.y/D f.t;z/Wt 2T; y.t /zy.t /Cg:

ldB;K.yIyn; /D K X kD2 ld.k/_B .yIyn; / (11) D K X kD2 n k !1

(21)

Ordering Curves by Data Depth 5

In the previous equation,s.B.y1; ; yk//is a scalar measure of band size, like

the diameter

diam.B.y1; ; yk//Dmax

t2T jmax.y1.t /; ; yk.t // min.y1.t /; ; yk.t //j :

Themodiﬁedversions are deﬁned in a similar way as the global counterparts and

are omitted. Forx; y2C.T /, two often-used distances are

ı1.x; y/ Dsupt2T jx.t / y.t /j;

ı2.x; y/D.

R

T.x.t / y.t //

2_dt_/1=2_:

The consistency of the local depth functions deﬁned in this section can be proved

along the same lines asAgostinelli and Romanazzi(2011).

4 Applications

We use two data sets to illustrate curve ranking. The ﬁrst one includes absorbance curves of 240 samples of meat recorded in the wavelength range 850–1,050 nm. See Ferraty and Vieu(2006) for more details. From Fig.1, the curves appear very regular and, apart from a vertical shift, they exhibit a similar behaviour. The second data set is the gross domestic product (GDP) dataset. Observed data are time trajectories over the period 1970–2009 of log GDP pro capite of 83 countries (yearly data

in dollars; source: World Bank). Graphical inspection of the curves (see Fig. 2)

shows different locations but a largely common shape. Rich countries form a fairly separated cluster, notably stable over time.

Our analysis is based on modiﬁed half-region depth, the tuning parameter of the

local version being the10-th percentile of pairwiseı1distances of trajectories.

For the meat data, the ranks provided by global and local depth largely agree and there is no evidence of partial centers or clusters. Both functions reach the maximum value on the dense subset of curves in the lower half of the plot and they steadily decrease when proceeding towards the lower or the upper extreme of the absorbance range. The shift of local central region towards the lower absorbance values reﬂects

the typical behaviour of local depth withasymmetricdistributions (Agostinelli and

Romanazzi 2011). A referee suggested that the depth ordering ofderivativesof the curves could provide information on curve shape. In the present instance this proves to be true because local depth of the second derivatives recognizes three classes of

curves corresponding to those found byFerraty and Vieu(2006), p. 136–137.

A different picture arises from GDP data (see Fig. 2). The deepest curves

according to global depth are Botswana and Colombia and the extreme curves are Burundi, Malawi and Norway. In most cases local depth is very different from global depth, with variations in both directions, which supports the hypothesis that GDP is a compound process. The (normalized) local and global depth values of

(22)

6 C. Agostinelli and M. Romanazzi 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Meat Data Wavelength Global Depth Absorbance 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Meat Data Wavelength Local Depth Absorbance

Fig. 1 Absorbance curves with gray level proportional to depth (darker means deeper;dashed line: deepest curve).Left: modiﬁed half-region depth,right: local version

1970 1980 1990 2000 2010 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

GDP PRO CAPITE 1970−2009 (World Bank)

Year Global Depth LOG GDP PR O CAPITE ($) 1970 1980 1990 2000 2010 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

GDP PRO CAPITE 1970−2009 (World Bank)

Year Local Depth

LOG GDP PR

O CAPITE ($)

Fig. 2 GDP curves with gray level proportional to depth (darker means deeper).Left: modiﬁed half-region depth (dashed line: Botswana, dotted line: Colombia);right: local version (dashed line: Austria,dotted line: Guatemala)

Botswana are 0.078 and 1, respectively, whereas the corresponding values of Austria are 1 and 0.194. Indeed, Austria can be considered the centroid of the group of richest countries. The second highest value of local depth is Guatemala, also shown

in Fig.2. The different behaviour of global and local depth is best evaluated from

Fig.2 where the gray level of GDP curves is proportional to (normalized) depth

(black corresponds to maximum depth). It is conﬁrmed that the richest countries form a densely packed group. Another two dense regions are suggested, but they are more dispersed. However, the gap between rich and non rich countries is ﬁlled with several curves with low depth values that can mask the group structure.

Previous results suggest that meat curves are determinations of a homogeneous process whereas GDP curves can come from a mixture of (at least) two processes describing GDP dynamics of rich and non rich countries, respectively. An important

(23)

Ordering Curves by Data Depth 7 l l l l l l l l 0 20 40 60 0 20 40 60 80 100

Percentile Order of Distances

Bootstr ap P−V alue l l l l l l l l

Fig. 3 Bootstrapp-value of homogeneity test withT3;n. /

test function (solid: GDP data,dashed: meat data)

problem is to test the (null) hypothesis that the underlying process is homogeneous. Therank transformationof the local and global depth values are obvious candidates to build a test function because, under the null hypothesis, local and global depth

produce the same ordering. Two natural test functions are the sample meanT1;n. /

of the absolute differences of local and global ranks,

T1;n. /Dn1

n X iD1

jd_i.R/ ld.R/_i . /j;

and the sample correlationT2;n. /of local and global ranks,

T2;n. /Dcorr.d.R/;ld.R/. // :

Here, fori D 1; ; n,d_i.R/ (ld.R/_i . /) stands for the rank of the i-th global

(local) depth value and corr denotes Pearson’s correlation. A more selective

version of T1;n. /arises from consideration of rank differences corresponding to

maximizers of local and global depth functions, only. This test function is denoted

withT3;n. /.

The homogeneity test was performed on both data sets, using the three test

functions with the percentile order of distances ranging from 5% to 70%. The

p-values are estimated through bootstrap resampling with 500 replicates. The

p-values corresponding toT3;n. / are shown in Fig.3. For meat data, according

to the initial guess, the test is never signiﬁcant, whatever test function or value are

used. For GDP data, thep-values corresponding to lower values of conﬁrm the

observed differences between local and global ranks to be important but they always remain above the standard critical values. Accordingly, the differences between rich and non rich countries could just be explained in terms of chance error. Another possibility, supported by simulations not discussed here, is that the power of the three test functions is poor when mixture components are not well separated.

(24)

References

Agostinelli, C., & Romanazzi, M. (2011). Local depth. Journal of Statistical Planning and Inference,141, 817–830.

Ferraty, F., & Vieu, P. (2006).Nonparametric functional data analysis. New York: Springer. Liu, R. (1990). On a notion of data depth based on random simplices.The Annals of Statistics, 18,

405–414.

L´opez-Pintado, S., & Romo, J. (2005).Depth-based classiﬁcation for functional data. Working Paper 2005.56, Departamento de Estadistica, Universidad Carlos III de Madrid

L´opez-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data.Journal of the American Statistical Association, 104, 718–734.

L´opez-Pintado, S., & Romo, J. (2011). A half-region depth for functional data.Computational Statistics & Data Analysis, 55, 1679–1695.

Ramsay, J. O., & Silverman, B. W. (2005).Functional data analysis. New York: Springer Zuo, Y., & Serﬂing, R. (2000). General notions of statistical depth function.Annals of Statistics,

(25)

Can the Students’ Career be Helpful in

Predicting an Increase in Universities Income?

Massimo Attanasio, Giovanni Boscaino, Vincenza Capursi, and Antonella Plaia

Abstract The students’ academic failure and the delay in obtaining their final degree are a significant issue for the Italian universities and their stakeholders. Based on indicators proposed by the Italian Ministry of University, the Italian universities are awarded a financial incentive if they reduce the students’ attrition and failure. In this paper we analyze the students’ careers performance using: (1) aggregate data; (2) individual data. The first compares the performances of the Italian universities using the measures and the indicators proposed by the Ministry. The second analyzes the students’ careers through an indicator based on credit earned by each student in seven academic years. The primary goal of this paper is to highlight elements that can be used by the policy makers to improve the careers of the university students.

1 Introduction

The Article 5 of the Law 537 of 1993 (Legge 2003) issued by the Italian Government is the first step towards the evaluation of the Italian University System (IUS). It marks the beginning of the financial autonomy of the universities with the hope to contain public expenditures. The law identifies the evaluation criteria of the various aspects of the IUS with the aim to obtain a fair distribution of the financial resources.

The same Article has created an Evaluation Committee (“Nuclei di Valutazione”)

M. Attanasio ()G. BoscainoV. CapursiA. Plaia

Dipartimento di Scienze Economiche, Aziendali e Statistiche e Matematiche “Silvio Vianelli”, Universit`a degli Studi di Palermo, Palermo, Italy

e-mail:[email protected];[email protected];[email protected];

[email protected]

(26)

10 M. Attanasio et al.

within each university and a National Assessment Committee, named “Osservatorio

Permanente”, whose primary scope is the management of the resources allocation.

Since late 1993 the law has experienced many modifications, such as the Ministerial Decree 509 in 1999. These reforms were created in order to improve the efficiency and the effectiveness of the universities due to the high number of drop outs and of long survivor students, defined as those students who stayed much more over the legal duration of the degree course (Lambert and Butler 2006). To cope with these aspects, the M.D. 509 introduced a new structure of university curricula,

introducing (1) University Educational Credit (Credito Formativo Universitario—

CFU), as a measurement student academic workload; (2) two qualiﬁcations (for both public and private institutions): ﬁrst degree (L) or 3-year Bachelor, second degree (LS) corresponding to two cycles of courses or 2-year Master degree. Since

2004 a part of the National University Funding System (Fondo di Finanziamento

Ordinario—FFO) have been distributed to the universities following new criteria

onuniversity performance, essentially based on teaching and research. Inspite of

the fact that the 509 reform was implemented in the academic year 2001/2002 and was revised in 2004, the problems of dropouts and of long survivor students is still unchanged.

Monitoring students’ careers allows us to monitor our system in order to improve efficiency for the benefit of the students, and it allows real time monitoring of the performance indicators provided by the Ministry of Education of the University to better distribute the competitive funds (which was around 10% of the Total Funds—FFO—in 2010). Indeed, monitoring of the students’ CFUs rates is both a commitment and a source of essential knowledge for those responsible for the creation of university degree programs. It also allows to collect information on the strengths and weaknesses of specific education programs; to acquire important data for programming and prevention and to establish checks and confirmations in order to create a process that is constantly under control. Furthermore, in order to improve educational services and identify appropriate ways to improve weak students’ performance it is important to analyze the determining factors of university courses successes and failures (Boscaino et al. 2007).

In this paper the student career performance (SCP) is analyzed using both aggregate data, taking into account the FFO criteria and the outcome of the 2010

universities fund allocation (Sect.2), and individual longitudinal data, taking into

account the heterogeneity of students and the monitoring of their careers (Sect.3).

The SCP data analysis raises several questions. For instance, is it possible to ﬁgure out simple and straightforward actions to accelerate the improvement FFO’s indicators? Is it possible to ﬁgure out a policy to improve FFO’s indicators in the long term? Are there simple numbers or indicators, useful to address university

policies, which can beextractedfrom individual SCP analysis? All these questions

lead to the ultimate purpose of this paper: which policymy university will/could

(27)

Can the Students’ Career be Helpful in Predicting an Increase in Universities Income? 11

2 Aggregate Data: FFO Structure and Criteria

Italian FFO has been divided in three parts since 2009: the QB (Quota Base)

corresponding to the general budget, theMP(Modello Premiale) corresponding to

the competitive funds (introduced to reward teaching and research quality), and a

residual part mostly related to salary increase (RE) (D. M. 2009,2010). Every year

the Ministry sets the FFO amount, theMPpercentage amount and how to compute

theQB, whileREis obtained as residual.

In 2010 the Ministry established the amount of the funds to be distributed to the universities and set the MP to be 10% of the yearly FFO while the QB was the

80% of the 2009 FFO. The 2010 FFO is composed by 10%MP, 83%QB, and 7%

RE. Moreover, The Ministry has announced that the MP will increase to 12% of the

2011 FFO .

TheMPaward is proportional to 6 performance indicators, 2 for teaching and 4

for research. The IUS 54 universities have received theMPaccording to (1):

MPD0:17A1C0:17A2C0:23B1C0:10B2C 0:20B3

C0:13B4 (1)

The two indicators relevant to the “quality of teaching”,A1 andA2, will be

analyzed in detail in this paper, while we do not analyze the others becauseB1 B4

represent the quality of the research outcome, which is not relevant to the paper

aims.A1andA2 consider respectively the demand of each University Education

and the number of CFUs earned by the students:

A1D.4STUDAC3STUDB C2STUDC C1STUDD/.KT CKA/

(2)

A2DCFU PACCFU PB CCFU PCCCFU PD (3)

with: CFU Pi D CFUEi CFUTi Mediani CFU Ei (4) where:

1. i D A; B; C; D (ministerial course classiﬁcation based on the ﬁnancial aid

allocated to each student).

2. STUDi is the number of “active students” attending typeicourse.

3. An “active student” is deﬁned as a student with the following features: they have been in the university system for a number of years less or equal to the legal duration of the course; and they have earned at least 5 CFUs per year.

(28)

Fig. 1 Active students in groups B, C, and D (%)

4. KT and KA are correction factors related to the local context (based on the

net income distribution of the university region) and to the sustainability of the courses (the ratio between the number of teachers and the number of Degree Courses), respectively.

5. CFU Eiare the CFU earned in a year by the students of the i-th group.

6. CFU Tiare the CFU correlated to the workload of the full-time students per year

in thei-th group.

7. Mediani is the median over the rates CFU Ei=CFU Ti computed for the 54

universities, in each group.

The MP could cause high losses/gains which will inﬂuence the FFO in the

subsequent years, since theQBis computed as a percentage (80%) of the previous

year FFO. In this way, successive losses are summed up each year since theQBis

progressive, so it is crucial to undertake actions to improveA1andA2in order to

avoid extra losses in the successive years. But how universities can improve their

own performance? According toA1, universities are funded proportionally to the

number of “active students”, but nothing is said about the “inactive students”: how do these students affect the total number of students? The graphical representation

of the “location” of the single universities (Fig.1) allows to summarize and compare

quickly the students’ performance (in terms of active students percentage) with respect to the groups B, C, and D. Group A is not included because it covers just 4% of the students.

The worst universities performance correspond to the black dots in the III

(29)

performances for the beneﬁt of the groups (B to D) by increasing the number of students earning at least 5 CFUs per year and/or by reducing the number of long survivor students. In practice, a university should increase the number of “active students” or, eventually, decrease the number of “inactive students”: this target can be achieved by avoiding students to stay in the university system for a period not longer than the legal duration of the course, and by letting them accumulate at least 5 CFUs in a year. This second task, which allows to stimulate earning 5 CFUs per year, will improve A2 too. A great advantage is to be able to identify early those students who could become “inactive” or could earn few CFUs each year.

To cope with these problems, a cohort study, as described in the next section, can give useful information.

3 Individual Data: Student Career Performance

In this section we investigate the student career performance applying individual longitudinal data. We will use the CFUs accumulation over 7 years.

The data concern the cohort of the Palermo University freshmen, enrolled in 2002/2003 and followed up till the 31st May 2010. For simplicity’s sake, we analyze only those students who never change faculty during 7 years, who have payed university fees in the legal terms every year and never dropped out (the core students). Moreover, we examine 3 faculties (Economics, Engineering, and Arts) belonging to three different cultural areas. The performance is deﬁned as the number

of CFUs earned by the i-th students at the end of the ﬁrstjyears (j D1; 2; : : : ; J)

(CFUi.j /). Table 1 reports the distribution of freshmen enrolled in 2002/2003

classiﬁed according to the student career status in 2009. Our classiﬁcation of the students is the following:

Core The student who never change faculty.

Mover The student who changed faculty, course, and/or university.

Withdrawal The student who quits.

Lost The student who never quitted, but whose follow up ended

before 7th year.

The latter is largely represented in Economics and Arts Faculties, but for the purpose of this paper we will not investigate this group issue because it deserves in-depth examination.

Table1shows that the most efﬁcient faculty at Palermo University is the Faculty

of Engineering. In this Faculty the graduation rate for the bachelor degree is greater than the one of two other faculties, reaching 46% at the end of the sixth year (may be it will be higher if a part of the lost is considered as withdrawal). The low success rate aetiology is difﬁcult to be proved due to its complexity. To simplify we focus our attention to few aspects, with the aim of ﬁnding out simple and useful information for the universities.

Based on these ﬁndings we investigate the number of CFUs earned at the end of the ﬁrst year (CFU(1)) versus the years needed to obtain the degree.

(30)

Table 1 Distribution of Palermo University freshmen enrolled in 2002/2003 by Status and Faculty Faculty

Status in 2009 Economics (%) Arts (%) Engineering (%)

Core 536 (341a₎ _49.1 _{1,339 (960}a₎ _54.2 _{733 (526}a₎ _63.7 Mover to other Faculties 53 (7a₎ _8.8 _{67 (12}a₎ _5.1 _{88 (18}a₎ _9.7 Mover to other Universities 43 60 24 Withdrawal 141 42.1 304 40.6 95 26.6 Lost 318 700 211 Total 1,091 100 2,470 10 1,151 100

a_{Students who took a degree (bachelors)}

This investigation is needed to ﬁgure out if CFU(1) is a good predictor. We restrict our analysis to those students who never changed Degree Course (core) in 7 years. Following Cozzucoli and Ingrassia (2005), we deﬁne:

B The bachelors, or those students that graduated within 7 years.

O The others (with respect to theB’s), those students not yet graduated in 2009.

CFUi.j / The number of CFUs earned by thei-th students at the end of the ﬁrstj

years (j D1; 2; : : : ; JIJ D7).

Xi The number of years expected to obtain a degree by thei-th student, considering

the number of CFUs earned by the i-th student at the end of the ﬁrst year

(CFUi(1)), that is 180=CFUi.1/, where 180 is the amount of CFUs needed to

get a degree. We groupX into 4 classes:4; 4-j5; 5-j6; and>6; obviously, the

students whose CFU.1/D0(12% of core students) have been excluded.

Yi The number of years observed to obtain a degree fori-th bachelor (B).

EYi The number of years expected to obtain a degree for the others (O). EYi

is computed extrapolating the student’s annual earning speed (i) based on

CFUi.j /, and followed to the attainment of 180 CFU:

i D CFUi.j / j (5) EYi D 180 CFUi.j / i Cj (6)

EYi ﬁrst term is the number of years beyond thej-th to “get” the degree. For

instance, if a student earned 105 CFUs at the end of the ﬁrst 7 years (j D7), we

expect that they will get their degree by the 12th year of attendance:

i D CFUi.7/ 7 D 105 7 D15 EYi D 180 CFUi.7/ i C7D 180 105 15 C7D12

(31)

Fig. 2 Box-plots of degree observed or expected years, by expected years at 1st year and student’s status (O and B)—Faculty of Engineering, Palermo University. UpperO’s boxes are truncated for reason of space

Figure2summarizes the results for the Faculty of Engineering because the other

faculties’ results are quite similar. The X axis is the number of the years to obtain the degree, as expected at the end of the ﬁrst academic year (computed through X), for bachelors and others. The Y axis reports: for B’s the observed number of years to obtain the degree; for O’s the expected number of years to obtain the degree

(computed through EY). Figure2clearly shows in most cases that a good start is

a good predictor for success. For instance, 117B students obtained the degree in

4-j5, as expected by their performance in the 1st year, and only 34Ostudents did

not “keep the promise” held in the 1st year. Interestingly, the ratio betweenB’s

andO’s decreases towards the class 5-j6, showing the “best” promise for class4.

These results suggest the importance of the ﬁrst year CFUs as predictor of success.

The box-plots relative to the expected years to obtain the degree (O’s) are very large

and their upper bounds are always over 20. This unusual number is due to the large number of students who have earned very few CFUs in 7 years. This dysfunction

obviously affects the denominator ofEYi.

4 Concluding Remarks and Future Developments

In conclusion, Palermo University aggregate career students data show that the number of earned CFUs are below the national median while the individual SCPs show the dramatic slowness to obtain a degree.

These analysis are the ﬁrst step to investigate the current IUS Evaluation. This Evaluation system presents several incongruities, whose analysis deserves speciﬁc attention. For instance, there is a need for different policies within the single

(32)

university and, more importantly, competitive policies within the different Italian universities. A good example is the FFO implementation by the Ministry for the four types of degree courses A, B, C and D. Nevertheless this is not enough. It is also very important to underline that FFO distributes the competitive fund on the basis of crude indicators. In fact, the process of comparison is conducted using the same set of indicators for all universities, without taking into account the different characteristics of universities and students. Moreover this process is somehow unfair, because good universities can attract more easily good students, and a vicious circle could be boosted by this system of awards (Lambert and Butler 2006).

Further statistical investigations with individual career data will provide detailed information on other covariates that may influence the success/failure of the students. For instance, Fasola (2011) applied a discrete-time competing risks models to the cohort of freshmen of 2002/2003. This model gets in a discrete-time setting simultaneous estimates of the degree and failure risks, including several covariates, such as the high school grades; the high school qualification; the age and the gender (in some faculties). These results may be useful as a basis to create a policy able to address specific actions for specific types of students with the aim to improve the quality of teaching outcomes and to provide recommendations to improve FFO indicators.

Acknowledgments The article is the result of the productive collaboration among the authors. In particular, paragraph 1 can be ascribed to Vincenza Capursi, paragraph 2 can be ascribed to Antonella Plaia, paragraph 3 can be ascribed to Giovanni Boscaino, and paragraphs 4 can be ascribed to Massimo Attanasio.

This paper has been supported from Italian Ministerial grant PRIN 2008 “Measures, statistical models and indicators for the assessment of the University System”, n. 2008BWXSLH

References

Boscaino, G., Capursi, V., & Giambona, F. (2007). La performance delle carriere di una coorte di studenti universitari, DSSM Working Papers - n. 2007.1.

Cozzucoli, P. C., & Ingrassia, S. (2005). Indicatori Dinamici di Efﬁcienza Didattica dei Corsi di Laurea Universitari.Statistica e Applicazioni, III, special issue 1, 61–68.

Fasola, S. (2011).Laurea o abbandono: il buongiorno si vede dal mattino? Graduation Thesis, University of Palermo.

Lambert, R., & Butler, N. (2006).The future of European universities: Renaissance or decay? London: Centre for European Reform.

Legge. (2003). 24-12-3, n. 537, Interventi correttivi di ﬁnanza pubblica. Italian Government. D. M. (2009). 23-9-2009, n. 45, Decreto criteri di Ripartizione del Fondo di Finanziamento

Ordinario (FFO) delle Universit`a per l’anno 2009. Italian Government.

D. M. (2010). 21-12-10, n. 655, Decreto criteri di Ripartizione del Fondo di Finanziamento Ordinario (FFO) delle Universit`a per l’anno 2010. Italian Government.

(33)

Model-Based Classiﬁcation Via Patterned

Covariance Analysis

Luca Bagnato

Abstract This work deals with the classification problem in the case that groups are known and both labeled and unlabeled data are available. The classification rule is derived using Gaussian mixtures where covariance matrices are given according to a multiple testing procedure which asesses a pattern among heteroscedasticity, homometroscedasticity, homotroposcedasticity, and homoscedasticity. The mixture models are then fitted using all available data (labeled and unlabeled) and adopting the EM and the CEM algorithms. The performance of the proposed procedure is evaluated by a simulation study.

1 Introduction

The main purpose of discriminant analysis (DA) is to assign an object to one of theK

groupsG1; : : : ; GK, according to a rule which is based on a vectorxD

x1; : : : ; xp

of observations ofpvariables. Following the well-known Bayes rule, the object with

measurementxis assigned toGkwhen

kDargmax

1jK

jfj.x/; (1)

wherej is the unconditional prior probability of observing a classj member and

fj.x/denotes thej-th group conditional density ofx. Then, whenj andfj.x/

for allj D1; : : : ; Kare known (or estimated), an object can be classiﬁed using the

rule (1). The most common DA methods assume that the data within groupj are

L. Bagnato ()

Dipartimento di Discipline Matematiche, Finanza Matematica e Econometria, Universit`a Cattolica del Sacro Cuore, Milano, Italy

e-mail:[email protected]

(34)

18 L. Bagnato

generated from ap-variate normal with mean_j and covariance matrix˙j. Then,

the global density can be written as a mixture of normal components:

f .x/D K X jD1 jfj.x/; (2) withfj.x/D.2/p=2ˇˇ˙jˇˇ 1=2

expŒ 1=2.x _j/0˙_j1.x _j/. These methods

differ depending on how˙j,j D1; : : : ; K, are deﬁned. While linear discriminant

analysis (LDA) assumes homoscedasticity, quadratic discriminant analysis (QDA) assumes heteroscedasticity. Operationally, in order to opt for LDA, assumptions of normality and homoscedasticity must be checked; with this aim, a possible

solution consists in adopting the test proposed byHawkins(1981). Unfortunately,

this is a rigid practice, since intermediate conﬁgurations between heteroscedasticity and homoscedasticity can be observed in the data. Methods such as regularized

DA (RDA) and eigenvalue decomposition DA (EDDA) (see Friedman 1989 and

Bensmail and Celeux 1996, respectively) are discriminant methodologies offering different ways to modelize such situations. Another approach proposed in literature arises from a test on the between-group covariance structure: some sub-models of

EDDA, such as proportional covariance matrices (PCMs) (see Flury 1986) and

common principal components (CPCs) (see Flury 1984), have been singularly

applied in DA. Also equal correlation matrices (Manly and Rayner 1987) have been

used to this purpose in literature (seeMcLachlan 1992, Sect. 5.4 and the references

therein for applications of these and other related testing-based approaches of DA). The traditional discriminant analysis involves samples of known origin (labeled data) and provides classiﬁcation rules for samples of unknown origin (unlabeled data). These methodologies suffer whenever only a few known observations are available. Furthermore, unlabeled data could contain important information in order

to deﬁne the classiﬁcation rule. Our proposal is to adopt the idea presented inDean

et al.(2006) where both labeled and unlabeled data are involved in the estimation procedure. In particular, we use the family of models (as structures for DA) and the multiple testing procedure (as model selection criteria) recently introduced in Greselin et al.(2011). The combination of these two approaches (Dean et al. 2006; Greselin et al. 2011) results in a novel procedure composed by two steps: ﬁrstly the common covariance structure is chosen by means of a test of hypothesis and then the model parameters are obtained by constrained estimation. Conversely, in Dean et al.(2006) the choice of the underlying model is performeda posteriori

according to the highest BIC- value after having estimated all the models. It is worth noticing that the new procedure accommodates also for the case of equal orientation covariance matrices (the homotroposcedastic model), which has never been used in the context of mixture models, despite it has been frequently observed in real data. Then, the obtained classiﬁcation rule permits to achieve parameter reduction and to

increase the interpretability of the results. In Sect.2, after a brief introduction to

the adopted family of models and the estimation procedure, two real examples are presented. Furthermore, some considerations about the relation with the CPC model

(35)

Model-Based Classiﬁcation Via Patterned Covariance Analysis 19

are provided. Finally, in Sect.3, the validity of the proposed procedure is illustrated

by a simulation study.

2 Patterned Covariance Analysis and Model Estimation

Suppose that the density of the data is given by (2) and let˙j D jj0j be the

spectral decomposition of the matrix˙j,j D 1; : : : ; K, wherej is the diagonal

matrix of the eigenvalues of˙j sorted in non-increasing order, andj is thepp

orthogonal matrix whose columns are the normalized eigenvectors of˙j ordered

according to their eigenvalues. Each component of the spectral decomposition has

a different morphologic interpretation in terms of the group scatters:j governs

theorientation, whilej controls thesize-shape. By allowingj andj to be

equal (E) or variable (V) between groups, we obtain four parsimonious and easily interpreted models which are appropriate to describe various practical situations

(see Table 1). Thus, for example, by imposing 1 D D K D we

obtain homometroscedasticity (model EV in our notation), while by constraining

1 D DK D we have homotroposcedasticity (model VE). We propose to use this family of models which allows to exploit the augmentation multiple testing

procedure inGreselin et al.(2011) as model selection criteria (instead of the

com-mon use of AIC or BIC as inDean et al.,2006). Without going into details, the main

idea of this procedure is based on giving two separate tests for EV and VE which are

combined by using the relationEV\VEDEE. In particular, EV is tested exploiting

the relationship between the eigenvalues inj and the variances of the principal

components in thej-th group,j D1; : : : ; K. The test for VE is obtained by adding

some modiﬁcation to the CPC test proposed byFlury and Gautschi(1986).

To estimate the models we follow the idea proposed inDean et al.(2006) where N

labeled observations and M unlabeled observations are available and all of them are

used in the mixture model estimation. EM (seeDempster et al. 1977) and CEM (see

Celeux and Govaert 1992) algorithms, which respectively maximize the likelihood and the complete-data likelihood, are used here. In addition to the different set of adopted models, the main contribution of our proposal is given by the choice of the

model restrictions. While inDean et al.(2006) the structure is chosenex-post(the

model estimation) using the BIC, we propose to chooseex-antethe structure using

only the labeled data. Note that we can apply the test only on the labeled data since they can be assumed to be independent.

2.1 Examples of Homometroscedasticity

and Heteroscedasticity

In this section two real examples of EV and VE conﬁgurations are presented. The

ﬁrst example is the famous crab data set (genusLeptograpsus) also considered in

(36)

20 L. Bagnato

Table 1 Model, covariance restrictions, nomenclature and number of covariance parameters for each considered model (seeGreselin et al. 2011)

Model j j Nomenclature # of covariance parameters

EE Equal Equal Homoscedasticity ppC1

2

EV Equal Variable Homometroscedasticity KppC1

2 .K1/ p

VE Variable Equal Homotroposcedasticity ppC1

2 CKp

VV Variable Variable Heteroscedasticity KppC1

2 6 8 10 12 14 16 18 10 20 30 40 50 RW CL

Fig. 1 Scatter plot of variables RW and CL for

n1D50males andn2D50

females blue crabs (ıdenotes male andfemale). The ellipses of equal (95%) concentration are also superimposed

and McLachlan(2000), the attention is focused on the sample ofn D 100blue

crabs, there beingn1 D 50males (group 1) andn2 D 50females (group 2), each

specimen havingp D 2 measurements (in millimeters) on the rear width (RW)

and the length along the midline (CL) of the carapace. In Fig.1the scatter plot of

RW versus CL, in both groups, is shown jointly with the ellipses of equal (95%) concentration (arising from estimation under the assumption of heteroscedasticity). Here, the homometroscedasticity structure stands out since the group scatters are

similar in size and shape but different in orientation. As shown inGreselin et al.

(2011) this conjecture is conﬁrmed also by applying the multiple testing procedure. The second real example comes from the Swiss 1,000-franc bank notes, also

considered inFlury and Riedwyl(1983), on which the following two variables are

measured

LEFTDwidth of bank note; measured on left side;

RIGHTDwidth of bank note; measured on right side:

There arek D 2groups of bills, genuine (group 1) and forged (group 2), each of

(37)

Model-Based Classiﬁcation Via Patterned Covariance Analysis 21 LEFT RIGHT 129.0 129.5 130.0 130.5 131.0 129.0 129.5 130.0 130.5 131.0 129.0 129.5 130.0 130.5 131.0 129.0 129.5 130.0 130.5 131.0 LEFT RIGHT a b

Fig. 2 Scatter plots, and related ellipses of equal (95%) concentration, of variables LEFT and RIGHT in two groups of Swiss bank notes. Coinciding points are marked by single symbol only. (a) Genuine, (b) Forged

LEFT versus RIGHT, for both groups, are jointly reported with the ellipses of equal (95%) concentration arising from estimation under the assumption of heteroscedas-ticity. As conﬁrmed by the multiple test, the structure of homotroposcedasticity can be assumed: size and shape are different, but the orientations agree.

2.2 Homotroposcedasticity Versus Common Principal

Components

The deﬁnition of model VE has been never used in DA, while models EE, EV and VV are considered also in EDDA. The EDDA counterpart of homotroposcedasticity

exploits the Common Principal Components (CPC) model (seeFlury 1984). The

latter is in principle equal to VE but with the difference that in the spectral

decompositions of˙j the eigenvalues ofj,j D 1; : : : ; K, are not constrained

to be in decreasing order. In CPC the matrices of normalized eigenvectors are equal between groups but the requirement of decreasing eigenvalues is abandoned. Thus, the family of CPC models contains the family of models VE. Even if both conﬁgurations reach the same parsimony in terms of parameters, VE could provide better discrimination if this conﬁguration really holds. Starting from the estimation of CPC, an ad hoc procedure to estimate VE has been implemented.

The difference between homotroposcedasticity and CPC is graphically explained

in Fig.3considering ellipses of equal concentration of two normal groups. Both the

ellipses have the same axes, according to the CPC model, although they differ in orientation. This means that homotroposcedasticity does not hold.

Statistical Models for Data Analysis

Studies in Classiﬁ cation, Data Analysis,

and Knowledge Organization

Statistical Models

for Data Analysis

Paolo Giudici

Salvatore Ingrassia

Studies in Classiﬁcation, Data Analysis,

and Knowledge Organization

Paolo Giudici

Salvatore Ingrassia

Maurizio Vichi

Editors

Statistical Models

for Data Analysis

Preface

Contents

Contributors

Ordering Curves by Data Depth

1

Introduction

2

Functional Depth

3

Local Functional Depth

4

Applications

References

Can the Students’ Career be Helpful in

Predicting an Increase in Universities Income?

1

Introduction

2

Aggregate Data: FFO Structure and Criteria

3

Individual Data: Student Career Performance

4

Concluding Remarks and Future Developments

References

Model-Based Classiﬁcation Via Patterned

Covariance Analysis

1

Introduction

2

Patterned Covariance Analysis and Model Estimation

2.1

Examples of Homometroscedasticity

and Heteroscedasticity

2.2

Homotroposcedasticity Versus Common Principal

Components