Studies in Classification, Data Analysis, and Knowledge Organization

(1)

Studies in Classiﬁcation, Data Analysis,

and Knowledge Organization

Managing Editors Editorial Board

H.-H. Bock, Aachen Ph. Arabie, Newark

D. Baier, Cottbus

M. Vichi, Rome F. Critchley, Milton Keynes

E. Diday, Paris M. Greenacre, Barcelona J. Meulman, Leiden P. Monari, Bologna S. Nishisato, Toronto N. Ohsumi, Tokyo O. Opitz, Augsburg M. Schader, Mannheim C.N. Lauro, Naples C. Weihs, Dortmund G. Ritter, Passau R. Decker, Bielefeld W. Gaul, Karlsruhe

For further volumes::

(2)

123

Editors

Data Analysis

and Classification

Francesco Palumbo

· Carlo Natale Lauro

Proceedings of the 6th Conference

of the Classification and Data Analysis Group

of the Società Italiana di Statistica

(3)

imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Printed on acid-free paper

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, are liable for prosecution under the German Copyright Law.

in its current version, and permissions for use must always be obtained from Springer-Verlag. Violations reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication

The use of general descriptive names, registered names, trademarks, etc. in this publication does not This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,

Springer is part of Springer Science + Business Media (www.springer.com)

ISSN 1431-8814

Editors

Cover design: SPi Publisher Services

Springer Heidelberg Dordrecht London New York

80126 Napoli Italy

[email protected]

Ramon Trias Fargas, 25–27 08005 Barcelona Spain [email protected] e-ISBN 978-3-642-03739-9 DOI: 10.1007/978-3-642-03739-9 ISBN 978-3-642-03738-2

Library of Congress Control Number: 2009936001

Via Crescimbeni, 20 Italy

Professor Francesco Palumbo

in Economics and Finance Università di Macerata

[email protected]

Professor Carlo Natale Lauro

Università Federico II di Napoli Via Cinthia - Complesso Universitario

Professor Michael J. Greenacre Department of Economics and Business Universitat Pompeu Fabra

Department of Institution

62100 Macerata

Department of Mathematics and Statistics

(4)

This volume contains revised versions of selected papers presented at the biennial meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, which was held in Macerata, September 12–14, 2007. Carlo Lauro chaired the Scientific Programme Committee and Francesco Palumbo chaired the Local Organizing Committee.

The scientific programme scheduled 150 oral presentations and one poster ses-sion. Sessions were organised in five plenary sessions, 10 invited paper specialised sessions and 24 solicited paper sessions. Contributed papers and posters were 54 and 12, respectively.

Five eminent scholars, who have given important impact in the Classification and Data Analysis fields, were invited as keynote speakers, they are H. Bozdogan,

S.R. Masera, G. McLachlan, A. Montanari, A. Rizzi.

Invited Paper Specialised Sessions focused on the following topics: Knowledge extraction from temporal data models

Statistical models with errors-in-covariates Multivariate analysis for microarray data Cluster analysis of complex data

Educational processes assessment by means of latent variables models Classification of complex data

Multidimensional scaling

Statistical models for public policies

Classification models for enterprise risk management Model-based clustering

It is worth noting that two of the ten specialised sessions were organised by the French (Classification of complex data) and Japanese (Multidimensional scaling) classification societies. The SPC is grateful to professors Okada (Japan) and Zighed (France), who took charge of the Japanese and French specialised session organ-isation, respectively. The SPC is grateful to the Italian statisticians who actively cooperated in the organisation of the specialised and solicited sessions: they were mainly responsible for the success of the conference.

(5)

vi Preface

On the occasion of the ClaDAG conference in Macerata, the SPC decided to have two sessions dedicated to young researchers who had finished their PhD programme during the year before the conference.

Thus, the conference provided a large number of scientists and experts from home and abroad with an attractive forum for discussion and mutual exchange of knowledge. Plenary and specialised sessions topics were agreed, aiming at fit-ting the mission of ClaDAG within the fields of Classification, Data Analysis and Multivariate Statistics.

All papers published in the present volume have been reviewed by the most qual-ified scholars from many countries, for each specific topic. The review process was quite long but very accurate to meet the publisher’s standard of quality and the prestige of the series.

The more methodologically oriented papers focus on developments in cluster-ing and discrimination, multidimensional data analysis, data mincluster-ing. Many papers also provide significant contributions in a wide range of fields of application. This suggested the presentation of the 51 selected papers in nine parts, one more section consists of the keynote lectures. Section names are listed below:

1. Keynote lectures 2. Cluster analysis

3. Multidimensional scaling

4. Multivariate analysis and applications 5. Classification and classification trees 6. Statistical models

7. Latent variables

8. Knowledge extraction from temporal data

9. Statistical methods for financial and economics data 10. Missing values

We wish to express our gratitude to the other members of the Scientific Pro-gramme Committee Andrea Cerioli (Università degli Studi di Parma) Paolo Giudici (Università degli Studi di Pavia) Antonio Giusti (Università degli Studi di Firenze) Pietro Mantovan (Università degli Studi “Cà Foscari” di Venezia) Angelo Marcello Mineo (Università degli Studi di Palermo) Domenico Piccolo (Università degli Studi di Napoli Federico II) Marilena Pillati (Università degli Studi di Bologna) Roberto Rocci (Università degli Studi di Roma “Tor Vergata”) Sergio Zani (Università degli Studi di Parma).

We gratefully acknowledge the University of Macerata and its Departments of

Istituzioni Economiche e Finanziarie and Studi sullo Sviluppo Economico for

finan-cial support. We are also indebted to SISTAR Marche who has partially supported the publishing of the present volume. We thank all the members of the Local Orga-nizing Committee: D. Bruzzese, C. Davino M. Gherghi, G. Giordano L. Scaccia, G. Scepi, for their excellent work in managing the organisation of the sixth ClaDAG conference. We desire to express our special thanks to Cristina Davino, for her skil-ful accomplishment of the duties of Scientific Secretary of ClaDAG 2007, and to Dr. Rosaria Romano for her assistance in producing this volume.

(6)

Finally, we would like to thank Dr. Martina Bihn of Springer-Verlag, Heidelberg, for her support and dedication to the production of this volume.

Macerata Francesco Palumbo

Naples Carlo N. Lauro

Barcelona Michael J. Greenacre

(7)

List of Referees

We are indebted with our colleagues who kindly accepted to revise one or more papers. Their work has been essential to the quality of the present volume.

T. Aluja Banet, J. Antoch, E. Beccalli, D. Blei, S.A. Blozis, D. Bruzzese, M. Chavent, D. Dorn, G. Elliott, V. Esposito-Vinzi, A. Flores-Lagunes, L.C. Freeman, G. Giampaglia, Z. Huang, F. Husson, S. Ingrassia, C. Kascha, H.A.L. Kiers, S. Klink, I. Lerman, P.G. Lovaglio, A.H. Marshall, G. McLachlan, S. Mignani, M. Misuraca, A. Morineau, I. Moustaki, F. Murtagh, A. Nasraoui, L. Lebart, L. Norden, T. Poibeau, M. Riani, F. Rijmen, J. Sander, G. Saporta, Y. Sheng, F.P. Schoenberg, T.A.B. Snijders, R. Turner, L. Trinchera, A. Uhlendorff, J.K. Vermunt, B.Y.Yeap T.P. York, N.L. Zhang, J. Zhuang, D. Zighed

(8)

Part I Key-note

Clustering of High-Dimensional and Correlated Data . . . . 3 Geoffrey J. McLachlan, Shu-Kay Ng, and K. Wang

Statistical Methods for Cryptography . . . 13 Alfredo Rizzi

Part II Cluster Analysis

An Algorithm for Earthquakes Clustering Based on Maximum

Likelihood . . . 25 Giada Adelfio, Marcello Chiodi, and Dario Luzio

A Two-Step Iterative Procedure for Clustering of Binary

Sequences . . . 33 Francesco Palumbo and A. Iodice D’Enza

Clustering Linear Models Using Wasserstein Distance . . . 41 Antonio Irpino and Rosanna Verde

Comparing Approaches for Clustering Mixed Mode Data: An

Application in Marketing Research . . . 49 Isabella Morlini and Sergio Zani

The Progressive Single Linkage Algorithm Based on

Minkowski Ultrametrics. . . 59 Sergio Scippacercola

Visualization of Model-Based Clustering Structures. . . 67 Luca Scrucca

(9)

xii Contents

Part III Multidimensional Scaling

Models for Asymmetry in Proximity Data . . . 79 Giuseppe Bove

Intimate Femicide in Italy: A Model to Classify How Killings

Happened . . . 85 Domenica Fioredistella Iezzi

Two-Dimensional Centrality of Asymmetric Social Network . . . 93 Akinori Okada

The Forward Search for Classical Multidimensional Scaling

When the Starting Data Matrix Is Known . . . .101 Nadia Solaro and Massimo Pagani

Part IV Multivariate Analysis and Application

Discriminant Analysis on Mixed Predictors . . . .113 Rafik Abdesselam

A Statistical Calibration Model for Affymetrix Probe Level

Data . . . .121 Luigi Augugliaro and Angelo M. Mineo

A Proposal to Fuzzify Categorical Variables in Operational

Risk Management . . . .129 Concetto Elvio Bonafede and Paola Cerchiello

Common Optimal Scaling for Customer Satisfaction Models:

A Point to Cobb–Douglas’ Form . . . .137 Paolo Chirico

Structural Neural Networks for Modeling Customer

Satisfaction .. . . .145 Cristina Davino

Dimensionality of Scores Obtained with a Paired-Comparison Tournament System

of Questionnaire Items . . . .155 Luigi Fabbris

Using Rasch Measurement to Assess the Role

of the Traditional Family in Italy .. . . .163 Domenica Fioredistella Iezzi and Marco Grisoli

(10)

Preserving the Clustering Structure by a Projection Pursuit

Approach . . . .171 Giovanna Menardi and Nicola Torelli

Association Rule Mining of Multimedia Content . . . .179 Adalbert F.X. Wilhelm, Arne Jacobs, and Thorsten Hermes

Part V Classification and Classification Tree

Automatic Dictionary- and Rule-Based Systems for Extracting

Information from Text . . . .189 Sergio Bolasco and Pasquale Pavone

Several Computational Studies About Variable Selection for

Probabilistic Bayesian Classifiers . . . .199 Adriana Brogini and Debora Slanzi

Semantic Classification and Co-occurrences: A Method for the Rules Production for the Information Extraction

from Textual Data . . . .209 Alessio Canzonetti

The Effectiveness of University Education: A Structural

Equation Model . . . .217 Bruno Chiandotto, Bruno Bertaccini, and Roberta Varriale

Simultaneous Threshold Interaction Detection in Binary

Classification . . . .225 Claudio Conversano and Elise Dusseldorp

Detecting Subset of Classifiers for Multi-attribute Response

Prediction . . . .233 Claudio Conversano and Francesco Mola

Clustering Textual Data by Latent Dirichlet Allocation:

Applications and Extensions to Hierarchical Data . . . .241 Matteo Dimai and Nicola Torelli

Multilevel Latent Class Models for Evaluation of Long-term

Care Facilities . . . .249 Giorgio E. Montanari, M. Giovanna Ranalli, and Paolo Eusebi

Author–Coauthor Social Networks and Emerging Scientific

Subfields . . . .257 Yasmin H. Said, Edward J. Wegman, and Walid K. Sharabati

(11)

xiv Contents

Part VI Statistical Models

A Hierarchical Model for Time Dependent Multivariate

Longitudinal Data . . . .271 Marco Alf`o and Antonello Maruotti

Covariate Error Bias Effects in Dynamic Regression Model Estimation and Improvement in the Prediction by Covariate

Local Clusters . . . .281 Pietro Mantovan and Andrea Pastore

Local Multilevel Modeling for Comparisons of Institutional

Performance . . . .289 Simona C. Minotti and Giorgio Vittadini

Modelling Network Data: An Introduction to Exponential

Random Graph Models . . . .297 Susanna Zaccarin and Giulia Rivellini

Part VII Latent Variables

An Analysis of Earthquakes Clustering Based on a

Second-Order Diagnostic Approach . . . .309 Giada Adelfio

Latent Regression in Rasch Framework. . . .319 Silvia Bacci

A Multilevel Latent Variable Model for Multidimensional

Longitudinal Data . . . .329 Silvia Bianconcini and Silvia Cagnone

Turning Point Detection Using Markov Switching Models with

Latent Information. . . .337 Edoardo Otranto

Part VIII Knowledge Extraction from Temporal Data Statistical and Numerical Algorithms for Time Series

Classification . . . .347 Roberto Baragona and Salvatore Vitrano

Mining Time Series Data: A Selective Survey . . . .355 Marcella Corduas

(12)

Predictive Dynamic Models for SMEs . . . .363 Silvia Figini

Clustering Algorithms for Large Temporal Data Sets . . . .369 Germana Scepi

Part IX Outlier Detection and Robust Methods

Robust Clustering for Performance Evaluation . . . .381 Anthony C. Atkinson, Marco Riani, and Andrea Cerioli

Outliers Detection Strategy for a Curve Clustering Algorithm . . . .391 Balzanella Antonio, Elvira Romano, and Rosanna Verde

Robust Fuzzy Classification .. . . .399 Matilde Bini and Bruno Bertaccini

Weighted Likelihood Inference for a Mixed Regressive Spatial

Autoregressive Model . . . .407 Carlo Gaetan and Luca Greco

Detecting Price Outliers in European Trade Data with the

Forward Search . . . .415 Domenico Perrotta and Francesca Torti

Part X Statistical Methods for Financial and Economics Data Comparing Continuous Treatment Matching Methods in Policy

Evaluation . . . .427 Valentina Adorno, Cristina Bernini, and Guido Pellegrini

Temporal Aggregation and Closure of VARMA Models: Some

New Results . . . .435 Alessandra Amendola, Marcella Niglio, and Cosimo Vitale

An Index for Ranking Financial Portfolios According to

Internal Turnover . . . .445 Laura Attardi and Domenico Vistocco

Bayesian Hidden Markov Models

for Financial Data . . . .453 Rosella Castellano and Luisa Scaccia

(13)

xvi Contents

Part XI Missing Values

Regression Imputation for Space-Time Datasets with Missing

Values . . . .465 Antonella Plaia and Anna Lisa Bond`ı

A Multiple Imputation Approach in a Survey on University

Teaching Evaluation . . . .473 Isabella Sulis and Mariano Porcu

(14)

Rafik Abdesselam ERIC EA 3038, University of Lyon 2, 69676, Bron, France [email protected]

Giada Adelfio Department of Statistical and Mathematical Sciences, University of Palermo, viale delle Scienze, ed 13, 90128, Palermo, Italy, [email protected] Valentina Adorno Department of Economics, University of Bologna, Piazza Scaravilli, 2 Bologna, [email protected]

Marco Alf`o Dipartimento di Statistica, Probabilit`a e Statistiche Applicate, Piazzale Aldo Moro, 5 - 00185 Roma, [email protected]

Alessandra Amendola Di.S.E.S. Universit`a degli Studi di Salerno, Via Ponte Don Melillo 84084 Fisciano (SA), Italy, [email protected]

Balzanella Antonio Universit`a degli Studi di Napoli Federico II, Via Cinthia I-80126 Napoli, Italy, [email protected]

Anthony C. Atkinson London School of Economics, London WC2A 2AE, UK, [email protected]

Laura Attardi Dip.to di Progettazione Aeronautica, Universit`a di Napoli, Italy, [email protected]

Luigi Augugliaro Dipartimento di Scienze Statistiche e Matematiche, Universit`a di Palermo, Viale delle Scienze, Edificio 13, 90128, Palermo, Italy, [email protected]

Silvia Bacci Department of Statistics “G. Parent”, Viale Morgagni 59, 50134 Firenze, Italy, [email protected]

Roberto Baragona Department of Sociology and Communication, Sapienza University of Rome, Via Salaria 113, 00198 Rome, Italy, [email protected]

Cristina Bernini Department of Statistics, University of Bologna, Via Belle Arti 41, Bologna, Italy, [email protected]

Silvia Bianconcini Department of Statistics, University of Bologna, Via Belle Arti, 41 - 40126 Bologna, Italy, [email protected]

(15)

xviii Contributors

Matilde Bini Department of Statistics “G. Parenti”, Viale Morgagni, 59, 50134 Firenze, Italy, [email protected]

Sergio Bolasco Dipartimento di Studi Geoeconomici, Linguistici, Statistici, Storici per l’Analisi Regionale, Sapienza, University of Rome, Via del Castro Laurenziano 9, Roma, [email protected]

Elvio Bonafede University of Pavia, Corso Strada Nuova 65, Italy, concetto. [email protected]

Anna Lisa Bondi Department of Statistical and Mathematical Sciences “S. Vianelli” University of Palermo,viale delle Scienze - ed. 13, 90128 Palermo, Italy, [email protected]

Giuseppe Bove Dipartimento di Scienze dell’Educazione, Universit`a degli Studi Roma Tre, Italy, [email protected]

Adriana Brogini Department of Statistics, University of Padova, via Cesare Battisti 241, 35121, Padova, Italy, [email protected]

Bruno Bertaccini Department of Statistics, Universit`a degli Studi di Firenze “G. Parenti”, Viale Morgagni, 59, 50134 Firenze, Italy, [email protected] Silvia Cagnone Department of Statistics, University of Bologna, Via Belle Arti, 41 - 40126 Bologna, Italy, [email protected]

Alessio Canzonetti Dipartimento Studi Geoeconomici, Linguistici, Statistici, Storici per l’Analisi regionale - Facolta’ di Economia - Sapienza Universita’ di Roma, Via del Castro Laurenziano 9, Roma, [email protected] Rosella Castellano DIEF, Universit`a di Macerata, Via Crescimbeni, 20, 62100 Macerata, Italy, [email protected]

Paola Cerchiello University of Pavia, Corso Strada Nuova 65, Italy, [email protected]

Andrea Cerioli Dipartimento di Economia, University of Parma, Via Kennedy 6, Italy, [email protected]

Bruno Chiandotto Universit`a degli Studi di Firenze, Dip.to di Statistica ‘G. Parenti’, Italy, [email protected]

Marcello Chiodi Department of Statistical and Mathematical Sciences, University of Palermo, viale delle Scienze, ed 13, 90128, Palermo, Italy, [email protected] Paolo Chirico Dipartimento di Statistica e Matematica applicata, Via Maria Vittoria 38, 10100, Torino, Italy, [email protected]

Claudio Conversano Department of Economics, University of Cagliari, Viale Fra Ignazio 17, I-09123, Cagliari, Italy, [email protected]

Marcella Corduas Dipartimento di Scienze Statistiche, Universit`a di Napoli Federico II, Via L.Rodino, 80138, Napoli(I), Italy, [email protected]

(16)

Cristina Davino University of Macerata, Dipartimento di Studi sullo sviluppo economico, Italy, [email protected]

Alfonso Iodice D’Enza Dipartimento di Scienze Economiche e Finanziarie Universit`a di Cassino, Rome, [email protected]

Matteo Dimai Department of Economics and Statistics, University of Trieste, P.le Europa 1, 34127 Trieste, Italy, [email protected]

Elise Dusseldorp TNO Quality of Life, Department of Statistics, Leiden, the Netherlands, [email protected]

Paolo Eusebi Dipartimento di Economia, Finanza e Statistica; Universit`a degli Studi di Perugia, Italy, [email protected]

Luigi Fabbris Statistics Department, University of Padua, Via C. Battisti 241, 35121 Padova, Italy, [email protected]

Silvia Figini Department of Statistics and Applied Economics L. Lenti, University of Pavia, Italy, [email protected]

Carlo Gaetan Department of Statistics, University Ca’ Foscari, Venice, Italy, [email protected]

Marta Giorgino EURES, Via Col di Nava, 3 - 00141 Roma, Italy, [email protected]

Luca Greco Department PE.ME.IS - Section of Statistics, University of Sannio, Benevento, Italy, [email protected]

Marco Grisoli Project Manager - Area Excelencia y Marketing Estrat´egico France Telecom Espa˜na [email protected]

Thorsten Hermes Universitat Bremen, Am Fallturm 1, D-28359 Bremen, Germany, [email protected]

Domenica Fioredistella Iezzi Universit`a degli Studi di Roma “Tor Vergata”, Italy, [email protected]

Antonio Irpino Dipartimento di Studi Europei e Mediterranei, Second University of Naples, Via del Setificio, 15, Belvedere di San Leucio, 81100 Caserta, Italy, [email protected]

Arne Jacobs Universitat Bremen, Am Fallturm 1, D-28359 Bremen, Germany, [email protected]

Dario Luzio Dipartimento di Chimica e Fisica della Terra, University of Palermo, via Archirafi, 26, 90123, Palermo, Italy, [email protected]

Pietro Mantovan Department of Statistics, University Ca Foscari, S Giobbe, Cannaregio, 873 -I-30121 Venezia, Italy, [email protected]

Antonello Maruotti Dipartimento di Statistica, Probabilit`a e Statistiche Applicate, Piazzale Aldo Moro, 5 - 00185 Roma, [email protected]

(17)

xx Contributors

Geoffrey J. McLachlan Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD 4072, Australia,

[email protected]

Giovanna Menardi Department of Economics and Statistics, P.le Europa, 1 Trieste, Italy, [email protected]

Angelo M. Mineo Dipartimento di Scienze Statistiche e Matematiche, Universit`a di Palermo, Viale delle Scienze, Edificio 13, 90128, Palermo, Italy,

[email protected]

Simona Caterina Minotti Dipartimento di Statistica, Universit`a degli Studi di Milano-Bicocca, Via Bicocca degli Arcimboldi 8, 20126 Milano, Italy,

[email protected]

Francesco Mola Department of Economics, University of Cagliari, Viale Fra Ignazio 17, I-09123, Cagliari, Italy, [email protected]

Giorgio E. Montanari Dipartimento di Economia, Finanza e Statistica; Universit`a degli Studi di Perugia, Italy, [email protected]

Isabella Morlini DSSCQ, Universit`a di Modena e Reggio Emilia, Modena, Italy, [email protected]

Marcella Niglio Di.S.E.S. Universit`a degli Studi di Salerno, Via Ponte Don Melillo 84084 Fisciano (SA), Italy, [email protected]

S.K. Ng Department of Mathematics, University of Queensland Brisbane, QLD 4072, Australia, [email protected]

Akinori Okada Graduate School of Management and Information Sciences, Tama University, 4-1-1 Hijirigaoka, Tama-shi, Tokyo 206-0022,Japan,[email protected] Edoardo Otranto Dipartimento di Economia, Impresa e Regolamentazione, Via Torre Tonda 34, 07100 Sassari, Italy, [email protected]

Massimo Pagani “Luigi Sacco” Hospital, University of Milan, Via G.B. Grassi 74, 20157 Milan, Italy, [email protected]

Francesco Palumbo Dipartimento di Istituzioni Economiche e Finanziarie Universit`a di Macerata, Faculty of Economics, Macerata, Italy, [email protected] Andrea Pastore Department of Statistics, University Ca Foscari, S Giobbe, Cannaregio, 873 -I-30121 Venezia, Italy, [email protected]

Pasquale Pavone Dipartimento di Studi Geoeconomici, Linguistici, Statistici, Storici per l’Analisi Regionale, Sapienza, University of Rome, Via del Castro Laurenziano 9, Roma, [email protected]

Guido Pellegrini Department of Economic Theory and Quantitative Methods for Political Choices, Sapienza University of Rome, Piazzale Aldo Moro 5, Roma, Italy, [email protected]

(18)

Domenico Perrotta European Commission (EC), Joint Research Centre (JRC), Institute for the Protection and Security of the Citizens (IPSC), Global Security and Crisis Management (GSCM), Via Enrico Fermi 2749, Ispra, Italy, domenico. [email protected]

Antonella Plaia Department of Statistical and Mathematical Sciences

“S. Vianelli”, University of Palermo, viale delle Scienze - ed. 13, 90128 Palermo, Italy, [email protected]

Mariano Porcu Dip. Ric. Economiche e Sociali - Univ. di Cagliari, Viale S. Ignazio 78, Italy, [email protected]

M. Giovanna Ranalli Dipartimento di Economia, Finanza e Statistica; Universit`a degli Studi di Perugia, Italy, [email protected]

M. Riani Dipartimento di Economia, University of Parma, Via Kennedy 6, Italy, [email protected]

Giulia Rivellini Universit`a Cattolica del Sacro Cuore, Largo Gemelli 1, 20123 Milano, Italy, [email protected]

Alfredo Rizzi Dipartimento di Statistica, Probabilit`a e Statistiche Applicate, Universit`a di Roma “La Sapienza” P.le A.Moro, 5 - 00185 Roma,

[email protected]

Roberta Varriale Universit`a degli Studi di Firenze, Dip.to di Statistica ‘G. Parenti’, Italy, [email protected]

Elvira Romano Seconda Universit`a degli Studi di Napoli, via Del Setificio 81100 Caserta, Italy, [email protected]

Yasmin H. Said Isaac Newton Institute for Mathematical Sciences, Cambridge University, Cambridge, CB3 0EH UK, [email protected]

and

Department of Computational and Data Sciences, George Mason University MS 6A2, Fairfax, VA 22030, USA

Luisa Scaccia DIEF, Universit`a di Macerata, Via Crescimbeni, 20, 62100 Macerata, Italy, [email protected]

Germana Scepi University of Naples, Via Cinthia, Monte Sant’Angelo (NA), Italy, [email protected]

Sergio Scippacercola Dipartimento di Matematica e Statistica - Universit`a degli studi di Napoli Federico II - Via Cinthia, 80126 – Napoli, Italy, sergio. [email protected]

Luca Scrucca Dipartimento di Economia, Finanza e Statistica, Universit`a degli Studi di Perugia, Perugia, Italy, [email protected]

Walid K. Sharabati Department of Statistics, Purdue University, West Lafayette, IN 47907, USA, [email protected]

(19)

xxii Contributors

Debora Slanzi Department of Statistics, University Ca’ Foscari, San Giobbe Canareggio 873, 30121, Venezia, Italy, [email protected]

Nadia Solaro Department of Statistics, University of Milan-Bicocca, Via Bicocca degli Arcimboldi 8, 20126 Milan, Italy, [email protected]

Isabella Sulis Dip. Ric. Economiche e Sociali - Univ. di Cagliari, Viale S. Ignazio 78, Italy, [email protected]

Nicola Torelli Department of Economics and Statistics, University of Trieste, P.le Europa 1, 34127 Trieste, Italy, [email protected]

Francesca Torti Universit`a Milano Bicocca, Facolt`a di Statistica, Milano, Italy, [email protected], [email protected]

Rosanna Verde Dipartimento di Studi Europei e Mediterranei, Second University of Naples, Via del Setificio, 15, Belvedere di San Leucio, 81100 Caserta, Italy, [email protected]

Domenico Vistocco Dip.to di Scienze Economiche, Universit`a di Cassino, Italy, [email protected]

Cosimo Vitale Di.S.E.S. Universit`a degli Studi di Salerno, Via Ponte Don Melillo 84084 Fisciano (SA) Italy, [email protected]

Salvatore Vitrano Statistical Office, Ministry for Cultural Heritage and Activities, Collegio Romano 27, 00186 Rome, Italy, [email protected]

Giorgio Vittadini Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali, Universit`a degli Studi di Milano-Bicocca, Via Bicocca degli Arcimboldi 8, 20126 Milano, Italy, [email protected]

K. Wang Department of Mathematics, University of Queensland Brisbane, QLD 4072, Australia, [email protected]

Adalbert F.X. Wilhelm Jacobs University Bremen, P.O. Box 75 05 61, D-28725 Bremen, Germany, [email protected]

Edward J. Wegman Department of Computational and Data Sciences, George Mason University, Fairfax, VA, USA, [email protected]

Susanna Zaccarin Universit`a di Trieste, Piazzale Europa 1, 34127 Trieste, Italy, [email protected]

Sergio Zani Dipartimento di Economia, Universit`a di Parma, Italy, sergio.zani@ unipr.it

(20)

(21)

Clustering of High-Dimensional

and Correlated Data

Geoffrey J. McLachlan, Shu-Kay Ng, and K. Wang

Abstract Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attrac-tive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We consider the applications of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. How-ever, for extremely high-dimensional data, some variable-reduction method needs to be used in conjunction with the latter model such as with the procedure called EMMIX-GENE. It was developed for the clustering of microarray data in bioinfor-matics, but is applicable to other types of data. We shall also consider the mixture procedure EMMIX-WIRE (based on mixtures of normal components with ran-dom effects), which is suitable for clustering high-dimensional data that may be structured (correlated and replicated) as in longitudinal studies.

1 Introduction

Clustering procedures based on finite mixture models are being increasingly pre-ferred over heuristic methods due to their sound mathematical basis and to the inter-pretability of their results. Mixture model-based procedures provide a probabilistic clustering that allows for overlapping clusters corresponding to the components of the mixture model. The uncertainties that the observations belong to the clusters are provided in terms of the fitted values for their posterior probabilities of component membership of the mixture. As each component in a finite mixture model corre-sponds to a cluster, it allows the important question of how many clusters there are in the data to be approached through an assessment of how many components are

G.J. McLachlan (

B

)

Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD 4072, Australia

e-mail:[email protected]

F. Palumbo et al. (eds.), Data Analysis and Classification,

Studies in Classification, Data Analysis, and Knowledge Organization, DOI 10.1007/978-3-642-03739-9 1, cSpringer-Verlag Berlin Heidelberg 2010

(22)

needed in the mixture model. These questions of model choice can be considered in terms of the likelihood function.

Scott and Symons (1971) were one of the first to adopt a model-based approach to clustering. Assuming that the data were normally distributed within a cluster, they showed that their approach is equivalent to some commonly used clustering criteria with various constraints on the cluster covariance matrices. However, from an esti-mation point of view, this approach yields inconsistent estimators of the parameters. This inconsistency can be avoided by working with the mixture likelihood formed under the assumption that the observed data are from a mixture of classes corre-sponding to the clusters to be imposed on the data, as proposed by Wolfe (1965) and Day (1969). Finite mixture models have since been increasingly used to model the distributions of a wide variety of random phenomena and to cluster data sets; see, for example, McLachlan and Peel (2000).

2 Definition of Mixture Models

We letY denote a random vector consisting ofp feature variables associated with the random phenomenon of interest. We lety1; : : : ;yndenote an observed random

sample of sizen onY. With the finite mixture model-based approach to density estimation and clustering, the density ofY is modelled as a mixture of a number (g) of component densitiesfi.y/in some unknown proportions1; : : : ; g. That

is, each data point is taken to be a realization of the mixture probability density function (p.d.f.), f .yI ‰/D g X iD1 ifi.y/; (1)

where the mixing proportionsiare nonnegative and sum to one. In density

estima-tion, the number of componentsgcan be taken sufficiently large for (1) to provide an arbitrarily accurate estimate of the underlying density function. For clustering purposes, each component in the mixture model (1) corresponds to a cluster. The posterior probability that an observation with feature vectoryj belongs to theith

component of the mixture is given by

i.yj/Difi.yj/=f .yj/ (2)

fori D 1; : : : ; g. A probabilistic clustering of the data into g clusters can be obtained in terms of the fitted posterior probabilities of component membership for the data.

An outright partitioning of the observations into g (nonoverlapping) clusters C1; : : : ; Cg is effected by assigning each observation to the component to which

it has the highest estimated posterior probability of belonging. Thus theith cluster Cicontains those observationsyj withOzij D1, whereOzij D1ifi Dh, and zero

(23)

Clustering of High-Dimensional and Correlated Data 5 hDarg max h O h.yj/I (3) O

i.yj/is an estimate ofi.yj/. As the notation implies,zOij can be viewed as an

estimate of zij which, under the assumption that the observations come from a

mix-ture of g groupsG1; : : : ; Gg, is defined to be one or zero according as thejth

observationyj does or does not come fromGi.i D1; : : : ; gIj D1; : : : ; n/.

3 Maximum Likelihood Estimation

On specifying a parametric formfi.yjIi/for each component density, we can fit

this parametric mixture model

f .yjI ‰/D g

X

iD1

ifi.yjIi/ (4)

by maximum likelihood (ML). Here‰ D .!T_;

1; : : : ; g1/T is the vector of

unknown parameters, where!consists of the elements of theiknown a priori to be

distinct. In order to estimate‰from the observed data, it must be identifiable. This will be so if the representation (4) is unique up to a permutation of the component labels. The maximum likelihood estimate (MLE) of‰;‰O, is given by an appropriate root of the likelihood equation,

@logL.‰/=@‰D0; (5) whereL.‰/denotes the likelihood function for‰,

L.‰/D

n

Y

jD1

f .yjI‰/:

Solutions of (5) corresponding to local maximizers of logL.‰/can be obtained via the expectation-maximization (EM) algorithm of Dempster et al. (1977); see also McLachlan and Krishnan (1997). Let‰O denote the estimate of‰so obtained.

4 Choice of Starting Values for the EM Algorithm

McLachlan and Peel (2000) provide an in-depth account of the fitting of finite mix-ture models. Briefly, with mixmix-ture models the likelihood typically will have multiple maxima; that is, the likelihood equation will have multiple roots. Thus the EM algo-rithm needs to be started from a variety of initial values for the parameter vector ‰ or for a variety of initial partitions of the data intog groups. The latter can be

(24)

obtained by randomly dividing the data intoggroups corresponding to theg com-ponents of the mixture model. With random starts, the effect of the central limit theorem tends to have the component parameters initially being similar at least in large samples. Nonrandom partitions of the data can be obtained via some clustering procedure such ask-means.

The choice of root of the likelihood equation in the case of homoscedastic normal components is straightforward in the sense that the ML estimate exists as the global maximizer of the likelihood function. The situation is less straightforward in the case of heteroscedastic normal components as the likelihood function is unbounded. Usually, the intent is to choose as the ML estimate of the parameter vector‰ the local maximizer corresponding to the largest of the local maxima located. But in practice, consideration has to be given to the problem of relatively large local max-ima that occur as a consequence of a fitted component having a very small (but nonzero) variance for univariate data or generalized variance (the determinant of the covariance matrix) for multivariate data. Such a component corresponds to a cluster containing a few data points either relatively close together or almost lying in a lower-dimensional subspace in the case of multivariate data. There is thus a need to monitor the relative size of the fitted mixing proportions and of the component variances for univariate observations, or of the generalized component variances for multivariate data, in an attempt to identify these spurious local maximizers.

5 Clustering via Normal Mixtures

Frequently, in practice, the clusters in the data are essentially elliptical, so that it is reasonable to consider fitting mixtures of elliptically symmetric component densi-ties. Within this class of component densities, the multivariate normal density is a convenient choice given its computational tractability.

Under the assumption of multivariate normal components, theith component-conditional densityfi.yIi/is given by

fi.yI i/D.yIi; †i/; (6)

wherei consists of the elements ofi and the 1₂p.pC1/distinct elements of

†i.i D1; : : : ; g/. Here .yIi; †i/D.2/ p 2j_†_ij1=2expf1 2.yi/ T †i1.yi/g: (7)

One attractive feature of adopting mixture models with elliptically symmetric components such as the normal ort-densities, is that the implied clustering is invari-ant under affine transformations of the data; that is, invariinvari-ant under transformations of the feature vectoryof the form,

(25)

Clustering of High-Dimensional and Correlated Data 7

whereC is a nonsingular matrix. If the clustering of a procedure is invariant under (8) for only diagonalC, then it is invariant under change of measuring units but not rotations.

It can be seen from (7) that the mixture model with unrestricted component-covariance matrices in its normal component distributions is a highly parameterized one with 1₂p.pC1/parameters for each component-covariance matrix†i.i D

1; : : : ; g/. As an alternative to taking the component-covariance matrices to be the same or diagonal, we can adopt some model for the component-covariance matrices that is intermediate between homoscedasticity and the unrestricted model, as in the approach of Banfield and Raftery (1993). They introduced a parameterization of the component-covariance matrix †i based on a variant of the standard spectral

decomposition of†i.

The mixture model with normal components (7) is sensitive to outliers since it adopts the multivariate normal family for the distributions of the errors. An obvious way to improve the robustness of this model for data which have longer tails than the normal or atypical observations is to consider using the multivariatet-family of elliptically symmetric distributions; see McLachlan and Peel (1998, 2000). It has an additional parameter called the degrees of freedom that controls the length of the tails of the distribution. Although the number of outliers needed for breakdown is almost the same as with the normal distribution, the outliers have to be much larger.

6 Factor Analysis Model for Dimension Reduction

As remarked earlier, the g-component normal mixture model with unrestricted component-covariance matrices is a highly parameterized model with 1₂p.pC1/ parameters for each component-covariance matrix†i.iD1; : : : ; g/. As discussed

above, Banfield and Raftery (1993) introduced a parameterization of the component-covariance matrix †i based on a variant of the standard spectral decomposition

of†i.i D 1; : : : ; g/. However, ifp is large relative to the sample sizen, it may

not be possible to use this decomposition to infer an appropriate model for the component-covariance matrices. Even if it is possible, the results may not be reliable due to potential problems with near-singular estimates of the component-covariance matrices whenpis large relative ton.

A common approach to reducing the number of dimensions is to perform a princi-pal component analysis (PCA). But as is well known, projections of the feature data yj onto the first few principal axes are not always useful in portraying the group

structure. A global nonlinear approach to dimension reduction can be obtained by postulating a finite mixture of linear submodels for the distribution of the full observation vectorYj given the (unobservable) factors. see Hinton et al. (1997),

McLachlan and Peel (2000), and McLachlan et al. (2003). The mixture of factor analyzers model is given by

(26)

f .yjI ‰/D g

X

iD1

i.yjI i; †i/; (9)

where theith component-covariance matrix†ihas the form

†iDBiBiT CDi .i D1; : : : ; g/ (10)

and whereBiis apqmatrix of factor loadings andDiis a diagonal matrix.i D

1; : : : ; g/. The parameter vector‰now consists of the mixing proportionsi and

the elements of thei, theBi, and theDi. With this approach, the number of free

parameters is controlled through the dimension of the latent factor space. By work-ing in this reduced space, it allows a model for each component-covariance matrix with complexity lying between that of the isotropic and full covariance structure models without any restrictions on the covariance matrices. The mixture of fac-tor analyzers model can be fitted by using the alternating expectation–conditional maximization (AECM) algorithm of Meng and van Dyk (1997).

A formal test for the number of factors can be undertaken using the likelihood ratio , as regularity conditions hold for this test conducted at a given value for the number of componentsg. For the null hypothesis thatH0 W q D q0 vs. the

alternativeH1 W q D q0C1, the statistic2logis asymptotically chi-squared

withdDg.pq0/degrees of freedom. However, in situations wherenis not large

relative to the number of unknown parameters, we prefer the use of the BIC crite-rion. Applied in this context, it means that twice the increase in the log likelihood .2log/has to be greater thandlognfor the null hypothesis to be rejected.

The mixture of factor analyzers model is sensitive to outliers since it uses normal errors and factors. Recently, McLachlan et al. (2007) have considered the use of mixtures oft analyzers in an attempt to make the model less sensitive to outliers.

7 Some Recent Extensions for High-Dimensional Data

The EMMIX-GENE program of McLachlan et al. (2002) has been designed for the normal mixture model-based clustering of a limited number of observations that may be of extremely high-dimensions. It was called EMIX-GENE as it was designed specifically for problems in bioinformatics that require the clustering of a relatively small number of tissue samples containing the expression levels of possi-bly thousands of genes. But it is applicable to clustering problems outside the field of bioinformatics involving high-dimensional data. In situations where the sample size nis very large relative to the dimensionp, it might not be practical to fit mixtures of factor analyzers to data on all the variables, as it would involve a considerable amount of computation time. Thus initially some of the variables may have to be removed. Indeed, the simultaneous use of too many variables in the cluster analysis may serve only to create noise that masks the effect of a smaller number of vari-ables. Also, the intent of the cluster analysis may not be to produce a clustering of

(27)

the observations on the basis of all the available variables, but rather to discover and study different clusterings of the observations corresponding to different subsets of the variables; see, for example, Soffritti (2003) and Galimberti and Soffritti (2007). Therefore, the EMMIX-GENE procedure has two optional steps before the final step of clustering the observations. The first step considers the selection of a subset of relevant variables from the available set of variables by screening the variables on an individual basis to eliminate those which are of little use in clustering the obser-vations. The usefulness of a given variable to the clustering process can be assessed formally by a test of the null hypothesis that it has a single component normal dis-tribution over the observations. A faster but ad hoc way is to make this decision on the basis of the interquartile range. Even after this step has been completed, there may still remain too many variables. Thus there is a second step in EMMIX-GENE in which the retained variables are clustered (after standardization) into a number of groups on the basis of Euclidean distance so that variables with similar profiles are put into the same group. In general, care has to be taken with the scaling of variables before clustering of the observations, as the nature of the variables can be intrinsi-cally different. Also, as noted above, the clustering of the observations via normal mixture models is invariant under changes in scale and location. The clustering of the observations can be carried out on the basis of the groups considered individu-ally using some or all of the variables within a group or collectively. For the latter, we can replace each group by a representative (a metavariable) such as the sample mean as in the EMMIX-GENE procedure.

8 Mixtures of Normal Components with Random Effects

Up to now, we have considered the clustering of data on entities under two assump-tions that are commonly adopted in practice; namely:

(a) There are no replications on any particular entity specifically identified as such. (b) All the observations on the entities are independent of one another.

These assumptions should hold for the clustering of, say, tissue samples consist-ing of the expression levels of many (possibly thousands) of genes, although the tissue samples have been known to be correlated for different tissues due to flawed experimental conditions. However, condition (b) will not hold for the clustering of gene profiles, since not all the genes are independently distributed, and condition (a) will generally not hold either as the gene profiles may be measured over time or on technical replicates. While this correlated structure can be incorporated into the normal mixture model (9) by appropriate specification of the component-covariance matrices†i, it is difficult to fit the model under such specifications. For example,

the M-step may not exist in closed form.

Accordingly, Ng et al. (2006) have developed the procedure called EMMIX-WIRE (EM-based MIXture analysis With Random Effects) to handle the clustering of correlated data that may be replicated. They adopted conditionally a mixture of

(28)

linear mixed models to specify the correlation structure between the variables and to allow for correlations among the observations. It also enables covariate information to be incorporated into the clustering process.

To formulate this procedure, we consider the clustering ofngene profilesyj.j D

1; : : : ; n/, where we letyj D.yT1j; : : : ;yTmj/T contain the expression values for the

jth gene profile and

ytj D.y1tj; : : : ; yrttj/

T

.t D1; : : : ; m/

contains thertreplicated values in thetth biological sample .t D1; : : : ; m/on the

jth gene. The dimensionpofyj is given bypD

Pm tD1rt.

With the EMMIX-WIRE procedure, the observed p-dimensional vectors y1; : : : ;yn are assumed to have come from a mixture of a finite number, say g,

of components in some unknown proportions1; : : : ; g, which sum to one.

Con-ditional on its membership of theith component of the mixture, the profile vector yj for thejth gene.j D1; : : : ; n/, follows the model

yj DXˇiCU bij CVciC"ij; (11)

where the elements of ˇi are fixed effects (unknown constants) modelling the

conditional mean ofyj in the ith component.i D 1; : : : ; g/. In (11),bij (aqb

-dimensional vector) andci (aqc-dimensional vector) represent the unobservable

gene- and tissue-specific random effects, respectively. These random effects repre-sent the variation due to the heterogeneity of genes and samples (corresponding to bi D .b_i1T; : : : ; bT_{i n}/T andci, respectively). The random effectsbi andci, and the

measurement error vector."T

i1; : : : ; "Ti n/T are assumed to be mutually independent,

whereX,U, andV are known design matrices of the corresponding fixed or ran-dom effects, respectively. The presence of the ranran-dom effectci for the expression

levels of genes in theith component induces a correlation between the profiles of genes within the same cluster.

With the LMM, the distributions ofbijandciare taken, respectively, to be

mul-tivariate normalNq_b.0; Hi/andNqc.0; ciIqc/, whereHiis aqbqbcovariance matrix andIqc is theqc qc identity matrix. The measurement error vector"ij is also taken to be multivariate normalNp.0; Ai/, whereAiD diag.W i/is a

diago-nal matrix constructed from the vector.W i/withi D. i12; : : : ; i q2_e/T andW a

knownpqe zero-one design matrix.

We let ‰ D . ₁T; : : : ; gT; 1; : : : ; g1/T be the vector of all the unknown

parameters, where i is the vector containing the unknown parametersˇi, the

dis-tinct elements ofHi; ci, andi of theith component density.i D1; : : : ; g/. The

estimation of‰ can be obtained by the ML approach via the EM algorithm, pro-ceeding conditionally on the tissue-specific random effectsci as formulated in Ng

et al. (2006). The E- and M-steps can be implemented in closed form. In particu-lar, an approximation to the E-step by carrying out time-consuming Monte Carlo methods is not required. A probabilistic or an outright clustering of the genes into g components can be obtained, based on the estimated posterior probabilities of

(29)

component membership given the profile vectors and the estimated tissue-specific random effectscOifori D1; : : : ; g; see Ng et al. (2006).

References

Banfield, J., & Raftery, A. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803–821.

Day, N. (1969). Estimating the components of a mixture of two normal distributions. Biometrika, 56, 463–474.

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Galimberti, G., & Soffritti, G. (2007). Model-based methods for identifying multiple cluster

structures in a data set. Computational Statistics and Data Analysis, 52, 520–536.

Hinton, G., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–73.

McLachlan, G., Bean, R., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor ana-lyzers model to incorporate the multivariatetdistribution. Computational Statistics and Data Analysis, 51, 5327–5338.

McLachlan, G., Bean, R., & Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.

McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. McLachlan, G., & Peel, D. (1998). Robust cluster analysis via mixtures of multivariate

t-distributions. In: A. Amin, D. Dori, P. Pudil, & H. Freeman (Eds.), Lecture notes in computer science (Vol. 1451, pp. 658–666). Berlin: Springer.

McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.

McLachlan, G., Peel, D., & Bean, R. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.

Meng, X., & van Dyk, D. (1997). The EM algorithm – an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511–567.

Ng, S., McLachlan, G., Wang, K., Ben-Tovim Jones, L., & Ng, S. (2006). A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics, 22, 1745–1752.

Scott, A., & Symons, M. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–397.

Soffritti, G. (2003). Identifying multiple cluster structures in a data matrix. Communications in Statistics – Simulation and Computation, 32, 1151–1177.

Wolfe, J. (1965). A computer program for the computation of maximum likelihood analysis of types (Technical Report SRM 65-112). US Naval Personnel Research Activity, San Diego.

(30)

Alfredo Rizzi

Abstract In this note, after recalling certain results regarding prime numbers, we will present the following theorem of interest to cryptography: Let two discrete s.v.’s (statistical variable)X, Y assume the value:0; 1; 2; : : : ; m 1. Let X be uniformly distributed, that is, it assumes the valuei.i D 0; 1; : : : ; m1/ with probability 1=m and let the second s.v. Y assume the value i with probability .piW

Pm1

iD1 pi D 1; pi0/. If the s.v.Z D X CY (mod m) is uniformly

dis-tributed andmis a prime number, at least one of the two s. v.XandY is uniformly distributed.

1 Introduction

In today’s world the need to protect vocal and written communication between indi-viduals, institutions, entities and commercial agencies is ever present and growing. Digital communication has, in part, been integrated into our social life. For many, the day begins with the perusal of e-mail and the tedious task of eliminating spam and other messages we do not consider worthy of our attention. We turn to the internet to read newspaper articles, to see what’s on at the cinema, to check flight arrivals, the telephone book, the state of our checking account and stock holdings, to send and receive money transfers, to shop on line, for students’ research and for many other reasons. But the digital society must adequately protect communication from intruders, whether persons or institutions which attack our privacy. Cryptog-raphy (from o&, hidden), the study and creation of secret writing systems in numbers or codes, is essential to the development of digital communication which is absolutely private insofar as being impossible to be read by anyone to whom it is not addressed. Cryptography seeks to study and create systems for ciphering and to verify and authenticate the integrity of data. One must make the distinction between

A. Rizzi

Dipartimento di Statistica, Probabilit`a e Statistiche Applicate, Universit`a di Roma “La Sapienza” P.le A.Moro, 5 - 00185 Roma e-mail:[email protected]

F. Palumbo et al. (eds.), Data Analysis and Classification,

Studies in Classification, Data Analysis, and Knowledge Organization, DOI 10.1007/978-3-642-03739-9 2, cSpringer-Verlag Berlin Heidelberg 2010