Facing online challenges using learning classifier systems

(1)

TESI DOCTORAL

Títol Facing Online Challenges Using Learning Classifier Systems

Realitzada per

Andres Sancho Asensio

en el Centre

Escola Técnica Superior d’Enginyeria Electrònica i

Informàtica La Salle

i en el Departament

Informàtica

Dirigida per

Dra. Elisabet Golobardes i Ribé

Dr. Jorge Casillas Barranquero

C. I. F . G : 5 9 0 6 9 7 4 0 U n iv er si ta t R a m o n L u ll F u n d a ci ó P ri va d a . R g tr e. F u n d . G en er a li ta t d e Ca ta lu n ya n ú m . 4 7 2 ( 2 8 -02 -90) C. Claravall, 1-3 08022 Barcelona Tel. 936 022 200 Fax 936 022 249 E-mail: [email protected] www.url.es

(2)

(3)

Aquesta Tesi Doctoral ha estat defensada el dia d __________ de

al Centre _______________________________________________________________

de la Universitat Ramon Llull

davant el Tribunal format pels Doctors sotasignants, havent obtingut la qualificació:

President/a

_______________________________

Vocal

_______________________________

Vocal

_______________________________

Vocal

_______________________________

Secretari/ària

_______________________________

Doctorand/a

C. I. F . G : 5 9 0 6 9 7 4 0 U n iv er si ta t R a m o n L u ll F u n d a ci ó P ri va d a . R g tr e. F u n d . G en er a li ta t d e Ca ta lu n ya n ú m . 4 7 2 ( 2 8 -02 -90) C. Claravall, 1-3 08022 Barcelona Tel. 936 022 200 Fax 936 022 249 E-mail: [email protected] www.url.es

(4)

(5)

LEARNING CLASSIFIER SYSTEMS

By

Andreu Sancho-Asensio

Ramon Llull University & Granada University for the degree of

Doctor of Philosophy

Grup de Recerca en Sistemes Intel·ligents

Supervisors:

Dr. Jorge Casillas Barranquero Dr. Elisabet Golobardes i Rib´e

(6)

(7)

Last advances in machine learning have fostered the design of competent algorithms that are able to learn and extract novel and useful information from data. Recently, some of these techniques have been successfully applied to solve real-world problems in distinct tech-nological, scientific and industrial areas; problems that were not possible to handle by the traditional engineering methodology of analysis either for their inherent complexity or by the huge volumes of data involved. Due to the initial success of these pioneers, current machine learning systems are facing problems with higher difficulties that hamper the learning process of such algorithms, promoting the interest of practitioners for designing systems that are able to scalably and efficiently tackle real-world problems.

One of the most appealing machine learning paradigms are Learning Classifier Systems (LCSs), and more specifically Michigan-style LCSs, an open framework that combines an ap-portionment of credit mechanism with a knowledge discovery technique inspired by biological processes to evolve their internal knowledge. In this regard, LCSs mimic human experts by making use of rule lists to choose the best action to a given problem situation, acquiring their knowledge through the experience. LCSs have been applied with relative success to a wide set of real-world problems such as cancer prediction or business support systems, among many others. Furthermore, on some of these areas LCSs have demonstrated a learning capacity that exceed those of human experts for that particular task.

The purpose of this thesis is to explore the online learning nature of Michigan-style LCSs for mining large amounts of data in the form of continuous, high speed and time-changing streams of information. Most often, extracting knowledge from these data is key, in order to gain a better understanding of the processes that the data are describing. Learning from these data poses new challenges to traditional machine learning techniques, which are not typically designed to deal with data in which concepts and noise levels may vary over time. The contribution of this thesis takes the extended classifier system (XCS), the most studied Michigan-style LCS and one of the most competent machine learning algorithms, as the starting point. Thus, the challenges addressed in this thesis are twofold: the first challenge is building a competent supervised system based on the guidance of Michigan-style LCSs that learns from data streams with a fast reaction capacity to changes in concept and noisy inputs. As many scientific and industrial applications generate vast amounts of unlabeled data, the second challenge is to apply the lessons learned in the previous issue to continue with the design of unsupervised Michigan-style LCSs that handle online problems without assuming any a priori structure in input data.

(8)

(9)

Els grans aven¸cos en el camp de l’aprenentatge automàtic han resultat en el disseny de màquines competents que són capaces d’aprendre i d’extreure informació útil i original de l’experiència. Recentment, algunes d’aquestes tècniques d’aprenentatge s’han aplicat amb èxit per resoldre problemes del món real en àmbits tecnològics, mèdics, cient´ıfics i industrials, els quals no es podien tractar amb tècniques convencionals d’anàlisi ja sigui per la seva complexitat o pel gran volum de dades a processar. Donat aquest èxit inicial, actualment els sistemes d’aprenentatge s’enfronten a problemes de complexitat més elevada, el que ha resultat en un augment de l’activitat investigadora entorn sistemes capa¸cos d’afrontar nous problemes del món real eficientment i de manera escalable.

Una de les fam´ılies d’algorismes més prometedores en l’aprenentatge automàtic són els sis-temes classificadors basats en algorismes genètics (LCSs), el funcionament dels quals s’inspira en la natura. Els LCSs intenten representar les pol´ıtiques d’actuació d’experts humans amb un conjunt de regles que s’empren per escollir les millors accions a realitzar en tot moment. Aix´ı doncs, aquests sistemes aprenen pol´ıtiques d’actuació de manera incremental a mida que van adquirint experiència a través de la informació nova que se’ls va presentant durant el temps. Els LCSs s’han aplicat, amb èxit, a camps tan diversos com la predicció de càncer de pròstata o el suport a la inversió en borsa, entre altres. A més en alguns casos s’ha demostrat que els LCSs realitzen tasques superant la precisió dels éssers humans.

El propòsit d’aquesta tesi és explorar la naturalesa de l’aprenentatge online dels LCSs d’estil Michigan per a la mineria de grans quantitats de dades en forma de fluxos d’informació continus a alta velocitat i canviants en el temps. Molt sovint, l’extracció de coneixement a partir d’aquestes fonts de dades és clau per tal d’obtenir una millor comprensió dels processos que les dades estan descrivint. Aix´ı, aprendre d’aquestes dades planteja nous reptes a les tècniques tradicionals d’aprenentatge automàtic, les quals no estan dissenyades per tractar fluxos de dades continus i on els conceptes i els nivells de soroll poden variar amb el temps de forma arbitrària. La contribució de la present tesi pren l’eXtended Classifier System (XCS), el LCS d’estil Michigan més estudiat i un dels algoritmes d’aprenentatge automàtic més competents, com el punt de partida. D’aquesta manera els reptes abordats en aquesta tesi són dos: el primer desafiament és la construcció d’un sistema supervisat competent sobre el framework dels LCSs d’estil Michigan que aprèn dels fluxos de dades amb una capacitat de reacció ràpida als canvis de concepte i entrades amb soroll. Com moltes aplicacions cient´ıfiques i industrials generen grans quantitats de dades sense etiquetar, el segon repte és aplicar les lli¸cons apreses per continuar amb el disseny de LCSs d’estil Michigan capa¸cos de solventar problemes online sense assumir una estructura a priori en els dades d’entrada.

(10)

(11)

Los grandes avances en el campo del aprendizaje automático han resultado en el diseño de máquinas capaces de aprender y de extraer información útil y original de la experiencia. Recientemente alguna de estas técnicas de aprendizaje se han aplicado con éxito para resolver problemas del mundo real en ámbitos tecnológicos, médicos, cient´ıficos e industriales, los cuales no se pod´ıan tratar con técnicas convencionales de análisis ya sea por su complejidad o por el gran volumen de datos a procesar. Dado este éxito inicial, los sistemas de aprendizaje automático se enfrentan actualmente a problemas de complejidad cada vez más elevada, lo que ha resultado en un aumento de la actividad investigadora en sistemas capaces de afrontar nuevos problemas del mundo real de manera eficiente y escalable.

Una de las familias más prometedoras dentro del aprendizaje automático son los sistemas clasificadores basados en algoritmos genéticos (LCSs), el funcionamiento de los cuales se inspira en la naturaleza. Los LCSs intentan representar las pol´ıticas de actuación de expertos humanos usando conjuntos de reglas que se emplean para escoger las mejores acciones a realizar en todo momento. As´ı pues estos sistemas aprenden pol´ıticas de actuación de manera incremental mientras van adquiriendo experiencia a través de la nueva información que se les va presentando. Los LCSs se han aplicado con éxito en campos tan diversos como en la predicción de cáncer de próstata o en sistemas de soporte de bolsa, entre otros. Además en algunos casos se ha demostrado que los LCSs realizan tareas superando la precisión de expertos humanos.

El propósito de la presente tesis es explorar la naturaleza online del aprendizaje empleado por los LCSs de estilo Michigan para la miner´ıa de grandes cantidades de datos en forma de flujos continuos de información a alta velocidad y cambiantes en el tiempo. La extracción del conocimiento a partir de estas fuentes de datos es clave para obtener una mejor comprensión de los procesos que se describen. As´ı, aprender de estos datos plantea nuevos retos a las técnicas tradicionales, las cuales no están diseñadas para tratar flujos de datos continuos y donde los conceptos y los niveles de ruido pueden variar en el tiempo de forma arbitraria. La contribución del la presente tesis toma el eXtended Classifier System (XCS), el LCS de tipo Michigan más estudiado y uno de los sistemas de aprendizaje automático más compe-tentes, como punto de partida. De esta forma los retos abordados en esta tesis son dos: el primer desaf´ıo es la construcción de un sistema supervisado competente sobre el frame-work de los LCSs de estilo Michigan que aprende de flujos de datos con una capacidad de reacción rápida a los cambios de concepto y al ruido. Como muchas aplicaciones cient´ıficas e industriales generan grandes volúmenes de datos sin etiquetar, el segundo reto es aplicar las lecciones aprendidas para continuar con el diseño de nuevos LCSs de tipo Michigan capaces de solucionar problemas online sin asumir una estructura a priori en los datos de entrada.

(12)

(13)

The present thesis is the result of three years of painstaking work, and it would not have been possible without the guidance of many individuals who contributed in the completion of the herein presented dissertation.

First, I would like to thank Elisabet Golobardes and Jorge Casillas for their valuable support and guidance as supervisors. I owe my deepest gratitude and respect to Albert Orriols who introduced me to the exciting world of research. Definitely, without his help and patience I would not have come this far. Also, I am grateful to the Research Group in Intelligent Systems1 (GRSI) and the Soft Computing and Intelligent Information Systems2

(SCI2_{S) group for letting me grow as a researcher.}

I would like to show my gratitude to the Interdisciplinary Computing and Complex Sys-tems3 (I(CO)2S) group for giving me the pleasure of visiting them, and specially to Jaume

Bacardit for receiving me and for his valuable support and guidance.

The present work is the result of the collaboration with many researchers of distinct areas. In this regard I would like to thank Joan Navarro, Álvaro Garc´ıa, Germán Terrazas, Salvador Garc´ıa, Mar´ıa Mart´ınez, Núria Macià, Nunzia Lopiccolo, Xaver Solé, Isaac Triguero, Mar´ıa Franco, Agust´ın Zaballos, Albert Fornells, José Enrique Armendáriz, José Antonio Moral, Xavier Vilas´ıs, Rosa Sanz, Miquel Beltrán, Rubén Nicolás, Francesc Xavier Babot, Francesc Teixidó, Joan Camps, Xavier Canaleta, David Vernet, Joaquim Rios, Carles Garriga, and all the others I forgot—thank you!

Last, but not least, I would like to thank the unconditional support that all my family and friends have given me over these time. Specially, I want to thank my parents Andreu and Mar´ıa Isabel, and my brother Sergi for their unconditional support. Also, I would like to especially thank Mar´ıa Jos´e for her great support despite the distance.

The research done in this thesis is framed on the graduate program in Information Tech-nology and Management from La Salle at Ramon Llull University. It has been developed in the GRSI, which is a research group created in 1994 and recognised as a consolidated research group by the Government of Catalonia since 2002 (2002-SGR-00155, 2005-SGR-00302, and 2009-SGR-183). During the development of my PhD thesis I have had the opportunity of relating my research to three projects, two national and one European, which are detailed in the following: 1 http://salleurl.edu/GRSI 2 http://sci2s.ugr.es 3 http://icos.cs.nott.ac.uk ix

(14)

1. KEEL-III: Knowledge Discovery based on Evolutionary Learning: Current Trends and New Challenges (TIN2008-06681-C06-05). Focused on the knowl-edge extraction from data using evolutionary algorithms, KEEL III aims at (1) to continue with the development of the KEEL software tool, (2) to continue with the de-velopment of evolutionary learning models and/or their improvement and adaptation to specific contexts associated to the current trends on knowledge extraction based on evolutionary learning, (3) the development of studies on new challenges in knowledge extraction, and (4) the characterisation of specific real problems and the applicability of evolutionary learning algorithms.

2. INTEGRIS: Intelligent Electrical Grid Sensor Communications (FP7-ICT-ENERGY-2009-1). INTEGRIS proposes the development of a novel and flexible ICT infrastructure based on a hybrid Power Line Communication-wireless integrated communications system able to completely and efficiently fulfil the communications requirements foreseen for the Smart Electricity Networks of the future.

3. PATRICIA: Pain and Anxiety Treatment based on social Robot Interaction with Children to Improve pAtient experience (TIN2012-38416-C03-01). A major focus for children’s quality of life programs in hospitals is improving their expe-riences during procedures. In anticipation of treatment, children may become anxious and during procedures pain appears. The challenge of the coordinated project is to design pioneering techniques based on the use of social robots to improve the patient experience by eliminating or minimising pain and anxiety. According to this proposed challenge, this research aims to design and develop specific human-social robot interac-tion with pet robots. Robot interactive behaviour will be designed based on modular skills using soft-computing paradigms.

My research has been cosupervised by Dr. Jorge Casillas and Dr. Elisabet Golobardes i Rib´e, and it has been supported by the Generalitat de Catalunya, the commission for Universities and Research of the DIUE and European Social Fund under the FI grant (with references 2011FI B 01028, 2012FI B1 00158, and 2013FI B2 00089). This thesis would not have been possible without the financial support of the Departament d’Universitats, Recerca i Societat de la Informaci´o (DURSI) and the European Social Fund (ESF) under a scholarship in the FI research program.

(15)

Abstract iii

Resum v

Resumen vii

Acknowledgements ix

List of Figures xv

List of Tables xix

1 Introduction 1

1.1 Thesis Scope . . . 2

1.2 Thesis Objectives and Contributions . . . 3

1.3 Roadmap . . . 5

2 Theoretical Background 9 2.1 Machine Learning, a Brief Tour . . . 9

2.1.1 Supervised Learning . . . 10

2.1.2 Unsupervised Learning . . . 11

2.1.3 Reinforcement Learning . . . 12

2.1.4 Offline and Online Learning . . . 12

2.2 Data Streams . . . 12

2.3 Nature-inspired Learning Algorithms . . . 13

2.3.1 Genetic Algorithms . . . 14

2.3.2 The Theory behind GA Design . . . 16

2.4 Learning Classifier Systems, a Quick Survey . . . 19

2.4.1 Michigan-style LCSs . . . 19

2.4.2 Pittsburgh-style LCSs . . . 20

2.4.3 Iterative Rule Learning . . . 20

2.4.4 Genetic Cooperative-Competitive Learning . . . 21

2.5 Summary . . . 21

3 The Michigan-style LCS Framework Through XCS 23 3.1 A Concise XCS Overview . . . 24

3.1.1 XCS Knowledge Representation . . . 25

(16)

3.1.2 XCS Learning Organisation . . . 25

3.1.3 XCS Action Inference in Test Phase . . . 28

3.1.4 Limitations and Further Improvements to XCS . . . 29

3.1.5 Theoretical Insights on Why XCS Works . . . 29

3.2 Specializing XCS for Supervised Tasks: UCS . . . 30

3.2.1 Knowledge Representation in UCS . . . 31

3.2.2 Learning Organization in UCS . . . 32

3.2.3 UCS Class Inference in Test Phase . . . 33

3.3 Summary and Conclusions . . . 34

4 Supervised Algorithms for Data Streams 35 4.1 Supervised Learning from Data Streams . . . 36

4.2 Related Work . . . 37

4.3 Description of SNCS . . . 38

4.3.1 Knowledge Representation . . . 39

4.3.2 Interaction with the Environment . . . 41

4.3.3 Classifier Evaluation . . . 42

4.3.4 Evolutive Component . . . 43

4.3.5 Inference System . . . 44

4.3.6 Algorithm Complexity . . . 44

4.4 Experiments on Data Stream Problems . . . 45

4.4.1 The Rotating Hyperplane Problem . . . 46

4.4.2 The SEA Problem . . . 46

4.4.3 The SEA Problem with Varying Noise Levels . . . 47

4.4.4 The SEA Problem with Virtual Drifts . . . 47

4.4.5 The SEA Problem with Padding Variables under High Dimensional Spaces . . . 47

4.4.6 The SEA Problem with Non-Linearities . . . 48

4.4.7 Methodology of Experimentation . . . 48

4.4.8 Analysis of the Results . . . 50

4.4.9 Summary and Discusion . . . 65

4.5 Experiments on Real-world Problems . . . 65

4.5.1 Methodology . . . 65

4.5.2 Results . . . 68

4.6 Discusion . . . 70

4.7 Summary, Conclusions and Critical Analysis . . . 73

4.7.1 Summary and Conclusions . . . 73

4.7.2 Critical Analysis of SNCS . . . 74

5 Clustering through Michigan-Style LCS 77 5.1 Introduction . . . 78

5.2 Data Concerns in Smart Grids and Framework . . . 80

5.2.1 Data Partitioning . . . 80

(17)

5.3 An Effective Online Clustering for Smart Grids . . . 83

5.3.2 Learning Organisation . . . 85

5.3.3 Rule Compaction Mechanism . . . 87

5.3.4 Cost of the Algorithm . . . 88

5.3.5 Insights on Why XCScds Works . . . 88

5.4 Experiments . . . 88

5.4.2 Experiment 1: Clustering Sythetic Data Streams . . . 89

5.4.3 Experiment 2: Evolving Component in Clustering Synthetic Data Streams 90 5.4.4 Experiment 3: Online Clustering in a Real Environment . . . 92

5.5 Summary, Conclusions and Critical Analysis . . . 94

5.5.2 Critical Analysis of XCScds . . . 94

6 A Prospective Approach to Association Streams 97 6.1 Introduction to Association Streams . . . 98

6.2 Framework . . . 100

6.2.1 Association Rules: A Descriptive Introduction . . . 100

6.2.2 Quantitative Association Rules by Means of Intervals . . . 101

6.2.3 Fuzzy Logic and Association Rules . . . 102

6.2.4 Obtaining Rules from Data . . . 104

6.2.5 Learning from Data Streams . . . 105

6.2.6 Association Streams in a Nutshell . . . 105

6.3 Description of Fuzzy-CSar . . . 107

6.3.2 Learning Interaction . . . 109

6.3.3 Cost of the Algorithm . . . 113

6.3.4 Insights on Why Fuzzy-CSar Works . . . 113

6.4 Experiments on Association Streams . . . 114

6.4.1 On the Difficulty of Evaluating Association Streams . . . 114

6.4.2 Methodology Of Experimentation . . . 115 6.4.3 Experiment 1 . . . 115 6.4.4 Experiment 2 . . . 117 6.4.5 Experiment 3 . . . 118 6.4.6 Experiment 4 . . . 120 6.4.7 Discussion . . . 122

6.5 Experiment on a Real Data Stream Problem . . . 122

6.5.2 Analysis of the Results . . . 123

6.6 Experiments on Real-World Data Sets with Static Concepts . . . 124

6.6.1 Analysis of the Computational Complexity and Scalability . . . 125

6.6.2 Analysis of the Quality of the Models . . . 127

(18)

6.7.2 Critical Analysis of Fuzzy-CSar . . . 129

7 A Deeper Look at Fuzzy-CSar 131 7.1 Introduction . . . 132

7.2 Framework . . . 132

7.2.1 Learning Challenges, a Brief Tour . . . 133

7.2.2 The Covering Challenge . . . 133

7.2.3 The Schema and Reproductive Opportunity Challenge . . . 135

7.2.4 The Learning Time Challenge . . . 136

7.2.5 The Solution Sustenance Challenge . . . 137

7.3 Parameter Setting Guidelines . . . 138

7.4 Summary and Conclusions . . . 139

8 Summary, Conclusions and Future Work Lines 141 8.1 Summary and Concluding Remarks . . . 141

8.2 Future Work Lines . . . 145

A Statistical Comparisons of Learning Algorithms 149 A.1 Essential Concepts . . . 149

A.2 Pairwise Comparisons: The Wilcoxon Signed-Ranks Test . . . 151

A.3 Multiple Comparisons . . . 152

A.3.1 The Friedman Test . . . 152

A.3.2 Post-hoc Nemenyi Test . . . 154

A.3.3 Holm’s Procedure . . . 155

B Index Terms 159 B.1 SNCS Internal Variables . . . 159

B.2 XCScds Internal Variables . . . 160

B.3 Fuzzy-CSar Internal Variables . . . 160

(19)

1.1 The distinct representations explored in this thesis: (a) a multilayer perceptron for data classification in the tao problem, (b) two clustering rules in a two-dimensional problem, and (c) three fuzzy association rules in a two-two-dimensional problem using five fuzzy sets per variable. . . 6

1.2 Conceptual chart displaying the applicability of the studied algorithms to real-world data stream problems versus readability of the representation used. No-tice that the metric is not to scale. . . 7

2.1 Evolution of a GA population through the GA cycle. . . 16

3.1 Interaction of the distinct pressures in XCS (Butz,2006,Butz et al.,2004). . 31

4.1 Michigan-style LCS Framework. . . 38

4.2 The knowledge representation of SNCS in the tao problem. . . 40

4.3 Rotating hyperplane problem: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 10 000 data samples there is a concept drift. Results are averages of ten runs. . . 50

4.4 Nemenyi’s test at α = 0.1 on the rotating hyperplane problem. Classifiers that are not significantly different are connected. . . 51

4.5 Rotating hyperplane problem: comparison of the test error achieved by SNCS, XCS and UCS. Every 10 000 data samples there is a concept drift. Results are averages of ten runs. . . 52

4.6 SEA problem: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 53

4.7 Nemenyi’s test at α = 0.05 on the SEA problem. Classifiers that are not significantly different are connected. . . 54

4.8 SEA problem: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 55

4.9 SEA problem with varying noise levels for each concept: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 55

4.10 Nemenyi’s test at α = 0.1 on the SEA problem with varying noise levels. Classifiers that are not significantly different are connected. . . 56

(20)

4.11 SEA problem with varying noise levels for each concept: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 57

4.12 SEA problem with virtual drifts: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 57

4.13 Nemenyi’s test at α = 0.05 on the SEA problem with virtual drifts. Classifiers that are not significantly different are connected. . . 58

4.14 SEA problem with virtual drifts: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 59

4.15 SEA5 problem: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 59

4.16 SEA7 problem: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 60

4.17 Nemenyi’s test at α = 0.05 on the SEA5 problem. Classifiers that are not significantly different are connected. . . 60

4.18 Nemenyi’s test at α = 0.05 on the SEA7 problem. Classifiers that are not significantly different are connected. . . 61

4.19 SEA5 problem: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 62

4.20 SEA7 problem: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 62

4.21 SEA problem with non-linearities: comparison of the test error achieved by SNCS, CVFDT, IBk (with k = 1), and NB. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 63

4.22 Nemenyi’s test at α = 0.05 on the SEA problem with non-linearities. Classifiers that are not significantly different are connected. . . 63

4.23 SEA problem with non-linearities: comparison of the test error achieved by SNCS, XCS and UCS. Every 12 500 data samples there is a concept drift. Results are averages of ten runs. . . 64

4.24 Nemenyi’s test at α = 0.05 on the classification problems using test accuracy. Classifiers that are not significantly different are connected. . . 68

4.25 Nemenyi’s test at α = 0.05 on the classification problems using Cohen’s kappa statistic. Classifiers that are not significantly different are connected. . . 68

4.26 Nemenyi’s test at α = 0.05 on the classification problems using F-Measure. Classifiers that are not significantly different are connected. . . 69

(21)

4.27 Illustration of the significant differences (at α = 0.05) among classifiers using accuracy as a test measure. An edge L1 → L2 indicates that the learner L1

outperforms L2 with the corresponding p-value. . . 71

4.28 Illustration of the significant differences (at α = 0.05) among classifiers using Cohen’s kappa statistic. An edge L1 → L2 indicates that the learner L1

outperforms L2 with the corresponding p-value. . . 72

4.29 Illustration of the significant differences (at α = 0.05) among classifiers using the F-Measure. An edge L1 → L2 indicates that the learner L1 outperforms

L2 with the corresponding p-value. . . 73

5.1 Architecture of the deployed XCScds system over the Smart Grid holding two data partitions. . . 81

5.2 Detailed schema of the XCScds system. . . 83

5.3 Knowledge representation used by XCScds in a two-dimensional problem. . . 84

5.4 Results of the data stream experiment at the end of (a) the first concept, (b) the second concept, and (c) third concept drift. Results are averages of 10 runs. Blue lines are the boundaries of the discovered clusters. . . 90

5.5 Results of the evolving data stream experiment at the end of (a) the first concept, (b) the second concept, (c) third concept and (d) the fourth concept. Results are averages of 10 runs. Blue lines are the boundaries of the discovered clusters. . . 92

5.6 Results of the data stream experiment in the real environment. The curve is the average of 10 runs. . . 93

6.1 Triangular-shaped membership function. . . 102

6.2 Representation of a fuzzy partition for a variable with three uniformly dis-tributed triangular-shaped membership functions using the fuzzy labels small, medium and large. . . 103

6.3 Knowledge representation used by Fuzzy-CSar in a two-dimensional problem using five linguistic labels per variable. . . 108

6.4 Results obtained in the first experiment. Every 12 500 data samples there is a concept drift. Curves are averages of 30 runs. . . 116

6.5 Results of Fuzzy-CSar on the second problem. Concept drifts happen at iter-ation 1 000 and 21 000. The resulting curve is the average of 30 runs. . . 118

6.6 Results obtained in the third experiment. Every 12 500 data samples there is a concept drift. Curves are averages of 30 runs. . . 120

6.7 Results obtained in the fourth experiment. Concept drift happens at iteration 20 000 and at 35 000. Curves are averages of 30 runs. . . 121

6.8 Number of rules obtained by Fuzzy-CSar on the New South Wales Electricity market problem. . . 123

6.9 Relationship between the runtime (in minutes) and the number of variables used with the 100% of transactions and five linguistic terms. . . 126

6.10 Relationship between the runtime (in minutes) and the number of transactions using all the 40 features of the problem and five linguistic terms. . . 126

(22)

A.1 Comparison of the performance of all learning algorithms against each other with the Nemenyi test. Groups of algorithms that are not significantly different at α = 0.05 are connected. . . 155

(23)

2.1 Training data of the weather problem taken from the UCI machine learning repository (Bache and Lichman,2013). We can see the input features outlook, temperature, humidity and windy, and the desired output play tennis? . . . 11

4.1 Configurations used to test the sensitivity of SNCS to configuration parameters in data stream problems. . . 49

4.2 Holm / Shaffer Table for α = 0.05 on the rotating hyperplane problem. Al-gorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 52

4.3 Holm / Shaffer Table for α = 0.05 on the SEA problem. Algorithms that per-form significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 54

4.4 Holm / Shaffer Table for α = 0.1 on the SEA problem with varying noise levels. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 56

4.5 Holm / Shaffer Table for α = 0.05 on the SEA problem with virtual drifts. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 58

4.6 Holm / Shaffer Table for α = 0.05 on the SEA5. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 61

4.7 Holm / Shaffer Table for α = 0.05 on the SEA7. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 61

4.8 Holm / Shaffer Table for α = 0.05 on the SEA problem with non-linearities. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 64

4.9 Summary of the experiments with data streams. For each problem it shows its description and the different positions in the ranking for each algorithm according to Friedman’s test. Lower values are better. . . 65

4.10 Summary of the properties of the data sets used. The columns describe: the identifier of the data set (Id.), the number of instances (#Inst.), the total number of features (#Feat.), the number of numeric features (#Num.), the number of nominal features (#Nom.), and the number of classes (#Class.). . 66

(24)

4.11 Comparison table of the average test performance of the ten times stratified ten-fold cross-validation obtained with the different data mining algorithms analyzed. Columns describe: the identifier of the data set (Id.), the test accuracy (A), the Cohen’s kappa statistic (K) and the F-Measure (F ), for each algorithm. The last two rows show the Friedman’s average ranking and the position for (1) the test accuracy, (2) the Cohen’s kappa statistic and (3) the F-Measure. . . 67

4.12 Holm / Shaffer Table for α = 0.05 on the classification problems using test accuracy. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 70

4.13 Holm / Shaffer Table for α = 0.05 on the classification problems using Cohen’s kappa statistic. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 71

4.14 Holm / Shaffer Table for α = 0.05 on the classification problems using the F-Measure. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold. . . 72

4.15 Critical analysis of SNCS. . . 74

5.1 Critical analysis of XCScds. . . 95

6.1 Predefined rule base used to generate the first experiment stream. The conse-quent variable is accentuated in bold. Notice that, from concept to concept, one variable is left taking random values. . . 116

6.2 Number of rules and its average quality in terms of support, confidence, lift and accuracy at the end of each concept. . . 117

6.3 Rule base used to generate the second problem. The consequent variable is accentuated in bold. Notice that the variable x3 is left taking random values. 117

6.5 Predefined rule base used to generate the third experiment stream. . . 119

6.7 Rule base used to generate the fourth problem. The consequent variable is accentuated in bold. Notice that, from concept to concept, some variables are left taking random values. . . 121

6.9 Properties of the data sets considered for the experimental study. Columns describe: the identifier (Id.), the number of instances (#Inst), the number of features (#Fe), the number of continuous features (#Re) and the number of integer features (#In). . . 125

(25)

6.10 Comparisons of results between Fuzzy-CSar and Fuzzy-Apriori. Columns de-scribe: the identifier of the data set (id.), the average number of rules (#R), the average number of variables in the antecedent (#AV), the average support (sup), the average confidence (con), and the average time in seconds (#T). These values are extracted from rules with a confidence _{≥ 0.75. . . .} 128

6.11 Critical analysis of Fuzzy-CSar. . . 130

A.1 Relations between truth/falseness of H0 and the outcomes of the test. . . 150

A.2 Comparison of the performance of the learning algorithms A1 and A2 over 20

data sets. For each data set, δ is the difference between A2 and A1. . . 151

A.3 Comparison of the performance of algorithms A1, A3, A4, A5, A6, and A7 over

30 data sets. For each algorithm and data set the average rank is supplied in parentheses. The lasts two rows show (1) the average rank of each learning algorithm and (2) the Friedman ranking. . . 153

A.4 Critical values for the two-tailed Nemenyi test for α = 0.05 and for α = 0.1. k is the number of learning algorithms in the comparison. These values have been taken from (Demˇsar,2006). . . 154

A.5 Differences between the average Friedman rankings of the learning algorithms. 155

A.6 Holm / Shaffer Table for α = 0.05. Algorithms that perform significantly different according to both Holm’s and Shaffer’s procedures are marked in bold.156

B.1 The distinct internal variables used by SNCS. . . 159

B.2 The distinct internal variables used by XCScds. . . 160

(26)

(27)

1

Introduction

The field of machine learning has fostered the interest of scientists and engineers since its conception in the second half of the fifties of the last century. Deeply rooted in artificial intelligence, machine learning is concerned with the design and development of computer programs that learn from the past experience without being explicitly programmed for solving problems that are far too complex for human experts to unravel (Mitchell, 1997). In this regard, machine learning is very attractive for extracting useful information out of the ever increasing, massively collected data of many industrial and scientific applications. The main feature of this kind of environment lies in the fact that data are potentially unbounded in size since current real-world applications output continuous flows or streams of information with a fast arrival rate. Another feature is that its data distribution may change over time and hence a fixed distribution cannot be assumed. Also, as most of these data come from sensor networks, large and varying amounts of noise are expected. Therefore, the processing of such information requires the use of learning techniques with a high degree of plasticity.

To tackle the above-mentioned challenges, Nature has recurrently inspired practitioners to design and develop a large variety of learning algorithms: individual cell biomechanics, group behaviour or population genetics—just to mention a few—have been commonly used as analogies to obtain computer programs that solve the aforementioned challenges. One of these studies gave rise to LCSs as cognitive systems that received perceptions from their environment and, in response to these perceptions, performed actions to achieve certain goals (Holland, 1992, Orriols-Puig,2008). The learning process of LCSs is guided by the general principles of Darwinian evolution (Darwin, 1859) and cognitive learning (Butz, 2006), and their goal is to provide human-readable rules that model the unknown structure of the prob-lems faced by the system. Despite the fact that the family of algorithms that integrate LCSs have demonstrated to be mature, flexible and competitive machine learning techniques, the challenge of mining streams of time-changing information is just starting to being addressed by researchers (Orriols-Puig and Casillas,2010b).

(28)

Therefore, the purpose of this thesis is to further investigate the online learning architec-ture of LCSs by analysing their behaviour when applied to data stream problems not in a single branch of machine learning, but covering distinct issues from different disciplines. In this chapter we present the framework of this thesis, detailing the dissertation scope to set the reader in the appropriate context to follow the presented work. Finally, we provide the road map of the thesis.

1.1 Thesis Scope

Strongly inspired by the general principles of Darwinian evolution and cognitive learning,

Holland (1976) envisaged LCSs as cognitive systems that receive perceptions from an envi-ronment and perform actions to achieve goals, thus imitating human experts. In this regard, the very first LCS—the cognitive system one or CS-1 (Holland and Reitman,1977)—was de-signed as a computer program capable of being aware of the environment it was embedded on and able to take decisions online that could affect such environment. Another important char-acteristic is that it automatically learns from the experience inferring patterns, behaviours or conclusions from data by modelling the unknown structure of the problems faced. As the pioneer CS-1 had, modern day LCSs are characterised by three fundamental aspects ( Orriols-Puig,2008): (1) a knowledge representation made up by individuals that enables the system to map sensorial states to actions, (2) an apportionment of credit mechanism which shares the credit obtained among individuals, and (3) a search algorithm, typically a genetic algo-rithm (GA (Holland,1976)), to discover new promising individuals. GAs, developed as the core knowledge-discovery component of LCSs, are search methods based on the mechanics of natural selection and genetics, and they rely on two fundamental concepts: (1) survival of the fittest and (2) the genetic effects of selection and recombination.

Shortly after the appearance of CS-1, Smith (1980) took a workaround to design a new kind of LCS—the learning system one or LS-1—focused on evolving rule sets instead of individual rules, as opposed to Holland’s idea: this way the complex rule sharing mechanism is avoided and therefore the learning process is hugely simplified, being the learning architecture much more similar to traditional offline GAs (Bacardit,2004). Algorithms resulting of both approaches were distinguished, being the Holland’s approach coined as Michigan-style LCSs, whilst the Smith’s approach labeled as Pittsburgh-style LCSs (Orriols-Puig,2008).

Although the early success of these pioneers, no complete learning theory was devel-oped for LCSs (Butz, 2004): neither learning nor convergence could be assured and the learning interactions of primeval LCSs appeared to be too complex and therefore these were not well-understood. It was not until the mid-1990s, with the conception of XCS (Wilson,

1995), that the interest in the field was renewed. XCS simplifies the learning architecture of primeval LCSs while being (1) accurate and (2) robust—i.e., XCS is based on a general-purpose framework that performs well in any type of problem. Shortly after its appearance, several theoretical analyses of XCS were published, resulting in the renaissance of the field (Butz, 2004). Consequently, XCS has become the standard framework for Michigan-style LCSs. So far, this framework has demonstrated a competitive behaviour in several appli-cations, being at the same level or even surpassing the most well-studied machine learning

(29)

techniques (Orriols-Puig,2008).

Machine learning techniques are specially practical when the problem to solve is too complex for traditional engineering methods, when the problem requires a high degree of adaptation to changing environments, and when requiring the processing of huge amounts of data to extract useful information. As reported by Aggarwal (2007),Angelov(2012),Gama

(2010) and others, today’s data of most industrial and scientific applications are generated online in the form of dynamic streams of information. This issue has rocketed the attention of machine learning practitioners, leading to the emergency of the field of data streams. Differently from traditional data mining processes, data streams come in continuous flows of information and posses the following characteristics: (1) these flows of data are potentially unbounded in size and therefore there are strong constraints in memory usage, (2) these data can only be read in a single pass (i.e., one epoch) due to its large size and fast arrival rate, and (3) as these data are inherently dynamic and consequently a fixed data distribution cannot be assumed. This dynamic nature is the distinctive trait of data streams, hence hampering the learning process of traditional machine learning algorithms, which are ill-suited for the task. Another important issue of data streams is that, as data come mostly from sensor networks, large and varying amounts of noise are present, therefore algorithms that are robust against noise are required in order to mine from real-world scenarios.

In this regard, the scope of the present work is focused on Michigan-style LCSs as a well-suited framework for mining data streams. Even if this framework has demonstrated to be mature, flexible and competitive, the problem of mining streams of time-changing information is just starting to being addressed by practitioners (Orriols-Puig and Casillas,

2010b). The purpose of this thesis is to go beyond this pioneering exploratory study and further investigate the online learning architecture of LCSs by analysing their behaviour when applied to data stream problems not in a single branch of machine learning (e.g., label prediction), but covering distinct issues from different disciplines. In the subsequent sections the thesis objectives and the roadmap followed in this work are detailed.

1.2 Thesis Objectives and Contributions

The fundamental objective of the present thesis is to explore the online learning nature of the Michigan-style LCS architecture for mining data streams without limiting our analy-sis to a single learning paradigm nor technique (i.e., supervised or unsupervised learning; classification or clustering), but covering a broader, global vision. As there is no recipe to guide practitioners in order to tackle a new, previously unsolved real-world problem, having a broader vision is of the upmost importance. Regarding this issue, some problems require the application of supervised techniques whereas some others the unsupervised paradigm. Further, often a mixture of both learning strategies may lead researchers towards their goal. Setting the mature and open framework of XCS, by far the most well-studied and influ-enced Michigan-style LCS (Orriols-Puig,2008), as a departure point we propose to:

1. Revise and improve the characteristics of Michigan-style LCSs for supervised learning in data stream classification tasks.

(30)

2. Revise, extend and improve the characteristics of Michigan-style LCSs for unsupervised learning in clustering data streams.

3. Explore and enrich the characteristics of Michigan-style LCSs for unsupervised learning by introducing association streams.

A more detailed discussion for each one of the three objectives is provided in the following.

Revise and improve the characteristics of Michigan-style LCSs for supervised learning in data stream classification tasks. In spite of the ever-increasing interest of researchers in obtaining novel and useful knowledge out of data streams little research has been done using Michigan-style LCSs (Abbass et al., 2004, Orriols-Puig and Casillas,

2010b). These studies suggest that, although XCS and its derivatives—specifically the supervised classifier system, UCS (Bernad´o-Mansilla and Garrell-Guiu, 2003,Orriols-Puig and Bernad´o-Mansilla, 2008)—are able to tackle such problems satisfactorily they suffer from requiring a huge population to obtain accurate results and also their responsiveness to a sudden concept drift is not competitive enough under high dimensional spaces. This issue is mainly caused by the overlapping rule-based representation these algorithms use, which is not the best individual representation for these kinds of problems as preliminary studies suggest. Therefore, this thesis starts by identifying the common difficulties that hamper the learning phase of supervised techniques when applied to data stream problems with the aim of obtaining a very robust system with a high reaction capacity to concept changes and noisy inputs, even in problems with high dimensional variables.

Revise, extend and improve the characteristics of Michigan-style LCSs for un-supervised learning in clustering data streams. The first objective shows that Michigan-style LCSs are indeed a competitive choice for mining prediction problems when applied to data streams—i.e., supervised tasks. However, many real-world problems require unsupervised learning schemes due to the dearth of training examples and the associated costs of labelling the information to train the system. Moreover, an interesting application of unsupervised learning that has received a special attention of the community is clustering data streams (Bifet et al.,2010,Gama,2012). Therefore, in this second objective we pro-pose the study and application of the Michigan-style LCSs framework for clustering data streams by means of a specialisation of XCS: the extended classifier system for clustering, XCSc (Shi et al.,2011,Tamee et al.,2007a). So far, XCSc has not been applied to online problems because of its internal architecture. Thus, we propose a revision of this learning architecture to handle the strict requirements of clustering data streams. Also, we deploy this enhanced version of the algorithm to tackle a real-world scenario.

Explore tand enrich the characteristics of Michigan-style LCSs for unsupervised learning by introducing association streams. Current trends in knowledge discovery are fostering practitioners to extract potentially useful knowledge out of unlabelled data composed of continuous features that flow over time while providing a high degree of inter-pretability (Orriols-Puig and Casillas,2010a,Orriols-Puig et al.,2012). Sound examples of this kinds of environments are found in Smart Grids and in network security monitoring,

(31)

where it is of the utmost importance to detect intrusion attacks when these are happening without assuming any a priori underlying structure. A realistic response to this challenge is in association stream mining, a field closely related to frequent pattern mining, but that differently to this latter is focused on obtaining rules directly out of the streams and adapt-ing to concept drift in a pure online fashion—thus it does not allow to use a classic offline process for rule generation. Due to its relevance, in the last objective of this thesis we ex-plore the new field of association streams by means of the Michigan-style LCS architecture. Furthermore, we propose ways for tackling these kinds of problems. Also, with the goal of having a high interpretability, we explore and enrich the crossbreeding of fuzzy logic with LCSs.

In addition, we take the first steps towards a theory of generalisation and learning for Fuzzy-CSar departing from the existing formalisms of XCS and its related algorithms. These resulting models of the analysis provide several configuration recommendations for applying Fuzzy-CSar.

Figure 1.1depicts the distinct representations explored with Michigan-style LCSs in this thesis: the highly-flexible, highly-accurate and non-human readable neural network, the ac-curate and human-friendly clustering rule and finally the linguistic fuzzy association rule, the most readable of them all. At this point it is important to highlight that as we approach to more real data stream problems, we improve the readability of the representation used. Also, we refine the applicability of our framework: as supervised learners are a replacement for human operators in daunting tasks, their applicability in real environments is limited as we will discuss in later chapters. On the other side of the spectrum, unsupervised learners are support tools designed for helping experts. Consequently, readability is a desirable feature in the unsupervised field. Figure 1.2 conceptualises the aforementioned applicability versus readability of the techniques presented in this thesis in a purely qualitative way.

1.3 Roadmap

In this section we detail the overall structure of the present thesis, which is composed of, in addition to the present chapter, seven chapters plus a complementing appendix.

Chapter 2 provides the required theoretical background to follow the present work. It starts with a brief introduction to machine learning, starting with a formal definition and providing a taxonomy of the three main branches of learning of algorithms. Afterwards, evolutionary computation—the core field of the learning algorithms of this thesis—is de-scribed in detail providing the theoretical background on designing competent algorithms. Finally, a survey on LCSs is presented which provides to the reader the big picture. Chapter 3 reviews the Michigan-style LCS framework by concisely detailing the XCS

classi-fier system. It departs from an overview of the system, introduces the knowledge represen-tation used by this system, details the learning organisation, shows its action inference, and briefly details the theoretical framework on which XCS (and hence every Michigan-style LCS) sustains. Also, UCS, a supervised extension of XCS, is described in detail.

(32)

x y f1(z) f2(z) f3(z) f4(z) f5(z) black white Node f1(z): w0: −11.44 x: −20.82 y: 1.75 Node f2(z): w0: −12.82 x: 22.13 y: −3.46 Node f3(z): w0: −0.23 x: 15.96 y: 35.41 Node f4(z): w0: −2.19 f1: 4.09 f2: 9.54 f3: −8.75 Node f5(z): w0: 2.19 f1: −4.09 f2: −9.54 f3: 8.75 fk(z) = 1 1 + e−(w0+∑ni=1wi·zi) 1 (a) 2 4 6 8 − 20 2 4 6 8 x y Cluster 1 Cluster 2 if x ∈ [1.04,3.72] and y ∈ [1.50,8.21] then C1 if x ∈ [0.89,8.65] and y ∈ [−2.49,2.75] then C2 1 (b) x µA!(x) 1 0 XS S M L XL y µ! B (y ) 1 0 XS S M L XL if x is {S} ⇒ y is {XS or S} [supp: 0.4; conf: 1] if x is {L or XL} ⇒ y is {XS}[supp: 0.2; conf: 0.3] if y is {M} ⇒ x is {L}[supp: 0.3; conf: 1] 1 (c)

Figure 1.1: The distinct representations explored in this thesis: (a) a multilayer perceptron for data classification in the tao problem, (b) two clustering rules in a two-dimensional problem, and (c) three fuzzy association rules in a two-dimensional problem using five fuzzy sets per variable.

Chapter 4 starts detailing the challenges of learning when applied to data streams in super-vised tasks and the difficulties that XCS, the referent Michigan-style LCS, has at handling these environments. This chapter presents the supervised neural constructivist system (SNCS), a brand new member of the Michigan-style LCS family that is specifically de-signed to learn from data streams with a fast reaction capacity to concept changes and a remarkably robustness to noisy inputs. The behaviour of SNCS on data stream problems with different characteristics is carefully analysed and compared with other state-of-the-art techniques in the field. Furthermore, SNCS is also compared with XCS and UCS using the same testbed to highlight their differences. This comparison is extended to a large

(33)

Readability

Applicability

SNCS SNCS XC Sc XCSc Fuz zy-C Sar Fuzzy-CSar

1

Figure 1.2: Conceptual chart displaying the applicability of the studied algorithms to real-world data stream problems versus readability of the representation used. Notice that the metric is not to scale.

collection of real-world problems.

Chapter 5 exploits the Michigan-style LCS framework beyond supervised tasks when ap-plied to data streams. Despite the good results obtained in the supervised field, these techniques assume an a priori underlying structure for the set of features of the problem: supervised techniques require the class of each training example in order to build a reliable model. This issue is often unrealistic, specially in real-world problems, making the direct application of pure supervised learners ill-suited. With this issue in mind this chapter is dedicated to improve XCSc to face the challenges of clustering data streams.

Chapter 6 presents the new field of association streams, devoted to modelling dynamically complex domains via production rules without assuming any a priori structure. This chapter discusses the challenges that association streams poses to learning algorithms and presents fuzzy-classifier system for association rules (Fuzzy-CSar), an algorithm that in-herits the main architecture of XCS and Fuzzy-UCS for extracting useful knowledge out of continuous streams of unlabelled data.

Chapter 7 goes further and takes the first steps towards a theory of generalisation and learning of Fuzzy-CSar departing from the existing formalisms of XCS and its related algorithms. Also, the lessons learned from this analysis result in several configuration recommendations for applying this algorithm on any type of problems.

(34)

Chapter 8 recapitulates the contributions of this thesis by summarising, providing key con-clusions, and presenting a proposal of the future work lines.

The material presented in the distinct chapters is complemented with an appendix (A) that details the required background on the statistic tests employed in the different chapters of this work.

(35)

2

Theoretical Background

Over the past decades the study of machine learning has grown from the early efforts of engineers exploring whether computers could learn to perform some intelligent actions to a broad discipline (Mitchell, 2006). This discipline, strongly related to artificial intelligence and statistics, is concerned with the development of computer programs that can learn from the past experience. This chapter provides all the necessary background to understand the theoretical foundations described in this thesis, starting with the formal definition of machine learning, arguing why is such an interesting field, and then showing the classical taxonomy, i.e., supervised learning, unsupervised learning and reinforcement learning. Following that, we further classify machine learning into offline and online learning, where this latter leads to the field of data streams. Afterwards, this chapter presents the core field of the thesis: Evolutive Computation (EC), a family of learning algorithms inspired in natural selection and genetics. Next, the theoretical foundations on which this family is based and the background on designing competent EC algorithms are described. Finally, the two main models of EC learning systems—Michigan-style and Pittsburgh-style LCSs—and its derivatives are detailed showing their differences.

2.1 Machine Learning, a Brief Tour

Machine Learning is a field of computer science closely related to AI that is concerned with the development of computer programs that learn from data obtained, for example, from input sensors or databases, and that automatically improve its results with the experience gained when manipulating these data. Mitchell (1997) proposed a more precise and formal definition of machine learning: “a computer program is said to learn from experience E with respect to some class of task T and performance measure P , if its performance at task T , as measured by P , improves with experience E.”

So, in general, to have a well-defined learning problem, the algorithm designer have to

(36)

identify three key features: (1) the class of the task to be performed by the machine, (2) the measure of performance to be improved, (3) and the source of experience. For example, a computer program that learns to drive a robot through a maze can be formally defined as follows:

• Task T: driving a robot through the maze.

• Performance measure P: average distance traveled before an error occurs. • Training experience E: a set of sensor data classified by an expert.

There are a vast number of useful applications of machine learning in distinct fields such as search engines, medical diagnosis, pattern recognition or recommender systems, just to mention a few. As mentioned by Orriols-Puig(2008), the most important reasons that may lead to the application of machine learning to solve problems are enumerated as follows:

1. Problems that are too complex to manually design and code an algorithm to solve it. This is the case of computer vision: it turns out that it is very complex to write an algorithm for recognising human faces using traditional software engineering tools. 2. Necessity of programs that continuously adapt to changing environments. This is the

case of robot control, for instance the Mars Science Laboratory Curiosity1_.

3. Necessity of processing huge amounts of data to extract novel, interesting and useful knowledge from patterns hidden in these data. This is the objective, in fact, of data mining (Witten et al.,2011).

From the classic point of view, machine learning algorithms are classified based on the task to perform. In general, there are three fundamental types of learning, which are: supervised learning, where an expert or teacher provides feedback in the learning process, unsupervised learning, where there is no expert or teacher when the learning process is running, and reinforcement learning, where the program learns interacting with the environment. Also, these families of learners can be further classified into offline and online methods, depending on the strategy followed by the algorithm. These concepts are elaborated in more detail in what follows.

2.1.1 Supervised Learning

Supervised learning consists in the process for extracting a function or model from training data. These training data is composed of a set of input features—often also called attributes— and the desired output, as is shown in Table 2.1. The main characteristic of the supervised learning approach is that the computer program needs an expert or teacher that provides feedback in the learning process. The supervised paradigm assumes an a priori underlying structure and thus require the use of existing information to obtain its knowledge (i.e., the learner uses the output of the training data as a guidance).

(37)

Orriols-Puig (2008) gives a more precise definition: “supervised learning is the process of extracting a function or model that maps the relation between a set of descriptive input attributes and one or several output attributes.”

outlook temperature humidity windy play tennis?

sunny 85 85 false no

sunny 80 90 true no

overcast 83 86 false yes

rainy 70 96 false yes

rainy 65 70 true no

overcast 64 65 true yes

sunny 72 95 false no

sunny 69 70 false yes

sunny 75 70 true yes

overcast 72 90 true yes

overcast 81 75 false yes

rainy 71 91 true no

Table 2.1: Training data of the weather problem taken from the UCI machine learning repository

(Bache and Lichman,2013). We can see the input features outlook, temperature, humidity and windy,

and the desired output play tennis?

We can further classify supervised learning depending on the type of the output features as data classification or data regression. In data classification the goal is to find a model that predicts the class (that is, the output feature) of new input instances not previously seen by the algorithm during the training phase. These output features are said to be categorical (i.e., the output represents the classes of the examples). Table2.1shows a classic example of a data classification problem. In data regression, the goal is to find a function that predicts the output value of new and previously unseen input instances, that is, modeling a function. In this case, the output features are said to be continuous.

2.1.2 Unsupervised Learning

Unsupervised learningencompasses a range of techniques for extracting a representation from data—these data consist only of input features—thus these techniques do not assume any a priori underlying structure in data (i.e., the learner does not know what it is looking for and, hence, has no feedback from the environment).

The kind of problems that unsupervised learning approaches handle are those in which it is required to determine how the data are organised. The main unsupervised techniques are clustering, dimensionality reduction and association rules. The objective of clustering is to separate a finite unlabelled set into a discrete finite set of hidden data structures. The aim of dimensionality reduction is to find a subset of the input space without any loss of information, that is, to reduce the number of input features of the problem. Association rules are methods for discovering interesting relations hidden among the variables of the problem.

(38)

2.1.3 Reinforcement Learning

Reinforcement learning is a family of techniques that envisage learning what to do by inter-acting with the environment. In this interaction the agent2 _{receives what is called perceptions}

from the environment and performs actions with the aim of achieving one or several goals. The agent receives, then, positive—or negative—rewards as a consequence of its actions. Sut-ton and Barto (1998) describe reinforcement learning as the problem faced by an agent that must learn behaviour through trial-and-error interactions with a dynamic environment.

Sutton and Barto (1998) define reinforcement learning somewhat more formally: “rein-forcement learning is how to map situations to actions so as to maximise a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. Actions may affect not only the immediate reward but also the next situation and, though that, all subsequent rewards.”

In this regard, reinforcement learning lies between supervised and unsupervised learning, and thus the agent has not any kind of examples to learn from, so it must be able to learn from its own experience. The agent has to exploit what it already knows in order to obtain a reward, but also has to explore in order to make better action selections in the future.

2.1.4 Offline and Online Learning

So far we presented the classic taxonomy that identifies the three types of learning. Moreover, the aforementioned families can be further classified into offline and online methods. Offline algorithms require all the data to be analysed in order to build a comprehensive system model and infer any kind of knowledge from it, whether we are in the supervised, unsupervised or reinforcement learning paradigm. On the other hand, online learning approaches build a dynamic system model that attempts to adapt itself to the environment specificities.

Offline techniques can be exported to online environments by means of windowing tech-niques. It this manner, the algorithm can continuously train and thus get nearly-online results.

2.2 Data Streams

In our Information Age, data of most industrial and scientific applications are generated online and collected continuously and massively, in which a static model cannot be assumed. Moreover, these data present the following characteristics (Angelov,2012,Gama,2012,2010,

Gama and Gaber,2007,Lughofer and Angelov,2011,N´u˜nez et al.,2007): • Potentially unbounded in size.

• Limited usage of memory. • Data can only be handled once. • Fast arrival rate.

(39)

• Continuous flow of information.

• Target concepts may vary over time (concept drifts). • Noise levels may change.

• A fixed data distribution cannot be assumed.

The field that monitors, tracks and controls such data under these constrains is coined as data streams and has attracted the attention of machine learning practitioners in recent years. One of the main challenges that present data streams is the dynamic nature of the target concept inside data that may change over time (referred to as concept drift), which makes the learning process difficult. Typically, concept drifts are classified as abrupt, incremental and recurring (Gama, 2010), which characterize the type of change inside the data. Also, most often examples can be seen only once, hence limiting the accuracy of the models generated by the learners. Some real-world examples of data streams can be found in sensor monitoring, Smart Grids, stock market analysis, web click-stream analysis, road traffic congestion analy-sis, market basket mining or credit card fraud detection, among many others (Orriols-Puig and Casillas,2010b).

So far we presented the classic taxonomy of machine learning based on the task they do. Distinct machine learning techniques have been developed to perform some of the aforemen-tioned tasks. One of the most successful approaches to face the distinct types of problems in machine learning are learning classifier systems (Holland, 1971, 1976, 1992, Holland et al.,

1999), originally designed for reinforcement learning tasks (Holland and Reitman, 1977). Since then, the LCSs family has been extended to deal with supervised (Bacardit and Butz,

2007,Bernad´o-Mansilla and Garrell-Guiu,2003) and unsupervised (Orriols-Puig and Casil-las,2010a,Tamee et al.,2007a) learning tasks, hence rendering as a flexible and trustworthy learning architecture (Orriols-Puig,2008). The remainder of this chapter is focused on evolu-tionary computation, the core set of techniques used by most LCSs to discover new knowledge and, thus, to LCSs themselves.

2.3 Nature-inspired Learning Algorithms

Strongly inspired by the way Nature solves complex problems, Evolutionary Computation (EC) is a field of study that solve problems using procedures inspired by natural processes (Bacardit, 2004). EC does not refer to a single type of learning algorithm, but a series of distinct techniques that share the same principles of natural selection and evolution for competent problem solving (Orriols-Puig,2008).

In Nature, all living entities are made of cells, and each contains chromosomes—strings of DNA—that define the blueprint for the organism. A chromosome is divided into genes, the functional blocks of DNA, each of which encodes a particular protein (Mitchell, 1998). In order to understand the principles that guide Nature, two fundamental concepts have to be defined: the concept of genotype and the concept of phenotype, which are elaborated in the following.

Facing online challenges using learning classifier systems

TESI DOCTORAL

Títol Facing Online Challenges Using Learning Classifier Systems

Realitzada per

Andres Sancho Asensio

en el Centre

Escola Técnica Superior d’Enginyeria Electrònica i

Informàtica La Salle

i en el Departament

Informàtica

Dirigida per

Dra. Elisabet Golobardes i Ribé

Dr. Jorge Casillas Barranquero

Aquesta Tesi Doctoral ha estat defensada el dia ____ d __________________ de ____

al Centre _______________________________________________________________

de la Universitat Ramon Llull

davant el Tribunal format pels Doctors sotasignants, havent obtingut la qualificació:

President/a

_______________________________

Vocal

_______________________________

Vocal

_______________________________

Vocal

_______________________________

Secretari/ària

_______________________________

Doctorand/a

LEARNING CLASSIFIER SYSTEMS

Andreu Sancho-Asensio

1

Introduction

1.1

Thesis Scope

1.2

Thesis Objectives and Contributions

1.3

Roadmap

Readability

Applicability

1

2

Theoretical Background

2.1

Machine Learning, a Brief Tour

2.2

Data Streams

2.3

Nature-inspired Learning Algorithms

Aquesta Tesi Doctoral ha estat defensada el dia d __________ de