Imitation learning in artificial intelligence

(1)

warwick.ac.uk/lib-publications

A Thesis Submitted for the Degree of PhD at the University of Warwick

Permanent WRAP URL:

http://wrap.warwick.ac.uk/94683

Copyright and reuse:

This thesis is made available online and is protected by original copyright.

Please scroll down to view the document itself.

Please refer to the repository record for this item for information to help you to cite it.

Our policy information is available from the repository home page.

(2)

Imitation Learning in Artificial Intelligence

by

Alexandros Gkiokas

Thesis

Submitted to the University of Warwick

for the degree of Computer Science.

Supervisor: Alexandra I. Cristea

Doctor of Philosophy

Department of Computer Science

(3)

Acknowledgments

I would like to express my sincere gratitude to my supervisor Alexandra Cristea,

without whom this thesis and research would never have taken place. My parents

George and Maria, for never giving up on me and always believing. My current

employer Stratos Arampatzis and Ortelio Ltd, as well as my previous employer

Matthew Sewell and Athium Ltd, for their support, funding and understanding.

Karen Stepanyan and Panagiotis Petridis for putting up with my questions, offering

their support and advice, and guiding me through this adventure. Finally, Mina

Makridi for putting up with me and for the past four years, and entertaining the

(8)

Declarations

This thesis is submitted to the University of Warwick in support of my application

for the degree of Doctor of Philosophy. All experimental data presented and

sim-ulations were carried out by the author, except in the following cases: Daedalus

experiments were done on-line after being approved by the BSREC with

refer-ence REGO-2015-1529. All experiments for Daedalus were carried out on the web

and they represent anonymous data. Chapter 3 contains theoretical formulations

[Gkiokas et al., 2014] carried out in cooperation with Matthew Thorpe from the

Mathematics Institute in Warwick. Parts of this thesis have been published by the

author, including submitted papers:

- Training a Cognitive Agent to Acquire and Represent Knowledge from RSS

feeds onto Conceptual Graphs [Gkiokas and Cristea, 2014a].

- Unsupervised neural controller for Reinforcement Learning action-selection:

Learning to represent knowledge [Gkiokas and Cristea, 2014b].

- Self-reinforced meta learning for belief generation [Gkiokas et al., 2014].

- Cognitive Agents and Machine Learning by Example: Representation with

Conceptual Graphs [Gkiokas and Cristea, 2016a].

- Deep Learning and Encoding in Natural Language Understanding: Sparse

and Dense Encoding Schemes for Neural-based Parsing [Gkiokas and Cristea,

(9)

Abstract

Acquiring new knowledge often requires an agent or a system to explore, search and

discover. Yet us humans build upon the knowledge of our forefathers, as did they,

using previous knowledge; there does exist a mechanism which allows transference

of knowledge without searching, exploration or discovery. That mechanism is known

as imitation and it exists everywhere in nature; in animals, insects, primates, and

humans. Enabling artificial, cognitive and software agents to learn by imitation

could potentially be crucial to the emergence of the field of autonomous systems,

robotics, cyber-physical and software agents. Imitation in AI implies that agents

can learn from their human users, other AI agents, through observation or using

physical interaction in robotics, and therefore learn a lot faster and easier.

Describing an imitation learning framework in AI which uses the Internet as

the source of knowledge requires a rather unconventional approach: the procedure

is a temporal-sequential process which uses reinforcement based on behaviouristic

Psychology, deep learning and a plethora of other Algorithms. Ergo an agent using a

hybrid simulating-emulating strategy is formulated, implemented and experimented

with. That agent learns from RSS feeds using examples provided by the user; it

adheres to previous research work and theoretical foundations and demonstrates

that not only is imitation learning in AI possible, but it compares and in some cases

(10)

Abbreviations

ADABOOST Adaptive Boosting

AGI Artificial General Intelligence

ANN Artificial Neural Networks

AI Artificial Intelligence

BFGS Broyden-Fletcher-Goldfarb-Shanno

BPROP Back Propagtion

CA Cognitive Agent

CE Cross Entropy

CG Conceptual Graph

CNN Convolutional Neural Networks

CRF Conditional Random Field

DARPA Defense Advanced Research Projects Agency

DNN Deep Neural Networks

FOL First Order Logic

FSM Finite State Machine

(11)

GOFAI Good Old Fashioned Artificial Intelligence

HOL Higher Order Logic

KR Knowledge Representation

LBFGS Limited storage Broyden-Fletcher-Goldfarb-Shanno

LMA Levenberg-Marquardt Algorithm

LSTM Long Short Term Memory

MBSGD Mini Batch Stochastic Gradient Descent

MDP Markov Decision Process

ML Machine Learning

MRL Meaning Representation Language

MSE Mean Squared Error

NLP Natural Language Processing

NLU Natural Language Understanding

PBD Programming by Demonstration

PBE Programming by Example

POS Part Of Speech

RBM Restricted Boltzmann Machine

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

RPROP Resilient Propagation

(12)

SVM Support Vector Machine

(13)

List of Tables

2.1 State of the Art NLU Software Tools. . . 50

4.1 Common NLP Datasets . . . 84

5.1 Accuracy Metrics Equivalence. . . 97

(14)

List of Figures

2.1 Simple MRL structure. . . 15

2.2 MRL with Edges. . . 16

2.3 Annotated MRL with Edges and Meta-data. . . 16

2.4 Conceptual Graph example. . . 17

2.5 Conceptual Graph Directed Edges. . . 17

2.6 Example of a Neural Network. . . 20

2.7 Example of a deep fully connected neural network. . . 25

2.8 Reinforcement Learning Episode. . . 28

2.9 Reinforcement Learning Agent Interaction with Environment. . . 29

2.10 Rewarding the Terminal State . . . 29

2.11 Example of a Policy. . . 30

2.12 Epicentre of Multiple Episodes. . . 30

2.13 Deep Reinforcement Learning Agent & Environment. . . 31

2.14 Knowledge sharing by agent and user . . . 42

2.15 WordNet Semantic Tree Graph. . . 47

3.1 Icarus Engine Blueprint. . . 56

3.2 Agent Observes Example . . . 59

3.3 Agent Recreates Example . . . 59

3.4 Agent Decomposes Example . . . 60

3.5 Shift-Reduce Graph Polulation . . . 61

3.6 Shift-Reduce Edge Creation . . . 61

3.7 Agent as a Decision Maker. . . 65

3.8 Decision Making Process. . . 66

3.9 Semantic Graph Union. . . 69

3.10 Deep Learning Cascade. . . 73

4.1 A CG example . . . 83

(15)

4.3 Clustering Coefficient: CG rhombus versus KR triplet . . . 89

5.1 Semantic-Heuristic Accuracy. . . 100

5.2 Probability Histograms. . . 104

5.3 Probability Value Map . . . 105

5.4 Probability Accuracy and Action Ratios. . . 107

5.5 Filtering withθ . . . 108

5.6 Cyclic Process: MDP to Statistics to ANN . . . 113

5.7 ANN Classifying and Mapping Search Space. . . 115

5.8 ANN Accuracy. . . 116

5.9 Sparse POS and Distributed Term Classifiers . . . 123

5.10 Sparse-Encoded POS and Term Edge Classifiers . . . 124

5.11 Sparse-Encoded Term Edge and Distributed-Encoded POS Classifier 125 5.12 Distributed Term Edges and Distributed POS Edges . . . 126

5.13 Complexity Mapping to Accuracy PCA . . . 127

B.1 Synthetic marijuana is made with shredded plant material coated with chemicals designed to mimic THC psychoactive compound found in marijuana. . . 138

B.2 Artificial magnetic bacteria turn food into natural drugs. . . 139

B.3 Ataluren Phase 3 trial results in nonsense mutation cystic fibrosis. . 140

B.4 UK drug company AstraZeneca rejects improved final £69 billion takeover offer from US firm Pfizer. . . 141

B.5 Lewy body dementia is most misdiagnosed dementia affecting 13 mil-lion Americans. . . 142

B.6 Action star has got himself tank and destroys piano and birthday cake with it. . . 143

B.7 Plan to offer better care and treatment for 500000 patients living with neurological conditions. . . 144

B.8 Each cell uses particular schemes of molecular interaction which psy-chologists call intercellular signaling pathways. . . 145

B.9 Branching with Column Graphs . . . 146

(16)

Chapter 1

Introduction

In the never-ending quest for Artificial Intelligence (AI), we take example from our-selves and our own intellect, since we are what we believe to be the most intelligent

species on the planet. Intelligence however, did not spontaneously come into

ex-istence, but was the result of a painstaking process of evolution [Bjorklund, 2006; Wynn, 1985; Sternberg, 1982]; even more interestingly, scientists argue that there

exist multiple intelligences and not just one [Gardner, 2011]. How those intelligences

arose is a topic biologists, geneticists and psychologists researching human intelli-gence have been working on for more than a century; yet their beliefs and theories

directly affect computer scientists working on AI. Our focus are machines: robots,

software and hardware, artificial artifacts, upon which humanity is trying to instill

intelligence and make them as smart as humans. Yet we cannot dismiss how

hu-man intelligence arose as outside the scope of AI, not only because it may be very

relevant to the actual processes we’re trying to recreate, in that there may be cru-cial information in the emerge of intelligence in Homininae, information that could

make Artificial General Intelligence (AGI) a reality (AGI as in, an irrefutably

in-telligent, sentient and self-aware technological singularity [Goertzel and Pennachin, 2007; Kurzweil, 2005]).

1.1 Human Intelligence

Nowadays we know that one of the pieces of the puzzle that aided the emergence of

intelligence isimitation[Dautenhahn and Nehaniv, 2002b]. Not only did imitation

aid human intelligence, but there is overwhelming evidence in nature which suggests that imitation is one of the core mechanisms behind intelligence in animals, insects

(17)

2002; Galef Jr, 1988]. It is human intelligence that interests us the most, and there

have been speculations that the ”great leap forward”, a period 200,000 years ago

when our intellect exploded and we started creating tools, is mostly attributed

to our ability to imitate [Ramachandran, 2000] in combination with our tendency

to congregate and socialise. Thus it appears that the two driving factors of the

leap forward were the development of communities and the ability to learn from

others. By doing so, theknowledge of previous generations was passed down to the

next generations, and thus individual knowledge, and by extension the collective knowledge began accumulating [Jones, 2009]. Through imitation, human societies

taught their offspring how to create tools, how to farm, and most importantly how

to survive. It thus becomes apparent that learning and imitation are closely knit

together [Heyes, 2002]; we can’t have imitation without the ability to learn, and,

vice-versa, being able to learn is of questionable use when there exists no mechanism

through which to acquire learning material.

1.2 Imitation in Humans

Imitation is a very broad term but appears to be a low-level to mid-level ability

of identifying examples, and learning from peers either via direct demonstration

or from observation [Dautenhahn and Nehaniv, 2002a]. Contemporary researchers

argue that imitation is a unitary competency, a behavioural process that could have evolved as a unit and can be inherited as well as shared across a species

[Myowa-Yamakoshi et al., 2004; Ferrari et al., 2006]. The implication of this argument is

that in AI an imitation process is something that can be learnt; an algorithm, a

heuristic or a cybernetic system.

Other researchers add that a neurological mechanism enables imitation in

humans [Iacoboni and Dapretto, 2006; Iacoboni, 2005; Gr`ezes et al., 2003; Decety

et al., 2002; Iacoboni et al., 1999] known as the ”mirror system”. Whilst the

neuron-based imitation in humans is not fully understood, such a hypothesis could imply

that an artificial neural-based imitation system could in theory be implemented.

However, it is not currently known if the mirror system enables higher cognitive

functions, or only sensor-motor functions [Bonini and Ferrari, 2011], although there

exists evidence to suggest that social and higher level functions are indeedpartially

attributed to the mirror-system.

There exist many types of imitation: high-level, physical, hierarchical,

struc-tured and more [Dautenhahn and Nehaniv, 2002a]. Imitation may be supervised

(18)

when observation is the only means of acquiring learning material.

1.3 Imitation in Artificial Intelligence

Artificial imitation research has mostly focused on Robotics, in order to achieve a

similar process to how infants learn movement and sensormotor abilities [Suleiman

et al., 2008; Nakaoka et al., 2007; Breazeal et al., 2005; Breazeal and Scassellati, 2002]. Artificial imitation in applications not related to robotics deals mostly with

programming by example (PBE) also known as programming by demonstration (PBD), the only imitation-related topic in non-robotic AI [Lieberman, 2001, 2000;

Halbert, 1984]. Research in PBE/PBD is more than 15 years old, and was mostly

concerned with programming and focused on user interfaces (UI). Regardless of the

advent and subsequent sunset of PBE, imitation as a learning mechanism for AI,

and more specifically for software agents, has been ignored and is to this very day

an esoteric and perplexing topic.

The transition from PBE to AI agents is not an easy one; whereas PBE was

concerned only with programming, AI agents are focusing on autonomy, learning,

self-organisation, knowledge representation, logic and reasoning. An imitating AI

agent is an even more peculiar entity: it does utilise the aforementioned topics, but

it revolves around the combination of learning via imitation.

The notion that imitation is not learning is often perplexing, the reason

being that learning is the main focal point of an agent without describinghow the

training material or samples have been acquired and used. Imitation thus focuses

on acquiring learning material, training samples, data or information which is of use to the agent, and serves the purpose of acquiring new behaviours, performing new

tasks or procedures [Dautenhahn and Nehaniv, 2002a]

Whilst learning and imitation are sometimes used interchangeably in AI and machine learning (ML), fundamentally they are different. The way in which the

model, agent or algorithm acquires information and translates it into knowledge, is

what differentiates learning from imitation [Dautenhahn and Nehaniv, 2002a]. A neural network being trained and evaluated by a user displays no form of

imitation; yet an autonomous, self-trained and self-evaluating agent requires that it

is able to identify paradigms from which it can extract samples, pre-process them,

and then use them appropriately so that learning may occur. Before acquiring

learning material an imitating agent must be able to extract or decode some kind of a paradigm which relates to what is being learnt. Post learning the agent should be

(19)

to other agents [Lawniczak and Di Stefano, 2010].

1.4 Cognitive Artificial Intelligence

Due to the cognitive nature of such agents this research further explores cognitive AI

and AI architectures. The termcognitive agent (CA) interchangeably used with the

term ”cognitive AI”, requires that the agent must meet certain criteria [Lawniczak and Di Stefano, 2010].

- perceive information in the environment provided by other agents

- reason about this information using existing knowledge

- judge the obtained information using existing knowledge

- respond to other agents

- learn and augment current knowledge if newly acquired information allows it.

The above basic criteria set the bar for a cognitive agent, but an imitating agent

has additional requirements, which are discussed in detail in Chapter 2.

This thesis sets the imitation requirements by taking into consideration the

cognitive nature of such systems; the core premise of the imitative ability being the acquisition of knowledge by the agent and by drawing parallelisms from the

observable, ostensible and discernible processes of the human cognitive system,

thereby recreating the outcome of that process. Whilst the goal is not to produce a biologically-plausible system, the agent is driven by biomimicry since it

demon-strates how AI might mimic human intelligence.

From the background research and work carried out in the last two decades, it can be asserted that imitation is not a unitary model, a finite-state machine

(FSM) or an algorithm, nor is it a theoretical abstract; it is in fact a group of

models, algorithms, a fusion or cascade of existing and new models into a software middle-ware, an agent. However imitation in AI and in robotics has not delivered its

promises; PBE has eclipsed as a field, and robotics to this very day still depend on

heuristic controllers. High level cognitive functions are usually programmed rather

than learnt and few state-of-the-art experimental research have so far focused on

learning by imitation. The main research question is therefore:

(20)

Due to recent developments in deep learning and cognitive AI and because of

the complexity of such agents, this thesis implements an agent using an AI existing

architecture: the Icarus cognitive model developed by Stanford University

[Lang-ley et al., 2003]. Icarus is a hybrid cognitive AI architecture, funded by Defence

Advanced Research Projects Agency (DARPA) Information Processing Techniques Office, United States Office of Naval Research, and the Unites States National

Sci-ence Foundation. It incorporates various models from across computer sciSci-ence, and

it mainly focuses onaction and perception over cognition.

Furthermore, Icarus separates categories from skills, uses a hierarchical

struc-ture for long-term memory [Langley et al., 2004] and usescorrespondence between

short-term and long-term memory [Langley et al., 2009]. Those four fundamental notions of Icarus are the basis upon which we draw comparison with the biological

counterpart and the human imitation mechanisms and implement the set of those

mechanisms in software. The advantages of implementing Icarus as a software cog-nitive agent are that it allows to examine the abilities, algorithms, processes and

qualities that anartificially imitating learning agentshould or mayposes, formulate

a theoretical model, examine the hypotheses via experimentation, and consolidate our conclusions through evaluation.

1.5 Icarus Engine

The Icarus implementation (called hereinafterIcarus engine) is greatly inspired by

PBE which has its roots in Henry Lieberman’s work [Lieberman, 2001]. However

in stark comparison to Lieberman’s PBE (described in Chapter 2.6), in this thesis Icarus is deployed as a stand-alone autonomous agent with the sole purpose of

acquiring knowledge from the Internet.

The reason for aiming at acquiring knowledge from the Internet is its ubiq-uitous nature. According to the United Nations Telecommunications development

Sector (ITU-D) around 40% of the global population has Internet access [Pe˜

na-L´opez et al., 2009], and most of those users generate content, information, news,

knowledge and data. Most of the human knowledge is being accumulated on the

Internet, either in open and public sites such asWikipedia or in specialist platforms,

suchQuora orStackExchange. Other knowledge engines (such asWolfram Alpha or

DBpedia) offer tailored meta-data, and last but not least, the blogs, new-sites, RSS

feeds and social networks all provide free information and knowledge. Hence, the

core research question is rephrased as:

(21)

Using the largest knowledge pool in the history of the human civilisation is

a promising source from which future AI agents can mature and reach higher-levels of intelligence. Thus the Icarus engine aims to acquire knowledge extracted from

widely and freely available information found on the Internet. The Icarus engine is

the first step towards an agent which learns by being taught”how to read and

under-stand” the Internet data thereby transforming information into knowledge. Albeit

the domain is natural language, it is not constrained by algorithms or models

tai-lored for natural language processing (NLP) and should be able to parse and acquire knowledge from other domains. Its main purpose is to project textual information

found on the Internet onto a knowledge representation (KR) structure. It does that

by satisfying all the cognitive agent (CA) criteria set by [Lawniczak and Di

Ste-fano, 2010], but it is not limited by finite-states or heuristics. Furthermore, the

way thememory is organised adheres to the Icarus specifications of a hierarchical

and structured knowledge index, with corresponding short and long term memory.

The Icarus engine is in a sense a parser, which instead of being programmed how

to parse, learns how to parse by example. By doing so, this thesis researches and

experiments into the specifics ofimitation learning and extracts conclusions about

the suitability of such agents and systems for cognitive AI functionality.

1.6 Research Scope & Biological Plausibility

Choosing to implement such a CA as a parsing agent is due to the fact that such

an ability is considered to be one of the high-level developmental steps in humans:

learning to read and understand information [Stuart and Coltheart, 1988]. The

approach taken is that of simulation: the process of mimicking the outwardly

ob-servable behaviour of children who learn how to read by being shown repeatedly

text inputs of various (usually increasing) sizes. This thesis focuses on the third and fourth state of reading development [Seymour, 1999] due to the fact that those

are the stages where decoding and hierarchical structuring develops. However, the

work reported in this thesis draws no parallelism to the human brain, nor does it

claim tosimulate the same processes. Yet the choices of machine learning (ML) are

all biologically-inspired, some based on behaviourist psychology, others use

artifi-cial neural networks, and only a handful are mathematical or heuristic components.

This approach is thus indirectly based on thehuman information processing models

[Berger et al., 2013], but does not implement them or as a whole; it only appears

to be functioning in a similar way, in order to achieve similar goals, however as a

(22)

The advantage of taking this approach outweighs the effort of designing and

implementing what can seem to be a complex agent: first and foremost the work is

focused on an artificial agent which does not rely on pre-programmed logic but on

learnt logic. This agent’s learning is not constrained by the logic embedded in the

program, but is adaptive and flexible. The implementation and experimentation of

the Icarus engine theory is based upon the following rationale: not programming

an agent, only teaching it. Whereas the actual implementation does indeed require

to be programmed, it is done via a neuro-dynamic agent [Bertsekas and Tsitsiklis,

1995] using abehaviouristapproach, similar to how humans learn from reinforcement

Thorndike [1901]; Galef Jr [1988].

The advantages ofimitation in AI are the same as the premise of imitation

in nature; allowing agents to acquire knowledge and information from their peers,

their social structure and our society, as well as surviving and evolving into capable

entities. The implied novelty of imitating agents (cognitive or not) is promising: AI software which can seamlessly and effortlessly acquire and manipulate knowledge

and information from humans directly or indirectly (through the internet), robots

which can learn how to reason and use logic by example and through observing human interactions, and much more. A comparison between traditional software

systems and imitating agents can thus provide the incentive to further explore imi-tation and support the usage of such agents in real-life applications. The advantages

of enabling autonomous agents and systems to acquire and evolve their knowledge

base only recently have been explored as corporations are gearing towards AI as-sistants, such as Microsoft’s Cortana, Apple’s Sirii, and Amazon’s Alexa. It is a

fact that such AI agents require imitation because it is the only known mechanism

through which passive observation and proactive teaching enables information and knowledge to be acquired and manipulated. Therefore less central but still

impor-tant research questions are:

- What are the differences between learning by imitation and programming by example [Lieberman, 2001]?

- What are the advantages of agents which learn by imitation? [Dautenhahn

and Nehaniv, 2002a]

- How do artificially imitating agents compare to traditional software systems?

Other questions related to the imitation learning literature [Dautenhahn and

Nehaniv, 2002a], such as ”what makes a good teacher?” are still inherently relevant,

(23)

in AI has numerous applications and can be applied in a variety of ways with the

potential to change high-level cognitive functions, such aslearning, reasoning, logic,

decision-making, etc. Although all those areas are relevant and applicable to the

work described hereinafter, it would be impossible to include them all, experiment

with a broad array of applications, or address all the entailing issues from each of

those fields. Therefore, our only scope islearning and not logic, reasoning or other

cognitive abilities. However, the Icarus engine sets a basis upon which logic and

reasoning can take place in addition to learning using our theoretical model and software engine.

1.7 Contributions

- The main and foremost contribution to the field of AI is the formulation and

combinationof a Markov decision process (MDP) in atemporal-spatial fashion

through which learning of symbolic KR structures (conceptual graphs) takes

place [Gkiokas and Cristea, 2014a].

- This novel approach enables reinforcement learning [Sutton and Barto, 1998]

to manipulate as an episodic process the creation and representation of a KR structure learnt by example.

- The importance of this contribution is explained in detail in Chapter 3 and

challenges the way in which symbolic and connectionistic AI deal with data

and knowledge, due to it demonstratinghow those two foundational approaches

in AI can be bridged.

- Furthermore, I address ”learning by imitation” at the highest possible level in

AI, that of symbolism, but learn it via reinforcement and deep learning.

- The imitation paradigm is given by a human user and is decomposed based

on observations, similar to visual decomposition in the brain [Biederman and Gerhardstein, 1993], and inspired by the decoding process of the 3rd and 4th

reading developmental stages in infants [Seymour, 1999].

Furthermore I created new algorithms and used them in experiments;

de-composition heuristics, relational and attribute semantics and statistical inference.

Those algorithms were implemented as parts of the Icarus CA, and used in order to examine both accuracy and suitability of such agents in AI [Gkiokas and Cristea,

2016a] and simulate the imitation process in humans, thereby formulating,

(24)

earlier. The cascade of various learning models used within the Icarus CA included

artificial neural networks (ANN) and restricted Boltzmann machines (RBM) in com-bination with the reinforcement learning algorithm, as an action-selection

mecha-nism for KR construction [Gkiokas and Cristea, 2014b], thus addressing

observa-tional qualities of the agent and off-policy exploration as well as inference. I also

employed sparse and dense encoding with deep learning in Icarus in order to examine

how it compares to more traditional shallow networks, drawing conclusions on the

advantages and the complexity involved when using sparse non-processed encoding, whilst formulating alternatives to dealing with unknown input [Gkiokas and Cristea,

2016b]. In addition to experimenting with natural language understanding (NLU)

in the Icarus CA, this work expanded into the field ofmeta-learning by formulating

a new model based upon the same principle of MDP knowledge graph construction.

We usedabstraction of existing KR graphs and theorised it is possible to compress

andgeneralise knowledge intobeliefs; autonomously generated meta-KR constructs which represent a group or cluster of highly related KR instances [Gkiokas et al.,

2014].

The contributions therefore are numerous and address imitation as a

mecha-nism in AI and cognitive agents, all the processes involved, such asdecomposition or

decoding of paradigms, the main learning mechanism and models used to bothlearn

and associate paradigms with understanding of the input (both semantically and

syntactically) as well as a variety of learning models, algorithms and sub-processes

required by the ad-hoc Icarus engine. I have expanded all research questions and mapped the characteristics and attributes that govern them and describe possible

solutions to previous questions raised in the imitation learning literature

[Dauten-hahn and Nehaniv, 2002a].

1.8 Thesis Overview

This thesis is organised in the following manner: in Chapter 2 is described what has

been researched in the past, all related fields, models and systems. The theoretical agent model is formulated and analysed in Chapter 3, and correlated to both the

biological mechanisms and the Icarus CA design. Following the theoretical

descrip-tion, the data-set created to evaluate the Icarus engine is presented and analysed in Chapter 4. Chapter 5 analyses experiments carried out using the Icarus engine,

describes in detail the components and algorithms, and reports on results and find-ings. The last Chapter 6 discuss in detail various findings, conclusions and future

(25)

Chapter 2

Background and Literature

Review

In this chapter the background literature is detailed and an analysis and presentation

on the work on which this thesis is based is examined. This aids in justifying the

arguably unconventional - but highly utilitarian - approach taken.

2.1 Alan Turing and the intelligent machines

One of the forefathers of Artificial Intelligence was Alan Turing, amongst others such as Allen Newell, J.C. Shaw, Herbert Simon, John McCarthy and Marvin Minsky.

Turing envisioned AI not just as an an intelligent machine, but as an artificial

child, a synthetic entity which has to go through a developmental process to achieve intelligence [Turing, 1950]. How Turing had imagined that specific progression and

development of AI is matter of speculation or scientific debate [Muggleton, 2014],

but we can deduce from his paper that he believed AI would have to follow a developmental phase similar to that of infants. And as aforementioned, two of the

most important developmental phases in infants revolve around imitation learning,

symbol decoding and hierarchical representation [Seymour, 1999].

The brief history of computer science showcases that modern AI has not

indulged Turing’s original thoughts and ideas. Two different schools of thought

have existed since the birth of AI:Symbolic AI, also know as good old fashioned AI

(GOFAI), andConnectionism orConnectionistic AI1. Whereas Symbolism takes a

modelling or aprogrammatic approach,Connectionism takes a pseudo-biological or

1_{Also known as}_{sub-symbolic AI}_{, albeit that term may have been coined by the symbolic school}

(26)

network-centric approach [Smolensky, 1987]. It is not clear if Turing had intended

to take any of the two approaches or combine them; in fact although it was in the that the first artificial neural networks were theorised, actual implementations

and models were published later, in the late 1950s, after his death. Therefore,

Turing’s approach was mostly theoretical and albeit based on biomimicry, it did

not explicitly specifyhow those intelligent machines would be created, since when

he wroteComputing Machinery and Intelligence [Turing, 1950], Connectionism was

still in its infancy.

Around the same time, the field ofCybernetics [Wiener, 1948] was described

as a interdisciplinary field for examining systems, their structures and organisation.

Whereas Norbert Wiener differentiated Cybernetics from AI, there was a clear ten-dency of describing AI systems and agents in a cybernetic fashion: as well defined

and modelled systems. The late 1940s and 1950s therefore saw the genesis of modern

AI, which was for the largest part based on symbolic approaches: models, theorems, well defined processes and programs.

2.2 Imitation in Nature

2.2.1 What is imitation?

The exact nature of imitation has been studied only in modern sciences; it wasn’t until the early 20th century that Edward Thorndike begun studying imitation in

animals [Galef Jr, 1988; Thorndike, 1901]. The study of imitation in humans was

be-lated; only after the modern field of Psychology started advancing (e.g., Jean Piaget and developmental theory, Raymond Cattell and crystallised intelligence, Burrhus

Skinner’s behaviourism and reinforcement, Erik Erikson and developmental psy-chology, and Albert Bandura and social cognitive theory) was imitation given some

attention. The field of developmental psychology and in specific the development

of children is what mostly interests us. As discussed in Chapter 1, Sections 1.1 and 1.2, one of the core mechanisms which enable human intelligence to develop and

progress into what Cattell refers to ascrystallised intelligence [Cattell, 1963], is the

ability to learn from others, our environment, our parents and peers.

That development enables us to acquire knowledge through subjective

ex-perience, what Haikonen refers to as qualia [Haikonen, 2009]. However, that

ex-perience is in effect a knowledge transference, either sensorimotor, or more general

and abstract. Verbalisation and learning by description plays a role (such as a

teacher explaining things), but even higher level learning (such as learning how to

(27)

phys-ical and non-physphys-ical: the hand learns how to draw and write, yet the brain is

conditioned into learning representations of letters, and then words. Psychologists suggest that imitative learning is a mechanism inherited and shared across humans

[Myowa-Yamakoshi et al., 2004; Ferrari et al., 2006] and one which is the product of

evolution; a hypothesis supported by the fact that not only humans are capable of imitation, but so are homininae, primates, animals and insects [Fritz and Kotrschal,

2002; Herman, 2002; Visalberghi and Fragaszy, 2002; Galef Jr, 1988], thus making

imitation a cognitive function shared across multiple species (but not necessarily of

the same competence level). Imitation therefore is the ability to acquire knowledge

from peers or others either via active demonstration, or passive observation.

2.2.2 How does imitation work?

Not taking into account the neurological and morphological properties of the brain,

and how those have evolved or how they enable imitation, Albert Bandura explicitly

states (direct quotation):

”Attentional processes regulate exploration and perception of modelled

activities; through retentional processes, transitory experiences are

con-verted into symbolic conceptions that serve as internal models for re-sponse production and standard for rere-sponse correction; production

pro-cesses govern the organisation of constituent sub-skills into new response

patterns; and motivational processes determine whether or not observa-tionally acquired competencies will be put to use.” [Bandura, 1986]

Analysing the above quote we can deduce a few key components and prop-erties of imitation and its overall structure:

- Perception produces a model (or a modelled activity)

- A process retains a symbolic conception (through experience)

- The internal models serve the purpose of providing responses (or correcting responses)

- Motivation determines if a competency (skill) will be re-used.

Thus, and similar to how Thorndike [Galef Jr, 1988] and others have demon-strated, imitation uses rewarding (either via motivation or reinforcement) in order

tolearn a competency. The subjective experience (qualia) must perceive a temporal

(28)

uses some form of symbolism or conceptualisation and serves the purpose of being

reused, either so that the agent can provide responses, or correct its responses. The key components are a model or structure which uses symbolism to represent an

action, a sequence or behaviour, an episode or sequence, the related reinforcement,

and the re-usability of the model.

The perceptive and cognitive abilities which relate to imitation have been

studied in human infants [Seymour, 1999] and involve a decomposition (or

decod-ing) of structures, objects or symbols and the internal hierarchical modelling (or

representation) of those within the short-term, and if rewarded or reinforced, into

the long-term memory. Visual understanding also uses decomposition [Biederman

and Gerhardstein, 1993], thus it appears that the process of breaking down stimuli or information into primitives or archetypes, prior to internally representing them

or modelling them, is a commonly occurring phenomenon.

2.3 Symbolic Artificial Intelligence

The symbolic school of thought has its roots in philosophical and centuries old beliefs

about what intelligence and cognition is, and many of those beliefs stem from Julien

Offray de La Mettrie and theL’homme machine (Man - Machine) [de La Mettrie,

1912]. The philosophical beliefs that followed suit of de La Mettrie’s medical

ex-periments, although revolutionary for his time2 set the path for future research and

development in Medicine, Psychology and eventually Artificial Intelligence. Those

beliefs described the human mind as a complicated machine, a notion which

sup-ports Artificial Intelligence; it did however indoctrinate later research in Psychology, Cognitive sciences and AI to the effect where mental and cognitive abilities were

be-lieved to be definable as a process or a model, such as in the case of Allen Newell and

theLogic Theorist [Newell and Simon, 1956] or theGeneral Problem Solver [Newell and Shaw, 1959].

This belief still echoes in modern AI: many designs for cognitive systems

take a purely programmatic approach, identical to a finite state machine (FSM)

[Gill et al., 1962; Minsky, 1967] and thereby limited by its very finite nature and

the logic enabling it. Admittedly, Symbolism and FSM are a necessity and part of

what is today the modern field of AI and Von Neumann computers operate on a

symbolic level. There are realistically many advantages to modern computing and

Symbolism which are not easily dismissed and a plethora of algorithms and models

2_{He was forced to quit his position with the French Guards, due to his materialistic and}

(29)

with decades of research and a proven track record.

2.3.1 Knowledge Representation

One of the topics most important and relevant to imitation learning of Symbolic

AI is Knowledge Representation or KR [Sowa, 1999]. Representing knowledge is

a major field under AI, with close ties to Philosophy and Psychology. Whilst the philosophical aspects of KR are outside the scope of artificial imitation (e.g., ”what

is knowledge”) the psychological attributes aren’t; since the research in this thesis is

using the developmental cycle of human intelligence as a reference point, employing KR requires that this work is at least partially based on models which are perceived

to be plausible mental representation models.

It is not by chance that the issue of KR first arose during the development

of the General Problem Solver [Newell and Shaw, 1959], that was also the moment

when the symbolic school of AI underpinned the basis of KR. In the past three

decades, multiple KR models and schemes have been devised, ranging from the

family ofKL-ONE [Woods and Schmolze, 1992], to Sowa’sconceptual graphs (CG)

[Chein and Mugnier, 2008; Sowa, 1999, 1984], and to more modern schemes such

as the resource description framework or RDF [Klyne and Carroll, 2005] and the

web ontology language or OWL [Bechhofer, 2009]. Most KR schemes feature similar

designs: an ontology and the relations required to describe the hierarchy; it is intended to be used by computers (although it can be readable) and serves the

purpose of representation but may also enable reasoning [Sowa, 1999; Levesque

and Brachman, 1984]. KR can be used by first order logic or FOL [Fitting, 1990]

which operates on the actual KR structure, and may be combined with or represent

semantics [Fellbaum, 1998]. KR uses primitives (e.g., domain archetypes) which,

depending on the application domain, may change. Meta-representation (or meta-data) is also used in most modern KR schemes such as RDF and OWL, which are

appliedon top of or as extensions to the primitives.

The most important aspects or topics related to KR are:

- incompleteness orcompleteness (e.g., semantic, functional, refutation or syn-tacticcompleteness associated with statement in the structure) [Lipschutz and

Judith, 1916; Duffy, 1991]. Fuzzy logic is one of the sub-fields of AI which deals

with a certain degree with incompleteness [Nov´ak et al., 2012; Zadeh, 1996].

- definitions, universals, facts and defaults are general rules and patterns which

relate tospecificity and offer quantification and generality employed by logic

(30)

- non-monotonic reasoning or hypothetical reasoning, asserts new hypotheses

based on rules or facts [Dung, 1995].

- expressiveness or functional completeness relates to the ability of adequately

expressing all truth tables when using first order logic (FOL) [Fitting, 1990]

expressions such asAND, OR as well asNAND, NOR.

- reasoning in general terms relates to the ability of the agent or system to

be updated, develop new inferences, and operate within reasonable time con-straints.

From the above list it is possible to identify related material which overlaps

KR, logic, learning and imitation. Whereas KR serves the purpose ofrepresentation

anddescription, it is used by logic. For the intents and purposes of this thesis, I am mostly concerned with KR and how it is to be used and manipulated by imitating

agents, rather than the logic enabled or applicable, which albeit relevant is outside

the scope of the work described hereinafter. There is a clear connection between KR and logic, how KR enables or allows inference, hypotheses and propositions,

and more important how the agent may learn to form such hypotheses, assertions

and propositions. Yet my scope is focusing on creating KR and not logically

ma-nipulating it.

Another form of KR is the meaning representation (MR) or the most

com-monly used term meaning representation language (MRL). Those structures have

been widely used in parsing (see Section 2.7.5) and are mostly related to natural

language processing (NLP). The phrase ”quick brown fox jumps over lazy dog” is

shown as a simple MRL in Figure 2.1, whereas the same phrase with directededges

is shown in Figure 2.2.

quick brown fox jumps

over

lazy dog

Figure 2.1: Simple MRL structure.

Whereas the first Figure 2.1 is overly simplistic, the addition of directed edges

(31)

quick brown fox jumps

over

[image:31.595.267.375.114.224.2]

lazy dog

Figure 2.2: MRL with Edges.

In certain KR such as a Penn treebank tree[Marcus et al., 1993], direction is often implied originating from the root label/node of the tree graph. Such structures can

often become esoteric or obscure, since they may include syntactic attributes mixed

with words and labels, making them hard to understand. A modern MRL form of the same phrase which includes propositional and syntactic meta-data is shown in

Figure 2.3; in this example, the blue labels in capitals are syntactic attributes (part

of speech tags, see Section 2.7.4) with extra information including a rudimentary propositional logic (edge labels in pink).

JJ

quick

JJ

brown

NN

fox

NNS

jumps

IN

over

JJ

lazy

NN

dog

amod amod

nn prep

pobj

[image:31.595.253.388.423.571.2]

amod

Figure 2.3: Annotated MRL with Edges and Meta-data.

The equivalent of that same phrase as a conceptual graph (CG) is shown in Figure 2.4. In this instance and according to Sowa’s publications [Sowa, 1999,

1984], rectangles depict concepts and circles depict relations. Sowa’s CG contain

implied relations and not just relations extracted from the text such as relations

between concepts. That approach might be confusing; whereas theedges describe

(32)

and relations are of different types. The edge label (number) indicates the order

of the edges and some of the labels in the concepts are implied and not contained

within the phrase, such as the relationsattr (attribute),act (actions),loc (location)

andrcpt (recipient).

quick brown

ANIMAL: Fox

jumps

over

lazy PET: dog

Attr Act

1 2

1

Loc

2

Rcpt

Attr

1

1 1

Figure 2.4: Conceptual Graph example.

Furthermore, Sowa in his original publications providedcontradicting

exam-ples such as the one shown in Figure 2.5. Those examexam-ples are called adisplay form,

which is a visualised conceptual graph. Figure 2.5 demonstrates the phrase ”a cat

on a mat”; the relation on is contained within the phrase, and is used to link the

concepts.

[image:32.595.229.410.185.386.2]

cat on mat

Figure 2.5: Conceptual Graph Directed Edges.

Part of the reason why Sowa treats CG (in their display form) in such a

way maypossibly be because he derives them from a logic form. The logic formula

of Figure 2.4 may be as shown in (2.1) when ignoring the attributes. The root is

the actionjump which in the Figure is shown as a concept and based on a Pierce

formula from which Sowa derived the CG in the display form.

(33)

[F ox]−>(J ump)−>[Dog]. (2.2)

Similarly, Sowa describes Figure 2.5 as having a linear form, as shown in

(2.3).

[Cat]−>(On)−>[M at]. (2.3)

Theconceptual graphs interchange format (CGIF), which was developed sim-ilar to the ISO Common Logic Project, would represent the linear form of (2.2) as

(J umps[F ox][Dog]) and the linear form of (2.3) as (On[Cat][M at]). Therefore, Sowa

did not attribute importance to either the order of appearance, nor to the

direc-tion of edges. He hypothesised that, albeit those forms may appear different, their

semantic foundations translate to the same predicate calculus. Later, researchers in CG have used a better defined approach [Obitko, 2007; Amati and Ounis, 2000],

in which the order of the graph is drawn from the first appearing node (concept

or entity), and the direction of the edges is important, as it demonstrates the logic continuation of the phrase being represented.

Furthermore, whereas Sowa used relations in an implicit manner (e.g.,

ex-tracting roles of concepts), there is no rule or limitation as to why relations (the

nodes within the graph) must beimplied entities; Sowa in his examples also used

relationsexplicitly obtained from the original phrase.

2.3.2 Criticism and Limitations of Symbolic Artificial Intelligence

When researching imitating agents, we expect a form of KR structure to be produced

which is identical orhighly similar to the one the teacher, paradigm or demonstrator

provided to the agent. That process, when described using symbolic AI, is a

well-defined model, an algorithm, a program or a heuristic process, as that is the nature of

Symbolism. Therefore, that process is in fact afinite state machineor a combination

of FSM, described by a program or agent operating on the information on a symbolic level. In the history of AI the two cornerstones supporting the suitability of symbolic

AI for our intents and purposes are theChurch-Turing thesis [Searle, 2001] and the

Myhill-Nerode theorem [Ignjatovi´c et al., 2010]. The Church-Turing thesis states that a function is computable by a human following an algorithm, if it is computable

by a Turing machine. Church-Turing thesis implies what can physically be computed

by a computer [Piccinini, 2011] or what could realistically be computed, however multiple researchers in the past have argued for or against it [Goldin and Wegner,

(34)

The Myhill-Nerode theorem offers insight into what can and cannot be done

when using FSM: it suggests that any language can be recognised by a model, by mapping strings in the language to unique accepting states, and strings not in the

language to unique non-accepting states. However, not all languages are regular,

that is they do not correspond to the language accepted by any FSM, and equiva-lently, there may be no regular expression to represent that language. Furthermore,

natural languages contain ambiguity and contradictions [Gorrell, 2006], which

cre-ate exceptions to rules and the logic enabled by the FSM. Finally, the FSM may

suffer from the halting problem [Stannett, 1990], and therefore the use of FSM,

al-beit advantageous, due to decades of research and the existence of multiple models,

theories and algorithms, does have the aforementioned drawbacks.

2.4 Connectionistic Artificial Intelligence

Connectionism takes a black-box approach and is based on Neuroscience and the

observation that our own intelligenceemerges from neural networks in the human

brain. Neuroscience and modern medical imaging have allowed us to observe and

analyse the neuronalnetwork mechanisms which enable cognitive functions and

in-telligent behaviour. It is from Neuroscience that artificial neural networks (ANN)

were inspired, and in 1943 McCulloch described the first artificial neuronal model

[McCulloch and Pitts, 1943]. In the 1949 Donald Hebb first describes Hebbian

net-works [Hebb, 1949], but it isn’t until 1958 that thePerceptron is published

[Rosen-blatt, 1958], the precursor of modern ANN. In 1969 Minsky criticises Connectionism

and neural networks [Minski and Papert, 1969] due to their computational limita-tions in training and deploying them, but most important due the fact that they

couldn’t learn to perform an exclusive (XOR) on the input. Those issues were

ad-dressed in the 1980ies and 1990ies with the advent of the personal computer, and

multi-layered networks in combination with back-propagation [Werbos, 1974]

show-cased that they could process complex input. At the same time GOFAI lost interest

from the research community, yet connectionism and the related ANN did not make any considerable breakthroughs until the late first decade of the third millennium

withdeep learning.

2.4.1 Artificial Neural Networks

An artificial neural network is a non-biologically plausible, yet bio-inspired network.

It works upon the premise of associative memory and learns to associate input

(35)

n1

n2

n3

n4

n5

n6

Input

hidden

output w1

w2

w3

w4

w5

w6

w7

w8

[image:35.595.249.393.115.248.2]

w9

Figure 2.6: Example of a Neural Network.

[Picton, 1994; Hertz et al., 1991; Hopfield, 1988]. There are multiple types of neural

networks, but we’ll only examine and present the most typically used, the fully

connected feed-forward neural network, also known as the single-layer or multi-layer

perceptron. From the Figure 2.6 we observe what a neural network is: layers of

nodes (or neurons) which are connected by synapses, or weights. In that figure

above, there are three layers: the input layer with nodes n1 and n2, one hidden

layer with nodes n3, n4 and n5 and an output layer with a single node n6. The

premise upon which the feed-forward perceptron relies is forward propagation: an

input vector of two values is fed into the input nodes, and is the activated, using anactivation function, most often alogistic function, with the most popular shown

below.

f(x) = 1

1 +e−x. (2.4)

f(x) = e

x₋_e−x

ex₊_e−x. (2.5)

The above functions are the sigmoid shown in (2.4) and hyperbolic tangent

shown in (2.5); others less popular are the sigmoid bipolar in (2.6), the scaled

hy-perbolic tangent [LeCun et al., 2012] in (2.7), and thesoft-sign in (2.8).

f(x) = −1 + 2

1 +e−x. (2.6)

f(x) = 1.7159∗tanh(2

(36)

f(x) = x

1 +|x|. (2.8)

A characteristic of the logistic functions is that they produce a S shaped

output. Non-logistic functions such as theGaussian have been used but are not as

popular. Common criticism for such networks, is that logistic activation functions

cannot be used in deep networks; often there is no need for such networks, and

one hidden layer may suffice to learn the task at hand. However, logistic activation

functions often lead tosaturation[Heaton, 2015], an issue which has been addressed

by usinglinear activation functions.

The forward propagation in neural networks is a task of accumulating the

inputIj multiplied by the weightsWi for each node and activating it viaf(Ii). The

most commonly used approach is to do vector and matrix arithmetic: we obtain

previous layer node output denoted as Oj where subscript j is the previous layer.

For input layers, Oj is the input value Ij whereas for hidden layers we substitute

Oj with Ii, e.g., the output of the activation function f(x) from the previous layer

j. Multiplying each node output Oj ·W i with every corresponding weight Wi is

a vector-matrix product, we then sum each input vector for the nodes in layer i:

P

(Oj ·Wi), e.g. reducing each produced matrix into a row vector of values, each

row corresponding to node input Ii in layer i. Finally, each value in that vector

isactivated using one of the functions f(Ii) as shown earlier in (2.4,2.5,2.6 or 2.7).

The above step is repeated for all layers; that is in essenceforward propagation.

Training neural networks is considerably more complex; one of the early

critiques on neural networks was the computational requirements, and the biggest issue was that training was inefficient [Minski and Papert, 1969]. Nowadays, there

are many training algorithms, the most notable of which are theback propagation

(or BPROP); other methods also have been used, such as reinforcement learning,

theLevenberg-Marquardt [Mor´e, 1978], and the Broyden-Fletcher-Goldfarb-Shanno

algorithm [Shanno, 1985] abbreviated as (BFGS), amongst the most famous. The

most commonly used training approach for ANN and shallow networks throughout the last two decades, has been the BPROP as described by Seppo Linnainmaa

[Linnainmaa, 1970], and was later demonstrated by application [Werbos, 1982]. The

basic mechanism behind BPROP, is that the observable error at the output layer, is back-propagated to account for each weight adjustment, until the desirable, or

highly similar to the desirable output is produced by the network. We define the

error asE shown in (2.9), e.g., the squared difference for each output node, where

(37)

E= (yi−yˆi)2. (2.9) We then proceed to reversely iterate all the layers, first we calculate the

output delta error, shown in (2.10).

δi=−E·f0

X

(Oi)

. (2.10)

In (2.10) f0 is the prime or derivative of the activation function and Oi is

the node input. The sigmoid and tanh derivatives are shown in (2.11) and (2.12).

f0(x) =f(x)·(1−f(x)). (2.11)

f0(x) = 1−tanh2(x). (2.12)

For hidden layers we calculate the value of the derivative on the node input

f0(P

(Oj·Wji)), e.g., the vector-matrix multiplication. Then those values are

mul-tiplied by thenext weightsWik and the next layer’s delta error3 δk, which are called

node deltas, shown in (2.13).

δi =f0

X

(Oj·Wji)·

X

(Wik·δk)

. (2.13)

Finally, we calculate the weight gradient for layeritok, as shown in (2.14),

with each gradient multiplying the next layer’s node delta and the observed node

outputOi.

∂E

∂W ik =δk·Oi. (2.14)

That gradient is then used in the update of the weight values. There are

different ways of updating weights; in batch training, the gradients are summed

P ∂E

∂Wik

and then used to adjust the individual weights at the end of an epoch

(a training sample iteration), whereas in on-line training, the weights are updated

after propagating a single training sample. The update rule in BPROP for batch

training is shown in (2.15), the time-steptdefines the index in time, and hencet−1

is the previous update, thelearning rate α defines how large adjustments are made

and themomentum µaffects current adjustment using previous adjustments.

3

Please note we swapδkwithδi from formula (2.10). At every backwards iteration, we replace

δkin (2.13) with the next layerδi. Because this is a reverse iteration, we start at the output layer,

(38)

∆W_ik(t)=α·

∂E ∂Wik

+µ∗ ∆W_ik(t−1). (2.15)

Other training algorithms have been based on BPROP, most notablyresilient

back-propagation (RPROP) [Riedmiller and Braun, 1993], as well as many of its

derivatives. The RPROP uses incremental small weight adjustments, and is assumed

to be faster than BRPOP, as shown in (2.16).

∆W_ik(t)=

        

−∆W_ik(t), if_∂W∂E

ik (t)

>0

+∆W_ik(t), if_∂W∂E

ik (t)

<0

0, otherwise

. (2.16)

The increments and decrements taking place are constant (or range bound)

and ∆W_ik(t) increases or decreases by η, where 0 < η− < 1 < η+. Redmiller and

Braun published akernel of the algorithm, and reported significantly better results

than BPROP [Riedmiller and Braun, 1993], since then newer versions have been

published.

Neural networks have various ways of calculating theerror; the most notable

being themean square error (MSE): _n1Pn

i=1(ˆyi−yi)2 fornsamples. Nowadays for

classification thecross entropy/log loss(CE) [Heaton, 2015, 104,120] is also used as

shown in (2.17) whereyi is the ideal node output and ˆyi is the actual node output4.

CE=−1

N

X

i=1

yilog(ˆyi) + (1−yi)log(1−yˆi)

. (2.17)

All the above networks are trained in a supervised manner where training

samples are obtained and associate input to specific output. Non-supervised learning before the advent of deep learning was a peculiar topic; the few models able to learn

unsupervised were theself-organising maps (SOM) [Kohonen, 1990], the

Hebbian-inspired gas networks [Fritzke et al., 1995] and the K-means family of clustering

kernels [MacQueen et al., 1967].

Evidently, the field of neural networks had its ups and downs; in its infancy it was heavily criticised, it appeared to not live up to the expectations of revolutionising

AI, and even after the 1980 developments they hadn’t been widely adopted.

4

(39)

2.4.2 General Purpose Computing on Graphic Processing Units

Part of the reason of the advent of deep learning has to do with advances and recent changes from the traditional ANN to deep networks and activation functions, but it

is also attributed to the new GPU hardware. Most of the graphics processing units

(GPUs) on modern computers have hundreds of simple cores and even

consumer-grade GPUs nowadays have thousands of cores. At the early third millennium

researchers started using GPUs for general purpose computing [Thompson et al.,

2002; Pharr and Fernando, 2005], coined general-purpose computing on graphics

processing units(GP2U), once it was realised that GPUs favoured parallel algorithms

[Nickolls et al., 2008; Che et al., 2008; Owens et al., 2007] such as the training

algorithms which are used for neural networks.

The nature of the matrix form operations on the input Ij and weights Wi,

the vector-matrix multiplications during forward propagation or the back propaga-tion such as (2.13, 2.14) as well as the training update rules such as (2.15, 2.16)

can execute a lot faster when using GP2U. Because GP2U kernels are able to run

asynchronous parallel operations on multiple training samples, weights and input vectors, a resurgence in the research of ANN took place before and during the rebirth

of the field, now calleddeep learning.

2.4.3 Deep Learning

The pragmatic approach of using GP2U from the research community enabled deep

learning to advance intorealistic applications dealing with computer vision, speech

recognition, pattern recognition, classification, prediction, regression, approximation

and auto-encoding.

Whereas the early ANN were mostly used fortoy problems and trained using

small data-sets, deep learning is able to crunch big data and has proven usable in

real-life sceanrios and most important on par with human performance [He et al.,

2015; Taigman et al., 2014] thus having a profound effect to the adaptation of deep

learning in multiple fields and domains. Perhaps the most important breakthrough

was recently in 2016 by Google [Silver et al., 2016] whenDeepMind beat the world

champion in the gameGo by 5 - 0, thereby proving that deep learning and

Connec-tionism are capable of super-human performance. Modern ANN (hereinafter deep

networks ordeep learning) differ from the traditional ANN in five distinct ways:

- Traditional ANN use logistic activation functions [Zadeh et al., 2010], e.g.,

sigmoid, tanh, arctan. Deep learning most often use linear functions such as

(40)

and Bengio, 2010].

- ANN normally use no hidden layers, or one to two hidden layers. In comparison

deep neural networks use multiple hidden layers, hence the termdeep. In some

cases it is possible to stack different types of networks (e.g., as in the case of

auto-encoders).

- Although not limited to deep networks,node dropout,L1 andL2regularisation

are new optimisation techniques which evolved as optimisation techniques for

deep learning [Dahl et al., 2013; Ngiam et al., 2011b,a; Bengio, 2009].

- Other types of deep networks such as convolutional neural networks (CNN)

are profoundly different from traditional ANN. They use max pooling and

convolution layers (often in alternating multi-layered fashion) and perform

classification at the final fully connected layer [Krizhevsky et al., 2012; Ciresan

et al., 2011].

- Development of deep learning saw the use of soft-max as the output layer

activation, instead of using traditional logistic functions. Whilst soft-max is

applicable in shallow networks, its use in deep learning proved very successful [Glorot and Bengio, 2010].

The Figure 2.7 below showcases the difference between traditional and deep networks. More often than not deep networks are harder to train [Glorot and

Ben-gio, 2010], they have more hyper-parameters which do not always warrant better

accuracy or performance, such as in the case of increasing hidden layers and amount of nodes and quite often require more training iterations or epochs.

n1

n2

n3

n4

n5

Input hidden output

n6

n7

n8

n9

n10

n11

n12

n13

n14

hidden hidden hidden

n15

n16

Figure 2.7: Example of a deep fully connected neural network.

Other types of networks used for deep learning are thedeep Boltzmann

Imitation learning in artificial intelligence

warwick.ac.uk/lib-publications

A Thesis Submitted for the Degree of PhD at the University of Warwick

Permanent WRAP URL:

http://wrap.warwick.ac.uk/94683

Copyright and reuse:

This thesis is made available online and is protected by original copyright.

Please scroll down to view the document itself.

Please refer to the repository record for this item for information to help you to cite it.

Our policy information is available from the repository home page.

Imitation Learning in Artificial Intelligence

Alexandros Gkiokas

Department of Computer Science

Contents

Acknowledgments

Declarations

Abstract

Abbreviations

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Human Intelligence

1.2

Imitation in Humans

1.3

Imitation in Artificial Intelligence

1.4

Cognitive Artificial Intelligence

1.5

Icarus Engine

1.6

Research Scope & Biological Plausibility

1.7

Contributions

1.8

Thesis Overview

Chapter 2

Background and Literature

Review

2.1

Alan Turing and the intelligent machines

2.2

Imitation in Nature

2.3

Symbolic Artificial Intelligence

2.4

Connectionistic Artificial Intelligence