Relational Visual Recognition

(1)

Faculty of Engineering Science

Laura Antanas

Dissertation presented in partial

fulfillment of the requirements for the

degree of Doctor in Engineering

June 2014

Supervisor:

Prof. Dr. Luc De Raedt

Co-supervisor:

(2)

(3)

Laura ANTANAS

Examination committee:

Prof. Dr. Adhemar Bultheel, chair Prof. Dr. Luc De Raedt, supervisor

Prof. Dr. ir. Tinne Tuytelaars, co-supervisor Prof. Dr. ir. Herman Bruyninckx

Prof. Dr. ir. Maurice Bruynooghe Dr. ir. Kurt Driessens

(Maastricht University, The Netherlands) Prof. Dr. Paolo Frasconi

(Università degli Studi di Firenze, Italy) Prof. Dr. David Hogg

(University of Leeds, UK)

Dissertation presented in partial fulfillment of the requirements for the degree of Doctor

in Engineering

(4)

Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande schriftelijke toestemming van de uitgever.

ISBN 978-94-6018-855-8 D/2014/7515/79

(5)

Abstract

In contrast to statistical visual recognition, relational visual recognition aims at employing relational representations for solving visual recognition problems. For high-level tasks involving complex objects and scenes, low- and mid-level visual features do not always suffice. In these cases it is the component objects, their structure and semantic configuration that guides recognition. They are best described in terms of relational languages or (higher-order) graphs. Relational approaches enjoyed popularity in the early vision work. Convenient at that time given the limitations of the hardware, data, scientific technologies and low-level vision routines, relational representations are rarely used in visual recognition today. This is mainly due to their pure symbolic nature. Nevertheless, recent successes in combining them with statistical learning principles and the maturity of the aforementioned resources motivates us to reinvestigate their use. Starting from low-and mid-level solutions and building on top of them, (statistical) relational learning gives the perspective of moving towards more general, complete and effective relational visual recognition systems.

The thesis makes several contributions in this direction, three in the field of computer vision and two in the field of robotics. We first introduce a new relational distance-based framework for hierarchical image understanding. Applied to the house facade domain, the relational distance shows good detection results, while demonstrating the interplay between structural and appearance-based aspects. The second contribution is the use of a kernel-appearance-based relational language for scene classification and tagging. Part of this contribution is the employment of the kernel-based language to understand images of houses. These recognition tasks use a similar relational representation and language, showing its generality and benefits. Our third contribution is a probabilistic logic pipeline for task-dependent robot grasping. It contains a new module based on causal probabilistic logic and symbolic object parts, such that, given a set of probabilistic observations about the world, it can semantically reason about object category, suitable tasks and pre-grasp configurations with respect to the intended task. Experimental results, including those obtained with a real robot

(6)

platform, confirm the importance of high-level reasoning and world-knowledge for robot grasping, as opposed to using solely local object shape information. Further, in the context of robot grasping, our fourth contribution is a relational approach to numerical feature pooling. It combines numerical shape features, qualitative spatial relations and kernels for graphs to recognize graspable object points. Finally, we contribute with the use of sequential statistical relational techniques to capture underlying concepts in video streams. In particular, we focus on monitoring card games and learning to detect fraudulent sequences. Overall, the experimental results provide evidence thatwe can develop effective and real-world relational visual recognition systems that benefit from statistical

(7)

Acknowledgments

Working on this Ph.D. thesis has been a challenging, but also an exciting and rewarding experience. Now, at the end, I would like to thank several people. Without their support, advise and help I never would have been able to get this far.

First and mostly, I would like to greatly thank my supervisors Luc De Raedt and Tinne Tuytelaars. They always encouraged me to look for ideas and have inspired me with great pointers and suggestions. They always took the time to discuss my ideas and problems, write many and useful comments on my drafts. I am especially thankful for the support they gave me in the most difficult times, when I could not immediately see the results of my work. Their persistent belief in me was a great motivation during this Ph.D.

Next, I want to kindly thank all the members of my jury, Maurice Bruynooghe, Herman Bruyninckx, Kurt Driessens, Paolo Frasconi and David Hogg for carefully reading my text and for their highly appreciated feedback. Their suggestions improved the content of this dissertation. I also want to thank Adhemar Bultheel for chairing the defense.

Research involves working with many people. The help and input I received during my Ph.D. was of great value. My thanks go towards the entire DTAI research group for creating a motivational and enjoyable environment. I would also like to thank the people with whom I wrote papers and discussed research ideas, who helped me with feedback or made my time at work more fun by sharing an office, going for coffee or having nice chats. I start with Kurt D, Tom and Robby, my first officemates in the DTAI group. Tom and Kurt D, thank you for your help on our first paper together. It opened my appetite for research and gave me a great and highly motivational start in my Ph.D. Next, I want to thank Angelika, Bernd, Bogdan, Davide, Fabian, Guy, Ingo, Jose, Mathias, McElory, Tias, Thanh for not only being wonderful colleagues or officemates, but also my friends, helping me in difficult moments both on the professional

(8)

and personal side. I also thank Francesco, Martijn and Parisa for the times we shared together as officemates. Furthermore, I want to thank my co-authors: Bogdan, David, Fabian, Fabrizio, Jose, Kristian, Marion, Martijn, McElory, Paolo, Plinio, Niels, Thomas, Wannes. Special thanks to Mathias, McElory, Ingo, Fabrizio, Paolo, Kurt DG and Francesco for our long discussions on different research ideas. I also want to thank Albrecht, Anton, Bjorn, Dimitar, Fernando, Hendrik, Jan Ramon, Jesse, Joaquin, Joris, Kostas, Siegfried, Theo and Wannes for their help on different work-related issues during these years. Furthermore, I would like to thank my friends. They have always encouraged me to finish my work and always offered me moral support. It is an honour to have you as friends.

Finally, I sincerely thank my family. First, I want to thank my husband, Cip, for his patience and understanding during these years. There were days when I had to work very late or I could not spend more time with my family because a conference deadline is a deadline! I deeply thank my parents for supporting and helping me throughout all the study years, for creating an environment that allowed me to focus on my studies, for always encouraging me to do what I was passionate about and follow my dreams. I want to thank my sister Simona for being such a wonderful friend. The very special thanks go to my son, David, for being such a good child. Thank you so much for your patience my sweet baby. David, baietelul meu drag, mama iti multumeste pentru rabdare. David si Ana, mama va iubeste foarte mult. Va multumesc draga mea familie din suflet!

(9)

Overture

1

1 Introduction 3

1.1 Context . . . 3

1.1.1 Machine Learning, Computer Vision and Robotics . . . 3

1.1.2 Visual Recognition . . . 4

1.1.3 Statistical Relational Learning . . . 5

1.2 Motivation and Research Question . . . 6

1.3 Thesis Contribution . . . 8

1.4 Thesis Roadmap . . . 11

1.5 Publication List . . . 13

(10)

2 Preliminaries 17

2.1 Foundations: Machine Learning and Reasoning . . . 17

2.1.1 Statistical Learning and Reasoning . . . 18

2.1.2 Relational Data Representations and Learning . . . 25

2.1.3 Statistical Relational Learning and Reasoning . . . 31

2.2 Background: Visual Recognition and Robot Grasping . . . 35

2.2.1 Local Features, Points and Regions of Interest . . . 35

2.2.2 Feature Descriptors . . . 38

2.2.3 Object Recognition and Scene Understanding . . . 41

2.2.4 Robot Grasping . . . 42

I

Relational Scene Understanding

45

3 History of Relational Representations in Visual Recognition 47 3.1 Back in History: Syntactic Pattern Recognition . . . 49

3.2 Lessons Learned and the Move towards Statistical Learning . . 59

3.3 Bringing Back Relations in Visual Recognition . . . 61

3.3.1 Timeline and Axes for Discussion . . . 61

3.3.2 Recent SRL-related Work in Visual Recognition . . . . 62

3.4 Conclusions . . . 71

4 Understanding Images of Houses Relationally 75 4.1 The Hierarchical Framework . . . 77

4.2 From Images to Visual Primitives . . . 79

4.3 Relational Problem Formulation . . . 80

4.3.1 Declarative and Relational Feature Construction . . . . 85

(11)

4.4 A Relational Distance-based Approach . . . 89

4.4.1 The Distance Metric . . . 89

4.4.2 Contextual Candidate Selection . . . 93

4.5 A Relational Kernel-based Approach with Context . . . 94

4.5.1 Graphicalization in kLog . . . 95

4.5.2 The Kernel Function . . . 95

4.6 Post Processing . . . 98

4.7 Experiments . . . 99

4.7.1 Datasets and Evaluation . . . 100

4.7.2 Baselines and Comparisons . . . 102

4.7.3 Results . . . 103

4.8 Related Work . . . 111

4.9 Conclusions and Future Work . . . 113

4.9.1 Future Work . . . 114

5 Relational Scene Classification and Tagging 117 5.1 Scene Primitives . . . 119

5.2 The Relational Scene Representation . . . 121

5.3 The Relational Learning Tasks . . . 123

5.3.1 Graphicalization in kLog . . . 124

5.3.2 Feature Generation . . . 125

5.4.1 Datasets and Evaluation . . . 132

5.4.2 Features Used . . . 133

5.4.3 Results . . . 134

(12)

5.6.1 Future Work . . . 138

II

Relational Recognition for Robot Grasping

141

6 Leveraging World Knowledge and Low-Level Data for Robot Grasping143 6.1 The robot grasping scenario . . . 145

6.2 Task-dependent Grasping: A Probabilistic Logic Pipeline . . . 146

6.2.1 The proposed pipeline . . . 146

6.2.2 Vision-based Scene Description . . . 149

6.2.3 The Probabilistic Logic Module . . . 153

6.2.4 Observations about the world . . . 154

6.2.5 World knowledge: ontologies and affordances . . . 155

6.2.6 The CP-theory for semantic grasping . . . 157

6.2.7 Shape-based Grasping . . . 161

6.3.1 Datasets and evaluation scenarios . . . 166

6.3.2 Evaluation measures . . . 168

6.3.3 Results and discussion . . . 168

6.4 Related work . . . 175

6.4.1 Visual-dependent grasping . . . 175

6.4.2 Task-dependent grasping . . . 175

6.4.3 SRL for robot grasping and other robotic tasks . . . 176

6.5.1 Future Work . . . 178

(13)

7.1 Robot grasping primitives . . . 181

7.2 Relational Grasping Problem Formulation . . . 181

7.2.1 Data modeling . . . 182

7.2.3 The Relational Problem Definition . . . 184

7.3 Relational Kernel Features . . . 184

7.3.1 Soft matching . . . 186

7.3.2 Hard-soft matching . . . 186

7.4.1 Dataset . . . 187

7.4.2 Evaluation measures . . . 188

7.4.3 Results and discussion . . . 188

7.6.1 Future Work . . . 192

III

SRL for Video Sequence Recognition

193

8 Monitoring Card Games using SRL for Video Sequences 195 8.1 Card Game Video Streams as Relational Sequences . . . 197

8.1.1 RelationalUnoSequences . . . 198

8.2 Learning Statistical Relational Models from Relational Sequences 201 8.2.1 R-grams . . . 201

8.2.2 TildeCRF . . . 203

8.3.1 Datasets . . . 207

(14)

8.3.3 Results . . . 208

8.4 Augmented r-grams . . . 212

8.6.1 Future Work . . . 215

Finale

217

9 Summary and Future Work 219 9.1 Thesis Summary . . . 219

9.2 Discussion . . . 221

9.2.1 General Remarks and Take Away Messages . . . 223

9.3 Future Work . . . 224

Appendix

227

A Simulated UNO datasets 229

Bibliography 231

Curriculum Vitae 259

(15)

List of Figures

1.1 This dissertation is situated in three subfields of AI: statistical relational learning (right), computer vision and robotics (left). . 6 2.1 The max-margin in SVMs (right). The classes to be separated

are −1 (circles) and +1 (rectangles). The dotted arrow is the

margin. The functionφmaps the data into a feature space where

the nonlinear pattern (left) is now linear (right). The kernel computes inner products in this feature space directly from the inputs. . . 21 2.2 Graphical representation of linear-chain CRF. . . 24 2.3 How would you characterize this train? A possible way is by

“every second car that is not an engine and has the shame shape as its cargo”. This rule can distinguish this train from others that do not have similar structure and properties. . . 25 2.4 (Partial) graph of a real-world dining room visual scene.

Rectangles are entities, diamonds are relationships among entities and they have properties. . . 27 2.5 E/R diagram for the train domain example. . . 28 2.6 Examples of dense sampling, interest points and regions of interest

on a house facade image. . . 37 2.7 Illustration of the HOG descriptor (a) and PFH

computa-tion [Rusu and Cousins, 2011] (b). . . 38 2.8 The Gistdescriptor of a bar scene. . . 40

2.9 An illustration of the BoW and the spatial pyramid representations. 41

(16)

2.10 Strategy of grasp selection. . . 43 3.1 Blocks world scenes; (a) illustrates the blocks world in [Roberts,

1963] (photograph used with permission of Lawrence Roberts), while (b) shows the representation scheme of vertices, regions and edges used by [Guzmán, 1968]. . . 49 3.2 Examples of graph-based representations used in early vision. . 52 3.2 Examples of graph-based representations used in early vision. . 53 3.2 Two hierarchical logical/relational representational schemes for

model-based image understanding in early vision. . . 55 4.1 Examples of house facades in Eindhoven. The third image from

left to right is a house facade annotated with windows, doors and individual houses. . . 76 4.2 The hierarchical framework. . . 78 4.3 Information flow at one layer: detection of visual primitives,

relational representation and declarative feature construction, relational distance/kernel module, statistical learner and regions selection. . . 80 4.4 Examples of corner detections in an image at the primitive layer. 81 4.5 E/R diagram. . . 82 4.6 Description of the house facade image at the object layer. Entities

are purple/yellow squares, relationships are diamonds (green/blue for spatial/functional constraints, grey for membership con-straints), properties are circles. Candidate entities not belonging to a class of interest are empty squares. A visual interpretation

i= (x, y) is on the right;xspecifies the input features, whiley

is the learning target. . . 84 4.7 A description of a facade image at the house layer. Entities are

yellow/red squares, the rest is kept the same as for the object layer. 84 4.8 Graph representations of an example (left) and an image

(17)

4.9 Part of the graphicalized visual interpretation in Figure 4.5(a). A neighborhood-pair feature withR= 2 andD= 4 is marked

in yellow. The root vertices or kernel points are thecandidate

vertices and the balls are marked as yellow ellipses. . . 95 4.10 Data flow in the four-level hierarchy of the facades domain. Input

layers: pixels, corner primitives and objects. Corresponding output layers: corner primitives, objects and houses, respectively. 101 4.11 Object layer segmentation, class door,D60. The influence of the

structure componentws on precision/recall values for different

values ofk. . . 104

4.12 Object layer segmentation, classwindow,D60. The influence of

the structure parameterwson precision/recall values for different

values ofk. . . 104

4.13 Object layer segmentation, classdoor,D164. The influence of the

structure parameterws on precision/recall values for different

values ofk. . . 105

4.14 Object layer segmentation, class window,D164. The influence of

values ofk. . . 105

4.15 House layer segmentation (annotations), classhouse,D60. The

influence of the structure parameterwson precision/recall values

for different values ofk. . . 106

4.16 House layer segmentation (annotations), class house,D164. The

influence of the structure parameterwson precision/recall values

for different values ofk. . . 107

4.17 Hierarchical segmentation, classhouse,D60. The influence of the

structure parameterws on precision/recall values for different

values ofk. . . 107

4.18 Hierarchical segmentation, class house, D164. The influence of

values ofk. . . 108

4.19 PR curves, classeswindow,door,houseusing the fixed split,D60,

(18)

4.20 Relational distance (RD) vs. baselines. PR curves, classhouse, D164 (5-fold cv). We recall that performance for our RD

(hierarchy) approach is measured as a precision-recall point due to the selection step. . . 112 5.1 Sample indoor scenes belonging to categories inside pool,

restaurant, bar andoffice(from left to right). . . 118

5.2 Visual interpretations of the office scene containing instances of the relational learning tasks considered. . . 122 5.3 E/R modeling of the two tasks. Rectangles denote entity vertices,

diamonds denote relationships, and ovals (except obj id) denote properties. . . 124 5.4 E/R groundings on a particular image for the object recognition

task. Each obj/3 relation is a training/testing instance. The

target is the dotted diamond. The subgraph pair roots are marked in green. The paths with distancesD = 1 (case a) and D= 2

(case b) are marked with a thick, dashed line. The radiusesR= 0

(case a) andR= 1 (case b) are marked as ellipses around the roots.127

5.5 Kernel features calculation reproducing the BoW setting. They are obtained using the exact (or hard) match kernel forobjand p_L0 as kernel roots and R = 0/D = 1. The graph identifier

(i.e., 314) is computed as the hash of the sorted list of edge hashes. An edge hash is computed as the hash of the sequence of the two endpoints new labels (i.e., 11 15). The new label of a vertex is calculated as the hash (i.e., 11) of the sorted list of distance-vertex label pairs (i.e., 1root0p_L0w1 1obj). . . 128

5.6 Kernel features calculation for the object recognition task using soft matching without context (a) and with context (b). Hyperparameters used areR= 1/D= 0 for (a) andR= 1/D= 2

for (b). . . 129 5.7 Graphicalized (partial) interpretation of the image for the scene

classification problem. Illustration of NSPDKfeatures when D = 2, R = 2 for the same graphicalized interpretation. The

sub-graph pair roots are marked in green. The path with distance

D = 2 is marked with a dashed line and the radius as ellipses

around the roots. The roots are, in this case, nodes with signature nameobjor object entities. . . 131

(19)

5.8 Scene images missclassified byObject bank/Gistand correctly

classified by our relational approach (top). Scene images where our approach fails (bottom). . . 135 6.1 Robot grasping scenario. The table is in front of the mobile

platform, the arm is vertical, the objects are on the table and the range sensor is marked by the green rectangle. . . 145 6.2 A partial point cloud of a can placed on the table. The (i, j, k)

is the reference frame of the camera centred at the sample point and its normal is the black line. The (i1, j, k1) is the reference

frame of the 3D grid, which is obtained by rotating the (i, j, k)

frame along they axis. . . 146

6.3 The task-dependent grasping pipeline on a cup point cloud

example. Top row (left to right): object 1, symbolic object

parts 2 with labels top (yellow), middle (blue), bottom (red),

and handle (green), k-nn graph 3 with part labels, k = 4

(the edges are colored according to the colors of the adjacent nodes), manifolds model with its outcome and visual description of the object (pose, containment and parts). Bottom row: probabilistic logic module with its components and reasoning outcome, predicted pre-graspmiddle 4, shape-based grasping

model and predicted grasping point. . . 147 6.4 The task-dependent grasping pipeline. In blue are marked the

vision-based scene module and the grasping execution module. The contributions of this chapter are situated in the green boxes: the probabilistic logic module and the grasping pose prediction module framed in a relational formulation. . . 148 6.5 Objects having approximative rotational symmetry. . . 150 6.6 Semantic parts for several objects after applying the completion

algorithm. The colors correspond to parts as follows: yellow top, blue middle, red bottom, green handle, and magenta -usable area. . . 151 6.7 An object ontology. . . 155 6.8 A task ontology. . . 155 6.9 Examples of the pre-grasp gripper poses for a face of the top part

(20)

6.10 Gripper and volume of interest, showing the reference frame origin for the orthogonal projection of the DI image from Eq. (6.3) (top left). Object (top right) and its correspondent point cloud (bottom left). The blue points show the selected points of a graspable region of the remote control. The bottom right image

shows the points enclosed by the gripper volume. . . 163

6.11 Example of a depth image (10x21 pixels) and its corresponding gradient magnitude (8×19 pixels). . . 164

6.12 Experimental settings with the real robot. Each picture shows the objects utilized for each scenario. Additional object constraints are: the gray bottle of scenario3 is full with water, the white bottle is empty and the coffee container is full of coffee. . . 168

6.13 Accuracy (%) of PLM for task and pre-grasp prediction using all evaluation settings. . . 172

7.1 From point clouds to feature vectors in kLog. . . 182

7.2 Relational robot grasping in kLog. . . 183

7.3 From point cloud graph to feature vectors in kLog. . . 185

7.4 Point clouds representing partial views of a cup. . . 188

7.5 ROC curves for the two kernel variants and different hyper-parameters (sphere features, VFH/PFH/SC + closeBy2). . . . 190

8.1 The Unogame domain . . . 198

8.2 A learned regression tree by TildeCRF representing the gradient in the first iteration. Internal nodes represent tests – queries in Prolog form – and leaves represent the output. Parts of the tree have been removed due to space restrictions (indicated by. . .). 205

8.3 Accuracy for different r-gram lengths (UnoReal). . . 209

8.4 Performance of TildeCRF on UnoReal. With a relational language bias TildeCRF outperforms the propositional setting. The plain gradient optimization and the Vi majority classifier were used for this experiment. . . 209

8.5 Performance onUnoReal with Vi majority: relational (left) vs propositional (right) CRFs. . . 210

(21)

8.6 All datasets; Influence of the noise on accuracy performance for all methods; Classification method: Viterbi majority; 5-fold crossvalidated . . . 211

(22)

(23)

List of Tables

3.1 Axes for discussing old and new related work. Old papers are displayed in red while new ones in blue. The papers marked in black indicate the transition period. One can notice the dominance of red in the first column indicating that crisp logical and relational approaches were highly popular in early vision. Relational languages and logic-based approaches have been rarely used in modern computer vision, in crisp or probabilistic formulations, given the size of the computer vision community. 63 4.1 kLog vs. relational distance (RD); classeshouse,doorandwindow

using the fixed split,D60. . . 111

4.2 Relational distance vs. baselines, class house,D164(5-fold cv). . 111

4.3 Relational distance (RD) vs. baselines, classhouse,D60(5-fold

cv). . . 112 5.1 Overall accuracy for scene classification on the considered

datasets. L denotes local object attributes, R denotes unary/ binary/ ternary/ quadruple relationships and G denotes global information. The best result for the G+L+R setting was obtained whenR=2/D=0. . . 134

5.2 AP for categorieschair andtable on the15MITdataset. The

BoW, SP and SP+context settings were obtained for kernel parametersR=0/D=1,R=1/D=0 andR=1/D=2, respectively. 136

6.1 Object-Task affordances. . . 156

(24)

6.2 Accuracy (%): PLM vs. propagation kernel (Manifolds) vs. random baseline for object categorization. . . 170 6.3 Accuracy (%): PLM for task prediction. . . 171 6.4 Accuracy (%): PLM for pre-grasp prediction. . . 172 6.5 Percentage (%) of successfully graspable points that have “visually

graspable” probability less than (lt) 0.3, 0.4 or 0.5: Pipeline vs. local shape grasp prediction. . . 174 6.6 Percentage of successful grasps in the real robot scenarios.

Different levels ofSROBOT complexity. . . 174

7.1 Performance results using sphere features. Per object evaluation using hard-soft matching (R= 2, D= 2). . . 189

7.2 Performance results using the gripper cell setup. Per object evaluation using hard-soft matching (R= 2,D= 2). . . 190

8.1 Performance of TildeCRF (conjugated gradient) on all Uno

datasets using the defined classification approaches. The bold notation shows the best accuracy scores. . . 211 8.2 Classification results forUnoReal. The bold notation shows the

(25)

Overture

(26)

(27)

Chapter 1

Introduction

1.1

Context

1.1.1

Machine Learning, Computer Vision and Robotics

Artificial Intelligence (AI) has the long-standing goal of building intelligent machines that can perceive, think and act in similar ways as humans [Landwehr, 2009]. This definition makes AI a big challenge to achieve, but also an inspiration to many researchers. Driven by this goal, AI today is an important field of computer science that can successfully solve tasks in many real-world applications, such as natural language understanding, drug discovery, fraud detection, 3D reconstruction or autonomous robot navigation.

This dissertation is situated at the intersection of three subfields of AI:machine

learning,robotics andcomputer vision.

Machine learning is concerned with building systems that improve their performance on a task with experience, beyond human experts. Machine learning systems typicallylearn concepts from examples, either by observing

an expert, or by interacting with the environment. From given examples, machine learning can automatically infer a model that is a formal representation of the structure inherent in the data. The model can be used to predict the environment or to assist humans in understanding the environment. In either case, the learned model requiresreasoning, that is, making predictions

or analyzing how modifying the system’s input will change its output. For example, a machine learning system can learn a model based on the medical

(28)

records of previous patients. Presented with a new patient record, the system could reason if the patient has a certain disease or not.

Robotics and computer vision are fields that emerged from AI in the early 1970s. They give AI the means to exhibit real-world intelligence by directly manipulating the environment. That is, computer vision gives the machine eyes, or a means of perception, while robotics gives the artificial mind a body, or a means of control and action. It studies ways to turn pixels of images into interpretable concepts, such as objects, scenes, events and beyond, so that computers can understand images in a similar manner as humans do. Thus, computer vision methods acquire, process, analyze, and understand real-world images in order to produce numerical or symbolic information. Computer vision tasks include object tracking, visual recognition and 3D reconstruction. Robotics deals with the construction, operation, and application of robots, as well as systems for their control and sensor processing. Robotics tasks include robot localization, navigation, planning and manipulation.

While classical robotics and computer vision use predefined models (or manual encoding of knowledge) of the robot or its environment for task solving, nowadays, both fields see learning as a central topic. The classical approaches have either proven unsatisfactory in real-world vision tasks, or, although successful for robotic industrial applications, they have fallen behind the more ambitious goal of robotics as a test platform for AI. Machine learning has brought a plethora of advances in extracting statistical models from abstract data in modern computer vision and robotics (similar to speech and bioinformatics). Furthermore, many robotic tasks depend on visual components. For example, robot grasping relies on object recognition, object segmentation and graspable point detection. Thus, the three fields intersect each other and the key contributions of this thesis lie at these intersections.

1.1.2

Visual Recognition

Of all the computer vision tasks, visual recognition is probably the most challenging. Visual recognition consists of analyzing static or dynamic scenes and recognizing its constituent entities. In a dynamic setting the task is one of

recognizing sequences of interest or events. Following the definition by [Szeliski,

2010], in a static setting, visual recognition can be divided along several axes. It subsumes thedetectiontask, which is defined as checking whether a specific

element (e.g., a face or an interest point) is in the image and where the match may occur. If the query entity to be recognized is a rigid template then the task is that of instance recognition. The most challenging variant is that ofcategory recognitionwhich involves recognizing instances of extremely

(29)

varied classes. Visual category recognition may refer toobject recognition, i.e.,

naming constituent objects of a certain object category,scene recognition, i.e.,

categorizing an image as belonging to one category of a large range of categories,

orscene understanding, i.e., naming all constituent objects, their categories and

potentially their semantic, spatial and functional relationships. Interlinked in all these tasks is the topic of learning from example images.

This dissertation is concerned with visual recognition. Several visual recognition tasks are essential for robot grasping, others are useful for robot navigation or pre-requisites for truly intelligent artificial agents. For example, detecting good contact points with the robot hand is a critical step for successful object grasping. Object recognition is an important task for robot grasping and navigation. Scene recognition and understanding plays a major role in mobile robot navigation. Finally, recognizing sequences in video data is of central importance in complex dynamic scenes. All these are important visual recognition problems in AI and the main motivation driving this work.

1.1.3

Statistical Relational Learning

Traditional machine learning is concerned with learning from examples represented in an attribute-value (or propositional) format. Propositional representations express knowledge about a single set of properties of the world and do not associate it with objects in the world. For example, in the medical diagnosis case, attributes may be the patient’s symptoms, medical record and current medication. In the object grasping domain, attributes can be object shape properties that locally characterize object points.

Although propositional learning has made much progress over the last decades with sophisticated and rigorous statistical techniques yielding accurate models in the presence of noisy data [Bishop, 2006, Szeliski, 2010], in many complex real-world domains a propositional representation is often not appropriate. In the real-world, instances are themselves structured and/or interrelated. For example, in the medical diagnosis problem we may want to also consider the medical records and symptoms of the patient’s relatives, but also an explicit family genealogy or relationship. Similarly, for the grasping points we may want to consider properties of neighboring points satisfying certain spatial constraints. In these cases the data exhibits a complex structure and examples are best represented in terms of entities and relationships amongst them.

Furthermore, often real-world problems, such as the ones in computer vision or robotics, cannot rely on complete and precise descriptions of the environment. Thus, the artificial agent should be able to make abstractions and to cope with incomplete or uncertain knowledge. For example, if the goal is to grasp a

(30)

Robotics Computer Vision Learning Learning (Reasoning) Relations Logic Probabilities SRL

Figure 1.1: This dissertation is situated in three subfields of AI: statistical relational learning (right), computer vision and robotics (left).

cup, then we do not care about its color or the number of handles (as long as there is at least one). If complete descriptions are available, making learning and reasoning effective requires exploiting symmetries and redundancies in the domain, and thus, generalizing over similar situations. These are critical aspects of intelligence not solved yet in computer vision or robotics. This thesis aims to achieve them by means ofrelational representations[De Raedt, 2008], which

are most easily described by first-order logic or related formalisms, such as (hyper-) graphs and best supplied by relational languages. When combined with probabilities and statistics, they also provide the possibility to handle uncertainty.

Statistical relational learning(SRL) is an area of machine learning that

success-fully combines statistical learning and reasoning and relational representations in many complex applications, such as social network modeling, text mining or bioinformatics [Getoor and Taskar, 2007, De Raedt, 2008]. A prominent example are probabilistic logical models that tackle a long standing goal of AI, namely unifying first-order logic –capturing regularities and symmetries– and probabilities –capturing uncertainty. Figure 1.1 shows the general aim of

relational visual recognition, that is the use of (statistical) relational learning

techniques instead of traditional machine learning to solve visual recognition problems for computer vision and robotics.

1.2

Motivation and Research Question

Computer vision and robotics have developed many techniques for visual recognition that use a plethora of local low to medium-level features, including geometric primitives, point clouds, shape and invariant features [Szeliski, 2010].

(31)

However, for high-level tasks involving complex objects and scenes such features are not always enough. As examples, consider the tasks of understanding and recognizing individual house facades, distinguishing between restaurant and bar scenes, or finding the best robot grasp based on the object configuration and task-related constraints. In these cases it is the component objects and their complex semantic configuration and interaction that helps recognition. It is more intuitive to understand and describe typical houses as consisting of aligned elements such as a roof, some windows, one or more doors and possibly a chimney. In the bar/restaurant scene example, the differentiating patterns are the consistent qualitative spatial and functional configurations between chairs. One can describe a bar scene as having ‘a variable number of chairs of similar size, close to each other and aligned horizontally along a counter’. Finally, high-level reasoning about symbolic object configurations and tasks reduces possible grasps and hence, improves performance. At the same time it allows grasp transfer to novel objects that share similar parts.

In the early days of computer vision, it was felt that hierarchical structure and relations are key components of a scene understanding system [Guzmán, 1968], [Kanade, 1977], [Hanson and Riseman, 1978], [González and Thomason, 1978], [Matsuyama and Hwang, 1985, Fu, 1974]. Popular in early work on syntactic or structural pattern recognition [Haralick, 1983], relational formalisms, such as ‘figure description languages’ and symbolic graphs, have lost interest in the 1990s [Bunke and Sanfeliu, 1990] due to reasons such as: high computational cost when facing graph complexities, immature low-and mid-level vision features to support such ambitious representations and the limitation of pure relational approaches in handling noisy data. Then, the focus in computer vision was shifted towards low-level representations:

“We have showed the use of relational representations, we must yet discover the use of low-level knowledge.” Linda Shapiro, 1983

Furthermore, to perceive, interpret and grasp objects in arbitrary and dynamic environmental scenarios, robot vision capabilities are essential. The majority of grasping methods learns direct mappings from visual perceptions to grasping parameters. However, these methods have a major shortcomming: it is a difficult problem to link gripper parameters to solely local sensor features when dealing with an exploding complexity in the environment and variation in tasks. Only recently, methods that take more global and symbolic knowledge into account have gained more interest. Incorporating domain knowledge (e.g., ontologies) that directly collaborates with the controllers and the (visual) sensors brings increased robustness and can generate more accurate robot grasps.

(32)

This thesis wants to contribute towards the idea that visual scenes, grasping sce-narios and world knowledge are best described using high-level representational devices that are based on semantically meaningful entities such as graphs, and even more generally using logical and relational languages. We shall argue that the advantages of these rich symbolic representations are: i) they can abstract spatial relations between scene components away from exact locations and thus, generalize over similar situations and view points, ii) they provide means to obtain analytic descriptions of scenes and thus, semantical consistency, iii) they offer contextual knowledge exploitation via symbolic relations, and iv) they transfer knowledge to novel scenarios that share similar semantic entities and thus, generalize over similar (multiple) entities.

Different from early work in computer vision, relational representations have shown robustness to noise when combined with statistical techniques [Antanas et al., 2013a]. Moreover, low-and mid-level vision features are now much more mature. Nevertheless, relational representations have not yet been used to solve visual-based grasping problems or have rarely been used to address visual recognition problems in general (exceptions are grammars for image understanding [Han and Zhu, 2009,Girshick et al., 2011,Zhu et al., 2012], graph mining and rule induction for video data [Sridhar et al., 2010a, Dubba et al., 2010]). Thus, it is time to reconsider old problems with new and successful (statistical) relational learning techniques.

The main research questions of this thesis arewhether visual recognition can

benefit from SRLandwhether we can develop effective and real-world relational

visual recognition systems.

One of the main problems in robotic grasping is generalization across many similar objects and/or tasks. Similarly, one challenge in computer vision is the optimal exploitation of contextual information and generalization across configurations of visual elements. Thus, on one hand, the extraction of similarities between objects and scenes requires relational representations. On the other hand, robotics and computer vision are fields continuously confronted with real-world uncertainties. As a result, we have strong reasons to suspect that SRL techniques can be beneficial for visual recognition tasks in computer vision and robotics.

1.3

Thesis Contribution

We answer these questions via thekey contribution of this dissertation, which

is the use of several (statistical) relational learning techniques for different computer and robot vision problems. This is an important step towards

(33)

relational visual recognition and thus, towards closing the loop with the old literature. To achieve this goal, the thesis makesfive main contributions, three

in the field of computer vision and two in the field of robotics. We will now list and describe briefly these contributions.

1. A relational distance-based framework for hierarchical understanding of images. Application: house facades.

Our first contribution is a new relational distance-based framework for hierarchical image understanding. This contribution includes the following:

• a new relational distance function between visual descriptions,

• the use of recent results in relational distance metrics as a relational generalization technique to recognize qualitative high-level structures in images,

• the use of relational generalization throughout all layers of the hierarchy, in a unified way.

2. The employment of a kernel-based relational language for scene classification and scene tagging

Our second contribution is a new relational representation of visual scenes for two important and challenging problems in computer vision: scene classification and scene tagging with object categories. Both problems use a similar relational representation, showing its generality and benefits. Part of this contribution is the employment of the kernel-based language to understand images of houses. Additional contributions are:

• a high-level relational scene description based on semantic objects and the spatial relationships that hold among them,

• a powerful and expressive representation using (hyper)-graphs,

• a principled way to represent exact metric locations as higher-order relations among objects,

• a deeper insight in scene understanding by employing relations among semantic off-the-shelf object detections.

3. A probabilistic logic pipeline for task-dependent robot grasping

Our third contribution is for robot grasping. It is a new reasoning module based on causal probabilistic logic [Vennekens et al., 2009] and symbolic object parts

(34)

for task-dependent robot grasping. Given a set of probabilistic observations about the world, the model can semantically reason about object category, suitable tasks and pre-grasp configurations with respect to the intended task. This contribution comprises:

• the integration of object categorical and task-dependent information for semantic pre-grasp prediction,

• the use of world knowledge about object-task affordances and object/task ontologies to encode general rules that allow generalization over similar object parts and object/task categories,

• a first probabilistic logic module for task-dependent robot grasping.

4. A relational kernel-based approach to numerical feature pooling for robot grasping. Application: graspable point recognition.

Our fourth contribution integrates, using kernels for graphs, numerical appearance features with qualitative spatial relations. Given a 3D point cloud and local shape features of each point, we construct a numerical attributed and symbolic graph by defining spatial relations among points in the cloud. Our goal is to investigate whether the structure of the object can improve graspable point recognition. To achieve it, our approach includes:

• the exploitation of the object graph for extended contextual information, • the use of spatial proximity to pool numerical shape features.

5. The employment of state-of-the-art SRL systems for video sequence

recognition. Application: video streams of _Unogame.

This last contribution uses sequential statistical relational techniques to capture underlying concepts in video streams. In particular, we focus on monitoring card games and learning to detect fraudulent sequences inUnovideo streams.

It includes two main steps:

• learning the rules of the Unogame by observing humans playing it from

video streams,

• recognizing fraudulent behavior using the learned rules.

Some of the SRL solutions proposed as contributions to the recognition problems considered are framed upon a similar relational kernel-based approach. More

(35)

precisely, contributions2 an4 rely on the same relational and logical language

of a kernel-based framework. They are obtained by changing the relational representation of the problem, while keeping the framework engine. Thus, the SRL approaches proposed are characterized not only by the expressivity of the relational representations, but also by generality with respect to the visual recognition problems addressed. This is an important step towards a general purpose relational visual recognition system.

1.4

Thesis Roadmap

The final part of this introduction gives a brief tour of this thesis. We review robotics, computer vision and statistical and relational learning foundations in

Chapter 2. The core of the thesis is divided into other three main parts.

Part I is devoted to relational scene understanding and tackles several

recognition problems: object recognition, scene recognition and scene understanding. Chapter 3 provides an insight in the history of syntactic

pattern recognition, relations and graphs in visual recognition. Its role is to point out why the popular relational frameworks of the 1970s failed and were abandoned. We discuss what is different now and how SRL can help to solve the old problems. Starting from the trends back then, we overview the recent SRL work for computer vision and point out what would be possible if SRL succeeded. To this aim,Chapter 4 proposes two new relational approaches

to hierarchical image understanding, where the goal is to recognize constituent objects of interest at different levels of semantic granularity. We consider as application the house facades domain. The first approach is a relational distance-based approach which combines robust feature extraction, qualitative spatial relations, relational instance-based learning and compositional hierarchies in one framework. The second approach extends the first one, by replacing the relational distance with a kernel for relational structures. This chapter is based on the following publications:

• Antanas, L., van Otterlo, M., Oramas Mogrovejo, J. A., Tuytelaars, T., and De Raedt, L. Not far away from home: A relational distance-based

approach to understand images of houses. In Lecture Notes in Computer

Science, vol. 6489, pp. 22-29, Inductive Logic Programming, Springer, 2010.

• Antanas, L., Frasconi, P., Tuytelaars, T., and De Raedt, L. Employing

logical languages for image understanding. In IEEE Workshop on Kernels

and Distances for Computer Vision, International Conference on Computer Vision, 2011.

(36)

• Antanas, L., van Otterlo, M., Oramas Mogrovejo, J. A., Tuytelaars, T., and De Raedt, L. A relational distance-based framework for hierarchical

image understanding. In Proceedings of the 1st International Conference

on Pattern Recognition - Applications and Methods, 2012, Best Paper

Award.

• Antanas, L., Frasconi, P., Costa, F., Tuytelaars, T., and De Raedt, L. A

relational kernel-based framework for hierarchical image understanding.

In Lecture Notes in Computer Science, vol. 7626, pp. 171-180, Structural, Syntactic, and Statistical Pattern Recognition, Springer, 2012.

• Antanas, L., van Otterlo, M., Oramas M., J. A., Tuytelaars, T., and De Raedt, L. There are plenty of places like home: using relational

representations in hierarchies for distance-based image understanding.

Neurocomputing Journal, 2013.

InChapter 5we move towards more generic scene understanding where, in a

first phase, we contribute a relational kernel-based language for scene recognition. We show that semantic object detections and qualitative spatial constraints between them can improve recognition. In a second phase, we employ a similar relational kernel-based language for scene tagging with object categories. We then iteratively combine object and scene recognition to boost the performance on both tasks. The chapter is based on the following contribution:

• Antanas, L., Hoffmann, M., Frasconi, P., Tuytelaars, T., and De Raedt, L.

A relational kernel-based approach to scene classification. In Proceedings

of Workshop on Applications of Computer Vision, 2013.

Part IIdiscusses SRL techniques for robot grasping. We demonstrate their

benefits inChapter 6 which proposes a new probabilistic logic pipeline for

object grasping, and in Chapter 7which presents a new SRL technique to

recognize good grasping points in point clouds. The pipeline leverages world knowledge, in the form of object/task ontologies, and low-level data, in the form of point clouds, to improve robot grasping. Starting from a symbolic vision-based scene description, the pipeline first employs a probabilistic logic module to semantically reason about object category, suitable tasks and pre-grasp configurations with respect to the intended task. Once the pre-pre-grasp is determined, the second step in the pipeline maps part-related shape features to good grasping hypotheses. The mapping is done inChapter 7using relational

kernels. Chapter 6is based on the paper:

• Antanas, L., Moreno, P., Figueiredo, R., Neumann, M., Kersting, K., and De Raedt, L. High-level reasoning and low-level learning for grasping: a

(37)

probabilistic logic pipeline. Submitted to IEEE Transactions on Robotics,

2013.

Part III focuses on visual sequence recognition with state-of-the-art SRL

techniques. The contribution is explained in Chapter 8 and considers as

application UNO card game. The work presented in this chapter has been previously published in:

• Antanas, L., van Otterlo, M., De Raedt, L., and Thon, I. Learning probabilistic relational models from sequential video data with applications

in table-top and card games. In the Belgian- Dutch Conference on Machine

Learning, 2009.

• Antanas, L., Thon, I., van Otterlo, M., Landwehr, N., and De Raedt,

L. Probabilistic logical sequence learning for video. In Inductive Logic

Programming, 2009.

• Antanas, L., Gutmann, B., Thon, I., Kersting, K., and De Raedt, L.

Combining video and sequential statistical relational techniques to monitor

card games. In Proceedings of the ICML Workshop on Machine Learning

and Games, 2010.

card games. In Proceedings of the Belgian-Dutch Conference on Machine

Learning, 2010.

A concluding chaptersummarizes the thesis, points out the implications of

the results and gives an outlook on future work.

Some of the work performed during my Ph.D research has not been included in the previous chapters. It is either work I am currently investigating and is briefly summarized inChapter 9in the context of related future work, or is

listed in thepublication list.

1.5

Publication List

Journals

• Antanas, L., van Otterlo, M., Oramas M., J. A., Tuytelaars, T., and De Raedt, L. There are plenty of places like home: Using relational

(38)

representations in hierarchies for distance-based image understanding.

Neurocomputing Journal, volume 123, pages 75-85, 2014.

• Janssens, T., Antanas, L., Derde, S., Vanhorebeek, I., Van den Berghe, G., Guiza Grandas, F. Charisma: An integrated approach to automatic H&E-stained skeletal muscle cell segmentation using supervised learning

and novel robust clump splitting. Medical Image Analysis, volume 17,

issue 8, pages 1206-1219, 2013.

Conferences and Workshops

• Antanas, L., Hoffmann, M., Frasconi, P., Tuytelaars, T., and De Raedt, L.

A relational kernel-based approach to scene classification. In Proceedings

of Workshop on Applications of Computer Vision, pages 133-139, 2013. • Neumann, M., Moreno, P., Antanas, L., Garnett, R., Kersting, K. Graph

kernels for object category prediction in task-dependent robot grasping. In

Online Proceedings of the Eleventh Workshop on Mining and Learning with Graphs, pages 1-6, 2013.

• Billiet, L., Oramas M., J., Hoffmann, M., Meert, W., Antanas, L. Rule-based hand posture recognition using qualitative finger configurations

acquired with the Kinect. In Proceedings of the 2nd International

Conference on Pattern Recognition - Applications and Methods, pages 539-542, 2013.

• Moldovan, B., Antanas, L., Hoffmann, M. Opening doors: An initial

SRL approach. In Lecture Notes in Computer Science Post Proceedings,

Inductive Logic Programming, Springer, pages 178-192, 2013.

• Robben, D., Smeets, D., Ruijters, D., Hoffmann, M., Antanas, L., Maes, F., Suetens, P. Intra-patient non-rigid registration of 3D vascular cerebral

images. In Lecture Notes in Computer Science, MICCAI Workshop on

Clinical Image-based Procedures: From Planning to Intervention, Springer, pages 106-113, 2013.

• Antanas, L., van Otterlo, M., Oramas Mogrovejo, J. A., Tuytelaars, T., and De Raedt, L. A relational distance-based framework for hierarchical

image understanding. In Proceedings of the 1st International Conference

on Pattern Recognition - Applications and Methods, pages 206-218, 2012,

Best Paper Award.

• Antanas, L., Frasconi, P., Costa, F., Tuytelaars, T., and De Raedt, L. A

(39)

Lecture Notes in Computer Science, Structural, Syntactic, and Statistical Pattern Recognition, Springer, pages 171-180, 2012.

• Derde, M., Antanas, L., De Raedt, L., Guiza Grandas, F. An interactive

learning approach to histology image segmentation. In Proceedings of the

24th Benelux Conference on Artificial Intelligence, pages 1-8, 2012. • Janssens, T., Antanas, L., Derde, S., Vanhorebeek, I., Van den Berghe,

G., Guiza Grandas, F. Charisma: An Integrated Approach to Automatic H&E-stained Skeletal Muscle Cell Segmentation Using Supervised Learning

and Novel Robust Clump Splitting Techniques. In Bioimaging, abstract,

2012.

• Antanas, L., Frasconi, P., Tuytelaars, T., and De Raedt, L. Employing

logical languages for image understanding. In IEEE Workshop on Kernels

and Distances for Computer Vision, International Conference on Computer Vision, pages 1-2, 2011.

• Antanas, L., van Otterlo, M., Oramas Mogrovejo, J. A., Tuytelaars, T., and De Raedt, L. Not far away from home: A relational distance-based

approach to understand images of houses. In Lecture Notes in Computer

Science, Inductive Logic Programming, Springer, pages 22-29, 2010. • Antanas, L., Gutmann, B., Thon, I., Kersting, K., and De Raedt, L.

card games. In Proceedings of the ICML Workshop on Machine Learning

and Games, pages 1-6, 2010.

card games. In Proceedings of the Belgian-Dutch Conference on Machine

Learning, pages 1-6, 2010.

• Antanas, L., Thon, I., van Otterlo, M., Landwehr, N., and De Raedt, L.

Probabilistic logical sequence learning for video. Online Proceedings In

Inductive Logic Programming, pages 1-6, 2009.

• Antanas, L., van Otterlo, M., De Raedt, L., and Thon, I. Learning probabilistic relational models from sequential video data with applications

in table-top and card games. In Proceedings of the Belgian- Dutch

Conference on Machine Learning, pages 1-2, 2009.

• Antanas, L., Driessens, K., Croonenborghs, T., Ramon, J. Using

decision trees as the answer network in temporal-difference networks.

In Proceedings of the 18th European Conference on Artificial Intelligence, pages 847-848, 2008.

(40)

(41)

Chapter 2

Preliminaries

This chapter provides the foundations for the work presented in this thesis. They include the necessary background on robot grasping, computer vision and statistical relational learning. Along the definitions and explanations we will also roughly categorize existing work and thus, provide more context for the contributions. We describe some concepts informally on examples and others more formally following existing literature.

We start by defining fundamental concepts of machine learning that are used throughout this text (Section 2.1.1). Next, we introduce relational data representations in Section 2.1.2. Relational learning and reasoning settings are outlined in Section 2.1.3. Finally, Section 2.2 explains the visual recognition and robot grasping setups in this dissertation.

2.1

Foundations: Machine Learning and Reasoning

This section briefly outlines some fundamental concepts of statistical and relational machine learning and reasoning, and introduces notation and terminology that is used throughout the thesis. More details can be found in [Flach, 2012,Barber, 2011] for statistical machine learning and reasoning and in [De Raedt, 2008] for (statistical) relational learning and reasoning.

(42)

2.1.1

Statistical Learning and Reasoning

The general setup in statistical machine learning is based on objects of interest calledinstances. The set of all possible instances is theinstance space X. Each

instancex∈ X is a point in anm-dimensional instance spaceX =d1× · · · ×dm

where di is the domain of the i-th attribute describing the input feature x.

Instances may have labels and all instance labels together define the output spaceY. Both instances and labels can be binary, categorical or continuous.

Learning from labeled instances is calledsupervised learning.

We consider supervised learning tasks throughout this thesis. We are given a training set D containing n labelled instances or training examples D = {(x1, y1),(x2, y2), . . . ,(xn, yn)}. D ⊂ X × Y is also calledtraining data and e= (x, y) a training example. Then the supervised learning task is to find a

mapping or a model ¯hfrom the instance space to the output space. It is assumed

that examples are independently drawn form a fixed (unknown) distribution

P. Such examples are said to bei.i.d.. Starting from the type of labels several

settings are possible. In this work we focus onclassification, where labels are

binary or categorical, i.e.,classes. In binary classification it is assumed that y∈ {−1,+1}.

Definition 1. (Supervised statistical learning). Given a set of training

examples D, a space of possible (probabilistic) classifiersH ={h|h:X → Y}

and a loss function LH : Y × Y → <, find the classifier ¯h ∈ H with

low approximation error Err(¯h) on the training data as well as on unseen

examples. Err(h) is estimated based on a combination of the training error,

e.g., 1_nPn

i=1LH(h(xi), y).

Example 1. A supervised learning task example is that of patient disease

prediction. X can be the space of all possible patients and Y is then the space

of all possible diagnoses. Theith _{attribute may be the patient’s glucose level at}

some point in time.

The classifier above is assumed to be deterministic, that is, it returns a class labelh(x)∈ Y. Ascoring classifier his a mapping from the instance spaceX

to ak-vector of real numbers<k_{, where}_h

i(x) is the score assigned to class Ci

for instancex. hbecomes aprobabilistic classifier ifh(x) is a probability vector

over classes, that ish: X →[0,1]k, where Pk

i=1hi(x) = 1. This provides a

confidence value for any prediction, allowing further inspection in ambiguous cases.

One of the central paradigms in statistical machine learning is to identify the relevantrandom variables x∈ X from training data, and make a probabilistic

(43)

probabilistic classifier) defines a probability distribution P(·) over a set of

random variables. The set of assignments a random variable x can take is

the domain of x. P(x) denotes the probability distribution of the random

variable xon all values in its space. Probabilistic reasoningis performed by

introducing evidence that sets variables in known states, and subsequently computing probabilities of interest of their interaction, conditioned on this evidence. The distribution P(x) can then be used to evaluate a conditional

probability distributionP(x|e) = P_P(x,e₍_e₎), called target distribution. In this case xinvolves random variables andeis theevidence or a partial value assignment

of the random variables.

Example 2. In the case of the patient disease prediction example the glucose

level is a random variable and its domain is the set of all possible glucose levels.

The set of all attributes and the disease target forms the set X of random

variables. At inference time the target distribution is the probability of the disease given the attribute values as evidence.

The conditional probability distributionP(y|x) expresses a relation between

the random variables, that is the probability that the random variabley has a

particular value given the knowledge of the evidencex. The random variables

can also be related with conjunction instead of condition. P(x, y) is called the

joint probability distribution over all possible values ofxandy. Agenerative

model provides an estimate of the full joint probability distribution P(x, y)

on the inputs xand labely. It uses Bayes’ rule to calculateP(y|x) and pick

the most likely label y. A discriminative model provides an estimate of the

conditional probability distributionP(y|x) directly or learns a direct map from

the inputsxto the class labelsy.

In the following we explain the basics of several well-established statistical learning methods used in this thesis: Support Vector Machines (SVMs),

k-Nearest Neighbor (_kNN) Conditional Random Fields (CRFs) and n-grams.

However, in our contributions they are upgraded to relational representations. While n-grams are based on generative learning, the other classifiers are discriminative methods.

Support Vector Machines and Kernels. Support Vector Machines (SVMs)

are very popular because of their good performance on noisy data and high-dimensional spaces [Boser and et al., 1992,Vapnik, 1995]. We will briefly sketch its principles here, more details can be found in [Shawe-Taylor and Cristianini, 2004, Gartner, 2008]. Assuming a binary classification problem, the goal of a classification method is to find ¯hthat best separates the two classes.

(44)

A linear classifierhis a linear function in the form of a vector of weightsw y(x) =sign(hw, xi+b), (2.1)

whereh·,·iis the l2 norm or inner product andbis a constant. Such a linear

classifier assumes that the data can be embedded into a space where the separating hypothesis (or hyperplane) is a linear relationhw, xi+bin_Rm_{. The}

examples on one side of the hyperplane are classified as positive, while the others as negative. The Euclidean distance between a pointxand the hyperplane is

|hw,xi+b|

kwk wherek · k denotes thel2 norm, that iskwk= p

hw, wi.

Training a linear classifier is equivalent to finding ¯w which constructs a

hyperplane (or set of hyperplanes) in the input feature space X that best

interpolates the training setD and can be used for classification:

¯ w= arg min w 1 n n X i=1 LH(w, xi, yi) +λ(w), (2.2)

whereλis a regularization term that constrains the weightswand LH is the

loss function. At inference time, the prediction for any inputxis made using

the learned vector ¯w: y= arg max

y∈Y (

hw, x¯ i+b), x∈ X. (2.3)

If the model is learned using probability estimates, the probabilistic inference step means estimating the target distributionP(y|x).

The ingredients of the SVMs are the maximum-margin principle addressing robustness, slack variables addressing class overlap, and the kernel trick addressing non-linear structure. We first assume a perfect linear classifier and explain the linearly separable examples case. Figure 2.1 (right) depicts such a situation. We then extend it to the non-separable case (slack variables) and non-linear case (kernel trick).

Perfect classification implies that for anyx y(x) =

(

+1 ifhw, xi+b >0,

−1 ifhw, xi+b≤0, (2.4)

which translates into∀xinX, y·(hw, xi+b)≥1, with the equality holding for

(45)

〈w,x〉+ b b/||w||

Figure 2.1: The max-margin in SVMs (right). The classes to be separated are

−1 (circles) and +1 (rectangles). The dotted arrow is the margin. The function φmaps the data into a feature space where the nonlinear pattern (left) is now

linear (right). The kernel computes inner products in this feature space directly from the inputs.

In a linearly separable setting, there are several perfect classifiers. However, an optimal partition is achieved by the hyperplane that has the largest distance to the nearest training data points of any class or the maximal margin. In general, the larger the margin the lower the generalization error of the classifier. If such a hyperplane exists, it is also known as the maximum margin hyperplane and the linear classifier it defines is known as a max-margin classifier. The margin is illustrated also in Figure 2.1 on the right.

These constraints are cast in the following optimization problem: min w,b 1 2hw, wi, subject to∀x∈ X :yi·(hw, xii+b)≥1. (2.5)

If the data is not separable, SVMs add a tolerance for misclassifications by introducing slack variablesξ in the constraints:

min w,b,ξ 1 2hw, wi+c n X i=1 ξ, subject to∀x∈ X :yi·(hw, xii+b)≥1−ξi, ξi≥0. (2.6)

The positive constantcis the cost parameter of the error term. By increasingc

(46)

The reason why SVMs are so popular is that for most problems, only a limited number of instances lie on the margin. This implies that the expensive computation required by equation (2.2) to find a solution ¯w can become sparse,

that is features that are not in support vectors get zero weight. This is feasible when the hinge loss function is used forLH(w, x, y) = max{0,1−y·f(x)}=

max{0,1−y·(hw, xi+b)}.

Kernel-defined feature mappings. Although the original problem may be stated

in a finite dimensional space, it often happens that the classes to discriminate are not linearly separable in that space. Using a feature mapping function

φ(x) instead of x, it is possible to project the original feature space into a

higher-dimensional space, presumably making the separation easier in that space (see Figure 2.1). In this contextφ(x) can be either computed explicitly

or defined implicitly, via akernel functionk(x, x0) =hφ(x), φ(x0)i. SVMs use

a mapping designed via the kernel. The ef

Relational Visual Recognition

Faculty of Engineering Science

ISBN 978-94-6018-855-8 D/2014/7515/79

Abstract

Contents

Abstract i

Overture

1 Introduction 3

Relational Scene Understanding

Relational Recognition for Robot Grasping

SRL for Video Sequence Recognition

Appendix

A Simulated UNO datasets 229

List of Figures

List of Tables

Overture

Visual Recognition

Statistical Relational Learning

Robotics Computer Vision Learning Learning (Reasoning) Relations Logic Probabilities SRL

Motivation and Research Question

Thesis Contribution

Thesis Roadmap

Publication List

Conferences and Workshops

Chapter 2

Foundations: Machine Learning and Reasoning