Robot Learning from Data

(1)

Robot Learning from Data

(2)

General Pipeline

1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation

construction

3. Robot learning, e.g.:

classification (recognition),

clustering (knowledge discovery) RL for action generation

4. Action execution

(3)

Robot Learning

• There are generally three types of robot learning:

➢Learning by demonstration

➢Reinforcement learning

➢Learning from data

(4)

Learning from data

(5)

Learning from data

• Supervised learning (Classification)

▪ Support Vector Machines (SVMs)

▪ Decision Trees

▪ Ensemble methods

• Unsupervised

learning (Clustering)

▪ Gaussian Mixture Models (GMM)

(6)

Support Vector Machines

• A Support Vector Machine (SVM) is a supervised learning (i.e., classification) model that recognizes patterns using a separating hyperplane.

• Given a set of training data (with semantic labels), an SVM training algorithm builds a model that

outputs an optimal hyperplane which assigns new examples into one category or the other

• SVM is a non-probabilistic binary linear classifier ^(as

our first SVM model for discussion)

(7)

Support Vector Machines

• The original SVM algorithm was invented by

Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963, based on Statistical Learning Theory

• In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create

nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes

• Empirically good performance: successful

applications in many fields (e.g., text processing, image processing, bioinformatics, robotics, etc.)

(8)

Support Vector Machines

• For example: To recognize human behaviors

• We have a training dataset of skeletal data for the human activity “jump forward”

• Now given a new (previously unseen) skeleton sequence, we want to answer the question: is the human performing

“jump forward”, or not?

(9)

Support Vector Machines

How can we choose a line to separate the data?

As an illustration, here we deal with lines and points in the Cartesian

plane instead of hyperplanes and vectors in a high dimensional space.

(10)

Support Vector Machines

H₁

Is this a good line?

(11)

Support Vector Machines

H₂

(12)

Support Vector Machines

H₃

(13)

Support Vector Machines

H₁ H₂ H₃

Which line is the best? …and Why?

(14)

Support Vector Machines

H₁ H₂ H₃

H3 can best separate two categories of data, as it maximizes the margin

(15)

Support Vector Machines

(16)

Support Vector Machines

• We are given a training dataset of 𝑛 instances with the format:

• The 𝑦_𝑖 are either 1 or −1, each indicating the class to which the point 𝒙_𝑖 belongs

• Each 𝒙_𝑖 is a 𝑝-dimensional real vector

𝒙₁, 𝑦₁ , … , (𝒙_𝑛, 𝑦_𝑛)

(17)

Support Vector Machines

• We are given a training dataset of 𝑛 instances with the format:

• We want to find the "maximum-margin hyperplane" that divides the dataset of

instances 𝒙_𝑖, where 𝑦_𝑖 = 1 if 𝒙_𝑖 belong to the category and 𝑦_𝑖 = −1 if 𝒙_𝑖 does not belong to the category

• The hyperplane is defined so that the distance between the hyperplane and the nearest data point from either group is maximized

𝒙₁, 𝑦₁ , … , (𝒙_𝑛, 𝑦_𝑛)

(18)

Support Vector Machines

• Any hyperplane can be written as the set of points 𝒙 satisfying

where 𝒘 is the

parameter (normal vector) of the

hyperplane

𝒘 ∙ 𝒙 − 𝑏 = 0

(19)

Support Vector Machines

• The parameter

𝑏 𝒘

determines the offset of the hyperplane

from the origin along the normal vector 𝒘

(20)

Support Vector Machines

• If the data instances are linearly separable, we can select two parallel hyperplanes (i.e., as decision boundaries) that separate the two categories of data

• The distance (i.e.,

margin) between these two hyperplanes is as large as possible

• The maximum-margin hyperplane is the one that lies halfway

between them

(21)

Support Vector Machines

• These hyperplanes can be described by the equations

and

𝒘 ∙ 𝒙 − 𝑏 = 1 𝒘 ∙ 𝒙 − 𝑏 = −1

Why 1 and -1?

(22)

Support Vector Machines

• Geometrically, the distance between these two planes is

2 𝒘

• Thus, maximizing the distance between the planes is equivalent to minimize 𝒘

(23)

Support Vector Machines

• We also need to

prevent data points from falling into the margin, we add the following constraint:

for each 𝑖 either or

• These constraints state that each data point must lie on the correct side of the margin

𝒘 ∙ 𝒙_𝑖 − 𝑏 ≥ 1, if 𝑦_𝑖 = 1 𝒘 ∙ 𝒙_𝑖 − 𝑏 ≤ −1, if 𝑦_𝑖 = −1

(24)

Support Vector Machines

• We also need to

prevent data points from falling into the margin, we add the following constraint:

for each 𝑖 either or

• The two equations can be rewritten as:

𝒘 ∙ 𝒙_𝑖 − 𝑏 ≥ 1, if 𝑦_𝑖 = 1

𝑦_𝑖(𝒘 ∙ 𝒙_𝑖 − 𝑏) ≥ 1

𝒘 ∙ 𝒙_𝑖 − 𝑏 ≤ −1, if 𝑦_𝑖 = −1

(25)

Support Vector Machines

• The final optimization problem becomes:

minimize 𝑤

subject to:

𝑦_𝑖 𝒘 ∙ 𝒙_𝑖 − 𝑏 ≥ 1 for all 1 ≤ 𝑖 ≤ 𝑛

(26)

Support Vector Machines

Relevant courses: Machine Learning and Convex Optimization

(27)

Support Vector Machines

• The final optimization problem becomes:

• After solving this optimization problem, given a new measurement 𝒙, we classify it as:

minimize 𝑤

subject to:

𝑦_𝑖 𝒘 ∙ 𝒙_𝑖 − 𝑏 ≥ 1 for all 1 ≤ 𝑖 ≤ 𝑛

𝒙 → 𝑦 = sgn 𝒘 ∙ 𝒙 − 𝑏

(28)

Support Vector Machines

• Support vectors are the data points that lie on the two

decision boundaries

• They are the most difficult to classify

• They have direct influence on the

optimum location of the decision

boundaries

(29)

Support Vector Machines

So far, we discussed our first SVM model, that is a binary linear classifier.

• How to classify data instances that are not linear spreadable?

(30)

Support Vector Machines

How about when data is not linearly separatable?

(31)

Support Vector Machines

Projecting data to a higher

dimensional space for separation

(32)

Projecting data to a higher

dimensional space for separation

Support Vector Machines

(33)

Support Vector Machines

SVMs – The kernel trick:

Project data into a higher dimensional space, then construct a hyperplane in that space so all math is the same.

(34)

(35)

(36)

(37)

(38)

(radial basis function)

(39)

(radial basis function)

𝛾 is the hyperparameter of the RBF kernel

(40)

Support Vector Machines

How to deal with non-separable data?

(i.e., when some points always fall on the wrong side of the decision boundaries even after using the kernel trick)

(41)

Support Vector Machines

• We can use the “soft”

margin (based on the slack variables 𝜉)

(42)

Support Vector Machines

• We can use the “soft”

margin (based on the slack variables 𝜉), and the problem can be rewritten as

(43)

Support Vector Machines

•The hyperparameter 𝐶 adds a penalty for each misclassified data point.

• If 𝐶 is small, the penalty for misclassified points is low, so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications.

• If 𝐶 is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller

margin.

(44)

Support Vector Machines

So far, we discussed our first SVM model, that is a binary linear classifier.

• How to classify data instances that are not linearly spreadable?

• How to classify categorical data with more than two categories?

(45)

Support Vector Machines

• Multi-class classification: one-versus-all (also called one-versus-rest)

Class 1

Class 2 Class 3

(46)

Support Vector Machines

• Multi-class classification: one-versus-all (also called one-versus-rest) or one-versus-one

Class 1

Class 2 Class 3

(47)

Support Vector Machines

So far, we discussed our first SVM model, that is a binary linear classifier.

• How to deal with time?

(48)

Support Vector Machines

• Online, onboard applications: Sliding window techniques

(49)

Support Vector Machines

So far, we discussed our first SVM model, that is a binary linear classifier.

• How to deal with time?

• How to evaluate performance?

(50)

Support Vector Machines

• Evaluation of a learning algorithm: Training and Testing separation

(51)

Support Vector Machines

• Evaluation of a learning algorithm: Cross- validation during training

Valid- ation

Valid- ation Valid-

ation

(52)

Support Vector Machines

• Evaluation of a learning algorithm: Confusion Matrix

Ground truth labels

Recognized labels

(53)

Support Vector Machines

• Evaluation of a learning algorithm: Confusion Matrix (normalized)

(54)