Robot Learning from Data
General Pipeline
1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation
construction
3. Robot learning, e.g.:
classification (recognition),
clustering (knowledge discovery) RL for action generation
4. Action execution
Robot Learning
• There are generally three types of robot learning:
➢Learning by demonstration
➢Reinforcement learning
➢Learning from data
Learning from data
Learning from data
• Supervised learning (Classification)
▪ Support Vector Machines (SVMs)
▪ Decision Trees
▪ Ensemble methods
• Unsupervised
learning (Clustering)
▪ Gaussian Mixture Models (GMM)
Support Vector Machines
• A Support Vector Machine (SVM) is a supervised learning (i.e., classification) model that recognizes patterns using a separating hyperplane.
• Given a set of training data (with semantic labels), an SVM training algorithm builds a model that
outputs an optimal hyperplane which assigns new examples into one category or the other
• SVM is a non-probabilistic binary linear classifier (as
our first SVM model for discussion)
Support Vector Machines
• The original SVM algorithm was invented by
Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963, based on Statistical Learning Theory
• In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create
nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes
• Empirically good performance: successful
applications in many fields (e.g., text processing, image processing, bioinformatics, robotics, etc.)
Support Vector Machines
• For example: To recognize human behaviors
• We have a training dataset of skeletal data for the human activity “jump forward”
• Now given a new (previously unseen) skeleton sequence, we want to answer the question: is the human performing
“jump forward”, or not?
Support Vector Machines
How can we choose a line to separate the data?
As an illustration, here we deal with lines and points in the Cartesian
plane instead of hyperplanes and vectors in a high dimensional space.
Support Vector Machines
H1
Is this a good line?
Support Vector Machines
H2
Is this a good line?
Support Vector Machines
H3
Is this a good line?
Support Vector Machines
H1 H2 H3
Which line is the best? …and Why?
Support Vector Machines
H1 H2 H3
H3 can best separate two categories of data, as it maximizes the margin
Support Vector Machines
Support Vector Machines
• We are given a training dataset of 𝑛 instances with the format:
• The 𝑦𝑖 are either 1 or −1, each indicating the class to which the point 𝒙𝑖 belongs
• Each 𝒙𝑖 is a 𝑝-dimensional real vector
𝒙1, 𝑦1 , … , (𝒙𝑛, 𝑦𝑛)
Support Vector Machines
• We are given a training dataset of 𝑛 instances with the format:
• We want to find the "maximum-margin hyperplane" that divides the dataset of
instances 𝒙𝑖, where 𝑦𝑖 = 1 if 𝒙𝑖 belong to the category and 𝑦𝑖 = −1 if 𝒙𝑖 does not belong to the category
• The hyperplane is defined so that the distance between the hyperplane and the nearest data point from either group is maximized
𝒙1, 𝑦1 , … , (𝒙𝑛, 𝑦𝑛)
Support Vector Machines
• Any hyperplane can be written as the set of points 𝒙 satisfying
where 𝒘 is the
parameter (normal vector) of the
hyperplane
𝒘 ∙ 𝒙 − 𝑏 = 0
Support Vector Machines
• The parameter
𝑏 𝒘
determines the offset of the hyperplane
from the origin along the normal vector 𝒘
Support Vector Machines
• If the data instances are linearly separable, we can select two parallel hyperplanes (i.e., as decision boundaries) that separate the two categories of data
• The distance (i.e.,
margin) between these two hyperplanes is as large as possible
• The maximum-margin hyperplane is the one that lies halfway
between them
Support Vector Machines
• These hyperplanes can be described by the equations
and
𝒘 ∙ 𝒙 − 𝑏 = 1 𝒘 ∙ 𝒙 − 𝑏 = −1
Why 1 and -1?
Support Vector Machines
• Geometrically, the distance between these two planes is
2 𝒘
• Thus, maximizing the distance between the planes is equivalent to minimize 𝒘
Support Vector Machines
• We also need to
prevent data points from falling into the margin, we add the following constraint:
for each 𝑖 either or
• These constraints state that each data point must lie on the correct side of the margin
𝒘 ∙ 𝒙𝑖 − 𝑏 ≥ 1, if 𝑦𝑖 = 1 𝒘 ∙ 𝒙𝑖 − 𝑏 ≤ −1, if 𝑦𝑖 = −1
Support Vector Machines
• We also need to
prevent data points from falling into the margin, we add the following constraint:
for each 𝑖 either or
• The two equations can be rewritten as:
𝒘 ∙ 𝒙𝑖 − 𝑏 ≥ 1, if 𝑦𝑖 = 1
𝑦𝑖(𝒘 ∙ 𝒙𝑖 − 𝑏) ≥ 1
𝒘 ∙ 𝒙𝑖 − 𝑏 ≤ −1, if 𝑦𝑖 = −1
Support Vector Machines
• The final optimization problem becomes:
minimize 𝑤
subject to:
𝑦𝑖 𝒘 ∙ 𝒙𝑖 − 𝑏 ≥ 1 for all 1 ≤ 𝑖 ≤ 𝑛
Support Vector Machines
Relevant courses: Machine Learning and Convex Optimization
Support Vector Machines
• The final optimization problem becomes:
• After solving this opti- mization problem, given a new measurement 𝒙, we classify it as:
minimize 𝑤
subject to:
𝑦𝑖 𝒘 ∙ 𝒙𝑖 − 𝑏 ≥ 1 for all 1 ≤ 𝑖 ≤ 𝑛
𝒙 → 𝑦 = sgn 𝒘 ∙ 𝒙 − 𝑏
Support Vector Machines
• Support vectors are the data points that lie on the two
decision boundaries
• They are the most difficult to classify
• They have direct influence on the
optimum location of the decision
boundaries
Support Vector Machines
So far, we discussed our first SVM model, that is a binary linear classifier.
• How to classify data instances that are not linear spreadable?
Support Vector Machines
How about when data is not linearly separatable?
Support Vector Machines
Projecting data to a higher
dimensional space for separation
Projecting data to a higher
dimensional space for separation
Support Vector Machines
Support Vector Machines
SVMs – The kernel trick:
Project data into a higher dimensional space, then construct a hyperplane in that space so all math is the same.
(radial basis function)
(radial basis function)
𝛾 is the hyperparameter of the RBF kernel
Support Vector Machines
How to deal with non-separable data?
(i.e., when some points always fall on the wrong side of the decision boundaries even after using the kernel trick)
Support Vector Machines
• We can use the “soft”
margin (based on the slack variables 𝜉)
Support Vector Machines
• We can use the “soft”
margin (based on the slack variables 𝜉), and the problem can be rewritten as
Support Vector Machines
•The hyperparameter 𝐶 adds a penalty for each misclassified data point.
• If 𝐶 is small, the penalty for misclassified points is low, so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications.
• If 𝐶 is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller
margin.
Support Vector Machines
So far, we discussed our first SVM model, that is a binary linear classifier.
• How to classify data instances that are not linearly spreadable?
• How to classify categorical data with more than two categories?
Support Vector Machines
• Multi-class classification: one-versus-all (also called one-versus-rest)
Class 1
Class 2 Class 3
Support Vector Machines
• Multi-class classification: one-versus-all (also called one-versus-rest) or one-versus-one
Class 1
Class 2 Class 3
Support Vector Machines
So far, we discussed our first SVM model, that is a binary linear classifier.
• How to classify data instances that are not linearly spreadable?
• How to classify categorical data with more than two categories?
• How to deal with time?
Support Vector Machines
• Online, onboard applications: Sliding window techniques
Support Vector Machines
So far, we discussed our first SVM model, that is a binary linear classifier.
• How to classify data instances that are not linearly spreadable?
• How to classify categorical data with more than two categories?
• How to deal with time?
• How to evaluate performance?
Support Vector Machines
• Evaluation of a learning algorithm: Training and Testing separation
Support Vector Machines
• Evaluation of a learning algorithm: Cross- validation during training
Valid- ation
Valid- ation
Valid- ation Valid-
ation
Support Vector Machines
• Evaluation of a learning algorithm: Confusion Matrix
Ground truth labels
Recognized labels
Support Vector Machines
• Evaluation of a learning algorithm: Confusion Matrix (normalized)