1.4 Thesis Structure & Contribution
2.1.2 Perception
2.2.1.2 Models and Methods
In order to make any predictions from previously seen data, every model and method has to make prior assumptions. Thesestructuralchoices encode some intrinsic assumptions, such as how the observed data points are correlated. There cannot exist a single machine learning method which outperforms or is on par with all other methods on all possible datasets, domains, as stated by the no free lunch theorem (D. H. Wolpert et al. 1997). However, it is fair to assume that there exist subsets of problems, relevant to robotic manipulation or to humans in general, for which a learning algorithm can outperform all other methods since it is based on the right assumptions. Models and methods for machine learning can be broadly categorized into three different categories: (i)lazylearning methods such asnearest neighbor(C. M. Bishop 2006) store the training data and perform inference by computing statistics on the set of k-nearest neighbors. Locally weighted regression, another example oflazylearning does not simply store all training examples but builds local models, lowering the inference complexity. (ii)Eagerlearning based onparametricmodels such as logistic regression (Chapter 2.2.2.3) and deep neural networks (Chapter 2.2.2.4) do not require any data samples at inference time. Such methods update a parametric function representation to explain all observed data samples. Hyper-parametersfor
parametricmodels might encode assumptions such as the functional form, or how data affects theparametricmodel. (iii) Models which grow with new data samples from the last category, typically referred to asnon-parametricmodels, e.g. random forests (Ho 1995), decision trees (Quinlan 1986) and Bayesian non-parametric (Murphy 2012).
The broad family ofparametric modelsis most relevant for this thesis and most commonly used in the robotics manipulation literature. In the following, we represent parametric models as
f(·;θ):X → Y (2.2)
whereθencapsulates all parameters optimized from data. The number of the open parameters
θin conjunction with the functional structure defines the model capacity. Generally, one can assume the more parameters, the higher the capacity of the model, typically required to express complex mappings from data. Conversely, it is often beneficial to use the least amount of parameters to explain the data, also referred to as Occam’s razor (c.1287–1347).
Within the framework of probability theory, supervised learning can be formalized as follows:
p(y|x) = p(x|y)p(y) p(x) =
p(x,y)
p(x) (2.3)
p(y|x)is the feature conditional,p(x|y)the target conditional, p(y)the target prior andp(x)the marginal distribution. Hence, we can categorize our parametric model f(·;θ)depending on the term it attempts to fit, either directly the left hand side p(y|x)(discriminative) or the right hand side (generative).
Generative Model: Generativemodels allow synthesizing data, reason about missing features
xand transparently encode prior knowledge about the targetsyand model knowledge of the data generating process. Hence, if we have a good model of the underlying data generation process, less data is required to achieve good regression and classification results. However, a severely wrong prior model introduces a strong bias. Thus, the resulting fitted model might not achieve good performance. Generative models often lend themselves for better analysis since the assumptions about the data generation process are explicitly expressed in the model. Additionally, we can draw samples from the model and therefore get insights about the shortcomings of the model. Further, generative models provide an uncertainty estimate, therefore, allow to better reason about their predictions. Since robotic manipulation systems act in their environment, can actively generate and acquire new data, uncertainty estimates can be exploited to make more informed decisions about future actions. However, capturing uncertainty as well as modeling the data generation process is a much more complex learning problem compared to simple discriminative learning discussed hereafter. For example, in order to classify objects, we might not be interested in rendering natural images, the inverse graphics problem. More related to manipulation, if we are interested in inferring if a grasp hypothesis results in a stable or unstable grasp from RGB-D sensor observations, we do not require to understand how realistic partially occluded depth maps are generated and used to inform a grasp.
Discriminative Model: The second model class attempts to directly learn a mapping from fea- turesxto targetsy, therefore learning adiscriminativefunction p(y|x). For classification (Chap- ter 2.2.1.6) this results in learning a decision boundary whereas for regression (Chapter 2.2.1.5) in predicting the values ofy. Models in this category are often computationally less expensive and scale better with large-scale datasets and high dimensional feature spaces compared to generative models. Further, since this model class learns a direct mapping, the inference is straightfor- ward, requires simple model evaluation. Hence, discriminative models result in fast inference. Different from generative models, information about the underlying data distribution has to be learned implicitly during the optimization. Therefore, large datasets are essential for good performance, usually resulting in gradient-based optimization methods which have been studied extensively in the optimization literature (Bottou et al. 2018; Nocedal et al. 1999). Although discriminative models are very flexible and allow to encode the assumptions implicitly, it is often not clear how to encode domain knowledge. For instance, how to use existent knowledge about a robots kinematics, useful for robot manipulation learning tasks, within a discriminative learning approach such as deep neural networks (Chapter 2.2.2.4). Introspection of the resulting models is another distinction to generative models. Understanding the biases of the learned model and analyzing error cases is inherently difficult. Additionally, many state-of-the-art discriminative models do not provide well-calibrated uncertainty estimates. Thus, it is unclear if a robotic system should rely on, e.g. torque predictions from such methods since there is no assurance that the resulting predictions are safe to execute on.
Global Model: Supervised learning approaches attempt to capture the important structure in the feature space to infer targetsyin an efficient manner. Global models such as Support Vector Machines (SVMs) (Chapter 2.2.2.1) and deep neural networks (DNNs) (Chapter 2.2.2.4) consist of one model f(·;θ) for all data samples inD. Therefore, different data samples (x,y)∈ D
affect the prediction for all other data samples(x0,y0)∈ D\(x,y). In the case of discriminative approaches, this property facilitates extrapolation since the parametric model does not encode a notion of proximity to the training data. Gaussian Process (Rasmussen et al. 2006), a generative Bayesian approach, does encode proximity to the training data, essential to obtain a measure of uncertainty. Still, the open parameters are optimized based on all data points which can result in worse performance if the underlying data distribution has very different noise characteristics in different parts of the feature space. Continuousonlinelearning is another challenge for global models. Since updates have a global effect,old datahas to be revisited to notforgetabout its mapping. Generally speaking, a representative dataset has to be maintained; otherwise, the model performance degenerates concerning the previously learned mapping.
Local Model: In order to better cope with shifting data distributions and streaming data, local learning methods allocate new local models over time such as locally weighted regression (LWR) (C. G. Atkeson et al. 1997) and incremental local Gaussian regression (iLGR) (Meier et al. 2014b). Learning the “optimal” similarity metric is a challenging task for this learning regime since a globally accessible proximity evaluation is vital to scale such methods to high dimensional features spaces. Further, changing the similarity metric over time introduces further complexity since efficient data structures such as k-d trees (Bentley 1975) required for fast proximity computation would have to be updated over time as well. Other challenges are adding models over time and altering local models while maintaining the best possible performance. Extrapolation with local models is posing further difficulties since global effects and structures are not captured; thus, it cannot be used to extrapolate to unseen parts of the feature space.