prohibitively large, and precludes conventional approaches, such as those based on simple Markovian dynamics.
The study of such fine-grained behavior is enabled by the ongoing explosion of recorded tracking data. Beyond sports Miller et al., 2014; Yue et al., 2014b; Zheng, Yue, and Lucey, 2016; Le et al., 2017, examples include video games Ross, Gordon, and Bagnell, 2011, video & motion capture Suwajanakorn, Seitz, and Kemelmacher- Shlizerman, 2017; Taylor et al., 2017; Xue et al., 2016, navigation & driving Ziebart et al., 2009; J. Zhang and Cho, 2017; Li, Song, and Ermon, 2017, laboratory animal behaviors Johnson et al., 2016; Eyjolfsdottir et al., 2017, and tele-operated robotics Abbeel and Ng, 2004; Lin et al., 2006. One popular research direction, which we also study, is to learn a policy that imitates demonstrated behavior, also known as imitation **learning** Abbeel and Ng, 2004; Ziebart et al., 2008; Daumé, Langford, and Marcu, 2009; Ross, Gordon, and Bagnell, 2011; Ho and Ermon, 2016. The arguably simplest form of imitation **learning** is known as behavioral cloning, where the goal is to mimic a batch of pre-collected demonstration data, e.g., from human experts.

Show more
141 Read more

In this paper, we quantify the predictive uncertainty of **deep** models by following a Bayesian nonparametric approach. In particular, we propose kernel functions which fully encapsulate the structural properties of LSTMs, for use with Gaussian processes. The resulting model enables Gaussian processes to achieve state-of-the-art performance on se- quential regression tasks, while also allowing for a principled representation of uncertainty and non-parametric flexibility. Further, we develop a provably convergent semi-stochastic op- timization algorithm that allows mini-batch updates of the recurrent kernels. We empirically demonstrate that this semi-stochastic approach significantly improves upon the standard non-stochastic first-order methods in runtime and in the quality of the converged solution. For additional scalability, we exploit the algebraic **structure** of these kernels, decomposing the relevant covariance matrices into Kronecker products of circulant matrices, for O (n) training time and O (1) test predictions (Wilson et al., 2015; Wilson and Nickisch, 2015). Our model not only can be interpreted as a Gaussian process with a recurrent kernel, but also as a **deep** recurrent network with probabilistic outputs, infinitely many hidden units, and a utility function **robust** to overfitting.

Show more
37 Read more

The present thesis contributes to **learning** over unsupervised, complex, and adversarial data.
Emphasis is laid on concocting online, **scalable** and **robust** algorithms, enabling streaming ana- lytics of sequential measurements based on vector, matrix, and tensor-based views of supervised and unsupervised **learning** tasks. For online and **scalable** **learning**, a novel kernel-based feature extraction framework is put forth, in which limited memory and computational resources are accounted for via maintaining an affordable budget. Furthermore, complex interactions of real-world networks are analyzed from a community identification point-of-view, in which a novel tensor-based representation along with provable optimization techniques robustify state- of-the-art alternatives. Finally, the performance of **deep** convolutional neural network based image classifiers is investigated when adversaries disturbing input images are modeled as im- perceptible yet carefully-crafted perturbations. To this end, a general class of high-performance Bayesian detectors of adversaries is developed. Extensive experimentation on synthetic as well as numerous real datasets demonstrates the effectiveness, interpretability and scalability of the proposed **learning**, identification, and detection algorithms. More importantly, the process of design and experimentation sheds light on the behavior of different methods and the peculiarities of real-world data, while at the same time it generates new ideas and directions to be explored.

Show more
126 Read more

2 Inria, Centre Rennes – Bretagne Atlantique, France
Abstract. Polynomial regression is a recurrent problem with a large number of applications. In computer vision it often appears in motion analysis. Whatever the application, standard methods for regression of polynomial models tend to deliver biased results when the input data is heavily contaminated by outliers. Moreover, the problem is even harder when outliers have strong **structure**. Departing from problem-tailored heuristics for **robust** estimation of parametric models, we explore **deep** convolutional neural networks. Our work aims to find a generic approach for training **deep** regression models without the explicit need of super- vised annotation. We bypass the need for a tailored loss function on the regression parameters by attaching to our model a differentiable hard- wired decoder corresponding to the polynomial operation at hand. We demonstrate the value of our findings by comparing with standard ro- bust regression methods. Furthermore, we demonstrate how to use such models for a real computer vision problem, i.e., video stabilization. The qualitative and quantitative experiments show that neural networks are able to learn robustness for general polynomial regression, with results that well overpass scores of traditional **robust** estimation methods.

Show more
19 Read more

An integral part of the model building process was the tuning of the decision thresholds. Though individual thresholds per code are possible best mi- cro-F1 results were always achieved with a com- mon decision threshold over all codes. This behav- ior reflects again the data sparsity issues as not all modeled codes appear in the dev set and many other codes are so sparse that no **robust** threshold estimation was possible.

execution. In parallel to the computation at every cycle, TMMU reads the next node from input buffer and saves to the registers Reg_b. Consequently, the registers Reg_a and Reg_b can be used alternately. For the calculation, we use pipelined binary adder tree **structure** to optimize the performance. As depicted in Fig. 3, the weight data and the node data are saved in BRAMs and registers. The pipeline takes advantage of time-sharing the coarse-grained accelerators. As a consequence, this implementation enables the TMMU unit to produce a part sum result every clock cycle.

Show more
cross-dataset approaches, our MMFA network is a one-step learn-and-adapt method, which can simultaneously learn the feature representation and adapt to the target domain in a single end-to-end training procedure.
Although, our MMFA framework improves the scalability of the Person Re-ID models in real-world deployment. However, it needs a vast number of unlabelled images obtained from the new system. It also requires some additional adaptive training to create a bespoke model for the new system. In Chapter 6, we aim to develop a **robust** feature learner that needs to be trained only once and can be deployed out-of-the-box for any new camera network without further data collection or adaptive training. With this motivation, we proposed a domain generalisation model (MMFA-AAE) that can leverage the labelled images from multiple datasets to learn a universal representation of people’s appearances. Our MMFA-AAE architecture learns a domain invariant feature representation by jointly optimising an adversarial auto-encoder with the MMD distance regularisation. The adversarial auto-encoder is designed to learn a latent feature space among di↵erent Person Re-ID datasets by matching the distribution of the hidden codes to an arbitrary prior distribution. The MMD-based regularisation further enhances the domain invariant features by aligning the distributions among di↵erent domains. Extensive experiments demonstrate that our proposed MMFA-AAE is able to learn domain-invariant features, which lead to state-of-the-art performance on many Person Re-ID datasets.

Show more
181 Read more

Chapter 4 discusses a distributionally **robust** chance constrained approximate AC-OPF.
The power flow model employed in the proposed OPF formulation combines an exact AC power flow model at the nominal operation point and an approximate linear power flow model to reflect the system response under uncertainties. The ambiguity set employed in the distribu- tionally **robust** formulation is the Wasserstein ball centered at the empirical distribution. The proposed OPF model minimizes the expectation of the quadratic cost function w.r.t. the worst- case probability distribution and guarantees the chance constraints satisfied for any distribution in the ambiguity set. The whole method is data-driven in the sense that the ambiguity set is constructed from historical data without any presumption on the type of the probability distri- bution, and more data leads to smaller ambiguity set and less conservative strategy. Moreover, special problem structures of the proposed problem formulation are exploited to develop an efficient and **scalable** solution approach. Case studies are carried out on IEEE 14 and 118 bus systems to show the accuracy and necessity of the approximate AC model and the attractive features of the distributionally **robust** optimization approach compared with other methods to deal with uncertainties.

Show more
230 Read more

A detailed mathematical analysis is provided establishing sufficient conditions for the proposed method to correctly cluster the data points. The numerical simulations with both real and synthetic data demonstrate that Innovation Pursuit notably outperforms the state-of-the-art subspace cluster- ing algorithms. For the **robust** PCA problem, we focus on both the outlier detection and the matrix decomposition problems. For the outlier detection problem, we present a new algorithm, termed Coherence Pursuit, in addition to two **scalable** randomized frameworks for the implementation of outlier detection algorithms. The Coherence Pursuit method is the first provable and non-iterative **robust** PCA method which is provably **robust** to both unstructured and structured outliers. Coher- ence Pursuit is remarkably simple and it notably outperforms the existing methods in dealing with structured outliers. In the proposed randomized designs, we leverage the low dimensional **structure** of the low rank component to apply the **robust** PCA algorithm to a random sketch of the data as opposed to the full scale data. Importantly, it is analytically shown that the presented randomized designs can make the computation or sample complexity of the low rank matrix recovery algo- rithm independent of the size of the data. At the end, we focus on the column sampling problem.

Show more
271 Read more

The first scoring function which is proposed in this paper, which we refer to as the explicit-accuracy-based score, builds upon the method originally proposed by [16]. The main advantage of our approach is that we assume that experts are heterogeneous, i.e., different experts have dif- ferent levels of accuracy. In addition, with our second score, referred to as the marginalization-based score, we are able to handle the problem that the estimated experts’ accuracies may not be so reliable, and we obtain a more **robust** score by marginalizing out the experts’ accuracy parameters. Ex- perimental results reveal that **exploiting** experts’ knowledge can improve the **structure** **learning** if we take the experts’ ac- curacies into account. Specifically, if the experts’ accuracies can be confidently estimated, it is suggested to explicitly use the estimated accuracies in the scoring process, otherwise, marginalizing out the accuracy parameters yields more ro- bust scores.

Show more
14 Read more

Bristol, U.K.
james.smith@uwe.ac.uk Abstract- This paper presents and examines the be-
haviour of a system whereby the rules governing lo- cal search within a Memetic Algorithm are co-evolved alongside the problem representation. We describe the rationale for such a system, and then describe the imple- mentation of a simple version in which the evolving rules are encoded as (condition:action) patterns applied to the problem representation. We investigate the behaviour of the algorithm on a suite of test problems, and show considerable performance improvements over a simple Genetic Algorithm, a Memetic Algorithm using a fixed neighbourhood function, and a similar Memetic Algo- rithm which uses random rules, i.e. with the **learning** mechanism disabled. Analysis of these results enables us to draw some conclusions about the way that even the simplified system is able to discover and exploit certain forms of **structure** and regularities if these exist within the problem space. We show that this “meta-**learning**”

Show more
Second, each binned object pixel may combine pixels from both the object and back- ground regions, introducing incorrect (noisy) ground-truth. **Robust** training using noisy ground-truth has been shown in other CNN tasks [59]. In essence, the CNN learns the invariants and filters out the random noise. Our results suggest that the downsampling has little effect to the final results. Next, for both training and test- ing, the input speckles are normalized between 0 and 1 by dividing each image by its maximum.

28 Read more

From the above brief reviews of previous work, it is clear that the handcrafted feature extraction methods are specifically designed for the particular application or the domain. Although it has superior description ability for the textural objects, it cannot learn the intrinsic characteristic of the signal and is generally with a poor adaptability. In contrast, the learned features can overcome these weaknesses. However, they normally require a large number of training samples and are expensive in terms of the computational load and the hardware cost. Therefore, a straightforward idea is to fuse these two features, which is naturally a promising way toward more **robust** and adaptable representations. However, to our best of knowledge, few works are done along this direction. In this paper, the low-level handcrafted feature is used as the input of the networks to learn higher representations. In addition, the convergence speed and stability are two other major problems in the **deep** networks with a propagation feedback scheme. Existing methods [19, 20] fail to solve these issues. In order to address the above problems, we propose a new **deep** **learning** framework, which only requires a small-scale CNN but achieves higher performance with less computational costs. The proposed synchronized multi-stage feature (SMF) **structure** can lead to an ensemble of the heterogeneous models, thus speeding up the training process and making our model more powerful by deeply investigating into the diversity among different features. Moreover, the proposed boosting-like algorithm gradually tunes the sample weights in the feedback propagation, thereby gaining more stability and performance improvement. Based on convolutional neural network (CNN) [21], our proposed **deep** **structure** contains two convolutional layers to obtain higher-level feature representations. The final classifier is only a simple single neural network. The performance of this framework is verified by two challenging applications: pedestrian detection and action recognition.

Show more
17 Read more

Figure 1. The **structure** of convolutional neural network used in visual tracking.
Scale Estimation
We separate position estimation and scale estimation into two phases in each frame. The first phase does nearly the same thing as normal correlation filter tracking, except the **deep** **learning** features are used instead of HoG features and color features. To handle scale variation, a pool of trackers in S different scales is introduced. After position is predicted, different scaled searching windows in estimated position are cropped and interpolated into the fixed same size before sent to convolutional neural network. The outputs of conv layers are concatenated as features then reshaped into 1-dimensional W*H*D vectors for each scale. An isolated correlation filter based scale tracker is trained to predict the best scale. The reason we use two isolated tracker instead of unique one is that the measure of correct scale is not as clear as position, which makes the tracker drifting if the unique tracker keeps putting different scale samples in the same model. We use Gaussian function to enforce the current scale with higher value and other scales with lower value. Intuitively, it means if the tracker believes the current scale is not good enough, it has to find another scale which perform much better score. In practice, this strategy will suppress drifting caused by scale estimation jitter.

Show more
Specifically, the Mapping-Net is an encoder and decoder architecture for describing the 3D **structure** of environment while the Tracking-Net is a Recurrent Convolutional Neu- ral Network (RCNN) architecture for capturing the camera motion. The Loop-Net is a pre-trained binary classifier for detecting loop closures. DeepSLAM can simultaneously generate pose estimate, depth map and outlier rejection mask. We evaluate its performance on various datasets, and find that DeepSLAM achieves good performance in terms of pose estimation accuracy, and is **robust** in some challenging scenes.

10 Read more

The way in which the experience is defined depends of how the available dataset is used. It can be divided into unsupervised and supervised **learning** problems. A dataset is a collection of many different samples. If each sample of the dataset is associated with a target or label, we are facing a supervised problem. Otherwise, the problem is unsupervised and the algorithm has to learn the **structure** of the dataset without an explicit labeling. An example of this kind of algorithms is clustering where the goal is to group similar samples, or some dimensionality reduction algorithms as Principal Component Analysis ( PCA ) [120] where the objective is to reduce the number of features by retaining the relevant information without targets or labels. Besides unsupervised and supervised task, we find other algorithms as reinforcement **learning**, where the program interact with an environment seeking a goal.

Show more
179 Read more

The first conclusion is that the best performing resolu- tion overall is not the best resolution for each and every acoustic event class separately. In general, and consis- tent with previous assumptions, certain low-energy events such as “keyboard typing” are better tracked with short frame resolutions, whereas long frames perform better for a “door slam” (50 ms). This is also the case with “applause” (40 ms), which has a very similar **structure** in the fre- quency domain. On the other hand, with events such as “chair move” , switching the frame length seems to have almost no effect on performance. Other sounds such as a “laugh” perform in various ways with no strong trend.

Show more
12 Read more

the output layer. The same DNN architecture was used for all frequency bands and we did not optimise it for individual frequencies.
The neural network was initialised with a single hid- den layer, and the number of hidden layers was gradually increased in later training phases. In each training phase, mini-batch gradient descent with a batch size of 128 was used, including a momentum term with the momentum rate set to 0.5. The initial **learning** rate was set to 1, which gradually decreased to 0.05 after 20 epochs. After the **learning** rate decreased to 0.05, it was held constant for a further 5 epochs. We also included a validation set and the training procedure was stopped earlier if no new best error on the validation set could be achieved within the last 5 epochs. At the end of each training phase, an extra hidden layer was added between the last hidden layer and the output layer, and the training phase was repeated until the desired number of hidden layers was reached (two hidden layers in this study).

Show more
17 Read more

Chapter 5
Rademacher Complexity for
Adversarially **Robust** Generalization
Many machine **learning** models are vulnerable to adversarial attacks; for example, adding adversarial perturbations that are imperceptible to humans can often make machine **learning** models produce wrong predictions with high confidence. Moreover, although we may obtain **robust** models on the training dataset via adversarial training, in some problems the learned models cannot generalize well to the test data. In this chapter, we focus on ` ∞ attacks, and study the adversarially **robust** generalization problem through the lens of Rademacher com- plexity. For binary linear classifiers, we prove tight bounds for the adversarial Rademacher complexity, and show that the adversarial Rademacher complexity is never smaller than its natural counterpart, and it has an unavoidable dimension dependence, unless the weight vector has bounded ` 1 norm. The results also extend to multi-class linear classifiers. For (nonlinear) neural networks, we show that the dimension dependence in the adversarial Rademacher complexity also exists. We further consider a surrogate adversarial loss for one- hidden layer ReLU network and prove margin bounds for this setting. Our results indicate that having ` 1 norm constraints on the weight matrices might be a potential way to improve generalization in the adversarial setting. We demonstrate experimental results that validate our theoretical findings.

Show more
172 Read more

In Chapter 11, the problem tackled is the segmentation of the Magnetic Resonance Imaging (MRI). MRI allows the acquisition of high-resolution images of the brain. The diagnosis of various brain illnesses is supported by the distinguished analysis of the different kind of brain tissues, which imply their segmentation and classification. Brain MRI is organized in volumes composed by millions of voxels (at least 65.536 per slice, for at least 50 slice), hence the problem of the labeling of the brain tissue classes in the composition of atlases and ground truth references [153], which are needed for the training and the validation of machine-**learning** methods employed for brain segmentation. We propose a stacking classification scheme that does not require any other anatomical a priori information to identify the 3 classes, Gray Matter (GM), White Matter (WM) and Cerebro-Spinal Fluid (CSF). We employed two different MR sequences: FLuid Attenuated Inversion Recovery (FLAIR) and Double Inversion Recovery (DIR). The former highlights both gray matter (GM) and white matter (WM), the latter highlights GM alone. Features are extracted by means of a local multi-scale texture analysis, computed for each pixel of the DIR and FLAIR sequences. The 9 textures considered are average, standard deviation, kurtosis, entropy, contrast, correlation, energy, homogeneity, and skewness, evaluated on a neighborhood of 3x3, 5x5, and 7x7 pixels. Two stacked classifiers were created **exploiting** the a priori knowledge about DIR and FLAIR images. The results highlight an improvement in classification performance with respect to using all the features in a state-of-the-art single classifier. The proper use of a priori information was further developed and we noted that the better performance depends on the a priori decision to use the two different images in a hierarchical manner with a specific order.

Show more
202 Read more