The study of such fine-grained behavior is enabled by the ongoing explosion of recorded tracking data. Beyond sports Miller et al., 2014; Yue et al., 2014b; Zheng, Yue, and Lucey, 2016; Le et al., 2017, examples include video games Ross, Gordon, and Bagnell, 2011, video & motion capture Suwajanakorn, Seitz, and Kemelmacher- Shlizerman, 2017; Taylor et al., 2017; Xue et al., 2016, navigation & driving Ziebart et al., 2009; J. Zhang and Cho, 2017; Li, Song, and Ermon, 2017, laboratory animal behaviors Johnson et al., 2016; Eyjolfsdottir et al., 2017, and tele-operated robotics Abbeel and Ng, 2004; Lin et al., 2006. One popular research direction, which we also study, is to learn a policy that imitates demonstrated behavior, also known as imitation **learning** Abbeel and Ng, 2004; Ziebart et al., 2008; Daumé, Langford, and Marcu, 2009; Ross, Gordon, and Bagnell, 2011; Ho and Ermon, 2016. The arguably simplest form of imitation **learning** is known as behavioral cloning, where the goal is to mimic a batch of pre-collected demonstration data, e.g., from human experts. Many decision problems can be naturally modeled as requiring high-level, long-term macro-goals, which span time horizons much longer than the timescale of low-level micro-actions (cf. (He, Brunskill, and Roy, 2010; Hausknecht and Stone, 2016)). A natural example for such macro-micro behavior occurs in spatiotemporal games, such as basketball where players execute complex trajectories. The micro-actions of each agent are to move around the court and, if they have the ball, dribble, pass, or shoot the ball. These micro-actions operate at the centisecond scale, whereas their macro-goals, such as "maneuver behind these 2 defenders towards the basket", span multiple seconds. Figure 2.1 depicts an example from a professional basketball game, where the player must make a sequence of movements (micro-actions) in order to reach a specific location on the basketball court (macro-goal).

Show more
141 Read more

Abstract- This paper presents and examines the be- haviour of a system whereby the rules governing lo- cal search within a Memetic Algorithm are co-evolved alongside the problem representation. We describe the rationale for such a system, and then describe the imple- mentation of a simple version in which the evolving rules are encoded as (condition:action) patterns applied to the problem representation. We investigate the behaviour of the algorithm on a suite of test problems, and show considerable performance improvements over a simple Genetic Algorithm, a Memetic Algorithm using a fixed neighbourhood function, and a similar Memetic Algo- rithm which uses random rules, i.e. with the **learning** mechanism disabled. Analysis of these results enables us to draw some conclusions about the way that even the simplified system is able to discover and exploit certain forms of **structure** and regularities if these exist within the problem space. We show that this “meta-**learning**” of problem features provides a means of creating highly **scalable** algorithms for some types of problems. We fur- ther demonstrate that in the absence of this kind of ex- ploitable patterns, the use of continually evolving neigh- bourhood functions for the local search operators adds robustness to the Memetic Algorithm in a manner simi- lar to Variable Neighbourhood Search. Finally we draw some initial conclusions about the way in which this meta-**learning** takes place, via examination of the use of different pivot rules and pairing strategies between the population of solution and the population of rules.

Show more
Convolution has the nice property of being translation-invariant. Intuitively, this means that each convolution filter represents a feature of interest (e.g from edge detector to eyes and noses). By hierarchically connecting these convolution filters, the CNN algorithm can learn a **robust** feature combination which can comprise the resulting reference (i.e. face). The output signal strength is not dependent on where the features are located, but on whether the features are present. Hence, a face could be located in di↵erent positions, and the CNN algorithm would still be able to recognise it. Moreover, we need to specify other important parameters such as channel depth, stride, and zero-padding. The channel depth corresponds to the number of filters we use for the convolution operation. The more filters we have, the more image features are extracted and the better the network becomes at recognising patterns in unseen images. Stride is the number of pixels (i.e. displacement) by which we slide our filter matrix over the input matrix. When the stride is 1, then we move the filters by one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps. Sometimes, it is convenient to pad the input matrix with zeros along the border, so that we can apply the filter to bordering elements of our input image matrix. A useful feature of zero padding is that it allows us to control the size of the feature maps.

Show more
181 Read more

Remark 5.4. According to Theorems 1-4, the sufficient number of random linear observations depends on r and the coherency parameters. The coherency parameters of the column/row spaces depend on the distribution of the rows/columns within the row/column spaces. For instance, if the distribution of the columns within the column space admit a clustering **structure** and the distribu- tion of the data is highly non-uniform, the row space coherency will be high [112, 117]. As an ex- ample, consider a scenario in which the columns lie in a union of two independent low-dimensional subspaces but 95 percent of the data lies in the first subspace. In this case, the row space coherency is high and one would need to sample too many columns to ensure that the sampled columns span the column space as confirmed by the theoretical analysis. Interestingly, the coherency of a sub- space can be even independent of the dimension of the ambient space. In [105], it was shown that the coherency of a randomly generated low dimensional subspace is upper-bounded by a fixed con- stant whp. Thus, if the row-space and column space of L are randomly generated r-dimensional subspaces, the sample complexity of the proposed randomized approaches will be independent of the size of the data.

Show more
271 Read more

Audio sensing is another area where **deep** **learning** offers effective solutions. Graves et al. [26] proposes an approach based on RNN for speech recognition, while Lee et al. [57] utilizes **deep** convolutional networks for audio classification. **Deep** **learning** is applied to build systems that are **robust** to noises for audio sensing tasks [51], such as inferring daily activities (eating, coughing, and driving), detecting the ambient environment, and deducing the user states (stress and emotion). In addition to its application on single-modal sensing data like images and audios, **deep** **learning** is effective to combine data from different modalities for content retrieval or human activity recognition [11, 79, 72]. Yao et al. [98] presents DeepSense, a framework to effectively fuse multi-modal sensor input, which can be applied to either regression or classification problems by adapting the output layer of the framework. The CNN **structure** in DeepSense allows the capability of effectively extracting and fusing the features from multiple sensors, while the RNN **structure** enables modelling of the temporal relationship, resulting the ability to learn the comprehensive temporal- spatial dependency from the multi-modality sensor data. DeepHR builds upon the foundation of DeepSense, however, it targets on the calibration of heart rate mea- surement instead of simply object or activity recognition and no applications of **deep** **learning** have been found on calibrating the heart rate sensing measurements col- lected from wearables. We apply **deep** **learning** based approach to calibrate the heart rate monitoring on smart wearables, directly utilizing the heart rate together with motion information. Unlike most previous works, our approach works directly on heart rate instead of the raw PPG signal or the RR intervals, without need to design heavily hand-crafted motion features extracted from accelerometer and gyroscope.

Show more
81 Read more

Abstract—**Learning** Bayesian network structures from data is known to be hard, mainly because the number of candidate graphs is super-exponential in the number of variables. Furthermore, using observational data alone, the true causal graph is not discernible from other graphs that model the same set of conditional independencies. In this paper, it is investigated whether Bayesian network **structure** **learning** can be improved by **exploiting** the opinions of multiple domain experts regarding cause-effect relationships. In practice, experts have different individual probabilities of correctly labeling the inclusion or exclusion of edges in the **structure**. The accuracy of each expert is modeled by three parameters. Two new scoring functions are introduced that score each candidate graph based on the data and experts’ opinions, taking into account their accuracy parameters. In the first scoring function, the experts’ accuracies are estimated using an expectation-maximization-based algorithm and the estimated accuracies are explicitly used in the scoring process. The second function marginalizes out the accuracy parameters to obtain more **robust** scores when it is not possible to obtain a good estimate of experts’ accuracies. The experimental results on simulated and real world datasets show that **exploiting** experts’ knowledge can improve the **structure** **learning** if we take the experts’ accuracies into account.

Show more
14 Read more

Abstract:- In this scientific manuscript, a **robust** framework of **deep** **learning** predictive modeling is introduced. The prime aim of this computational system is to determine and predict wireless spectrum data set with lower computational cost. The cost-effective design of the formulated system applies training of convolutional neural network (CNN) to strengthen the prediction accuracy. The computational modeling and design optimization is carried out considering ANN stacks along with its corresponding feature neuron sets. It also implies non-recursive and less iterative design solution which makes it more **scalable** and **robust** and also determines better classification accuracy as compared to conventional approaches. The model validation is carried out with respect to a set of performance matrices such as Mean Absolute Error (MAE), Mean Relative Error (MRE), Correlation Density Function (CDF) and Root Mean Square Error (RMSE) in a numerical computing environment.

Show more
The employed ResNet model has been pre-trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, to classify 1.3 million images to 1000 ImageNet classes [52]. The ResNet consists of convolutional layers, pooling layers, and fully connected layers. The network takes images of size 224 × 224 pixels as input then passes through the network in a forward pass after applying filters to the input image. When treating networks as a fixed feature extractor, we cut off the network at an arbitrary point (normally prior to the last fully-connected layers); thus, all images will be extracted from the activations of convolutional feature maps directly. This would compute a 2048-D feature vector for every image that contains the hidden layer immediately before the classifier. The 2048-D feature vectors will be directly used for computing the similarity between images. The computational complexity and retrieval process may become cumbersome as the dimensionality grows. This requires us to optimize the retrieval process by proposing a hierarchically nested indexing **structure** and recursive similarity measurements to facilitate faster access and comparison of multi-dimensional feature vectors as described in the following sections.

Show more
22 Read more

Abstract—In this paper, we propose DeepSLAM, a novel unsupervised **deep** **learning** based visual Simultaneous Localization and Mapping (SLAM) system. The DeepSLAM training is fully unsupervised since it only requires stereo imagery instead of annotating ground-truth poses. Its test- ing takes a monocular image sequence as the input. There- fore, it is a monocular SLAM paradigm. DeepSLAM con- sists of several essential components, including Mapping- Net, Tracking-Net, Loop-Net and a graph optimization unit. Specifically, the Mapping-Net is an encoder and decoder architecture for describing the 3D **structure** of environment while the Tracking-Net is a Recurrent Convolutional Neu- ral Network (RCNN) architecture for capturing the camera motion. The Loop-Net is a pre-trained binary classifier for detecting loop closures. DeepSLAM can simultaneously generate pose estimate, depth map and outlier rejection mask. We evaluate its performance on various datasets, and find that DeepSLAM achieves good performance in terms of pose estimation accuracy, and is **robust** in some challenging scenes.

Show more
10 Read more

Recently, we have seen the potential of directly modeling the real spectrogram in AED in studies such as [7, 8]. The idea is that a detail-rich input such as a high reso- lution spectrogram is sparse enough to deal with com- plex scenarios with overlapping sounds. This complexity does not appear only in the frequency domain, but also in the form of a wide range of temporal structures. In [8] (Fig. 1a), the spectrogram patch concept is used to describe a model that receives an input including a con- text of frames from a spectrogram. This is rather typical in **deep** **learning** these days, but it is stressed here since a suf- ficient amount of short-time temporal **structure** regarding sounds can be packaged if the context is wide enough. This approach is possible given the ability of DNNs to model such a high dimensional input. This contrasts with traditional approaches in which the classifier models pre- defined acoustic features (e.g., MFCC, or Mel-filter banks) [9, 10], which compress and neglect details that we actu- ally need. Espi et al. [8] succeeds in modeling spectro- gram patches input as a whole, i.e.; it learns features that describe “globally” a short-time spectrogram patch. How- ever, this dismisses important properties of sounds (e.g., stationarity, transiency, burstiness, etc.), a taxonomy that could also help to model acoustic events.

Show more
12 Read more

We proposed a method for **learning** kernels with recurrent long short-term memory **structure** on sequences. Gaussian processes with such kernels, termed the GP-LSTM, have the **structure** and **learning** biases of LSTMs, while retaining a probabilistic Bayesian nonparametric representation. The GP-LSTM outperforms a range of alternatives on several sequence-to- reals regression tasks. The GP-LSTM also works on data with low and high signal-to-noise ratios, and can be scaled to very large datasets, all with a straightforward, practical, and generally applicable model specification. Moreover, the semi-stochastic scheme proposed in our paper is provably convergent and efficient in practical settings, in conjunction with **structure** **exploiting** algebra. In short, the GP-LSTM provides a natural mechanism for Bayesian LSTMs, quantifying predictive uncertainty while harmonizing with the standard **deep** **learning** toolbox. Predictive uncertainty is of high value in robotics applications, such as autonomous driving, and could also be applied to other areas such as financial modeling and computational biology.

Show more
37 Read more

The presence of ubiquitous sensors continuously recording massive amounts of information has lead to an unprecedented data collection, whose exploitation is expected to bring about scientific and social advancements in everyday lives. Along with the ever-increasing amount of data, incredible progress in the fields of Machine **Learning**, Pattern Recognition, and Optimization has also contributed to the growing expectations. Such progress however, has also brought to light certain limitations in state-of-the-art **learning** machines, manifesting the roadblocks in the research path ahead. For instance, in addition to practical considerations pertaining to non- stationary, noisy and unsupervised settings, various applications often run on limited memory and stringent computational resources, thus requiring efficient and light-weight algorithms to cope with extreme volumes. Furthermore, certain characteristics such as presence of outliers or adversaries as well as the complex nature of real-world interactions call for **robust** algorithms, whose performance will be resilient in the face of deviations from nominal settings.

Show more
126 Read more

Adversarially **robust** generalization As discussed in Section 5.1, it has been observed by Madry et al. [132] that there might be a significant generalization gap when training **deep** neural networks in the adversarial setting. This generalization problem has been further stud- ied by Schmidt et al. [162], who show that to correctly classify two separated d-dimensional spherical Gaussian distributions, in the natural setting one only needs O(1) training data, but in the adversarial setting one needs Θ( √ d) data. Getting distribution agnostic gener- alization bounds (also known as the PAC-**learning** framework) for the adversarial setting is proposed as an open problem by Schmidt et al. [162]. In a subsequent work, Cullina, Bhagoji, and Mittal [47] study PAC-**learning** guarantees for binary linear classifiers in the adversarial setting via VC-dimension, and show that the VC-dimension does not increase in the adversarial setting. This result does not provide explanation to the empirical observa- tion that adversarially **robust** generalization may be hard. In fact, although VC-dimension and Rademacher complexity can both provide valid generalization bounds, VC-dimension usually depends on the number of parameters in the model while Rademacher complex- ity usually depends on the norms of the weight matrices and data points, and can often provide tighter generalization bounds [15]. Suggala et al. [175] discuss a similar notion of adversarial risk but do not prove explicit generalization bounds. Attias, Kontorovich, and Mansour [12] prove adversarial generalization bounds in a setting where the number of po- tential adversarial perturbations is finite, which is a weaker notion than the ` ∞ attack that

Show more
172 Read more

An integral part of the model building process was the tuning of the decision thresholds. Though individual thresholds per code are possible best mi- cro-F1 results were always achieved with a com- mon decision threshold over all codes. This behav- ior reflects again the data sparsity issues as not all modeled codes appear in the dev set and many other codes are so sparse that no **robust** threshold estimation was possible.

truncation that shortens the protein sequences at either N-terminal or C-terminal sometimes still retains the structural fold [101]. A good method of extracting fold- related features from sequences should capture the consistent patterns despite of the evolutionary changes. Therefore, we simulated these four residue changes to check if the fold-related features extract from protein sequences by DeepSF are **robust** against mutation, insertion, deletion and even truncation. To analyze the effects of mutation, insertion, and deletion, we selected some proteins that have 100 residues, and randomly selected the positions for insertion, deletion, or substitution with one or more residues randomly sampled from 20 standard amino acids. And at most 20 residues in total are deleted from or inserted into sequences. Each change was repeated 50 times, and the exactly same sequences were removed after sampling. For example, for domain ’d1lk3h2’ we generated 44 sequences with at least one residue deleted, and 44 sequences with at least one residue insertion, and 18 sequences with at least one residue mutation. The SF-Features for these mutated sequences are generated and compared to the SF-Feature of the original wild-type sequence. We also randomly sampled 500 sequences with length in the range of 80 to 120 residues from the SCOP 1.75 dataset as control, and compare their SF-features with those of the original sequence. The distribution of KL-D divergences between the SF features of these sequences and the original sequence are shown in Figure 3.7. The divergence of the sequences with mutations, insertions, and deletions from the original sequence is much smaller than that of random sequences. The p-value of difference according to Wilcoxon rank sum test is < 2.2e-16. The same analysis is applied to the other two proteins: ’d1foka3’ and ’d1ipaa2’, and the same phenomena has been observed (see Figure 3.8). The results suggest that the feature extraction of DeepSF is **robust**

Show more
194 Read more

Recently, with the advance of neural network techniques, **deep** **learning** methods (Zeng et al., 2014, 2015) are introduced, and the hope is to model noisy distant supervision process in the hid- den layers. However, their approach only selects one most plausible instance per entity pair, in- evitably missing out a lot of valuable training in- stances. Recently, Lin et al. (2016) propose an attention mechanism to select plausible instances from a set of noisy instances. However, we believe that soft attention weight assignment might not be the optimal solution, since the false positives should be completely removed and placed in the negative set. Ji et al. (2017) combine the external knowledge to rich the representation of entity pair, in which way to improve the accuracy of atten- tion weights. Even though these above-mentioned methods can select high-quality instances, they ig- nore the false positive case: all the sentences of one entity pair belongs to the false positives. In this work, we take a radical approach to solve this problem—We will make use of the distantly la- beled resources as much as possible, while learn- ing a independent false-positive indicator to re- move false positives, and place them in the right place. After our ACL submission, we notice that a contemporaneous study Feng et al. (2018) also adopts reinforcement **learning** to learn an instance selector, but their reward is calculated from the prediction probabilities. In contrast, while in our method, the reward is intuitively reflected by the performance change of the relation classifier. Our approach is also complement to most of the ap- proaches above, and can be directly applied on top of any existing relation extraction classifiers.

Show more
11 Read more

unstructured data is inescapable— it is, or soon will be, a challenge for every organization, in every sector, in every economy. While **exploiting** open-source content might once have concerned only government analysts and data geeks, Big Data is now a critical source for leaders across global business and governments to create “new intelligence.” No Big Data strategy lacking a **robust** solution to find, harvest, and curate the 90 percent of unstructured data will fully succeed in the ever-changing landscape of the **Deep** Web.

Moreover, we consider a more general case where the hi- erarchical **structure** is not available. A hierarchical **structure** can give us more insight about the relations among features but **learning** it from data is a difficult problem. To the best of our knowledge, there is no work to directly learn the hier- archical **structure** among features. Here we give the first try based on the DHS method by proposing a **Learning** **Deep** Hierarchical **Structure** (LDHS) method. Given the height of the hierarchical **structure**, the LDHS method assumes that each path from the root to a leaf node corresponding to a data feature does not share any node between each other, then uses a generalized fused-Lasso regularizer to enforce nodes to fuse at each height, and finally designs a sequen- tial constraint to make the learned **structure** form a hierar- chical **structure**. For optimization, we use the GIST algo- rithm (Gong et al. 2013) to solve the objective function of the LDHS method. By comparing with several state-of-the- art baseline methods, experiments on several synthetic and real-world datasets show the effectiveness of the proposed models.

Show more
This paper presents a Discriminative **Deep** Dyna-Q (D3Q) approach to improving the ef- fectiveness and robustness of **Deep** Dyna-Q (DDQ), a recently proposed framework that extends the Dyna-Q algorithm to integrate planning for task-completion dialogue policy **learning**. To obviate DDQ’s high dependency on the quality of simulated experiences, we in- corporate an RNN-based discriminator in D3Q to differentiate simulated experience from real user experience in order to control the quality of training data. Experiments show that D3Q significantly outperforms DDQ by controlling the quality of simulated experience used for planning. The effectiveness and robustness of D3Q is further demonstrated in a domain ex- tension setting, where the agent’s capability of adapting to a changing environment is tested. 1

Show more
11 Read more

Most real-world problems involve interactions between mul- tiple agents and the complexity of problem increases sig- nificantly when the agents co-evolve together. Thanks to the recent advances of **deep** reinforcement learing (DRL) on single agent scenarios, which led to successes in play- ing Atari game (Mnih et al. 2015), playing go (Silver et al. 2016) and robotics control (Levine et al. 2016), it has been a rising trend to adapt single agent DRL algo- rithms to multi-agent **learning** scenarios and many works have shown great successes on a variety of problems, in- cluding automatic discovery of communication and lan- guage (Sukhbaatar, Fergus, and others 2016; Mordatch and Abbeel 2017), multiplayer games (Peng et al. 2017a; OpenAI 2018), traffic control (Wu et al. 2017) and the analy- sis of social dilemmas (Leibo et al. 2017).

Show more