We use Graph-CNNs to address this problem and use the standard benchmark datasets − NCI1 and D&D, to compare classification performance. NCI1 is a bal- anced graph dataset of chemical compounds that are screened for activity against non-small cell lung cancer and ovarian cancer cell lines respectively . D&D graphs are protein structures that can be classified into enzymes or non-enzymes categories . These data are highly complex in terms of size and structure of individual sam- ples. Each graph sample is heterogeneous and contains multiple adjacency matrices which indicate the presence of a specific bond type between two molecules. Detailed statistics and classification results on these datasets are listed in Table 2.5. The Graph-CNN architectures achieve state-of-the-art performance compared with other recent approaches.
This is why, for better performance and general applicability a number of more advanced CL strategies are needed.
In this section, the CWR, CWR+ and AR1 strategies have been evaluated on the iCIFAR-100 and the CORe50 benchmark in the SIT scenario. Early results on these benchmarks prove that AR1 allows to train sequentially complex models such as CaffeNet and GoogLeNet by limiting the detrimental effects of catastrophic forgetting. AR1 accuracy was higher than existing regu- larization approaches such as LWF, EWC and SI. While we did not explicitly consider rehearsal techniques in our comparison sessions, from preliminary results AR1 was also competitive with iCARL on CORe50 [Lomonaco and Maltoni, 2018]. AR1 overhead in terms of storage is very limited and most of the extra computation is based on information made available by stochastic gradient descent. We showed that early stopping SGD after very few epochs (e.g., 2) is sufficient to incrementally learn new data on CORe50. Further ideas could be investigated in the future to quantify weight importance for old tasks such as exploiting the moving average of squared gradients already considered by methods like RMSprop [Hinton, 2012] or Adam [Kingma and Ba, 2014] or the Hebbian-like reinforcements between active neurons recently proposed by Aljundi et al. . Class-incremental learning (NC update content type) is only one of the cases of interest in SIT. New instances and classes (NIC) update content type, available under CORe50, is a more realistic scenario for real applications, and would constitute the main target of our future research. AR1 extension to unsupervised (or semi-supervised) implementations, such as those described in Section 3.6.1 and [Parisi et al., 2018b], is another scenario of interest for future studies. In particular, Parisi et al. [2018b] propose an interesting 2-level self-organizing model built on the top of a convolutional feature extractor that is capable of exploiting temporal coher- ence in CORe50 videos and provides good results also with weak supervision. Although much more validations in complex setting and new better approaches will be necessary, based on these preliminary results we can optimistically envision a new generation of systems and applications that once deployed can continue to acquire new skills and knowledge without the need of being retrained from scratch.
To retrain from scratch a deep neural network like VGG, the amount of images used  is above one million. In many different cases the dataset used is not big enough to meet this requirement, so two new techniques are proposed.
The first one is to fine-tune an already pre-trained network. By doing this, the network is adapted to solve a particular problem different from the original purpose of the network. The authors of  explain the benefits of using fine-tuning against other methods. The methodology to adapt a model consists in freezing the weights of some part of the network, typically the layers that are near the input layer, so when the training process starts, these layers are not updated and only the weights and biases from the desired layers are modified. The other technique, which is also used in other application fields of ML, is called data augmentation, which consist in enlarging the size of the training dataset by performing different transformations over the data. The transformations that can be done are rotations, horizontal or vertical reflections, add white noise to the set, change of brightness, and many more. The transformations may vary from the dataset that is being used, for example, a hand-written dataset like MNIST would accept transformations like rotations or changing the brightness but reflecting horizontally or vertically the images would break the logic of the set. In  two types of data augmentation are implemented, the first one consisted in image translations and reflections that augmented the size of the set by 2048, and the second transformation consisted in the manipulation of the RGB colour channels.
Local Winner-Take-All Networks
Consider a layer of units in an MLP being trained to classify input patterns as being of one of the ten Arabic digits (0–9). The units are expected to learn to identify certain features in the inputs which are indicative of the digit’s identity, such as edges oriented at various angles. Assume that the dataset is very simple, such that identifying horizontal and vertical edges would be sufficient to classify the data correctly. When the network is not completely trained, it is likely that many units respond to edges but do not have very high responses to these orien- tations. Instead, there are several units that respond to edges with intermediate orientations. Backpropagation will then assign some credit to all of these units, and they will all be slowly adjusted towards one of the desired directions at each learning step until the objective function is minimized.
3 Another hybrid example can be found in the work from Pandey and Dukkipati [ 7 ], where they use a wide learning (i.e. a single layer architecture of infinite width) through an arc-cosine kernel. They propose exact and approximate procedures to train single- layer wide networks. They provide an approximate strategy to compute the kernel matrix of an arc-cosine kernel, which is constructed with the weight matrix learnt by a Restricted Boltzmann Machine (RBM). The kernel matrix is finally fed into a linear kernel classifier. Approximations for concatenating several kernel arc-cosine kernel layers are also presented, but they show lower performances than the one from the single-layer architecture. They show that this wide network can achieve better results than single layer and deep belief networks. On the other hand, while we see how this method takes advantage of kernel methods it cannot achieve to get stratified representations of the data which make deeplearningarchitectures a good choice for transfer learning. Several works in the literature have approximated kernel methods within neural networks establishing training procedures that do not deviate much from the traditional deeplearning approach (e.g. usage of the same or very similar hyperparameter set, usage of the same cost function or optimization algorithm) while are efficient and easy to implement at the same time. Mehrkanoon et al. [ 3 ] propose the basis for implementing hybrid neural networks by stacking an additional layer that approximates a Gaussian kernel using Random Fourier Features onto a fully connected layer. They empirically show that they can match the performance of traditional kernel methods (e.g. LS- SVM) while being able to scale to datasets of any size. In this work we generalize the commented work by stacking multiple kernel blocks building deeper architectures and providing training procedures to train them in a couple of different scenarios (i.e. structured and non-structured data).
and time in identifying the disease.
However, there are many other datapoints that are available with medical images, such as omics data, biomarker calculations, patient demographics and history. All these datapoints can enhance disease classification or prediction of progression with the help of machine learning/deeplearning modules. However, it is very difficult to find a comprehensive dataset with all different modalities and features in healthcare setting due to privacy regulations. Hence in this thesis, we explore both medical imaging data with clinical datapoints as well as genomics datasets separately for classification tasks using combinational deeplearningarchitectures. We use deep neural networks with 3D volumetric structural magnetic resonance images of Alzheimer Disease dataset for classification of disease. A separate study is implemented to understand classification based on clinical datapoints achieved by machine learning algorithms. For bioinformatics applications, sequence classification task is a crucial step for many metagenomics applications, however, requires a lot of preprocessing that requires sequence assembly or sequence alignment before making use of raw whole genome sequencing data, hence time consuming especially in bacterial taxonomy classification. There are only a few approaches for sequence classification tasks that mainly involve some convolutions and deep neural network. A novel method is developed using an intrinsic nature of recurrent neural networks for 16s rRNA sequence classification which can be adapted to utilize read sequences directly. For this classification task, the accuracy is improved using optimization techniques with a hybrid neural network.
Our main contributions are as follows. First, we demon- strate that deep CNNs offer a solution for ultra-wide base- line matching. Inspired by recent efforts in patch matching [14, 43, 31] we build a siamese/classification hybrid model using two AlexNet networks , cut off at the last pooling layer. The networks share weights, and are followed by a number of fully-connected layers embodying a binary clas- sifier. Second, we show how to extend the previous model with a Spatial Transformer (ST) module, which embodies an attention mechanism that allows our model to propose possible patch matches (see Fig. 1), which in turn increases performance. These patches are described and compared with MatchNet . As with the first model, we train this network end-to-end, and only with same and different training signal, i.e., the ST module is trained in a semi- supervised manner. In sections 3.2 and 4.6 we discuss the difficulties in training this network, and offer insights in this direction. Third, we conduct a human study to help us char- acterize the problem, and benchmark our algorithms against human performance. This experiment was conducted on Amazon Mechanical Turk, where participants were shown pairs of images from our dataset. The results confirm that humans perform exceptionally while responding relatively quickly. Our top-performing model falls within 1% of hu- man accuracy.
When we learn, we constantly rely on our prior experiences. When learners encounter new problems, they base their thinking on what they have experienced previously. Existing prior knowledge can both facilitate and hinder learning and transfer (Bransford, Brown, & Cocking, 2000). According to Gilbert, Bolte, and Pilot (2011), learning is mediated by what is already known. “Seductive” details trigger perspectives when encountering problems. These details potentially increase with the richness of the initial information. For example, while a stick figure may trigger a general perspective, a diagram or photograph includes many more details that require filtering to distinguish relevant from irrelevant information. While this added information when coupled with prior experience can enhance problem solving, it can also stir “noisy,” irrelevant schema that may be counter-productive to problem solving (Son & Goldstone, 2009). For learners, it is imperative to sift through and identify both relevant and irrelevant information as they prioritize what to learn. Specific prior knowledge must serve as a lens for assimilating new content, not the focus. Reducing distractors, seductive details, and noisiness will promote transfer (Day & Goldstone, 2012).
critical steps. For example, the human expert may check the misclassifications with the highest loss for incorrect labels, thus effectively reducing label noise. With shorter training times, such feedback loops can be executed faster. In the CheXpert dataset, which was used as a groundwork for the present analysis, labels for the images were generated using a specifically developed natural language processing tool, which did not produce perfect labels. For example, the F1 scores for the mentioning and subsequent negation of cardiomegaly were 0.973 and 0.909, and the F1 score for an uncertainty label was 0.727. Therefore, it can be assumed, that there is a certain amount of noise in the training data, which might affect the accuracy of the models trained on it. Implement- ing a human-in-the loop approach for partially correcting the label noise could further improve performance of networks trained on the CheXpert dataset 21 . Our findings differ from applied techniques used in previous literature, where deeper network architectures, mainly a DenseNet-121, were used to classify the CheXpert data set 6,9,22 . The authors of the CheXpert dataset achieved an average overall AUROC of 0.889 3 , using a DenseNet-121, which was not surpassed by any of the models used in our analysis, although differences between the best per- forming networks and the CheXpert baseline were smaller than 0.01. It should be noted, however, that in our analysis the hyperparameters for the models were probably not selected as precise as in the original CheXpert paper by Irvin et al., since the focus of this work was more on the comparison of different architectures instead of the optimization of one specific network. Keeping all other hyper-parameters constant across the models might also have affected certain architectures more than others, thus lowering the comparability between the different networks we evaluated.
In this work we explore whether the TL paradigm can be successfully applied to three different art classification problems. We use four neural architectures that have obtained strong results on the ImageNet challenge in recent years and we investigate their performances when it comes to attributing the authorship to different artworks, recognizing the material which has been used by the artists in their creations, and iden- tifying the artistic category these artworks fall into. We do so by comparing two pos- sible approaches that can be used to tackle the different classification tasks. The first one, known as off the shelf classification , simply retrieves the features that were learned by the DCNNs on other datasets and uses them as input for a new classifier. In this scenario the weights of the DCNN do not change during the training phase, and the final, top-layer classifier is the only component of the architecture which is actually trained. This changes in our second explored approach, known as fine tuning, where the weights of the original DCNNs are “unfrozen” and the neural architectures are trained together with the final classifier.
University of Cambridge
Multi-modal distributional models learn grounded representations for improved performance in semantics. Deep visual representations, learned using convolutional neural networks, have been shown to achieve particularly high performance. In this study, we systematically compare deep visual representation learning techniques, exper- imenting with three well-known network architectures. In addition, we explore the various data sources that can be used for retrieving relevant images, showing that images from search engines perform as well as, or better than, those from manually crafted resources such as ImageNet. Furthermore, we explore the optimal number of images and the multi-lingual applicability of multi-modal semantics. We hope that these findings can serve as a guide for future research in the field.
These models were pretrained on a subset of the ImageNet database (www.ImageNet.org), which was used in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [24,25]. Both networks, trained on ImageNet, are able to classify images into thousands of object categories, learning rich feature representations for a wide range of images. Thanks to the generalization property common to the neural network approach, it is possible to develop an appropriate learning procedure to force the networks to classify images of a different domain produced by the CWT block into 24 diagnostic classes. Both networks were tested in the first phase of the Challenge: the SqueezeNet showed a faster training procedure, whereas the GoogLeNet presented a better performance, and thus the latter was used in the official phase.
Abstract—Deeplearning has taken over - both in problems beyond the realm of traditional, hand-crafted machine learning paradigms as well as in capturing the imagination of the practi- tioner sitting on top of petabytes of data. While the public per- ception about the efficacy of deep neural architectures in complex pattern recognition tasks grows, sequentially up-to-date primers on the current state of affairs must follow. In this review, we seek to present a refresher of the many different stacked, connectionist networks that make up the deeplearningarchitectures followed by automatic architecture optimization protocols using multi- agent approaches. Further, since guaranteeing system uptime is fast becoming an indispensable asset across multiple industrial modalities, we include an investigative section on testing neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deeplearning has emerged as a game-changing technology - be it anomalous behavior detection in financial applications or financial time-series forecasting, predictive and prescriptive analytics, medical imaging, natural language processing or power systems research. The thrust of this review is on outlining emerging areas of application-oriented research within the deeplearning community as well as to provide a handy reference to researchers seeking to embrace deeplearning in their work for what it is: statistical pattern recognizers with unparalleled hierarchical structure learning capacity with the ability to scale with information.
We then evaluate the generalisation capability of the models trained only on IEMOCAP. We re-train these models to output three emotions according to the mapping in Table 4 and report performance on the out-of-domain corpora. The results are pre- sented in Table 2 on the WA and UA metrics. One can see that the unweighted accuracy for the out-of-domain corpora is very poor, which is in line with previous works . Note that in Table 2, 33.33% UA means that the whole test set is classiﬁed as negative. This shows that none of these architectures, when trained on a single corpus, generalise at all to out-of-domain data.
In this paper we examined the current architecture for generating image captions with deep learn- ing and argued that in its present setup they fail to ground the meaning of spatial descriptions in the image but nonetheless achieve a good perfor- mance in generating spatial language which is sur- prising given the constraints of the architecture that they are working with. The information that they are using to generate spatial descriptions is not spatial but distributional, based on word co- occurrence in a sequence as captured by a lan- guage model. While such information is required to successfully predict spatial language, it is not sufficient. We see at least two useful areas of fu- ture work. On one hand, it should be possible to extend the deeplearning configurations for image description to take into account and specialise to learn geometric representations of objects, just as the current deeplearning configurations are spe- cialised to learn visual features that are indicative of objects. The work on modularity of neural net- works such as (Andreas et al., 2016; Johnson et al., 2017) may be relevant in this respect. On the other hand, we want to study how much information can be squeezed out of language models to success- fully model spatial language and what kind of lan- guage models can be built to do so.
b Airborne & Space Systems Division, Leonardo MW Ltd, Edinburgh, United Kingdom
In long range imagery, the atmosphere along the line of sight can result in unwanted visual effects. Random variations in the refractive index of the air causes light to shift and distort. When captured by a camera, this randomly induced variation results in blurred and spatially distorted images. The removal of such effects is greatly desired. Many traditional methods are able to reduce the effects of turbulence within images, however they require complex optimisation procedures or have large computational complexity. The use of deeplearning for image processing has now become commonplace, with neural networks being able to outperform traditional methods in many fields. This paper presents an evaluation of various deeplearningarchitectures on the task of turbulence mitigation. The core disadvantage of deeplearning is the dependence on a large quantity of relevant data. For the task of turbulence mitigation, real life data is difficult to obtain, as a clean undistorted image is not always obtainable. Turbulent images were therefore generated with the use of a turbulence simulator. This was able to accurately represent atmospheric conditions and apply the resulting spatial distortions onto clean images. This paper provides a comparison between current state of the art image reconstruction convolutional neural networks. Each network is trained on simulated turbulence data. They are then assessed on a series of test images. It is shown that the networks are unable to provide high quality output images. However, they are shown to be able to reduce the effects of spatial warping within the test images. This paper provides critical analysis into the effectiveness of the application of deeplearning. It is shown that deeplearning has potential in this field, and can be used to make further improvements in the future.
School of Computing Sciences, University of East Anglia, UK
This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deeplearning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme con- texts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Further-
Abstract: The preemptive defenses against various malware created by domain generation algorithms (DGAs) have traditionally been solved using manually-crafted domain features obtained by heuristic process. However, it is difficult to achieve real-world deployment with most research on detecting DGA-based malicious domain names due to poor performance and time consuming. Based on the recent overwhelming success of deeplearning networks in a broad range of applications, this article transfers five advanced learned ImageNet models from Alex Net, VGG, Squeeze Net, Inception, Res Net to classify DGA domains and non-DGA domains, which: (i) is suited to automate feature extraction from raw inputs; (ii) has fast inference speed and good accuracy performance; and (iii) is capable of handling large-scale data. The results show that the proposed approach is effective and efficient.
For all deeplearningarchitectures, transfer learning increased performance in detecting GON and decreased the training time needed for model convergence. A possible reason that transfer learning helps the model learn faster with less data is that it mimics the way our visual system develops. The visual system develops circuits to identify simple features such (e.g. edges and orientation), combines these features to process more complex scenes (e.g. extracting and identifying objects), and associates this complex set of features with other knowledge regard- ing the scene. By the time a clinician learns to interpret medical images, they already have vast experience in interpreting a wide variety of visual scenes. Initializing models via transfer learning is an important approach that should be considered whenever training a CNN to perform a new task, especially when limited data is a concern. There are some limitations to this study. The ground truth used here was derived from the generally accepted gold standard of subjective assessment of stereoscopic fundus photography by at least 2 graders certified by the UCSD Optic Disc Reading Center with adjudication by a third experienced grader in cases of disagreement. Because the images were collected and reviewed over the course of ~15 years, the ground truth was not generated by a single pair of graders and a single adjudicator. This means that model classifications cannot be compared to 2 specific graders to determine agreement. Rather, our results were compared to the ground truth based on the final grade. For our dataset, two graders agreed in the assessment of an image in approximately 77% of cases and adjudication was required in approximately 23% of cases. This is comparable to previously published levels of agreement between graders and in ground truth data used in training automated systems 52 , 53 .
Deeplearning for root systems
The prevailing methodology when working with images in deeplearning is the CNN. CNNs improve upon traditional machine learning via their ability to learn not only solutions to prob- lems but also the most effective way in which to transform data to make this goal easier. This representation learning pro- vides CNNs with unparalleled discriminative power and has seen them quickly move into a dominant position within the field of computer vision . A CNN is a layered structure that per- forms successive image-filtering operations that transform an image from a traditional RGB input into a new feature represen- tation. This transformation is learned during training and pro- vides the final layers of the CNN with the best possible view of those data from which to base decisions. The deeper into a CNN data flows, the more abstracted and powerful the representation becomes. While the initial layers may compute simple primi- tives such as edges and corners, deeper into the network fea-