Existing asynchronous parallel learning methods are only for the sparse feature models, and they face new challenges for the dense feature models like neural networks (e.g., LSTM, RNN). The problem for densefeatures is that asynchronous parallel learning brings gradient errors derived from overwrite actions. We show that gradient errors are very common and inevitable. Never- theless, our theoretical analysis shows that the learning process with gradient errors can still be convergent towards the optimum of objective functions for many practical applications. Thus, we propose a simple method AsynGrad for asynchronous parallel learning with gradient error. Base on various dense feature models (LSTM, dense-CRF) and various NLP tasks, experiments show that AsynGrad achieves substantial improvement on training speed, and without any loss on accuracy.
Sentence level evaluation in MT has turned out far more difficult than corpus level evaluation. Existing sentence level metrics employ a lim- ited set of features, most of which are rather sparse at the sentence level, and their intricate models are rarely trained for ranking. This pa- per presents a simple linear model exploiting 33 relatively densefeatures, some of which are novel while others are known but seldom used, and train it under the learning-to-rank frame- work. We evaluate our metric on the stan- dard WMT12 data showing that it outperforms the strong baseline METEOR. We also ana- lyze the contribution of individual features and the choice of training data, language-pair vs. target-language data, providing new insights into this task.
Scalable discriminative training methods are now broadly available for estimating phrase-based, feature-rich translation mod- els. However, the sparse feature sets typi- cally appearing in research evaluations are less attractive than standard densefeatures such as language and translation model probabilities: they often overfit, do not gen- eralize, or require complex and slow fea- ture extractors. This paper introduces ex- tended features, which are more specific than densefeatures yet more general than lexicalized sparse features. Large-scale ex- periments show that extended features yield robust BLEU gains for both Arabic-English (+1.05) and Chinese-English (+0.67) rel- ative to a strong feature-rich baseline. We also specialize the feature set to specific data domains, identify an objective function that is less prone to overfitting, and release fast, scalable, and language-independent tools for implementing the features.
Nevertheless, there remain challenging prob- lems of how to encode all the available infor- mation from the configuration and how to model higher-order features based on the dense repre- sentations. In this paper, we train a neural net- work classifier to make parsing decisions within a transition-based dependency parser. The neu- ral network learns compact dense vector represen- tations of words, part-of-speech (POS) tags, and dependency labels. This results in a fast, com- pact classifier, which uses only 200 learned densefeatures while yielding good gains in parsing ac- curacy and speed on two languages (English and Chinese) and two different dependency represen- tations (CoNLL and Stanford dependencies). The main contributions of this work are: (i) showing the usefulness of dense representations that are learned within the parsing task, (ii) developing a neural network architecture that gives good accu- racy and speed, and (iii) introducing a novel acti-
Most of the modern Statistical Machine Transla- tion (SMT) systems, for example (Koehn et al., 2003; Och and Ney, 2004; Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), employ a large rule set that may contain tens of millions of translation rules or even more. In these systems, each transla- tion rule has about 20 densefeatures, which repre- sent key statistics collected from the training data, such as word translation probability, phrase transla- tion probability etc. Except for these common fea- tures, there is no connection among the translation rules. The translation rules are treated as indepen- dent events.
can reconstruct 96 night images that are twice as many as that of the baseline method using COLMAP with DoG+RootSIFT . This result validates the benefit of densely detected features that can provide correspon- dences across large illumination changes as they have smaller loss in keypoint detection repeatability than a standard DoG. On the other hand, both methods with sparse and densefeatures work well for reconstructing day images. The difference between with and without key- point localization can be seen more clearly in the next evaluation.
To demonstrate the effectiveness of the individ- ual improvements, we show results for four differ- ent En-De systems: (1) A baseline that contains only the 19 densefeatures, (2) a feature-rich trans- lation system with the additional rich features, (3) a feature-rich translation system with an additional word class LM, and (4) a feature-rich translation system with an additional wordclass LM and a huge language model. For Fr-En we only built systems (1)-(3). Results for all systems can be seen in Table 5 and Table 6. From these results, we can see that both language pairs benefitted from adding rich fea- tures (+0.4 BLEU for En-De and +0.5 BLEU for Fr-En). However, we only see improvements from the class-based language model in the case of the En-De system (+0.4 BLEU). For this reason our Fr- En submission did not use a class-based language model. Using additional data in the form of a huge language model further improved our En-De sys-
As mentioned, we pair the data from handcrafted features with a Gradient Boosted Trees (GBT) model (Drucker and Cortes, 1996). Table 1 shows all hyperparameter values set for the GBT model. These values are the result of extensive grid searching, optimizing for F1-score (the task’s official metric), and selecting the best performing model on 5-fold cross-validated results.
2010) and the closely related fields of named en- tity recognition (Li et al., 2018) and entity men- tion detection (Shen et al., 2015) with many dif- ferent approaches. State-of-the-art named entity detection models have historically employed a combination of hand-crafted features, rules, natu- ral language processing, string-pattern matching, and domain knowledge using supervised learn- ing on relatively-small manually annotated cor- pora (Piskorski and Yangarber, 2013). A common approach to toponym detection has been to utilize place name gazetteers which are directories of ge- ographic names and their corresponding geoloca- tions to perform string matching of place names in text (Lieberman et al., 2010).
For video classification problems, there are some generally accepted algorithm based on deep learning method [8, 12-15]. In order to extract more spatiotemporal features from a sequence of frames 3D convolution network is introduced. 3D kernel can learn the spatiotemporal feature. Tran et al.  proposed a 3D architecture using 3D convolution kernels which slide whole video. They studied the 3D structure of Resnet system and improved the structure of C3D . In addition, recurrent networks are also a good approach to extract temporal relation between frames [16, 17, 12]. Donahue et al.  proposed an approach using LSTM to integrate features from CNN. In fact, LSTM-based network to extract the features of spatiotemporal is inefficient and unsatisfactory. In action recognition field, the performance of LSTM-based method is always behind the CNN-based methods. Recently introduced 3D CNN has some new approach and new methods emerge one after another [10, 14, 18]. The characteristic of these methods is the use of sliding window to obtain short term temporal context. However, such methods consume all computing resources because the average score of these Windows needs to be calculated before final fusion.
and Zitnick, 2014; Donahue et al., 2014; Fang et al., 2014; Karpathy and Fei-Fei, 2014) using neural network models. Although the above studies have shown interesting results, our task is arguably more complex than generating text descriptions: in ad- dition to the visual and textual signals, we have to model the popular votes as a third dimension for learning. For example, we cannot simply train a con- volutional neural network image parser on billions of images, and use recurrent neural networks to gen- erate texts such as “There is a white cat sitting next to a laptop.” for Figure 1. Additionally, since not all images are suitable as meme images, collecting training images is also more challenging in our task. In contrast to prior work, we take a very different approach: we investigate copula meth- ods (Schweizer and Sklar, 1983; Nelsen, 1999), in particular, the nonparanormals (Liu et al., 2009), for joint modeling of raw images, text descriptions, and popular votes. Copula is a statistical framework for analyzing random variables from Statistics (Liu et al., 2012), and often used in Economics (Chen and Fan, 2006). Only until very recently, researchers from the machine learning and information retrieval communities (Ghahramani et al., 2012; Han et al., 2012; Eickhoff et al., 2013). start to understand the theory and the predictive power of copula models. Wang and Hua (2014) are the first to introduce semi- parametric Gaussian copula (a.k.a. nonparanormals) for text prediction. However, their approach may be prone to overfitting. In this work, we generalize Wang and Hua’s method to jointly model text and vision features with popular votes, while scaling up the model using effective dropout regularization.
We propose a method to represent dependency trees as dense vectors through the recursive appli- cation of Long Short-Term Memory networks to build Recursive LSTM Trees (RLTs). We show that the dense vectors produced by Recursive LSTM Trees replace the need for structural features by using them as feature vectors for a greedy Arc-Standard transition-based dependency parser. We also show that RLTs have the ability to incorporate useful information from the bi-LSTM contextualized representation used by Cross and Huang (2016) and Kiperwasser and Goldberg (2016b). The resulting dense vectors are able to express both structural information relating to the dependency tree, as well as sequential information relating to the position in the sentence. The resulting parser only requires the vector representations of the top two items on the parser stack, which is, to the best of our knowledge, the smallest feature set ever published for Arc-Standard parsers to date, while still managing to achieve competitive results.
The double rows of pits  and  in Field 2 represent the avenue of limes present from c.1701 to 1758, which were removed to construct the new carriage drive (Bell 1880–90, 13). The two hexagonal fea- tures  and  are both c.148m either side of this avenue, and are on the same alignment and lie at the same distance from the Dover road. It is possible, therefore, that the hexagons and the avenue of trees were part of the same c.1701 programme, although excavation (Wilkinson 2008, Wilkinson and Macpherson- Grant 2014) suggested that the northern of the two hexagons was cut by burials of 5th/6th-century date. A garden wall (?) perhaps associated with Oswald’s Lodge, may be found at the southern part of Field 3 and the other boundaries in Field 3 appear to have been present in the 18th and 19th centuries. The artificial lake, brick foundations of a boathouse, cottages and lodges in the Park, and several landscaping features date to the 19th century.
uses Principal Component Analysis (PCA) to assess the population variance and is referred to as a Point Distri- bution Model (PDM) . Both methods rely on manually annotated landmarks that are used directly or as a basis for constructing a dense point correspondence [1,4-6]. This means that both direct distances and statistically based methods are prone to human operator annotation errors. There exist several surface-based automatic registration methods for point correspondence, still for manual anno- tation, at least on a sparse set of landmarks, is widely used when facial analysis is used in clinical applications. Understanding the variance (noise) introduced by man- ually annotated landmarks is important for knowing the
In this work, we first propose an accessible and general approach to collect, transform and represent snapshots of road networks marked with congestion levels. We then apply it to build a dataset named SATCS for traffic congestion research. We develop a deep learning model DCPN by combining a DAE-inspired feature learning architecture and dense layers to learn representational features and temporal correlations from historical traffic congestion data for prediction of future congestion levels in a transportation network near the Seattle area, Washington state, USA. To evaluate the effectiveness of the proposed DCPN model for short-term traffic congestion forecasting, we compare its prediction performance with that of two state-of-the-art deep learning neural network models using the back-testing technique. Results over the SATCS benchmark dataset show that our proposed DCPN is more effective and computationally efficient for short-term traffic congestion forecasting.
In the earlier years, most of the study focused on hand- crafted features. For nuclei detection and segmentation, a series of level set methods equipped hand-crafted features, such as Hough transform , concavity  and gradient , etc., were proposed as a fundamental prerequisite in many breast cancer histopathological applications. These features were exquisitely considered with prior knowledge of boundary, region or shape. Besides, a series of wavelet filters equipped hand-crafted features, such as isotropic phase symmetry  and texture descriptor , were proposed for detection of beta cells, lymphocytes or glandular structure. Afterwards, researchers utilized machine learning methods, like Bayesian classifier  and Support Vector Machine (SVM)  to detect or segment nuclei. These kinds of hand-crafted methods were also applied in a wide range of other digital histopatho- logical applications, such as level set for tubule segmentation in breast cancer ,  and SVM for gland detection in prostate cancer , etc. However, hand-crafted features are limited in representation capability to solve such complex problems.
The Independent component analysis  based face recognizer captures the higher order statistics of the image thus considering both the amplitude and phase spectrum of the image thus overcoming the drawback of PCA based system which considers only the second order statistics. Given training images as input, BICA extracts significant features and store it in the database. In BICA the optimal discriminant features are calculated. Face recognition using BICA is based on computation of feature space F (from training set). The whole image is partitioned into many sub- images, i.e. blocks of the same size and then a common demixing matrix for all the blocks are calculated . Compared with ICA, whose training vector is stretched from the whole image, B-ICA stretches only part of the face image as the training vector. B-ICA greatly dilutes the dimensionality dilemma of ICA because the dimension of the training vector is much smaller. The algorithm for BICA is given below:
In Table 3, we can see that the overall accuracy is about 82%. Wood mouse is correctly recognized 100%, which is surprising, considering that none biometric features are used. For over one third of the 18 species, this experi- ment obtained classification accuracy over 90%, such as paca, ocelot, red deer, and wild boar. As expected, red brocket deer is easily misclassified as white-tailed deer because they are of the same ontology and have the simi- lar appearance. In order to better classify the two species like these, biometric features, such as spots on the fur and shape of antlers, play a key role in species recognition. However, automatically identifying biometric features is a challenging task, to our best knowledge.
This discussion is only focused on influencing factors of LST changes in the change period that shows significant difference in the LST value based on the t- statistics test that were very dense of 2003-2013, semi dense of 2003-2013, and semi-dense of 1995-2013 (Table 7.5). The possible factor that influences this variation of LST changes is factors of time of day, seasonality, and LULC condition (Zhou and Wang; 2011). The three images used in this study were captured at around the same local time at around 9-10 AM, thereby theoretically excluding the influence of the time-of-day factor. In addition, the seasonal factor that can influence LST difference also has been excluded by standardization in this analysis. Thus, this variation of UHI is assumed to be caused only by the LULC condition that is triggered by anthropogenic factors. The increasing temperature as well as the dense building density and geometry influence limited wind circulation that further impact human comfort and trigger the use of air cooling. As a result, the artificial heat sources and also air pollution influence the LST change. The more air conditioners are used, the more heat released, and vice versa, which eventually leads to the variation of UHI hot spot features. Therefore, the possible reason of this significant declined LST might be associated with the different seasons influencing human behavior and activities.