This paper describes the results of NILC team at CWI 2018. We developed solutions follow- ing three approaches: (i) a featureengineering method using lexical, n-gram and psycholin- guistic features, (ii) a shallow neural network method using only word embeddings, and (iii) a Long Short-Term Memory (LSTM) language model, which is pre-trained on a large text cor- pus to produce a contextualized word vector. The featureengineering method obtained our best results for the classification task and the LSTM model achieved the best results for the probabilistic classification task. Our results show that deep neural networks are able to per- form as well as traditional machine learning methods using manually engineered features for the task of complex word identification in English.
In this work, we experiment with multiple deep learning models for compound type classifi- cation. Our extensive experiments include standard neural models comprising of Multi-Layer Perceptrons (MLP), Convolution Neural Networks (CNN) (Zhang et al., 2015) and Recurrent models such as Long Short Term Memory (LSTM) configurations. Unlike the feature-rich repre- sentation of Krishna et al. (2016), we rely on various word embedding approaches, which include character level, sub-word level, and word-level embedding approaches. Using end-to-end train- ing, the pretrained embeddings are fine tuned for making them task specific embeddings. So all the architectures are integrated with end-to-end training (Kim, 2014). The best system of ours, an end-to-end LSTM architecture initialised with fasttext embeddings has shown promis- ing results in terms of F-score (0.73) compared to the state of the art classifier from Krishna et al. (2016) (0.74) and outperformed it in terms of accuracy (77.68%). Summarily, we find that the models we experimented with, report competitive results with the current state of the art model for compound type identification. We achieve the same without making use of any featureengineering or domain expertise. We release the codebase for all our models experimented with at https://github.com/Jivnesh/ISCLS-19.
In the Parlando subcorpus, we find per average 37 lines, 18 lines with finite verbs, and 25 lines using a punctuation. In the Variable Foot subcorpus, the same distribution is 20, 10, and 11. This indicates that the poetic lines in both classes do hardly contain complete sentences and that these poems belong to both classes: Parlando and Variable Foot. The results of classifying poems as dominated by Parlando or Variable Foot are presented in Table 1. As can be seen, the classifier using manually engineered features yielded the best results by using only the parser information (f-measure is 0.69) which is unexpected given that pauses identify the Variable Foot pattern, according to theory. The classification results indicate that the method based on neural networks outperforms the manually engineered features, in particular when taking speech (and pausing) into account. The NN that uses only text is inferior to manual parsing features. This indicates that the neural network is better able to make use of information contained in the speech audio than can be captured by traditional feature-engineering approaches.
In this paper, we study, compare and combine two state-of-the-art approaches to automatic featureengineering: Convolution Tree Ker- nels (CTKs) and Convolutional Neural Net- works (CNNs) for learning to rank answer sentences in a Question Answering (QA) set- ting. When dealing with QA, the key aspect is to encode relational information between the constituents of question and answer in learn- ing algorithms. For this purpose, we propose novel CNNs using relational information and combined them with relational CTKs. The results show that (i) both approaches achieve the state of the art on a question answering task, where CTKs produce higher accuracy and (ii) combining such methods leads to un- precedented high results.
Our goal is to predict the first language (L1) of English essays’s authors with the help of the TOEFL11 corpus where L1, prompts (top- ics) and proficiency levels are provided. Thus we approach this task as a classification task employing machine learning methods. Out of key concepts of machine learning, we fo- cus on featureengineering. We design fea- tures across all the L1 languages not making use of knowledge of prompt and proficiency level. During system development, we experi- mented with various techniques for feature fil- tering and combination optimized with respect to the notion of mutual information and infor- mation gain. We trained four different SVM models and combined them through majority voting achieving accuracy 72.5%.
Lina Wang, et al. developed a comprehensive approach of featureengineering by combining all three domain features including statistical features in time domain and obtained higher average accuracy [5]. Neural Network based BCI model proposed by Ankita Mazumder , et al., [6] used time varying adaptive autoregressive algorithm (TVAAR) for extraction of features in time domain. Changjian Yang et al. brought a fuzzy logic system using time domain statistical features for recognition of EEG signals [7].
This paper proposes a dependency tree- based SRL system with proper pruning and extensive featureengineering. Official evaluation on the CoNLL 2008 shared task shows that our system achieves 76.19 in la- beled macro F1 for the overall task, 84.56 in labeled attachment score for syntactic dependencies, and 67.12 in labeled F1 for semantic dependencies on combined test set, using the standalone MaltParser. Be- sides, this paper also presents our unofficial system by 1) applying a new effective pruning algorithm; 2) including additional features; and 3) adopting a better depend- ency parser, MSTParser. Unofficial evalua- tion on the shared task shows that our sys- tem achieves 82.53 in labeled macro F1, 86.39 in labeled attachment score, and 78.64 in labeled F1, using MSTParser on combined test set. This suggests that proper pruning and extensive featureengineering contributes much in dependency tree-based SRL.
Abstract: The bankruptcy of manufacturing corporates is an important factor affecting economic stability. Corporate bankruptcy has become a hot research topic mainly through financial data analysis and prediction. With the development of data science and artificial intelligence, machine learning technology helps researchers improve the accuracy and robustness of classification models. Ensemble learning, with its strong predictive power and robustness, plays an important role in machine learning and binary classification prediction. In this study, we proposed a bankruptcy classification model combining featureengineering method and ensemble learning method, Synthetic Minority Oversampling Technique (SMOTE) imbalanced data learning algorithm is applied to generate balanced dataset, multi-interval discretization filter is applied to enhance the interpretability of the features and ensemble learning method is applied to get an accurate and objective prediction. To demonstrate the validity and performance of the proposed model, we conducted comparative experiments with ten other baseline classifiers, proving that SMOTE imbalanced learning algorithm and featureengineering method with multi-interval discretization was effective. The comparative experiment results show that the ensemble learning method has a good effect on improving the performance of the proposed model. The final results show that the proposed model has achieved better performance and robustness than other baseline classifiers in terms of classification accuracy, F-measure and Area under Curve (AUC).
Abstract With the widespread of modern technologies and social media networks, a new form of bullying occurring anytime and anywhere has emerged. This new phenomenon, known as cyberaggression or cyberbullying, refers to aggressive and intentional acts aiming at repeatedly causing harm to other person involving rude, insulting, offensive, teasing or demoralising comments through online social media. As these aggressions represent a threat- ening experience to Internet users, especially kids and teens who are still shaping their identities, social relations and well-being, it is crucial to understand how cyberbullying occurs to prevent it from escalating. Considering the massive information on the Web, the developing of intelligent techniques for automatically detecting harm- ful content is gaining importance, allowing the monitoring of large-scale social media and the early detection of unwanted and aggressive situations. Even though several approaches have been developed over the last few years based both on traditional and deep learning techniques, several concerns arise over the duplication of research and the difficulty of comparing results. Moreover, there is no agreement regarding neither which type of technique is better suited for the task, nor the type of features in which learning should be based. The goal of this work is to shed some light on the effects of learning paradigms and featureengineering approaches for detecting aggressions in social media texts. In this context, this work provides an evaluation of diverse traditional and deep learning techniques based on diverse sets of features, across multiple social media sites.
We describe the systems of NLP-CIC team that participated in the Complex Word Iden- tification (CWI) 2018 shared task. The shared task aimed to benchmark approaches for iden- tifying complex words in English and other languages from the perspective of non-native speakers. Our goal is to compare two ap- proaches: featureengineering and a deep neu- ral network. Both approaches achieved com- parable performance on the English test set. We demonstrated the flexibility of the deep- learning approach by using the same deep neu- ral network setup in the Spanish track. Our systems achieved competitive results: all our systems were within 0.01 of the system with the best macro-F1 score on the test sets except on Wikipedia test set, on which our best sys- tem is 0.04 below the best macro-F1 score.
significant featureengineering with bi-LSTM neu- ral network with and without featureengineering and word embeddings. We experiment with tag- ging each clitic in context and with tagging all clitics in a word collectively. We also compare both systems with MADAMIRA, which is a state- of-the-art Arabic POS tagging system. We show that adding explicit features to the bi-LSTM neu- ral network and employing word embeddings sep- arately improve POS tagging results. However, combining both explicit features and embeddings together leads sub-optimal results. For testing, we employ the so-called “WikiNews” test set which is composed of freely available recent news arti- cles in multiple genre (Abdelali et al., 2016). We are making all resultant systems available as open- source systems.
The goal of this paper is to examine the impact of simple featureengineering mechanisms before applying more sophisticated techniques to the task of medical NER. Sometimes papers using scientifically sound techniques present raw baselines that could be improved adding simple and cheap features. This work focuses on entity recognition for the clinical domain for three lan- guages: English, Swedish and Spanish. The task is tackled using simple features, starting from the window size, capitalization, prefixes, and moving to POS and semantic tags. This work demonstrates that a simple initial step of featureengineering can improve the baseline results significantly. Hence, the contributions of this paper are: first, a short list of guidelines well sup- ported with experimental results on three languages and, second, a detailed description of the relevance of these features for medical NER.
The paper describes the various methodologies involved in data cleaning and featureengineering. The paper emphasizes the step by step format required to perform a featureengineering task on the machine learning problem. It also described the basic technical aspects of the machine learning concepts to get a break through about the basics of ML. Get statistical insight of the Data using various EDA tools. Perform Univariate, Bivariate and Multivariate analysis to understand individual features and their relationship with Target variables. The analysis of data may vary depending upon the dataset.
We described our system for the universal depen- dency parsing task that relies heavily on featureengineering for each component in the pipeline. Our system achieves reasonable performance. An important observation we have is regarding the pretrained word embeddings. Unlike neural net based parsers that can effectively use large unla- beled data by pretrained word embedding, pictures of semi-supervised learning approaches for featureengineering based systems are unclear. Though we tried different ways in our work, the improve- ment is quite limited. In our future work, we plan to combine our system with neural net based ap- proaches and explore some other semi-superivsed learning techniques.
This paper describes our system about mul- tilingual semantic dependency parsing (SR- Lonly) for our participation in the shared task of CoNLL-2009. We illustrate that semantic dependency parsing can be transformed into a word-pair classification problem and im- plemented as a single-stage machine learning system. For each input corpus, a large scale featureengineering is conducted to select the best fit feature template set incorporated with a proper argument pruning strategy. The system achieved the top average score in the closed challenge: 80.47% semantic labeled F1 for the average score.
Different from previous approaches that use tree- edit information derived from syntactic trees, our kernel-based learning approach also use tree struc- tures but with rather different learning methods, i.e., SVMs and structural kernels, to automatically ex- tract salient syntactic patterns relating questions and answers. In (Severyn et al., 2013c), we have shown that such relational structures encoding input text pairs can be directly used within the kernel learning framework to build state-of-the-art models for pre- dicting semantic textual similarity. Furthermore, se- mantically enriched relational structures, where au- tomatic have been previously explored for answer passage reranking in (Severyn et al., 2013b; Sev- eryn et al., 2013a). This paper demonstrates that this model also works for building a reranker on the sen- tence level, and extends the previous work by apply- ing the idea of automatic featureengineering with tree kernels to answer extraction.
This paper describes the winning solution of team National Taiwan University for track 1 of KDD Cup 2013. The track 1 in KDD Cup 2013 considers the paper-author identification problem, which is to identify whether a paper is truly written by an author. First, we conduct featureengineering to transform the various types of provided text information into 97 features. Second, we train classification and ranking models using these features. Last, we combine our individual models to boost the performance by using results on the internal validation set and the official Valid set. Some effective post-processing techniques have also been proposed. Our solution achieves 0.98259 MAP score and ranks the first place on the private leaderboard of the Test set.
Knowledge tracing is a vital element in person- alized and adaptive educational systems. In or- der to investigate the peculiarities of SLA and ex- plore the applicability of existing knowledge trac- ing techniques for SLA modeling, we conducted extensive data analyses on three newly released Duolingo datasets. We identified a number of fac- tors affecting students’ learning performance in SLA. We extracted a set of 23 features from stu- dent trace data and used them as input for the GTB model to predict students’ knowledge state. Our experimental results showed that (i) a student’s engagement plays an important role in achieving good exercise performance; (ii) contextual factors like the device being used and learning format should be taken into account for SLA modeling; (iii) repetitive practice of words and exercises af- fect students performance considerably; (iv) GTB can effectively use some of the designed features for SLA modeling and there is a need for fur- ther investigation on featureengineering. Apart from the future work already outlined in previous sections, we also plan to investigate deep knowl- edge tracing approaches and the inclusion of some
For the constrained mode, dictionaries (includ- ing static mapping dictionary and similarity in- dex), classification feature calculation and classi- fier training are based on the same data set. It causes overfitting because the dictionaries and the support and confidence features leak label information. However, our cross-validation re- sults show that learning dictionaries, support and confidence features, and classifier on the same data set generates better generalization as well. It leads to better F1 score than splitting the data set into two parts and learning dictionaries and fea- tures on one part and learning the classifier on the other part. This is because having large dic- tionaries is crucial for candidate generation and the correct canonical form cannot be found if it is not among the candidates. Using all the available data instead of splitting it allows the system to learn larger dictionaries and more than makes up for the overfitting problem.
In this work, unsupervised feature selection for CWS is based on frequent strings that are extracted automatically from unlabeled corpora. For convenience, these features are referred to as unsupervised features in the rest of this paper. Unsupervised features are suitable for closed training evaluation where external resources or extra information is not allowed, especially for cross-domain tasks, such as SIGHAN CWS bakeoff 2010(Zhao & Liu, 2010). Without proper knowledge, the closed training evaluation of word segmentation can be difficult with OOV words, where frequent strings collected from the test data may help. For incorporating unsupervised features into character-position based CRF for CWS, Zhao and Kit (2007) tried strings based on accessor variety (AV), which was developed by Feng et al. (2004), and based on co-occurrence strings (COS). Jiang et al. (2010) applied a feature similar to COS, called term-contributed boundary (TCB).