MORPHOLOGICAL ANALYSIS MARTIN KAY M O R P H O L O G I C A L ANALYSIS A computer program that is intended to carry out nontrivial oper ations on texts in an ordinary language must start by recognizing[.]
This paper describes a project tagging a sponta- neous speech corpus with morphological infor- mation such as word segmentation and parts-of- speech. We use a morphologicalanalysis system based on a maximum entropy model, which is independent of the domain of corpora. In this paper we show the tagging accuracy achieved by using the model and discuss problems in tagging the spontaneous speech corpus. We also show that a dictionary developed for a corpus on a certain domain is helpful for improving accu- racy in analyzing a corpus on another domain.
This paper describes a finite-state approach to morphologicalanalysis and generation of Gagauz, a Turkic language spoken in the Republic of Moldova. Finite-state approaches are commonly used in morphological modelling, but one of the novelties of our approach is that we explicitly handle orthographic errors and variance, in addition to loan words. The resulting model has a reasonable coverage (above 90%) over a range of freely-available corpora.
Most previous systems use morpheme as a processing unit for morphologicalanalysis. We would like to examine the effectiveness of the proposed models based on Eojeol and syllable. First, compare the models that use the Eojeol- unit analysis with others (“M” vs. “EM”, “S” vs. “ES”, and “MS” vs. “EMS”). When applying the Eojeol-unit analysis, AA is decreased, and AIS and 1A are increased. Then, compare the mod- els that use the syllable-unit analysis with others (“E” vs. “ES”, “M” vs. “MS”, and “EM” vs. “EMS”). When applying the syllable-unit anal- ysis, AIR and 1A are increased, and FR is de- creased. Therefore, both models are very useful when compared the morpheme-unit model only.
A TWO LEVEL MORPHOLOGICAL ANALYSIS OF KOREAN A T W O L E V E l / ~ ~ ' M O R t tfOI~OGI( ,AI~, ANALYSIS OF K O R E A N D c o k B o n g K i m , S u n g J i a L e e , K e y S u n C h o i , a n d G i l C[.]
Disambiguation of morphological analysis in Bantu languages Disambiguation of morphological analysis in Bantu languages A r v i H u r s k a i n e n D e p a r t m e n t o f A s i a n a n d A f r i c a[.]
MORPHOLOGICAL ANALYSIS AS A STEP IN AUTOMATED SYNTACTIC ANALYSIS OF A TEXT GUSTAV LEUNBACH M O R P H O L O G I C A L ANALYSIS AS A STEP IN A U T O M A T E D SYNTACTIC ANALYSIS OF A TEXT Introduction T[.]
We focus on the computational processing of Dravidian morphology, a critical issue since the family exhibits rich agglutinative inflectional morphology as well as highly-productive com- pounding. For example, Dravidian nouns are typically inflected with gender, number and case in addition to various postpositions. E.g., con- sider the word ag niparvvatattinṟeyeāppam ( അഗ്നിപർവ്വതത്തിന്റെയോപ്പം ) in Malayalam which is compromised of the compound noun stem agni+paṟavvatam (fire+mountain) and the following suffixes: tta (inflectional increment), inṟe (genitive case marker), ye (inflectional increment) and oppam (postposition). These combine to give the mean- ing of the English phrase ``with a volcano.'' This complexity makes morphologicalanalysis obligatory for the Dravidian languages.
The task of morphologicalanalysis is to produce a complete list of lemma+tag analyses for a given word-form. We pro- pose a discriminative string transduction approach which exploits plain inflection tables and raw text corpora, thus obviat- ing the need for expert annotation. Ex- periments on four languages demonstrate that our system has much higher cover- age than a hand-engineered FST analyzer, and is more accurate than a state-of-the-art morphological tagger.
(MorphologicalAnalysis and GEneration for Ara- bic and its Dialects) system (Habash et al., 2005; Habash and Rambow, 2006). This system, which we use as starting point in this paper, compiles ab- stract high-level linguistic information of different types to finite state machinery. The second type is typically not implemented in finite-state technology. Examples include the Buckwalter Arabic Morpho- logical Analyzer (BAMA) (Buckwalter, 2004) and its extension A LMORGEANA (Habash, 2007). These
There are usually two perspectives to be considered when NLP tools are evaluated: the developer’s and the users’ view. Developers validate their tool by comparing the in- put/output pairs to what they expect, but they also check e.g. for the processing speed or other system parameters. Such validation of specific targets by the developer is de- pendent on the system’s knowledge base (e.g. lexicon con- tents and processing rules), in other words, developers val- idate and report on the performance of their system on the basis of what they expect it to be capable of doing. From the users’ perspective, system performance has to sat- isfy their requirements. We refer to Underwood (1998) who states – for NLP lexicons – that users’ requirements may significantly differ when being compared to what a system has to offer; this ranges from needing far less information than what the system has to offer to needing to extend or modify even the best output. Additionally, in the light of an increasing number of web services offering linguistic anal- ysis (including morphologicalanalysis), the user should have the possibility to compare between different tools on offer.
work exploring morphological tagging for Finnish include Kanerva et al. (2018) and Silfverberg et al. (2015). However, work on full data-driven morphologicalanalysis, where the task is to return all and only the valid analyses for each token irrespec- tive of sentence context, is almost non-existent for Uralic languages. The only system known to the authors is the recent neural analyzer for Finnish presented by Silfver- berg and Hulden (2018). The system first encodes an input word form into a vector representation using an LSTM encoder. It then applies one binary logistic classifier conditioned on this vector representation for each morphological tag (for example NOUN|Number=Sg|Case=Nom). The classifier is used to determine if the tag is a valid analysis for the given input word form. Similarly to Silfverberg and Hulden (2018), our system is also a neural morphological analyzer but unlike Silfverberg and Hulden (2018) we incorporate lemmatization. Moreover, the design of our system consider- ably differs from their system as explained below in Section 3.
in compound formation. Finite verbal forms are marked for number and (4) person (1st, 2nd, 3rd), and (5) by a complex system of tenses and modes. The rule based morphological analyzer produces fine- grained annotations that cover these five morphological categories. Because the classification methods used in this paper require a single output variable from a nominal scale, an obvious approach would use the Cartesian product of the five morphological categories as target variable. However, this approach unnecessarily complicates the learning process, because most feature combinations cannot cooccur in the morphologicalanalysis of a single word. As a consequence, Hellwig (2015) reduced the tag set used
The morphological analyzer illustrated in this paper falls into the first class of Gold(2001) classification. The system aims at high accuracy of morphologicalanalysis of English language with morphological rules obtained through unsupervised machine learning. The analyzer applies letter transitional probability proposed in Keshava&Pilter(2005) in morphological rule learning and in disambiguation of morphologicalanalysis as well. An initial evaluation of the analyzer shows a promising result with an 88.42% precision, 78.46 recall and 83.14% F-score, which transcends the best results of English language reported in Unsupervised Segmentation of Words into Morphemes – Challenge 2005.
KyTea (Neubig et al., 2011) is a similar tool that can perform morphologicalanalysis for languages with the continuous script. It can also be trained using partial annotation data and output point-wise confidence scores for the analysis result which were used for creating partially annotated data in an active learning scenario. Still, by using a point- wise approach and estimating auxiliary tags (like POS) after computing segmentation, KyTea trades off accuracy for simplicity. Juman++ is faster, has better accuracy, does tag estimation jointly with segmentation, uses an online learning approach and can use longer contexts in forms of RNNLM and trigram features.
In order to alleviate the performance loss of maximum matching, we propose a concept of Context Independent Strings (CISs), which are strings having no ambiguity in terms of morpho- logical analysis. We also propose an algorithm for the building of the CISs dictionary from the large amount of automatically analysed texts. The dic- tionary maps CISs to the results of morphologicalanalysis (sequence of words and POS tags).
However, this morphological knowledge can be exploited by adding as training features the results from rule based morphologicalanalysis described in section 4. That gives a reasonably accurate (contains correct form in 98% cases) list of what tags seem possible for each word. So in addition to the used classifier training features commonly used for other languages, we also supply a list of possible part-of-speech and tag options for the selected word and its closest neighbours. We also provide a ‘recommended’ POS and tag, calculated as described in section 5.1, which gives ~1% additional boost in accuracy. This change augments the machine learning of ending (letter n-gram) relations with morphological features with the linguistic rules in analyser, and allows to achieve good results with rather small training corpora.
This study attempts to explore the discriminated phenotype features of the common facial morphological variations between the Mainland Japanese and the Ryukyuan; the difference of phenotype features between these two populations is prospected to infer different gene base sequences. In order to explore the phenotype features of facial morphology between the Mainland Japanese and the Ryukyuan, we propose a general framework of 3D facial morphologicalanalysis, which is shown in Figure 1. Our framework mainly includes two steps: (1) registration and landmark correspondence (procedures 1-4 in Fig.1), (2) statistical analysis of 3D facial morphological variation and population classification using facial morphological features (procedures 5 and 6 in Fig. 1). Both principal component analysis (PCA) and Mean Hyperplane are used for exploring the facial morphological variations . Experiments show that our proposed strategy can give promising identification performances between the Mainland Japanese and the Ryukyuan.
From the viewpoint of automatic processing, the non-standard word-forms described in Section 4 should be divided into two groups: those that have to be included in the user lexicon manually, and those that can be normalized using some kind of rewriting rules prior to the morphologicalanalysis, or added to the user lexicon automatically. This division corresponds roughly to that of frequent, irregular, non-productive on one side, and infrequent, regular, productive morphological or orthographic changes, on the other side.