The Slavic languages provide a fertile ground for corpus-based and computational investigations. For one, the Slavic languages combined count more than 315 million speakers (Sussex and Cubberley 2006) who have produced and continue to produce massive amounts of data on a daily basis. The Slavic languages also display peculiar typological features, which make them more challenging from a computational perspective in comparison to other commonly studied languages of the Indo-European family. For example, due to the richness of their morphology, Slavic languages exhibit a relatively extensive freedom in word order in comparison to the Germanic and Romance languages. From a computational viewpoint, this property is challenging because it results in a greater data sparsity, i.e., there are considerably more surface realisations for a given underlying linguistic phenomenon, be it the number of forms a word can display, or the number of places the subject or object of a sentence can occupy relative to the verb. Take, for example, the number of morphological forms of a verb: if participial forms are counted, Russian transitive verbs yield about 80 different forms, while English verbs have at most five different forms. Another example concerns the increased number of surface patterns for such pairs as Verb-Direct Object, because the position of the Direct Object can be quite flexible. Therefore, sparsity needs adequate representation in computational research on Slavic languages. Yet, the same morphological richness and regularity in inflection also impacts computational studies of Slavic languages positively: it is possible to predict with reasonable precision the Part-of-Speech (PoS) category and the syntactic function of word forms from their endings, something that is considerably more difficult to achieve in the Germanic and Romance languages. Finally, the high regard in which both linguistics and mathematics are held in Slavic countries has yielded remarkable results. One of the earliest examples of research with Slavic corpora is the seminal paper by Andrej Markov, which concerned predictions of word sequences on the basis of Eugene Onegin (Markov 1913; Hayes et al., 2013). This study led to development of Markov models, which are commonly used in modern computationallinguistics for predicting a linguistic phenomenon from the adjacent context.
Automatic Survey Article Generation: The iOPENER Project (Information Organization for PENning Exposi- tions on Research), a newly-commenced NSF-funded col- laboration between the University of Maryland and the University of Michigan, will link automatic summariza- tion (e.g., (Zajic et al., 2007; Radev et al., 2004; Radev et al., 2005)) and visualization work with citation classifica- tion. Key developments in this work will include extending techniques in summarization to handle redundancy, contra- dictions, and temporal ordering based on citation analyses (Elkiss et al., 2008). The intended result is a set of readily- consumable surveys of different scientific domains and top- ics, targeted to different audiences and levels. The project will leverage existing publicly-available resources such as the ACL Anthology, ACM Digital Library, CiteSeer, and others for analysis, retrieval, selection, and survey/timeline creation and visualization. The iOPENER software and re- sulting surveys and timelines will be made publicly avail- able.
differentiates written and spoken genres by separating the characteristic style of a text or a discourse from the medium (graphic vs phonic) in which it appears. Modelled on a contin- uum of proximity vs distance between the interlocutors, style is determined by the condi- tions of a specific communication (dialogue vs monologue, familiarity of interlocutors, presence in time and space, etc.) and the strategies of verbalization (permanence, density of information, complexity, etc.). Spoken language prototypically displays several characteris- tics with respect to morpho-syntax (anacoluthon, paratactic sequences, holophrastic utter- ances, etc.), lexis (e.g. low variation of lexical items), and pragmatics (discourse particles, self-corrections, etc.) that vary from prototypical written genres such as fictional and jour- nalistic prose. In many cases the characteristic style corresponds with the medium, i.e. in- formal conversations between friends are carried out orally while an administrative regula- tion is published in written form. However, the relation between style and medium is not immutable, and the importance of distinguishing between the two becomes obvious when they are on opposite ends. A sermon, for example, features characteristics of written texts, and, in fact, it will be produced in written form. Nevertheless, it is usually orally presented. Thus, a sermon is transmitted orally (hence the medium is phonic), but, with respect to the conception of the text, it is based on a written tradition and reflects the characteristic style of written language. Changes in the relation between style and medium can also show the other way around. Particularly nowadays, with the rise of the new media, people can chat with each other using keyboards and touch-screens – a form of communication that was not possible some twenty years ago. New media thus facilitate proximity communication such as informal chatting using a graphic representation of language resembling habits of spoken language use.
The model has been applied to learn simple English motion constructions from a corpus of child-directed utterances, paired with situation representations. The resulting learning trends reflect cross-linguistic acquisition patterns, including the incremental growth of the constructional inventory based on experience, the prevalence of early grammatical markers for conceptually basic scenes (Slobin, 1985) and the learning of lexically specific verb island constructions before more abstract grammatical patterns (Tomasello, 1992). For current purposes, the systems described demonstrate the utility of the ECG formalism for supporting computational modeling and offers a precisely specified instantiation of the simulation-based approach to language.
In this paper, we combine techniques from rule-based and corpus-based MT in a hybrid approach. We only use a dictionary, basic analytical resources and a monolingual target- language corpus in order to enable the construction of an MT system for lesser-resourced languages. Statistical and example-based systems usually do not involve a lot of linguistic notions. Cutting up sentences in linguistically sound subunits improves the quality of the translation. Demarcating clauses, verb groups, noun phrases, and prepositional phrases restricts the number of possible translations and hence also the search space. The sentence chunks are translated using a dictionary and a limited set of mapping rules. By bottom- up matching the different translated items and higher-level structure with the database information, one or more plausible translated sentences are constructed. A search engine ranks them using the frequencies of occurence and the matching accuracy in the target- language corpus.
levels of state-of-the-art NLP tools for textual data, such as CoreNLP (Manning et al., 2014), OpenNLP (OpenNLP, 2017), but also implement open source SDK for tool developers to promote adoption. These workflow engines can operate dif- ferent tools which are separately developed only because of the underlying data interchange for- mats that impose common I/O language between those tools. For such an interchange format, The LAPPS Grid uses LAPPS Interchange Format (LIF) rooted on JSON-LD serialization (Verhagen et al., 2015), while the WebLicht uses XML-based Text Corpus Format (TCF) (Heid et al., 2010). Additionally the LAPPS Grid defines a semantic linked data vocabulary that ensures semantic in- teroperability (Ide et al., 2015). Having imple- mented in-platform interoperability has led to a multi-platform collaboration between LAPPS and CLARIN (Hinrichs et al., 2018).
SMT concept is coming from information theory and it uses the statistical models in order to generate the output. There is no customization work needed because translation tool learns methods from statistical analysis of bilingual corpora . It is less expensive than RBMT and it has better resource usage. Corpus is the basis of this method but its creation is expensive with limited resources . SMT does not work well with languages that have different word orders and it is not possible to predict the result in SMT. Eg: n-gram based SMT
However, despite the importance of understand- ing the size and quality of Arabic content availa- ble online, little research has been done to sys- tematically assess and measure this content in a rigorous manner. This paper presents the results of a research project conducted by King Abdula- ziz City for Science and Technology (KACST) and aimed at the development of an indicator for Arabic online content based on computational linguistic corpora. The paper is structured as fol- lows: The next section provides a background for the research and its design, followed by a de- tailed description of the corpora development process and results. The paper then highlights the findings of the project and provides pointers for further research.
The theme of this workshop is the interaction be- tween computationallinguistics (CL) and general linguistics. The organizers ask whether it has it been virtuous, vicious, or vacuous. They use only three of the rather extraordinary number of v -initial adjectives. Is the relationship vital, valu- able, venturesome, visionary, versatile, and vi- brant? Or vague, variable, verbose, and sometimes vexatious? Has it perhaps been merely vestigial and vicarious, with hardly any general linguists really participating? Or vain, venal, vaporous, vir- ginal, volatile, and voguish, yet vulnerable, a re- lationship at risk? Or would the best description use adjectives like vengeful, venomous, vilifica- tory, villainous, vindictive, violent, vitriolic, vo- ciferous, and vulpine?
In this paper we have outlined the MiLCA project, a Germany-wide joint project on distance learning in the field of ComputationalLinguistics funded by the German ministry for education and research.The immediate result of the project is the pool of course material itself. The courses cover all main areas of CL, theoretical as well as applied, and a number of special topics. All material is available in a uniform XML markup including metadata which can be imported into the open-source learning platform ILIAS. A collecting society has been founded to take care of the future administration and distribution of the course material. As a second result of the project, a number of XML standards which existed only as proposals at the beginning of the project have been further developed, tested, and documented by project partners and within a number of co-operations with external experts.
In 1974 a signal of 1,679 bits was considered potentially significant and challenging to technology of the time, e.g. it took three minutes to transmit; a quarter of a century later, we are used to processing messages of megabytes, gigabytes, or bigger in terrestrial communication networks such as the Internet. It is clear that we could look beyond a single pictogram or collection of diagrams, to design a much larger Corpus of data to represent humanity. (Vakoch 1998c) advocates that the message constructed to transmit to extraterrestrials should include a broad, representative collection of perspectives rather than a single viewpoint or genre; this should strike a chord with Corpus Linguists for whom a central principle is that a corpus must be “balanced” to be representative.
COLING 1973 Volume 1 Computational And Mathematical Linguistics Proceedings of the International Conference on Computational Linguistics BIBLIOTECA DELL' (( A R C H I V U M R O M A N I C U M > ) S e r[.]
The rest of the document is organised as follows. In section 2, after a brief introduction to the humble beginnings of Unit Terjemahan Melalui Komputer (UTMK) a computer-aided translation unit in Malaysia, and some of her past and present projects, we discuss UTMK’s current projects and her participation in MABBIM’s proposed Malay Linguistics portal. The text initiative comes under this portal.
Programming of Reversible Systems in Computational Linguistics Programming of Reversible Systems in Computational Linguistics Gerhard Engelien, Forschungsgruppe LIMAS, Bonn ~ ~ I ~ ~ ~ ~ ~" ? In my pa[.]
It is well known that lexically annotated text corpora are extremely helpful in lexical ambiguity resolution, especially in computationallinguistics tasks. Lexical annotation means that polysemous words occurring in the corpus (ideally, all such words) are tagged for concrete lexical senses, specified by some lexicographic resource, be it a traditional explanatory dictionary or an electronic thesaurus like WordNet. Such lexically annotated corpora play a crucial role in word sense disambiguation (WSD) tasks. These tasks are normally solved by machine learning techniques, which are rapidly developing and improving. Research work in this area performed in varied paradigms for a multitude of languages is immense; recent papers, to cite but a few, include a comprehensive review by Navigly 2009, a paper by Moro et al. 2014, and newest research involving neural networks presented by Dayu Yuan et al. 2016.
This conference represents just the third time in their 40+ year history that the two premier conferences in natural language processing, computationallinguistics, and language technology have merged for a joint COLING/ACL event; and it’s the first time that the joint conference will be held in the southern hemisphere. It is fitting then, that we received a record number of 630 submissions from 40+ countries: 39% from 13 countries in Asia, 29% from 17 countries in Europe, 25% from Canada and the United States, 4% from Australia and New Zealand, 2% from 4 countries in the Middle East, and less than 1% from South America (Brazil) and Africa (South Africa and Tunisia). Of the 630 submissions, 23% were accepted for paper presentations and an additional 20% for poster presentations.
Assigning author factions can be seen as network classification problem, where the goal is to label nodes in a network such that there is (i) a corre- lation between a node’s label and its observed at- tributes and (ii) a correlation between labels of in- terconnected nodes (Sen et al., 2008). Such collec- tive network-based approaches have been used on scientific literature to classify papers/web pages into its subject categories (Kubica et al., 2002; Getoor, 2005; Angelova and Weikum, 2006). If we knew the word distributions between factions beforehand, learning the author factions in our model would be equivalent to the network classification task, where