[PDF] Top 20 Large Scale Text Collection for Unwritten Languages

Large Scale Text Collection for Unwritten Languages

... Existing methods for collecting texts from endangered languages are not creating the quantity of data that is needed for corpus studies and natural language processing tasks. This is because the process of ... See full document

5

Towards Efficient Framework for Semantic Query Search Engine in Large-Scale Data Collection

... a text, and hence the meaning of the text; Second, it is able to represent a text by a compact, binary code, which enables fast ...input text such that the learned compact binary codes can be ... See full document

6

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

... The Projekt Deutscher Wortschatz (Quasthoff, 1998) started more than 15 years ago by creating a corpus-based monolingual dictionary of the German language available at http://wortschatz.uni-leipzig.de. Since June 2006 ... See full document

7

A Large scale Recipe and Meal Data Collection as Infrastructure for Food Research

... We organized the recipes and their related data in cookpad to help researchers use them. First, we collected approximately 1.7 million recipes that had been uploaded to cookpad by September 2014. Figure 2 gives an ... See full document

5

A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure

... of text data, and these applications mainly aim at web pages. Hence, a large amount of analyzed web pages is desirable as the language resources based on the document ... See full document

6

Large-Scale Hierarchical Alignment for Data-driven Text Rewriting

... on large-scale sentence alignment is in machine translation, where adding pseudo-parallel pairs to an existing parallel dataset has been shown to boost the translation performance (Munteanu and Marcu, ... See full document

10

Summarizing large text collection using topic modeling and clustering based on MapReduce framework

... multi-document text summarizer based on MapReduce framework is presented in this ...a large text collection and the summarization performance parameters compression ratio, retention ratio and ... See full document

18

A Large Scale Comparison of Historical Text Normalization Systems

... ied collection of datasets used for historical text normalization so far, covering eight languages from different language families—English, Ger- man, Hungarian, Icelandic, Spanish, Portuguese, ... See full document

14

Lexical Coverage Evaluation of Large scale Multilingual Semantic Lexicons for Twelve Languages

... Different from many existing lexical resources, which are built as independent lexical knowledge bases, our semantic lexicons form components of the USAS system, in which the lexicons and software framework are ... See full document

6

The Creation of Large Scale Annotated Corpora of Minority Languages using UniParser and the EANC platform

... When designing a parsing tool for middle-sized and large corpora in different languages, we had in mind several requirements it should conform to. First, it should work fast enough to cope with big amounts ... See full document

10

Design and Evaluation of a Parallel Classifier for Large Scale Arabic Text

... of text classification for different languages and is included in numerous experiments as a basis for ...of text classification research, and is one of the best classifiers within the field [4, ...a ... See full document

8

A novel clustering algorithm for large-scale text collection and its incremental version

... of text collection. Therefore, NMF spends much more time when text collection ...from text collection to separate texts into several clusters of different ... See full document

12

Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords

... colloquial languages that were historically unwritten are starting to be written for the first ...these languages, there are extremely limited (approximately zero) resources available, not even ... See full document

5

The Human Language Project: Building a Universal Corpus of the World’s Languages

... CLARIN (Simons and Bird, 2003; Broeder and Wittenburg, 2006; V´aradi et al., 2008), but this is not the same as collecting and disseminating data. Initiatives to develop standard formats for linguistic annotations are ... See full document

10

Normalising Audio Transcriptions for Unwritten Languages

... The task of documenting the world’s languages is a mainstream activity in linguistics which is yet to spill over into computa- tional linguistics. We propose a new task of transcription normalisation as an algo- ... See full document

9

GATECloud.net: a Platform for Large-Scale, Open-Source Text Processing on the Cloud

... potentially large amounts of data, including the text-processing application file(s), the document collection to be processed, the execution reports and the results files (if any are ... See full document

14

Collection and linguistic processing of a large-scale corpus of medical articles

... In the next step, medical terms are identified. We cur- rently have two different term identifiers: one is based on a robust lookup in the medical database UMLS (Lindberg et al., 1993; Humphreys et al., 1998), the other ... See full document

5

The SPIRIT collection: an overview of a large web collection

... web collection and a number of statistics derived from our initial ...The collection appears to be a useful resource for those who require geographically more heterogeneous data than existing web ... See full document

6

Large-scale development, characterization, and cross-amplification of EST–SSR markers in Chinese chive

... Chinese chive (A. tuberosum Rottler ex Spr.) is a tetraploid (2n = 4X = 32) perennial that belongs to the Liliaceae, and the species contain an abundance of organic sulfur compounds, which are responsible for plant’s ... See full document

7

Text To Speech for Languages without an Orthography

... the languages of the world do not have a standardized writing ...such languages. It may seem useless to develop a text-to-speech system when there is no text ...these languages, and ... See full document

10