Top PDF text corpus

Construction and Analysis of a Large Vietnamese Text Corpus

... Vietnamese text processing started to become active about twelve years ...a corpus consisting of news- papers coming from two news sources collected within 6 months in ...This corpus was annotated ...

5

Statistical Analysis of Multilingual Text Corpus and Development of Language Models

... optimal text and of the language model largely depends on the quality of the text corpus ...The corpus should be unbiased and large enough to convey the entire syntactic behaviour of the ...

5

Approach for Transforming Monolingual Text Corpus into XML Corpus

... Figure 3.1: Sample Input Text File (input.txt). The English Stanford Tagger has three modes: tagging, training, and testing. Tagging allows you to use a pre-trained model (two English models are included) to ...

5

Mining Paraphrasal Typed Templates from a Plain Text Corpus

... Finding paraphrases in text is an impor- tant task with implications for genera- tion, summarization and question answer- ing, among other applications. Of par- ticular interest to those applications is the ...

11

Risamálheild: A Very Large Icelandic Text Corpus

... Icelandic text corpus has been evident for some ...large text corpora and other textual resources has increased ...a corpus such as the one described here has therefore been considered a top ...

6

A Survey on Identification of Emotion from Text Corpus

... written text. Emotion detection from text is a research issue as it is difficult to automate the recognition of feelings through the ...the text corpus and different emotion models is also ...

5

Multi-Layer Discourse Annotation of a Dutch Text Corpus

... Text has been normalized to UTF-8, tokenized and segmented into sentences using the Alpino tools. 4 The RST annotation is created using O'Donnell's RST tool. 5 The MMAX annotation tool 6 was used to mark pairs of ...

6

Certification and Cleaning up of a Text Corpus: Towards an Evaluation of the “Grammatical” Quality of a Corpus

... An other method to detect syntactic errors is based upon the language rules. This method consists in applying syntactic rules on the corpus to analyze. (Mitton, 1996) recommends, in case of analysis failure, to ...

8

Web Text Corpus for Natural Language Processing

... web text, training on 153 manually marked web ...newspaper text only use regular text features, such as words and ...web text uses HTML tag features in addition to regular text ...web ...

8

Identifying and Reducing Gender Bias in Word Level Language Models

... Many text corpora exhibit socially problematic biases, which can be propagated or amplified in the models trained on such ...a text corpus and the text generated from a recurrent neural net- ...

9

Large Coverage Root Lexicon Extraction for Hindi

... This paper describes a method using mor- phological rules and heuristics, for the au- tomatic extraction of large-coverage lexi- cons of stems and root word-forms from a raw text corpus. We cast the problem ...

9

Development of Speech corpora for different Speech Recognition tasks in Malayalam language

... of text corpus and speech corpus for each tasks is being ...of text and speech corpus used for each recognition tasks is ...these text and speech corpus are explained ...

8

Computer Program for Counting the Part of Speeches, Text Narrations by using Secondary Data Algorithm Techniques

... In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as ...

6

Analysis of Selective Strategies to Build a Dependency Analyzed Corpus

... This paper discussed several sampling strategies for Japanese dependency-analyzed corpora, testing them with the Kyoto Text Corpus and the IPAL corpus. The IPAL corpus was constructed ...

8

A Fully Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis

... Fujio and Matsumoto presented a syntactic analysis method based on lexical statistics (Fujio and Matsumoto, 1998). They made use of a probabilistic model defined by the product of a probability of hav- ing a dependency ...

8

AUTHOR IDENTIFICATION OF HINDI POETRY

... a corpus of 3,000 passages which is the work of three Bengali ...available text corpus were used as dataset for extracting features namely English books and Reuters corpus volume ...

5

A Structured Approach for Building Assamese Corpus: Insights, Applications and Challenges

... language text, a well structured text corpus is very much ...a corpus can directly influence on performance of various Natural Language Processing ...Assamese Corpus in UNICODE ...

8

Ambiguity in Semantically Related Word Substitutions: an investigation in historical Bible translations

... We use three English Bibles. The first is the King James Version (KJV) from 1611–1769. It has been annotated with word senses. The other two Bibles are the Bible in Basic English (BBE)— 1941–1949—and Robert Young’s ...

6

COUNTER COrpus of Urdu News TExt Reuse

... a corpus is clear from the above discussion, and for us, it represents the first stage in a larger ...this corpus to inform the design of an Urdu text reuse detection ...the corpus will serve ...

26

Modality in Text: a Proposal for Corpus Annotation

... up-to-date corpus of contemporary Portuguese. The written sub-part of the corpus consists of 310 million words, sampled from texts mostly after 1970 gathered from many different genres and domains such as ...

8

text corpus

Related subjects