• No results found

Web corpora

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Supervised Web-Corpora Building

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Supervised Web-Corpora Building

... of web corpora with a specific design through a targeted crawling ...balanced web corpus for ...friendly web interface, several conversion and cleaning tools, an anti-duplicate filter, a ...

6

On Bias free Crawling and Representative Web Corpora

On Bias free Crawling and Representative Web Corpora

... many corpora intended for use in corpus lin- ...option, web cor- pora are often regarded with reservation, partly because the sources from which they are com- piled and their exact composition are unknown ...

7

Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora

Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora

... A novel approach to automatically extracting paired transliterated-cognates from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking multiple pronunciation ...

8

Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

... In this paper, we present a novel method for dis- covering and modeling the relationship between in- formal Chinese expressions found in web corpora and their formal equivalents. Specifically, we im- ...

10

Vocabulary-Based Language Similarity using Web Corpora

Vocabulary-Based Language Similarity using Web Corpora

... Web corpora of the LCC for 346 languages are used in the following experiments. Text collections from different sources are compiled to test robustness of the approaches in regard to factors such as subject ...

6

Mining Key Phrase Translations from Web Corpora

Mining Key Phrase Translations from Web Corpora

... mixed-language web pages and extracting their ...retrieve web page snippets containing this phrase using the Google search ...mixed-language web page ...

8

{bs,hr,sr}WaC   Web Corpora of Bosnian, Croatian and Serbian

{bs,hr,sr}WaC Web Corpora of Bosnian, Croatian and Serbian

... constructed web corpora is quite an underresearched topic, with the exception of boilerplate removal / content extrac- tion approaches that deal with this problem implic- itly (Baroni et ...in web ...

7

Building a Database of Japanese Adjective Examples from Special Purpose Web Corpora

Building a Database of Japanese Adjective Examples from Special Purpose Web Corpora

... purpose Web corpora (SPW corpora) and investigates the characteristics of examples in the database by comparison with examples that are collected from a general purpose Web corpus (GPW ...SPW ...

5

Efficient construction of metadata enhanced web corpora

Efficient construction of metadata enhanced web corpora

... regarding Web re- sources, as linguistic evidence cannot be cited or identified properly in the sense of the ...all” web corpora may undermine the relevance of web texts for linguistic ...

10

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

... In our practical work we found that the collection and analysis of very large Web corpora is difficult for many reasons. For example, it is not clear how to treat pages with artificial vocabulary that is ...

46

Producing Monolingual and Parallel Web Corpora at the Same Time   SpiderLing and Bitextor’s Love Affair

Producing Monolingual and Parallel Web Corpora at the Same Time SpiderLing and Bitextor’s Love Affair

... As regards the The hrenWaC corpus, it is based on a crawl of 6.1 million documents acquired from 25,924 domains, from which only 6,228 contained documents both in En- glish and Croatian. From the collection of documents ...

8

Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora

Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora

... To examine the feasibility of the proposed method in identifying Chinese in Taiwan, mainland China and Hong Kong, we conducted a preliminary experiment. To avoid the data sparseness of using a tri-gram language model, we ...

8

Building Large Corpora from the Web Using a New Efficient Tool Chain

Building Large Corpora from the Web Using a New Efficient Tool Chain

... of web corpora was examined suc- cessfully in the BootCaT/WaCky community from a linguistic ...of corpora based on linguistic ...a web corpus (or a set of search engine results) might never be ...

8

Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories

Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories

... difference between the best performing classifiers for the positive and negative classes is 4 percent. There is no difference between the worst perform- ing classifiers for the positive and negative classes. These ...

7

Evaluating Different Methods for Automatically Collecting Large General Corpora for Basque from the Web

Evaluating Different Methods for Automatically Collecting Large General Corpora for Basque from the Web

... filtering web documents by their size improved the quality of the web ...from web servers or tend to have little textual content once page headers, menus, ...linguistic corpora, since they are ...

18

Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval

Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval

... of Web corpora, partic- ularly, for constructing general balanced corpora meant to represent a language as a ...of Web documents and then find a good method to randomly sample a set of ...

6

Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web

Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web

... While traditional electronic corpora such as journal articles or corpus resources BNC, SUSANNE, Brown corpus are satisfactory for classical lexical acquisition, Web corpora are another s[r] ...

9

Building a Korean Web Corpus for Analyzing Learner Language

Building a Korean Web Corpus for Analyzing Learner Language

... Other possibilities There are other ways to in- crease the size of a web corpus using BootCaT. First, one can increase the number of returned pages for a particular query. We set the limit at 20, as anything ...

9

The Challenges and Joys of Analysing Ongoing Language Change in Web based Corpora: a Case Study

The Challenges and Joys of Analysing Ongoing Language Change in Web based Corpora: a Case Study

... Researchers of language variation and change often need to go to great lengths to find sufficient data, particularly when they shall be used for a sound statistical analysis of the phenomenon in question. The recent ...

8

Investigating the distribution of  some  (but not  all ) implicatures using corpora and web-based methods

Investigating the distribution of some (but not all ) implicatures using corpora and web-based methods

... scalar implicatures such as those from some to not all, is that they fall into the class of GCIs and as such, constitute a homogeneous class of highly regularized and context-independent implicatures. This paper reports ...

55

Show all 10000 documents...

Related subjects