The experiment compared automated and manual reuse of arbitrary selected open corpus content using a real life user-trial. A simple online e-assessment ap- plication (available in English, Spanish and French), built for this experiment, presented users with traditional gap filler grammar exercises, built using differ- ent sets of open corpus resources converted (manually or automatically) into grammatical e-assessment exercises. Verb chunks conjugated at specified tenses were removed and replaced by gaps, which users had to fill according to par- ticular infinitives and tenses specified for each gap. The answers provided were compared to the original verb chunks and users were assigned a score for each grammar point.
First of all, for a subject area like SQL, there exists a wide range of content openly available on the World Wide Web. This is true for a great many subject areas. However, if we narrowed the subject of interest into very specific, almost niche areas, it could have an effect on the overall behaviour of the system. This needs to be further researched. Due to the fact that both domain experts (to create the domain ontology) and subject-experienced annotators were available, the area of SQL provided a good starting point to test all aspects of the system. However, changing the scope would require a generalised process for ontology creation and annotation generation. Additionally, no validation of annotations was performed, which assumes that the annotators gave reliable information. Changing the scope of the domain might require changing this model to either using expert users (as opposed to experienced users) or using large crowds of annotators with the facility to vote for the most popular annotation values per document.
10 Read more
Aim and Hypothesis: As discussed in section 2, in order to supply long tail niche requirements, a content production system service should guarantee the provision of i) large volumes of content, iii) at low production costs, iii) suitable for arbitrary activities performed. Although the ﬁrst condition is necessarily achieved through the selection of an open corpus reuse strategy, the perfor- mance of a slicer with respect to the two other conditions is yet to be examined. For this reason, this evaluation focuses, as a ﬁrst step, upon investigating the suitability and cost of content produced automatically by a slicing system with respect to content manually produced for the same activity. This initial experi- ment is performed upon a sample of independently selected niche requirements and open corpus resources. Assuming positive results were measured, this would indicate such a technique could also scale for any niche requirement using any open corpus content. Any issues detected at this early stage would only be ampli- ﬁed within a large scale deployment making any scaling pointless. Additionally, since production cost is dependent upon local prices of manual labor as well as individual server speciﬁcations, an estimation of production time was considered instead a better proxy for production cost diﬀerences. For this reason, the time required to produce content batches was measured for both the educators and slicer. The hypotheses tested within this evaluation are therefore as follows:
11 Read more
specifically for the purpose of this experiment. Within the context of this paper, the purpose of this educational application is to be used only as a ”content reuse vehicle” for evaluating the slicer (i.e. not discussing educational aspects).This application represents a well known approach to grammatical teaching, via a Computer Adaptive Testing (CAT) tool, similar in nature to the item response assessment system introduced by Lee et al. . For this reason, it provides a good example of the kind of educational application which could avail of 3rd party content. In this application, users are presented with series of individual open corpus resources (involving no re-composition), repurposed (manually or automatically) as traditional gap filler exercises. Verb chunks items are removed and replaced by gaps, which users must fill according to particular infinitives and tenses specified for each gap. Answers provided are then compared to the original verb chunks and users are assigned a score for each specific grammar point. Slicing open corpus content is known to a↵ect the reading flow and annotation quality of content produced . An evaluation of content performed within the context of a language learning task (by individuals with varying levels of competency in the English language) should hence make participants very sensitive to reading flow properties of content. Moreover, since grammar e-assessment units are created according to verb chunk annotations, any annotation quality issue would provoke major assessment errors, making any training over such content pointless and time consuming to users. Suitability performance, within the context of this use case, hence refers to the ability to correctly assess an individual.
10 Read more
Countries such as the UK, USA, Canada and Australia, are increasingly investing in national repositories to encourage the re-use of digital content. Examples of such initiatives include, ARIADNE Foundation (European Union) [ARIADNE], Education Network Australia Online (Australia) [EdNA], eduSource (Canada) [eduSource] and BELLE (Canada) [BELLE], Multimedia Educational Resources for Learning and Online Teaching and Gateway to Educational Materials (USA) [MERLOT] and the National Institute of Multimedia Education (JAPAN) [NIME]. Open corpus learning content can exist in numerous formats. It can be wrapped and tagged as a learning object or as an entire course; it can take the form of video, audio or graphical streaming; it can be in the form of documents, pdf files or PowerPoint presentations. Any text regardless of its granularity from single words up to entire pages also constitutes open corpus content.
Some educational hypermedia systems, such as KBS Hyper- book  and SIGUE , allow the incorporation of content sourced from the WWW. However, leveraging such content for use within these systems requires significant manual ef- fort on the part of a domain expert or course designer, in advance of incorporation. The content must be sourced, not a trivial task, and then tagged using the LOM  meta- data schema and associated with domain model concepts to enable integration . SIGUE, similarly, requires the meta- data generation for all new content and manual integration with a domain model. Knowledge Sea II also allows the in- tegration of open corpus content, however, to add an open corpus resource to the system, a comparative analysis must be conducted between the new resource and every existing resource in the collection. A keyword-based similarity met- ric is used. Knowledge Sea II employs self-organizing topic maps to categorize content and create a linked flow between resources . While these AH systems are attempting to utilize open corpus content, none have yet attempted to im- plement the automated harvesting and use of open corpus knowledge models.
However there are a number of issues which restrict and in some cases prevent such content from being discovered, harvested and reused. One of the main issues involved is the “discoverability” of such content. The vast quantities of content available via the Worldwide Web are seldom organised and vary in purpose, topic, format and exposure methods. How the content is stored can dictate how accurately machines can classify and interpret the content, this in turn relates to how accurately content requirements can be matched to suitable candidate content. Several methods of addressing these issues have been identified, which are discussed in this paper, such as content aggregation, automatic metadata generation, semantic and vocabulary mapping, web crawling, etc. However a scenario has yet to be created whereby all these methods are united and evolved to produce a content discovery and harvesting mechanism that can deliver open corpus content to eLearning environments that accurately matches learning content requirements.
Traditional educational hypertext systems are based upon the generation of links and anchors between content objects . However the dynamic incorporation of open corpus educational content in eLearning requires the generation of a relationship between educational concepts and the hypertext documents. One approach to create this overlay between concept and content is to use a Mindmap interface to allow learners to explore and associate hypertext content with knowledge maps of their own creation. This paper presents the Open Corpus Content Service (OCCS), a framework that uses the hypertext structure of the WWW to provide methods of educational content discovery and harvesting. The OCCS semantically examines linked content on both the WWW and in digital content repositories, and creates concept specific caches of content. The paper also introduces U-CREATe, a novel user-driven Mindmap interface for supporting the exploration and assembly of content cached by the OCCS, in a pedagogically meaningful manner. The combination of these systems benefits both the educator and learner, empowering the learner through ownership of the educational experience and allowing the educator to focus on the pedagogical design of educational offerings rather than content authoring.
As educators attempt to incorporate the use of educational technologies in course curricula, the lack of appropriate and accessible digital content resources acts as a barrier to adoption. Quality educational digital resources can prove expensive to develop and have traditionally been restricted to use in the environment in which they were authored. As a result, educators who wish to adopt these approaches are compelled to produce large quantities of high quality educational content. This can lead to excessive workloads being placed upon the educator, whose efforts are better exerted on the pedagogical aspects of eLearning design. The accessibility, portability, repurposing and reuse of digital resources thus became, and remain, major challenges. The key motivation of this research is to enable the utilisation of the vast amounts of accumulated knowledge and educational content accessible via the World Wide Web in Technology Enhanced Learning (TEL). This thesis proposes an innovative approach to the targeted sourcing of open corpus content from the WWW and the resource-level reuse of such content in pedagogically beneficial TEL offerings. The thesis describes the requirements, both educational and technical, for a tool-chain that enables the discovery, classification, harvesting and delivery of content from the WWW, and a novel TEL application which demonstrates the resource-level reuse of open corpus content in the execution of a pedagogically meaningful educational offering. Presented in this work are the theoretical foundations, design and implementation of two applications: the Open Corpus Content Service (OCCS); and the User-driven Content Retrieval, Exploration and Assembly Toolkit for eLearning (U-CREATe). To evaluate and validate this research, a detailed analysis of the different aspects of the research is presented, outlining and addressing the discovery, classification and harvesting of open corpus content from the WWW and open corpus content utilisation in TEL. This analysis provides confidence in the ability of the OCCS to generate collections of highly relevant open corpus content in defined subject areas. The analysis also provides confidence that the resource-level reuse of such content in educational offerings is possible, and that these educational offerings can be pedagogically beneficial to the learner. A novel approach to the sourcing of open corpus educational content for integration and reuse in TEL is the primary contribution to the State of the Art made by this thesis and the research described therein. This approach differs significantly from those used by current TEL systems in the creation of learning offerings and provides a service which is considerably different to that offered by general purpose web search engines.
417 Read more
3. Bohus, D., Horvitz, E.: Models for Multiparty Engagement in Open-world Dialog. In: Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue. pp. 225–234. Association for Computational Linguistics, Stroudsburg, PA, USA (2009).
Natural language inference (NLI) is a key part of natural language understanding. The NLI task is defined as a decision problem whether a given sentence – hy- pothesis – can be inferred from a given text. Typically, we deal with a text con- sisting of just a single premise/single sen- tence, which is called a single premise en- tailment (SPE) task. Recently, a derived task of NLI from multiple premises (MPE) was introduced together with the first an- notated corpus and corresponding several strong baselines. Nevertheless, the further development in MPE field requires acces- sibility of huge amounts of annotated data. In this paper we introduce a novel method for rapid deriving of MPE corpora from an existing NLI (SPE) annotated data that does not require any additional annotation work. This proposed approach is based on using an open information extraction sys- tem. We demonstrate the application of the method on a well known SNLI cor- pus. Over the obtained corpus, we provide the first evaluations as well as we state a strong baseline.
Abstract. LHC experiments make extensive use of web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is not always possible to have local web caches, particularly for opportunistic cases where experiments have little control over site services. The Open High Throughput Computing (HTC) Content Delivery Network (CDN), openhtc.io, aims to address this by using web proxy caches from a commercial CDN provider. Cloudflare provides a simple interface for registering DNS aliases of any web server and does reverse proxy web caching on those aliases. The openhtc.io domain is hosted on Cloudflare's free tier CDN which has no bandwidth limit and makes use of data centers throughout the world, so the average performance for clients is much improved compared to reading from CERN or a Tier 1. The load on WLCG servers is also significantly reduced. WLCG Web Proxy Auto Discovery is used to select local web caches when they are available and otherwise select openhtc.io caching. This paper describes the Open HTC CDN in detail and provides initial results from its use for LHC@Home and USCMS opportunistic computing.
First, the frequencies of all the nodes in the query graph are determined independently by using only the restrictions on the nodes themselves, i. e. restrictions on word form, lemma, part of speech, word class, type of incoming or outgoing dependency relations. Then the nodes are sorted by their frequencies in ascending order. For the least frequent node, all matching cor- pus positions and sentence IDs are determined using a CQP query. The sentence IDs and corpus positions are stored in a data structure, then the active corpus in the Corpus Workbench is limited to the sentences that matched the query. The procedure is repeated on the restricted subcorpus for all remaining nodes in the query graph. By always limiting the corpus to the results of the previous query, the sentence IDs of the last query represent all sentences in the corpus that match all the node restrictions in the query graph. By beginning this process with the least frequent node, subcorpus size is minimized from the start.
Special thanks to Jorge Valeriano Assem and Elena Vera for the support they give to the social service program ded- icated to the CIEMPIESS Project. We thank to Professor Javier Cu´etara for his advices on the creation of the auto- matic phonetizer and we thank to Alejandra Chavarria for her assistance in detecting bugs on it. We also thank to Fe- lipe Ordu˜na for his criticism on the first version of this doc- ument and we thank to Ivan V. Meza for his observations and advices on the final version. And finally, we thank to all of the social service students who helped to the creation of the corpus.
Since entity linking (EL) systems’ results vary widely according to the corpora and to the annota- tion types needed by the user, we present a work- flow that combines different EL systems’ results, so that the systems complement each other. Con- flicting annotations are resolved by a voting scheme which had not previously been attempted for EL. Besides an automatic entity selection, a measure of coherence helps users decide on the validity of an annotation. The workflow’s results are presented on a UI that allows navigating a cor- pus using text-search and entity facets. The UI helps users assess annotations via the measures displayed and via access to the corpus documents.
Mehdad (2010) used CrowdFlower 3 for the creation of a bilingual Textual Entailment corpus for the English/Spanish language pair taking advantage of an already existing monolingual English RTE corpus. Although the benefits from collecting corpora using crowdsourcing techniques are numerous, gathering data is cheap, quick to acquire, and varied in nature, it does not go without carrying risks such as quality control and workers that try to "game" the system.
for her help in the early stages of the project, and to the Finnish team, Graham Wilcock and Niklas Laxström for their work on the WikiTalk application, and Hanna Kellokoski and Jani Koskinen for assisting in the data collection. We are extremely grateful to all the people who spread the word about the project, and helped us in many ways to contact potential participants as well as organize the events. Moreover, we would like to thank the participating schools and their principals for accepting our data collection visits, and in particular, the teachers for helping us before and during the actual event: Marjaana Aikio (Ivalo), Helmi Länsman (Utsjoki), and Mette Anti Gaup (Karasjok). The biggest thanks go to the participants themselves: they produced Wikipedia articles, read aloud three articles, and took part in the video conversations, and without whom the DigiSame corpus would not exist.
For the Wikipedia sections, a different preprocessing pro- cess was used. To extract a coherent, domain-specific part of Wikipedia we followed the established methodology and toolchain of Ytrestøl et al. (2009). For the Linux-related WDC sub-corpus, sub-domains are approximated by cre- ating a seed set from relevant items in the Wikipedia cat- egory system. We then performed a connectivity analysis of articles linked to by the seed set, and discard those that are infrequently referenced. Once linguistic content had been extracted, we segmented the text into ‘sentence’ (or other top-level utterance type). Again following Ytrestøl et al. (2009), for this first public release of the WDC we combined the open-source tokeniser package 6 with a
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
The original English TUNA corpora were col- lected in 2006 and used for multiple shared tasks (Gatt and Belz 2010) on Referring Expression Gen- eration (REG) and other work in this area (van Deemter et al. 2012). Each corpus consists of REs produced by humans presented with a target item (1 or 2 pieces of furniture, or 1 or 2 people’s faces) and a set of distractors (other pieces of furniture or faces), in a web-based elicitation paradigm. A Dutch TUNA was conducted in 2011 (DTuna, Koolen et al. 2011) and an Arabic one in 2015 (Khan 2015).