The TERRE-ISTEX project aims to identify scientificresearch dealing with specific geographical territories areas based on hetero- geneous digital content available in scientificpapers. The project is divided into three main work packages: (1) identification of the periods and places of empirical studies, and which reflect the publications resulting from the analyzed text samples, (2) identification of the themes which appear in these documents, and (3) development of a web-based geographical information retrieval tool (GIR). The first two actions combine Natural Language Processing patterns with text mining methods. The integration of the spatial, thematic and temporal dimensions in a GIR contributes to a better understanding of what kind of research has been carried out, of its topics and its geographical and historical coverage. Another originality of the TERRE-ISTEX project is the heterogeneous character of the corpus, including PhD theses and scientific articles from the ISTEX digital libraries and the CIRAD research center.
were either one character or more than 15 words long. In a step towards finding canonical names, we automatically detected abbreviations and their expanded forms from the full text of papers by searching for text between two parentheses, and considered the phrase before the parentheses as the expanded form (similar to (Schwartz and Hearst, 2003)). We got a high precision list by picking the top most occurring pairs of abbreviations and their expanded forms and created groups of phrases by merging all the phrases that use same abbrevia- tion. We then changed all the phrases in the ex- tracted phrases dataset to their canonical names.
As a future goal, we aim to detect bibliographical refer- ence zones in PDF files and not only in structured files (XML/TEI) or semi-structured files. Since our work will be introduced as a new feature for the open source software BILBO, using directly PDF files as an input would be practical by saving time and work on converting files, not to mention the coast of tools that convert files from PDF to XML/TEI. We can also use machine learning technique like Conditional Random Fields (CRFs) for labeling references’ zones after the detection of references by the SVM’s model. Due to CRF, we can reduce the SVM’s model errors.
Sources and consequences of uncertainties concerning the environmental fate of nanoparticles were discussed in re- view but not in researchpapers. The main source of these uncertainties and knowledge gaps is caused by the lack of available analytical methods to separate, characterize, and detect engineered nanoparticles in environmental media at environmental concentrations [54,56,59-61]. One of the biggest challenges is to separate and characterize the small amount of engineered nanoparticles in environmental matrices, which contain high amounts of highly heterogenic natural nanoparticles . Because of the lack of analytical methods, the form of nanoparticles at release and the surface properties transformed and aged in the environment are currently not known [8,56,60]. Therefore, one important factor of the fate of nanoparticles in the en- vironment, the exact surface properties, and the aggrega- tion state is still unclear , and the characteristics of the chosen bare or coated nanoparticles may not be relevant under environmental conditions . This source of uncer- tainty is also relevant for other fields in environmental nanoparticle research, e.g., ecotoxicology. Also the nano- particles (with or without coatings) which were chosen for toxicity tests may not be relevant under environmental conditions. Another problem in toxicity tests is caused by the high concentrations of salts and nanoparticles used in the test: this may induce aggregation of the particles, which may alter their behavior towards the test organisms. As consequences of these uncertainties and knowledge gaps, quantitative risk assessment, regulation, or manage- ment concerning engineered nanoparticles is still based on modeling data or studies in model systems, the Table 4 Mentioned uncertainties in selected review papers
Current domain-specific information extrac- tion systems represent an important resource for biomedical researchers, who need to pro- cess vaster amounts of knowledge in short times. Automatic discourse causality recog- nition can further improve their workload by suggesting possible causal connections and aiding in the curation of pathway models. We here describe an approach to the automaticidentification of discourse causality triggers in the biomedical domain using machine learn- ing. We create several baselines and experi- ment with various parameter settings for three algorithms, i.e., Conditional Random Fields (CRF), Support Vector Machines (SVM) and Random Forests (RF). Also, we evaluate the impact of lexical, syntactic and semantic fea- tures on each of the algorithms and look at er- rors. The best performance of 79.35% F-score is achieved by CRFs when using all three fea- ture types.
separately. Among these fields, Paper ID is the unique identification of a paper in CiteseerX (DOI). Doc ID can be used to quickly locate the relevant files in the data set. And text field is extracted from the full text files, but we only index the first 400 words of the text instead of the whole text, since the purpose of indexing text is to find the possible titles and the titles usually appear in the first few lines. And moreover, if we include too many text of a paper (especially the tailing part), it can cause some other paper titles included, e.g. cited paper titles listed in references. Paper full text need to be normalized before add into Lucene index. We only do lightweight normalization for full text here by lowering case for all English letters and delete punctuation. As for index searching, we use a highly efficient Lucene function called: M M apDirectory to search the full text field of the index. We also visualize our indexed CiteSeerX data set via web page search engine using Django web framework .
In order to evaluate detailed behaviour of our proposed system, we check the data whose annotation results are different from baseline system and proposed system. We confirm there are some cases in which new parameters terms that did not exist in training data are extracted. For example, "tempera- ture of (MeCp)2Mn" did not exist at all in the training data, and was not able to be annotated by CRF in test data before adding physical quantities list. After marking "temperature" as parameter, CRF was able to find "temperature of (MeCp)2Mn" by learning compound term construction rule (i.e., PAR of CM may be a parameter term). On the other hand, merging parameters into one category then classify- ing them seems to be effective, because we can get use of a larger training data. For example, "partial pressure" (where "pressure" is in the physical quantities list) did not exist in the training data. It was not recognized neither by the baseline system nor by the suggested system without parameter classifi- cation; However, when we merge the parameters into one category (allowing for larger training data for the identification of parameter), the suggested system with parameter classification was able to identify such entity. Another example, "period of the mask openings" where "period" is in the physical
Automatic term extraction is an important component of many natural language processing systems. It is often used in applications such as knowledge discovery, knowledge management, automatic text indexing, and so on. Many studies have been conducted on the recognition of terminologies from researchpapers, especially in the biological domains where new terms are created constantly. These studies primarily focussed on the extraction of domain specific concepts such as nouns and collocations. But it is hasn’t been possible to apply a general rules to extract all the terms.
a statement of the authors’ anticipated knowledge gain. This is shown in examples (1) and (2) in Table 1. New Knowledge: A relation or event is considered as New Knowledge if it corresponds to a novel research outcome resulting from the work the author is describing, as per examples (3) and (4) in Table 1. Whereas the value assigned to each of the core MK dimensions of Thompson et al. is completely indepen- dent of the values assigned to the other core dimen- sions, our newly introduced dimensions do not maintain this independence. Rather, Research Hypothesis and New Knowledge possess the property of mutual exclusivity, as an event or relation cannnot be simultaneously both a Research Hypothesis and New Knowledge. We chose to enrich two different corpora with attributes encoding Research Hypothesis and New Knowledge, i.e., a subset of the biomolecular interactions annotated as events in the GENIA-MK corpus , and the biomarker-relevant rela- tions involving genes, diseases and treatments in the EU- ADR corpus . Leveraging the previously-added core MK annotations in the GENIA-MK corpus, we explored how these can contribute to the accurate recognition of New Knowledge and Research Hypothesis. Specifi- cally, we have introduced new approaches for predicting the values of the core Knowledge Type and Knowledge Source dimensions, demonstrating an improvement over the former state of the art for Knowledge Type. We sub- sequently use supervised methods to automatically detect New Knowledge and Research Hypothesis, incorporating the values of Knowledge Type, Knowledge Source and Uncertainty as features into the trained models.
tion Science (PPS). PPS is a major international scientific journal covering the problematic of the plant disease, pest and weed control research in the Czech Republic (CZ); it is published by the Czech Academy of Agricultural Sciences (hptt:// www.cazv.cz) in the Institute of Agricultural and Food Information (Slezská 7, 120 56 Prague 2, Czech Republic, hptt://www.uzpi.cz) and is financed by the Ministry of Agriculture of the Czech Republic. The journal is published quarterly. The abstracts from this journal are comprised in AGRIS/FAO database, Phytomed database, BIOSIS Previews database, CAB abstracts database and in Czech Agricultural and Food Bibliography. The journal publishes original scientificpapers, short commu- nications and reviews together with book reviews, proceedings and other items. From 1998 on, plant health care glossary is issued as a supplement of the journal step by step. Terms of the major fields of plant health and their definitions are given in both English and Czech in this glossary.
indicates the key phrases frequently used which are the indicators of common rhetorical roles of the sentences (e.g. phrases such as “We agree with court”, “Question for consideration is”, etc.,). In this study, we encoded this information and generated automatically explicit linguistic features. Feature functions for the rules are set to 1 if they match words/phrases in the input sequence exactly. Named entity recognition - This type of recognition is not considered fully in summarizing scientific articles (Teufel & Moens, 2002). But in our work, we included few named entities like Supreme Court, Lower court etc., and generate binary-valued entity type features which take the value 0 or 1 indicating the presence or absence of a particular entity type in the sentences.
SAVE Science (Ketelhut et al., 2010; Ketelhut et al., 2009; Ketelhut et al., 2012) is a novel project for evaluating students’ understanding of the scientific method — problem identification, gathering data, analyzing data, developing a hypothesis, and com- municating results — by asking students to solve a mystery in a virtual world through the applica- tion of the scientific method to a content-based prob- lem. Using immersive virtual environments for as- sessments is a current area of focus among educa- tion researchers (Clarke-Midura, 2010); SAVE Sci- ence is unique in its attempt to assess understand- ing of both inquiry as well as content. That is, the test is designed to assess students’ ability to apply their knowledge of the scientific inquiry processes to a problem they have never seen before, but within a content area they have just studied. To be suc- cessful, students must explore a virtual environment, collect appropriate data about it, and find evidence that supports their inference about the cause of the mystery. Part of the reasoning for a particular con- clusion draws on scientific knowledge learned in the classroom, but for these mysteries such knowledge of scientific content is insufficient. Students must also be able to explore the virtual world and create a hypothesis about the cause of the problem, based on their observations and analysis of collected data.
Generally speaking, automatic related work section generation is a strikingly different prob- lem and it is much more difficult in comparison with general multi-document summarization tasks. For example, multi-document summariza- tion of news articles aims at synthesizing con- tents of similar news and removing the redundant information contained by the different news arti- cles. However, each scientific paper has much specific content to state its own work and contri- bution. Even for the papers that investigate the same research topic, their contributions and con- tents can be totally different. The related work section generation task needs to find the specific contributions of individual papers and arrange them into one or several paragraphs.
• Knowledge Base For automatic watering systems on these plants, use several rules that are likely to occur in the plants to be controlled. In making this rule or statement, it actually has no limits in numbers, the more rules are made the more precise and detailed the working tools are designed. Table 4.1 below is the statement rules on the automatic watering system using a 15 system rule fuzzy logic control.
We present two complementary annotation schemes for sentence based annotation of full scientificpapers, CoreSC and AZ-II, which have been applied to primary research articles in chemistry. The AZ scheme is based on the rhetorical structure of a scientific paper and follows the knowledge claims made by the authors. It has been shown to be reliably annotated by independent human coders and has proven useful for various information access tasks. AZ-II is its extended version, which has been successfully applied to chemistry. The CoreSC scheme takes a different view of scientificpapers, treating them as the humanly readable representations of scientific investigations. It therefore seeks to retrieve the structure of the investigation from the paper as generic high-level Core Scientific Concepts (CoreSC). CoreSCs have been annotated by 16 chemistry experts over a total of 265 full papers in physical chemistry and biochemistry. We describe the differences and similarities between the two schemes in detail and present the two corpora produced using each scheme. There are 36 shared papers in the corpora, which allows us to quantitatively compare aspects of the annotation schemes. We show the correlation between the two schemes, their strengths and weaknesses and discuss the benefits of combining a rhetorical based analysis of the papers with a content-based one.
According to the survey research shows that cancer has become a major problem affecting human health in the world. Among all types of cancer, lung cancer mortality is higher than other types of cancer mortality. The 2015 global cancer statistics  was published in journal of the CA on February 4, 2015. It is expected that in 2012, about 820 million patients worldwide died of cancer, and 14.1 million new cases were found in cancer cases, and there are more cancer patients and cancer deaths in developing countries than in developed countries. At present, lung cancer is the leading cause of death in male patients. In recent years, due to the extreme pollution of the city, a substantial increase in smoking population, lung cancer early difficult to be found by doctors and late difficult to be cured and other factors, the incidence of lung cancer in the population is increasing year by year.
The idea for this project comes from our experience with the XMM-Newton  database interface which we are operating in Strasbourg. XMM-Newton is an X-ray space observatory orbiting since 2000 and working in pointed observation mode. Observations are requested by guest observers who get proprietary access to the data over one year. After this period, the data are made public. Every year or so, all data are compiled in a catalog  merging all source detections and linking them with other scientific products. The whole catalog is accessible through different interfaces, among those is the XCatDB  deployed in Strasbourg by the XMM-Newton Science Survey Consortium. In addition to source parameters, XCatDB users can take advantage of accessing technical data such as filter responses, energy band definitions and so on and so forth. This kind of information is spread out over various documents such as the XMM user’s documentation or Web pages. Although these documents are publicly available, they would be more valuable if they were automatically connected with a database user interface.
Electronic learning and Virtual Reality can be applied to Biology in multiple modalities, of type of CAI, within the “HSEA” computer application, PowerPoint slides presentations, video presentations, but also of type of Internet, by YouTube videos, Chat Relay, e-mail, discussions forums sites, Video Conferencing, with numerous examples of learning units, lessons, courses by E-learning and VR. The didactical models recommended are the technocentric ones, based on technique and virtual reality, or technocentric combined with the empiriocentric model, of rediscovery of biological concepts, the basic didactical strategy being computer-focused or combined, computer-focused and heuristic, the methodology being based on didactical methods as CAI, simulation, experiment, observation, conversation, modelling and so on. Numerous functions, roles, tasks of computer have been identified, which, by taking on many roles and tasks of the biology teacher, could be called the “teacher” computer, E-learning and VR with many advantages, disadvantages and limitations, but with ways to counteract the latter. E-learning and Virtual Reality bring a cognitive plus, as well as an affective plus, contributing to bioethical education, ecological education, health education and first aid education, as aspects of Biology-specific education, demonstrated by the results of research assessment of E-learning and Virtual Reality. The scientificresearch has achieved its purpose and objectives, its hypotheses being confirmed, numerous E- learning and Virtual Reality recommendations being made, but also correlative in the presented paper with a view to making biological education more efficient, such as: discussions and debates on Chat Relay, on site forums, Video Conferencing to pupil and student groups, organizing direct, team-based sociocentrically activities integrated into the E-learning lesson to prevent dehumanization of the educational process; alternating E-learning activities with others by applying the techniques specific to the study of Biology, so that the psychomotor objectives provided by the biology school curricula in force can be achieved. “New approaches are emerging, with increasing reliance on“blended models” that combine elements of face-to-face education and online modalities in different proportions” (OECD, 2015).
With the aim of supporting automated analysis and thus easier access to this information, we have generated a multi-layered annotated corpus of scientific discourse. In this article, after introducing our multi-layered scienfic annotation schema, we describe the way we collaboratively annotate our corpus so as to create its gold standard version. Corpus annotations have been provided in two stages. In the first stage (Fisas et al., 2015), annotators were asked to characterize the argumentative structure of papers by associating each sentence to one over 5 categories (Challenge, Background, Approach, Outcome and Future Work) and eventually specifying for Challenge sentences a subcategory among Goal and Hypothesis and distingushing among the Outcome sentences the ones that describe an authors’ Contribution. Based on the work of Liakata et al. (2010) and Teufel (2010) we developed an annotation schema and produced an annotated corpus. Its quality was evaluated in terms of the inter-annotator agreement (K=0.66) comparable to the values attained by the afore- mentioned researchers.