A major challenge in data-driven TF functionannotation is to minimize the impacts from false bindings and to re- liably extract gene function signals. We combined mul- tiple statistical strategies to achieve this. First, TGs from ChIP-seq experiments were extracted with a stringent FDR, which was calculated using a statistical framework modified from TIP  by combining binding locations and intensity information to enrich for true TF-DNA binding events over false signals. Second, we defined the target functions of TFs as the consensus functions among the putative TGs. The statistical enrichment ana- lysis hence further filtered noises from the remaining false TGs. Third, we chose conservative gene universes specific to the types of functions, so as to minimize spurious associations. Finally, we applied the Benjamini– Hochberg multi-test correction procedure and required a FDR of 5% for all associations reported. With these, approximately 10,000 significant TF-target function associations were obtained. Meanwhile, the total number of true TF-target function associations was estimated to be over 80,000, indicating the presence of rich functional signals in the TFTG data (Fig. 3). We believe there is room for further improvement to retrieve a higher number of TF-target function annotations at a controlled FDR.
The elusive nature of protein functions necessitates the use of appropriate function taxonomies to properly iden- tify the set of activities a protein performs. Approaches that utilize PPI data coupled with a standard taxonomy of functions have demonstrated better results compared to those exploiting direct annotation transfers, as shown in , , , , and . Most of these techni- ques use the Gene Ontology (GO)  as a functional classification scheme. GO is a structured and controlled vocabulary of terms providing consistency in annotating how a protein behaves in a cellular context. It is arranged in Directed Acyclic Graph (DAG) of nodes, associated in parent child relationships; with each node indicating a functional term. Nodes are connected with “ is_a ” (special case of the parent node/term) or “ part_of ” (sub-process of the parent node/term) relationships. Functionally known proteins are related to one or more nodes of the GO hierarchy; and because of parent/child associations if a protein is known to a child term it is also known to all of its parent terms in the hierarchy.
In conclusion, many computational methods that inte- grate heterogeneous data for predicting protein (or gene) functions have been suggested. Most of these techniques follow the same basic paradigm: firstly, they generate vari- ous functional association networks by analyzing implicit information of shared functions of proteins from different data sources. Then these individual networks are com- bined into a composite and highly reliable network through a weighted sum. The weight of each individual network represents the contribution of the corresponding data source to the function prediction. A correct setting of these weights is thought to be the key to designing an ef- fective function prediction method. In general, the weights adjustment of individual networks is mainly influenced by human experience and statistical analysis. The major drawback of how each network is weighted is that it varies between different datasets. Furthermore, functions of pro- teins are diverse and some of them only occur under spe- cific conditions. Different functional association networks play different roles and have varying importance in func- tion prediction. Combining a heterogeneous data source into a single weighted network could obscure the inherent nature of the protein function.
Cis-natural antisense transcripts (cis-NATs) are a new class of RNAs identified in various species. However, the biological functions of cis-NATs are largely unknown. In this study, we investigated the transcriptional characteristics and functions of cis-NATs in the muscle tissue of lean Landrace and indigenous fatty Lantang pigs. In total, 3,306 cis-NATs of 2,469 annotated genes were identified in the muscle tissue of pigs. More than 1,300 cis-NATs correlated with their sense genes at the transcriptional level, and approximately 80% of them were co-expressed in the two breeds. Furthermore, over 1,200 differentially expressed cis-NATs were identified during muscle development. Functionannotation showed that the cis-NATs participated in muscle development mainly by co-expressing with genes involved in energy metabolic pathways, including citrate cycle (TCA cycle), glycolysis or gluconeogenesis, mitochondrial activation and so on. Moreover, these cis-NATs and their sense genes abruptly increased at the transition from the late fetal stages to the early postnatal stages and then decreased along with muscle development. In conclusion, the cis-NATs in the muscle tissue of pigs were identified and determined to be mainly co-expressed with their sense genes. The co-expressed cis-NATs and their sense gene were primarily related to energy metabolic pathways during muscle development in pigs. Our results offered novel evidence on the roles of cis-NATs during the muscle development of pigs.
The advent of NGS platforms has accelerated sequence discovery and functionannotation to a whole new level, with the publication of shotgun assemblies of genomic sequences which are rarely completed to finished chro- mosomes. In this new era of shotgun sequencing and as- sembly, short-sequence read collections and incomplete genome surveys, additional checks are absolutely inevit- able , to ensure that those – once considered groundbreaking – genome-aware methods achieve their full potential . How would the community address the type of errors described here in a systematic man- ner? One solution might be by allowing for the inclusion of additional metadata to flag NGS-related projects, thus enabling the modification of annotations at the assembly and/or sequence boundary levels. Thus, annotation ef- forts of NGS projects will need to flag and treat differ- ently quasi-correct genome sequences, erroneous or elliptic assemblies and inexact gene predictions. We highlight the issue for automated function prediction, which apart from the current agenda , should further consider any substantial NGS artifacts, as a novel chal- lenge that has not been adequately addressed so far .
In this paper, we have introduced various tools addressing three different stages of the semantic web based knowledge management life cycle – knowledge creation, knowledge capture and knowledge reuse - for assisting engineers using the Geodise toolkit. Accessing and reasoning with the ontology and instances is facilitated via the OntoView mechanism on top of which functionannotation and reuse services are built. The reuse includes ontology driven queries over instances and the semantic matching based knowledge advisor for function configuration and assembly. We are currently extending this system to incorporate the semantic annotation and retrieval of the configured workflows.
With the brat rapid annotation tool (Stenetorp et al., 2012), for the first time a web-based open- source annotation tool was introduced, which sup- ports collaborative annotation for multiple anno- tation layers simultaneously on a single copy of the document, and is based on a client-server ar- chitecture. However, the current version of brat has limitations such as: (i) slowness for docu- ments of more than 100 sentences, (ii) limits re- garding file formats, (iii) web-based configuration of tagsets/tags is not possible and (iv) configuring the display of multiple layers is not yet supported. While we use brat’s excellent visualization front end in WebAnno, we decided to replace the server layer to support the user and quality management, and monitoring tools as well as to add the interface to crowdsourcing.
the search term ‘climate change’. The search showed that very few comments were made on articles relating to climate change before 2008. This may indicate some delay in the feature being taken up substantially by the readership or may be because there was originally a limit of 50 comments set for each article. In order to test the semantic annotationfunction of Wmatrix against the largest dataset the articles were sorted according to the greatest number of comments elicited and articles making only a passing reference to, for example, “Chris Huhne, Secretary of State for Energy and Climate Change” were excluded. Thirty-three ‘climate change’ articles from The Guardian website elicited 500+ comments, with the highest being 1679 comments. This demonstrates the depth of information available for conducting a longitudinal, cross-case comparison between articles and between newspapers. The top three ranking articles by number of user comments elicited were identified for analysis. Other researchers might consider alternative criteria such as the date, authorship, or source material in the gathering of their data. The three articles with the highest number of comments taken from The Guardian website can be seen in Table 1.
Overall, the idea of such studies is to understand how well an enzyme adapts its function. Indeed, evolutionary relationships exist between pro- teins that show diverse folds or topologies, but share similar function. It is unlikely that these folds evolved independently. Many elegant studies [8, 9] suggest that enzymes evolved from pre-existing enzymes via gene duplica- tion using common binding sites or mechanistic features to catalyze dif- ferent reactions. The most probable scenario is that these folds evolved from a smaller, less diverse set of ancestral proteins. The earliest enzymes were probably weakly catalytic and multifunctional with broad specifici- ties [10–12]. Gradually, evolutionary events (gene duplication, mutation and divergence) helped the evolution of more numerous, effective, and spe- cific enzymes to evolve from the multifunctional enzymes . Hence, they share a set of protein functions to effect the reaction [4, 13]. This in turn suggests that there are numerous functions which have evolved in unrelated structures.
Davidson (1967) proposed to treat events as individual objects, facilitating the semantic interpretation of adverbs, like “quickly”, “passionately”, and adverbial quantifying expressions such as “everywhere”, “never” and “at least three times”. Following Parsons (1990), this event-based semantics can be ex- pressed in semantic representations by means of one-place predicates applied to existentially quantified event variables, and two-place predicates to indicate the semantic roles of the participants in an event. This ‘neo-Davidsonian’ approach has been adopted in the ISO annotation standards 24617-1 (Time and events), 24617-4 (Semantic roles), 24617-7 (Spatial information), and 24617-8 (Discourse relations). Champollion (2015) has shown that GQT and neo-Davidsonian semantics can be combined successfully. Still, natural language quantification is a semantically extremely complex set of phenomena, and espec- cially the interpretation of plural noun phrases presents certain theoretical challenges for GQT (see e.g. Schwertel, 2005), some of which have been successfully been approached in DRT (Kamp and Reyle, 1993), which has other limitations. Luckily, providing a semantics for quantification annotations is less challenging than providing a semantics for natural language expressions involving quantifications.
The score of interaction clusters metabolism of xeno- biotics by cytochrome P450, oxidative phosphotylation and ribosome were very high, signifying alteration of succinylation level on these proteins may contribute to the development of NAFLD. In the metabolism of xeno- biotics by cytochrome P450 function module, most of these interacted proteins belonged to the Cytochrome P450 (CYP450) family or Glutathione S-transferase (GST) family. These two families participated in the me- tabolism of various metabolites together, especially the metabolism of secondary metabolism such as steroids, fatty acids, xenobiotics and so on [26–28]. Lysine succi- nylation on the members of CYP450 family and GST family may disturb various secondary metabolism and induced NAFAD. This is consistent with the aforemen- tioned classification and enrichment analysis as various molecular metabolism related terms were significantly enriched. The major proteins in oxidative phosphotyla- tion and ribosome cluster were subunits of mitochon- drial F0F1 ATP synthase and ribosome, which imply succinylation on F0F1 ATP synthase and ribosome were important events in NAFAD development. The energy production and protein synthesis was likely to be inter- fered by increased lysine succinylation modification in the liver of NAFLD model.
Several studies support the ‘signaling endosome hypothesis’ in which neurotrophic factors initiate ligand- mediated endocytosis of receptor tyrosine kinases into clathrin-coated vesicles that contain activators such as G proteins and downstream effector molecules involved in Ras-mitogen-activated protein kinase (MAPK) signaling [58-61]. In the electric organ, we identified guanine nucleotide binding proteins and inhibitors that act on Rho family of Ras-related G proteins that may be involved in signaling endosomes. These include G subu- nit b-1 (GNB1), subunit b-2-like1 (RACK1), G(s) subu- nit a (GNAS1), and Rho GDP-dissociation inhibitor 1 (ARHGDIA). GNB1 composes part of the catalytic machinery of GTPases and provides docking regions for interacting proteins. ARHGDIA prevents the release of GDP from Rho proteins (Rho, Rac, cdc42, TC10). RACK1 is the receptor of protein kinase C (PKC), which is known to inactivate Rho; PKC also phosphorylates serine residues of AChR δ subunit to promote receptor desensitization and disassembly [62-64]. Identification of several proteins involved in ligand-mediated endocytosis, activators and inhibitors of GTPases, signaling, lysoso- mal and proteasomal degradation support the mainte- nance of protein function across myogenic-derived cell types.
To measure the effectiveness of MDDs as a metric of syntactic difficulty and cognitive demand in a broad sense, the testing sets of 20 languages with two versions of annotations were drawn from the UD 2.2 and SUD 2.2 to form 20 corresponding treebanks. There 20 languages are Arabic (ara), Bulgarian (bul), Catalan (cat), Chinese (chi), Czech (cze), Danish (dan), Dutch (dut), Greek (ell), English (eng), Basque (eus), German (ger), Hungarian (hun), Italian (ita), Japanese (jpn), Portuguese (por), Roma- nian (rum), Slovenian (slv), Spanish(sp), Swedish (swe) and Turkish (tur). These 20 treebank-pairs would help to demonstrate the features and distinctions of syntactic- and semantic-oriented annotation schemes in measuring syntactic complexity and cognitive constraint.
Abstract. With the development of image processing and storage technology, rapid classification and annotation of huge volumes of digital images have been attracted much attention. However, the complex and ambiguous relationship between images and concept classes poses significant challenges on building effective annotation models. Structured machine learning methods have been studied to tackle the problem of complex relationship between concept classes for prediction, which have been proved effective for image understanding tasks. We proposed a novel image annotation model based on structured machine learning, by introducing a learned kernel function in the sample space, aiming at capturing the underlying distribution of concept classes of the training data set. The model is evaluated on two benchmark data sets and the results show that the model is promising compared to current state-of-the-art methods. Introduction
High throughput genome sequencing made large amount of genome data available to research community. Accurate gene structure prediction and annotation is the fundamental step towards the understanding of genome function. A large number of gene prediction tool and pipeline have been developed over the past year. To understand whether the prediction tools and pipeline are providing same or different result for the same genome or not, we have compared manually the gene prediction result of RAST (Rapid Annotations using Subsystems Technology), AMIgene (Annotation of MIcrobial Genes) and Genmark hmm for organism Mycoplasm genitalium in reference to Genbank CDS (Coding Sequence) or gene. During comparative analysis we have seen the similarity as well as variation in prediction result of each tool. Variation in prediction results were also seen in total number of CDS predicted, gene coordinate and gene length. We have tried to find the reason behind the variation in prediction result and try to relate our analysis with nowadays high throughput data analysis. These types of analysis are useful to annotate a newly sequenced genome.
wind was an important factor in determining migration and short-term flight patterns of pelagic birds [4,24,36,48]. Here, we show how Env-DATA annotation can assist an investigation of wind dependencies and flow assistance. Figures 7 and 8 show a single albatross trajectory annotated by tail-wind support and side-wind (cross wind), two de- rived variables (Table 1) computed using wind direction and wind speed and movement direction (flight heading) of the albatross along its flight path, based on the for- mulation from [24,36]. The space-time-cube illustrates how wind assistance facilitates the albatross’ flights toward the Galapagos Islands (orange to red colors represent higher wind assistance), while the flights to the coast are often challenged by head wind (aquamarine to blue colors represent wind resistance). The flight pattern in Figures 7 and 8 is characteristic to most other flight tracks in our albatross dataset. As seen in Figure 7 and Figure 8a–b, the albatross repeatedly takes a more northern route to the coast relying mostly on side winds, and then moves south (presumably foraging) before returning to the Galapagos Islands using a tail- wind assisted route (cf. Figure 8a and Figure 8c). The observed clock-wise pattern is in accordance with pre- vious findings [4,48]. Weimerskirch et. al  found that albatrosses prefer tail or side winds and therefore use predictable weather systems to fly in large looping tracks; when going south movements are in a clockwise direction. This enables albatrosses to achieve high speeds while expending little energy. The travel direc- tion towards continental South America and back to the Galapagos undertaken by waved albatross means they almost always have side-winds (cf. Figure 8b and Figure 8d).
An observation that holds for all semantic role la- belling schemes is that certain labels seem to be more similar than others, based on their ability to occur in the same syntactic environment and to be expressed by the same function words. For example, Agent and Instrumental Cause are of- ten subjects (of verbs selecting animate and inan- imate subjects respectively); Patients/Themes can be direct objects of transitive verbs and subjects of change of state verbs; Goal and Beneficiary can be passivised and undergo the dative alternation; Instrument and Comitative are expressed by the same preposition in many languages (see Levin and Rappaport Hovav (2005).) However, most an- notation schemes in NLP and linguistics assume that semantic role labels are atomic. It is there- fore hard to explain why labels do not appear to be equidistant in meaning, but rather to form equiva- lence classes in certain contexts. 5
The initial set of annotation contained a high level of redundancy as more than half of the annotated tran- scripts were alternative splicing isoforms . To avoid this redundancy in subsequent analyses, ORFs from six Nicotiana species were clustered using cd-hit-est com- mand in CD-HIT v4.6.1-2012-08-27  with >95 % similarity cutoff and only the representative ORFs in each cluster were retained. This yielded non-redundant sequence datasets for N. glauca (22,934 genes), N. noc- tiflora (26,788 genes), N. cordifolia (29,356 genes), N. knightiana (22,168 genes), N. setchellii (26,579 genes) and N. tomentosiformis (24,213 genes). These non-redundant sequences were defined as unigenes. Similarly, tomato coding sequences (CDSs) obtained from ITAG2.4 were also clustered and 33,721 genes were obtained. OrthoMCL v1.4  was used to identify ortholog rela- tionships between the six wild Nicotiana species and tomato.
2013; Bonin et al., 2012). This is consistent with observations on chat and chunk phases – laughter is more common in chat phases which often provide a ‘buffer’ between single speaker chunks. The structure of chat following chunks is also consistent with Schneider’s description of ‘idling’ phases at the end of topics, where speakers contribute short utterances referring back to previous content rather than committing to starting a new topic. In terms of the distribution of phases across conversations, chat was more common at the start of the multiparty conversations studied, in line with descriptions of casual conversation in the literature, where the initial stages comprise light interactive talk. Chunk phases become more prevalent as conversation continues. These observations lead to the interesting question of whether, once the conversation is well established, the extent of topics may correlate with the stretch from the beginning of one chunk to the beginning of the next chunk, with any intervening chat reflecting the decay of one topic and the process of starting the next. We intend to investigate this question using the topic annotation currently under way on the data.
In this paper, we introduce Y EDDA , a lightweight but efficient and comprehen- sive open-source tool for text span an- notation. Y EDDA provides a systematic solution for text span annotation, rang- ing from collaborative user annotation to administrator evaluation and analysis. It overcomes the low efficiency of traditional text annotation tools by annotating entities through both command line and shortcut keys, which are configurable with custom labels. Y EDDA also gives intelligent rec- ommendations by learning the up-to-date annotated text. An administrator client is developed to evaluate annotation qual- ity of multiple annotators and generate de- tailed comparison report for each annota- tor pair. Experiments show that the pro- posed system can reduce the annotation time by half compared with existing anno- tation tools. And the annotation time can be further compressed by 16.47% through intelligent recommendation.