As mentioned above, both the edge-based and the node-based measures have their limitations. Some works take into account both nodes and edges in the graph into account. Wang et al.  presented a similarity measurement by combining the structure of GO graph and the semantic information of the GO terms, where they quantified the semantics of a term by S-value, which integrated the contribution of all term in a GO subgraph including all the ancestors and the term itself. Their similarity measure between two terms was defined as the percentage of S-value they share. Several works [29-31] have demonstrated the advantage of this measurement. However, this approach suffers from some shortcomings , namely that the semantic contribution value of a edge is empirically determined, and that the dynamically calculation for the semantic values of GO terms is rather time consuming. Recently, Wu et al.  proposed a hybrid measure, where they used the node information to improve the edge-based measure they introduced previously, and shown the superiority in determining the protein-protein interaction. Bandyopadhyay and Mallick  developed a new hybrid method to address the issue of shallow annotation in the GO structure. Song et al.  introduced an aggregate information content approach where they defined the semantics of a term as the aggregate contribution of semantic weight of all its ancestors and the term itself and the similarity between two terms was defined as the ratio of semantics they shared.
Algorithms introduced above have been developed based on topological features of PPI networks. However, due to experimental limitations, there exist false posi- tives and false negatives in PPIs. Besides physically inter- acting pair-wise relationships between proteins, semantic similarity describes another type of relationship between pairs of proteins by measuring closeness between the two proteins which is based on estimates of ontology- based functionalsimilarity [18,19]. The GeneOntology (GO)  is the main focus of investigation of semantic similarity in molecular biology . Many measures [19,21-23] for computing semantic similarities have been proposed by using annotations from the three GO hierarchies  - Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). It has been confirmed that GO-driven similarity among genes is a relevant indicator of functional interaction in the inves- tigation of assessment and evaluation of semantic simi- larity . Results in the study  also demonstrated that there is a significant correlation between the semantic similarity of pair-wise proteins and their co- complex membership. It is showed that semantic simi- larity assists validating the results which are obtained from biomedical studies, such as gene clustering and gene expression data analysis . Therefore, in the paper, it is assumed that incorporating semantic similar- ity into clustering process can improve the accuracy of identifying protein complexes.
In this paper, we attempt to make use of GO annota- tions and the ontology structure of GO terms to mea- sure semantic similarity of GO terms and proteins. The similarity of two GO terms is measured based on their average distance to their lowest common ancestors in the ontology structure. Semantic similarity between pro- teins is computed as the similarity of two sets of GO terms, which annotate the two proteins respectively. PPIs in the network are then weighted by the similarity of interacting proteins for the filtering and clustering steps. As far as we know, most approaches filter the pre- dicted complexes with low density or statistical signifi- cance in post processes [4,9,11,12], which still introduce some unreliable interactions in the results. In our method, however, the low-weight interactions are fil- tered first, followed by a cluster-expanding algorithm to identify high quality complexes consisting of only reli- able interactions. Considering the core-attachment structure revealed by Gavin et al. , which reflects the inherent organization of protein complexes, we propose a network clustering algorithm to identify the core and attachment proteins of complexes successively. Firstly, cliques in the filtered network are detected. Highly over- lapping cliques are merged to form cores of complexes. Secondly, we add attachment proteins to the cores, making use of the cluster-expanding strategy in RRW algorithm , which is appropriate for expanding clus- ters consisting of multiple nodes in weighted networks. By applying the clustering algorithm on the purified PPI network, our method identifies complexes with high biological significance and functional homogeneity. Methods
Abstract—One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of GeneOntology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
flage. Results in Figure 3(b) (for the case K → B) describes an evident scenario for negative trans- fer where the adaptation performance with SCL descends lower than the baseline. However, the proposed algorithm still sustains the performance by transferring knowledge proportionate to simi- larity between the two domains. To further an- alyze the effect of similarity, we segregated the 12 cross-domain classification cases into two cat- egories based on similarity between two the par- ticipating domains i.e. 1) > 0.5 and 2) < 0.5 . Table 5 shows that for 6 out of 12 cases that fall in the first category, the average accuracy gain is 10.8% as compared to the baseline. While for the remaining 6 cases that fall in the second cat- egory, the average accuracy gain is 15.4% as com- pared to the baseline. This strongly elucidates that the proposed similarity-based iterative algorithm not only adapts well when the domainsimilarity is high but also yields gain in the accuracy when the domains are largely dissimilar. Figure 4 also shows how weight for the target domain classi- fier w t varies with the number of iterations. It further strengthens our assertion that if domains are similar, algorithm can readily adapt and con- verges in a few iterations. On the other hand for dissimilar domains, slow iterative transfer, as op- posed to one-shot transfer, can achieve similar per- formance; however, it may take more iterations to converge.While the effect of similarity on do- main adaptation performance is evident, this work opens possibilities for further investigations. 3) Effect of varying threshold θ 1 & θ 2 : Figure 5(a) explains the effect of varying θ 1 on the final classification accuracy. If θ 1 is low, C t may get trained on incorrectly predicted pseudo labeled in- stances; whereas, if θ 1 is high, C t may be defi- cient of instances to learn a good decision bound- ary. On the other hand, θ 2 influences the number of iterations required by the algorithm to reach the
"SNOMED-CT" stands for "Systematized Nomenclature of Medicine — Clinical Terms" is a systematically organized computer readable collection of medical terminology covering most areas of clinical information where its first version was released in 2002. SNOMED-CT ontology provides a global and broad hierarchical terminology for clinical data storage, encoding, and the retrieval of health and diseases information (Lee et al., 2014; Schulz et al., 2014; Schulz and Martínez-Costa, 2013). Basically, SNOMED-CT has been designed to be used by computer applications to represent clinical data in consistent and unambiguous manner. Then, the resulted data can be used for electronic health records (EHRs) and decision-support (DS) systems and finally to enable semantic interoperability which is precisely the goal (Sicilia, 2014; Campbell et al., 2013, Duarte et al., 2014). SNOMED-CT as an internationally accepted standard ontology is included in the UMLS repository.
The increasing popularity of question answering websites has led to the emergence of a new area of research called community question answering, which has to deal with two distinct but complementary tasks. The first task, called question-to-question similarity, has to provide related ques- tions to a given original question. The identification of similar question pairs aims at preventing duplicate posts in the forums and to redirect users towards posts that might contain an appropriate answer. The second task, called question-answering similarity, aims at providing a correct answer to a given original question. If several users con- tribute to a given post, it is important to automatically ex- tract the correct answers among tens or hundreds of an- swers since a manual exploration becomes hard to achieve. These tasks offer a key challenge while they have to deal with textual similarity not only in terms of lexical similarity but also in terms of reformulation, paraphrasing, duplicates and near duplicates, textual entailment, semantics, etc. Over the past years, there have been several studies on com- munity question answering (Qiu and Huang, 2015; Filice et al., 2016; Barr´on-Cede˜no et al., 2016; Franco-Salvador et al., 2016; Nakov et al., 2016; Nakov et al., 2017; Pa- tra, 2017), etc. Most of them addressed this task through specific datasets such as the programming Q&A website Stackoverflow 1 , Quora dataset for duplicate extraction 2 ,
In this paper, we describe an unsupervised measure for quantifying the ‘informativeness’ of correlation matrices formed from the pairwise similarities or relationships among data instances. The measure quantifies the heterogeneity of the correlations and is defined as the distance between a correlation matrix and the nearest correlation matrix with con- stant off-diagonal entries. This non-parametric notion generalizes existing test statistics for equality of correlation coefficients by allowing for alternative distance metrics, such as the Bures and other distances from quantum information theory. For several distance and dissimilarity metrics, we derive closed-form expressions of informativeness, which can be applied as objective functions for machine learning applications. Empirically, we demon- strate that informativeness is a useful criterion for selecting kernel parameters, choosing the dimension for kernel-based nonlinear dimensionality reduction, and identifying struc- tured graphs. We also consider the problem of finding a maximally informative correlation matrix around a target matrix, and explore parameterizing the optimization in terms of the coordinates of the sample or through a lower-dimensional embedding. In the latter case, we find that maximizing the Bures-based informativeness measure, which is maxi- mal for centered rank-1 correlation matrices, is equivalent to minimizing a specific matrix norm, and present an algorithm to solve the minimization problem using the norm’s proxi- mal operator. The proposed correlation denoising algorithm consistently improves spectral clustering. Overall, we find informativeness to be a novel and useful criterion for identifying non-trivial correlation structure.
Although the semantic similarity between two GO terms has been extensively investigated [1-4], how to define sim- ilarity between two gene products based on GO annota- tions for a specific application remains unclear . To date annotation similarity has been computed by four general approaches: the set-based approach; the graph- based approach; the vector-based approach; and the term- based approach. In the set-based approach an annotation is viewed as a 'bag of words'. Two annotations are similar if there is a large overlap between their sets of terms. A graph-based approach views similarity as a graph-match- ing procedure. Vector-based methods embed annotations in a vector space where each possible term in the ontology forms a dimension. Term-based approaches compute sim- ilarity between individual terms and then combine these similarities to produce a measure of annotation similarity. All the above approaches do not consider the semantics of relationships between terms. How terms are related can significantly alter how an annotation, which is a set of terms, is interpreted. In the GO there are two main types of relations: is_a and part_of. The is_a relation represents a taxonomic relationship between terms that can be mod- eled using the improper subset relation, which is a partial ordering of terms. The part_of relation represents a parto- nomic relationship between terms that can also be mod- eled in terms of a partial order. Though the partial orders represented by taxonomies and partonomies are well understood there has been little attention given as to how these two partial orderings combine. Using the various cases identified by combining taxonomies and partono- mies we construct an algorithm called SSA (Semantic Sim- ilarity of Annotations) that identifies the terms that can be associated with an annotation and terms that relate to both annotations. Instances associated with these terms are then used to construct a Resnik-like measure of anno- tation similarity thus extending the underlying intuitions behind this term-based measure to the annotation level. A measure of term or annotation similarity should be based on a set of principles that form the basis for what is considered similar. The nature of similarity has been the focus of intense research in the areas of aesthetics [6,7] and psychology . In mathematics properties such as identity, symmetry and the triangle inequality have been used to form the basis of measures of similarity of mathematical objects. Principles of term and annotation similarity have been suggested by various authors. This work intends to build on these principles and introduce additional princi- ples that a measure of similarity should seek to satisfy. Similarity between objects is normally expressed as a number that ranges along an interval on the real numbers ⺢. However the main purpose of similarity is usually to
At the core of the system’s knowledge representation framework is an ontology covering the ICT domain, within which content from registered providers will be classified. In addition, Diogene will provide users with the opportunity to use free Web content in their domain of interest; this material may have a limited pedagogical value but can be used as additional material during training sessions. One method for drawing useful freeware resources from the Web and making them available to the system is to “map” external ontologies in the same domain to Diogene’s ontology. Given the decentralised nature of Semantic Web development, it is likely that the number of ontologies will greatly increase over the next few years (Doan et al, 2002), and that many will describe similar or overlapping domains, providing a rich source of material.
Abstract. Cross-domain recommender systems adopt multiple methods to build relations from source domain to target domain in order to alleviate problems of cold start and sparsity, and improve the performance of recommendations. The ma- jority of traditional methods tend to associate users and items, which neglected the strong influence of friend relation on the recommendation. In this paper, we propose a cross-domain item recommendation model called CRUS based on user similarity, which firstly introduces the trust relation among friends into cross-domain recom- mendation. Despite friends usually tend to have similar interests in some domains, they share differences either. Considering this, we define all the similar users with the target user as Similar Friends. By modifying the transfer matrix in the random walk, friends sharing similar interests are highlighted. Extensive experiments on Yelp data set show CRUS outperforms the baseline methods on MAE and RMSE. Keywords: cross domain recommendation, trust relation, user similarity, rating pre- diction, random walk.
63. 5 M C C ARTHY , supra note 29, § 25A:51 (“It is irrelevant under the ACPA that confusion about a web site’s source or sponsorship could be resolved by visiting the web site identified by the accused domain name.”); B LAKESLEE , supra note 41 (“The fact that confusion about a website’s source or sponsorship could be resolved by visiting the website is not relevant to whether the domain name itself is identical or confusingly similar to a plaintiff’s trademark.”); SoftCom Technology Consulting Inc. v. Olariu Romeo/Orv Fin Group S.L., WIPO Case No. D2008-0792, Judgment for Complainant, § 6 (July 8, 2008) https://www.wipo.int/amc/en/domains/decisions/html/2008/d2008- 0792.html [https://perma.cc/Q44W-SQQ7] (Domain <myhostingfree.com> transferred to complainant. Confusing similarity only examines whether the letter string of the domain name is confusingly similar to the letter string of the trademark, devoid of marketplace factors. Panel explicitly addresses that it does not look at mental reaction of Internet users to the domain name); Harry Winston Inc. and Harry Winston S.A. v. Jennifer Katherman, WIPO Case No. D2008-1267, Complaint Denied, § 7 (Oct. 18, 2008), https://www.wipo.int/amc/en/domains/decisions/html/2008/d2008- 1267.html [https://perma.cc/SE7J-N4Q2] (Although such content may be regarded as highly relevant in the assessment of intent to create confusion under subsequent elements (i.e., rights or legitimate interests and bad faith)).
Abstract. This paper presents an empirical study on four techniques of lan- guage model adaptation, including a maximum a posteriori (MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion. We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains. In particular, we at- tempt to interpret the results given in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain measured us- ing the information-theoretic notion of cross entropy. We show that such a met- ric correlates well with the CER performance of the adaptation methods, and also show that the discriminative methods are not only superior to a MAP-based method in terms of achieving larger CER reduction, but are also more robust against the similarity of background and adaptation domains.
Personalized Spam Filtering (Spam): The data set comes from ECML/PKDD 2006 discovery challenge. The goal is to adapt a spam filter trained on a common pool of 4000 labeled emails to three individual users’ personal inboxes, each contain- ing 2500 emails. We use bag-of-word features for this task, and we report classification accuracy. Gene Name Recognition (NER): The data set comes from BioCreAtIvE Task 1B (Hirschman et al., 2005). It contains three sets of Medline ab- stracts with labeled gene names. Each set corre- sponds to a single species (fly, mouse or yeast). We consider domain adaptation from one species to another. We use standard NER features includ- ing words, POS tags, prefixes/suffixes and contex- tual features. We report F1 scores for this task. Relation Extraction (Relation): We use the ACE2005 data where the annotated documents are from several different sources such as broadcast news and conversational telephone speech. We re- port the F1 scores of identifying the 7 major rela- tion types. We use standard features including en- tity types, entity head words, contextual words and other syntactic features derived from parse trees. 3.2 Methods for Comparison
While pattern based and vertical relation approaches yield a high precision, they suffer simultaneosly from a poor recall, particularly because patterns are rarely applied in real documents. Likewise, the high-recall distributional approaches suffer from low precision. One problem is that many unrelated terms might co-occur if just occurring frequently enough. Secondly, data sparseness arises as many domainterms are multi-word terms which tend to rarely appear in corpora, hampering the collection of statistically evaluable context information.
performed for each list of genes of interest compared to the gene background. No threshold is applied and results are combined together. The most popular test to per- form a functional enrichment analysis is the Fisher ’ s exact test . P-values measure the degree of independence between belonging to the GO term and being enriched. They are unadjusted for multiple testing in this exploratory context. ViSEAGO offers all statistical tests and algorithms developed in the Bioconductor topGO R package , taking into account the topology of GO graph by using ViSEAGO::create_topGO- data method followed by the topGO::runTest method. A table of results that summa- rizes functional enrichment tests performed for each list of genes is built using ViSEAGO::merge_enrich_terms method. The number of enriched GO terms is displayed in a barchart plot using ViSEAGO::GOcount. The number of GO terms overlapping between lists of interest is also available in the upset plot with ViSEAGO::Upset (Fig. 2, Section “ Enrichment Analysis ” ). Thus, ViSEAGO allows comparison of biological functions associated with each list of enriched GO terms in the study. Users can inter- actively sort the table of results by p-values or query by GO term.
Unsupervised domain adaptation. Our work relates to unsupervised domain adaptation (UDA) where no la- beled target images are available during training. In this community, some methods aim to learn a mapping be- tween source and target distributions [37, 13, 9, 38]. Cor- relation Alignment (CORAL)  proposes to match the mean and covariance of two distributions. Recent meth- ods [18, 4, 28] use an adversarial approach to learn a trans- formation in the pixel space from one domain to another. Other methods seek to find a domain-invariant feature space [34, 31, 10, 30, 42, 11, 2]. Long et al.  and Tzeng et al.  use the Maximum Mean Discrepancy (MMD)  for this purpose. Ganin et al.  and Ajakan et al.  in- troduce a domain confusion loss to learn domain-invariant features. Different from the settings in this paper, most of the UDA methods assume that class labels are the same across domains, while different re-ID datasets contain en- tirely different person identities (classes). Therefore, the approaches mentioned above can not be utilized directly for domain adaptation in re-ID.
Smart phones are becoming more integrated and important part of people’s daily lives due to their highly powerful computational capabilities, such as email applications, online banking and online shopping…etc. The use of mobile devices has increased in our lives offering almost the same functionality as personal computers. Android devices have appeared lately and, since then, the number of applications available for this operating system has increased exponentially. Finding similar or related Android applications is a feature in popular search engines (e.g., Play store, Galaxy apps). For example, after users submit search queries, Google play displays the search results together with a group of relevant applications labeled as similar applications. Market- specific search engines identify similar apps by relying on textual descriptions only . However, a match between words in a search query with words in the descriptions or in the source code of applications doesn't guarantee that these applications are relevant. In addition, many application repositories are polluted with poorly functioning projects. In this paper, the aim is to compare the similarity between Android applications' graphical interfaces and their functions to figure out if there is any association between them. This can evolve a new direction in different researches concerned
One way of determining what a community represents is to examine its leader, which is the focus of the NSL algorithm. The leaders of communities 1, 2 and 3 are the accounts @HoughtonProbs, @CCCU_news, and @ JoAnneLyonGS, respectively. The @HoughtonProbs account is one of the many meme accounts found on Twitter, presumably taking inspiration from the “first world problems” meme; it publishes tweets referring to problems or “problems” peculiar to the subculture of Houghton, New York. The account @CCCU_news is operated by The Council for Christian Colleges and Universities, which describes itself as “an international association of intentionally Christian colleges and uni- versities.” 2 Houghton College is listed on the CCCU website as a member, along with 117 other colleges and universities. 3 The account posts news stories relevant to those institutions. The third major leader in the network, @JoAnneLyonGS, is a prominent member of the Wes- leyan denomination of the Protestant church, the official denomination of Houghton College.