The vast increase in the amount of online text and the need of the different types of information have led to the interest in the different areas of information and retrieval like multimedia retrieval, chemical and biological informatics, topic detection and summarization etc. . Despite this, Textual similarity is the basis of all the above said fields. Textual similarity functions plays the important role in the text related research and the tasks related to its applications in the field of information retrieval, topic detection, text classification. The textual similarity makes use of similarity functions. The textual similarity function is partitioned into String-based, Corpus-based and knowledge-based. String- based is further characterized as the character-based approach and the term based approach .The term-based approach makes use of Jaccard, Cosine, Dice and Overlap similarity functions. If binary weights are used, then weight of term can be 1 if term occurs in the document and 0 if the term does not occurs in the document then all the stated formulae of section III of the paper for the similarity function in the binary term vectors are shown in the Table1. X is defined, a set of all terms occurring in document X. Y is defined, a set of all terms occurring in document Y. | X | = Numbers of terms that occur in set X.
Memory-based methods simply memorize the rating matrix and issue recommendations based on the relationship between the queried user and item and the rest of the rating matrix. The most popular memory-based CF methods are neighbourhood- basedor user-based methods, which predict ratings by referring to users whose ratings are similar to the queried user, or to items that are similar to the queried item. This is motivated by the assumption that if two users have similar ratings on some items they will have similar ratings on the remaining items. Or alternatively if two items have similar ratings by a portion of the users, the two items will have similar ratings by the remaining users. Model-based methods, on the other hand, fit a parametric model to the training data that can later be used to predict unseen ratings and issue recommendations. Model-based methods include cluster-based CF [3, 4, 5, 6, 7], Bayesian classifiers [8, 9], and regression based methods . The slope-one method  fits a linear model to the rating matrix, achieving fast computation and reasonable accuracy.In this paper user-based recommendation generation algorithm is analysed and implemented.The paper focus on different similarity functions for computing user to user similarities for generating recommendations for them.
Several authors have presented systems that estimate the audio similarity of two pieces of music through the calculation of a distance metric, such as the Euclidean distance, between spectral features calculated from the audio, related to the timbre or pitch of the signal. These features can be augmented with other, temporally or rhythmically based features such as zero-crossing rates, beat histograms, or fluctuation patterns to form a more well-rounded music similarity function. It is our contention that perceptual or cultural labels, such as the genre, style, or emotion of the music, are also very important features in the perception of music. These labels help to define complex regions of similarity within the available feature spaces. We demonstrate a machine- learning-based approach to the construction of a similarity metric, which uses this contextual information to project the calculated features into an intermediate space where a music similarity function that incorporates some of the cultural information may be calculated.
10 Read more
Fifty years ago, Hans A. Panofsky published a paper entitled Determination of stress from wind and temperature meas- urements. In his famous paper, he presented a new profile function for the mean horizontal wind speed under the condi- tion of diabatic stratification that includes his integral similarity function. With his integral similarity function, he opened the door for Monin-Obukhov scaling in a wide range of micrometeorological and microclimatological applica- tions. In a historic survey ranging from the sixties of the past century down to the present days, we present integral similarity functions for momentum, sensible heat, and water vapor for both unstable and stable stratification, where on the one hand free convection condition and on the other hand strongly stable stratification are addressed.
14 Read more
In this paper, we proposed hybrid matching techniques based on instance values and on schema information, such as datatypes, cardinality, and relationships. The techniques uni- formly apply similarity functions to generate matchings and are grounded on the interpretation, traditionally accepted that “terms have the same extension when true of the same things” . In our context, two concepts match if they de- note similar sets of objects. The techniques essentially differ on the nature of the sets to be compared and on the similar- ity functions adopted. For example and in a very intuitive way, two classes match if their sets of observed instances are similar, two terms from different thesauri match if the sets of instances they classify are similar; properties match if their sets of observed values are similar.
14 Read more
of a concept can contain one or more attribute descriptions as well as references to other concepts as well thus allowing the user to create object-oriented case representations. A description of an attribute can be formulated using one of the following data types: Boolean, Double, Integer, Interval, String or Symbol. For each data type myCBR provides a default similarity function that supports their definition and individual functionalities to further define more sophisti- cated similarity functions with regard to the specific data types chosen for an attribute. To ease the prototyping of CBR systems one single attribute descrip- tion can have more than one similarity measure, allowing for rapid testing and experimentation to find the most suitable similarity functions.
12 Read more
We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (ker- nels) based on cosine similarity and edit dis- tance among the paths between the protein names. Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Ma- chines and k-nearest-neighbor, and the semi- supervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively. Significant improvement over the previous results in the literature is re- ported as well as a new benchmark dataset is introduced. Semi-supervised algorithms perform better than their supervised ver- sion by a wide margin especially when the amount of labeled data is limited.
10 Read more
Abstract: Information retrieval is a key technology in accessing the vast amount of data present on today’s World Wide Web. Numerous challenges arise at various stages of information retrieval from the web, such as missing of plenteous relevant documents, static user queries, ever changing and tremendous amount of document collection and so forth. Therefore, more powerful strategies are required to search for relevant documents. In this paper, a PSO methodology is proposed which is hybridized with Simulated Annealing with the aim of optimizing Web Information Retrieval (WIR) process. Hybridized PSO has a high impact on reducing the query response time of the system and hence subsidizes the system efficiency. A novel similarity measure called SMDR acts as a fitness function in the hybridized PSO-SA algorithm. Evaluations measures such as accuracy, MRR, MAP, DCG, IDCG, F-measure and specificity are used to measure the effectiveness of the proposed system and to compare it with existing system as well. Ultimately, experiments are extensively carried out on a huge RCV1 collections. Achieved precision-recall rates demonstrate the considerably improved effectiveness of the proposed system than that of existing one.
Genetically, bunyaviruses are highly diverse, and some species can cause lethal infections in humans or live- stock. RVFV is one of the most important bunyaviruses, and has caused large outbreaks in both ruminants and humans, resulting in devastating economic loss in af- fected regions. The major virulence factor NSs, exerts several different biological functions to hijack infected cells. Each mechanism is apparently similar to those used by other bunyavirus NSs proteins, as summarized in Table 1. The degradation of PKR is induced by RVFV and TOSV NSs proteins. RVFV NSs protein also pro- motes the degradation of TFIIH p62 protein, whereas TOSV NSs protein triggers the degradation of RIG-I. Since RVFV NSs protein functions as viral adaptor pro- tein to form the E3 ligase complex for p62 or PKR deg- radation, similar mechanisms may be utilized by TOSV NSs protein for PKR or RIG-I degradation. Host general transcription suppression is induced by RVFV, BUNV, and LACV NSs proteins. These NSs proteins induce DNA damage responses, and trigger host translational shutoff. The expression of IFN-β gene is suppressed at the transcriptional level, which is in sharp contrast with other NSs proteins, as they only localize in the cyto- plasm (e.g., TOSV, SFTSV). TOSV or SFTSV NSs pro- teins inhibit specific proteins in the IFN-β gene induction pathway (e.g., RIG-I, TBK-1). Further grouping of phleboviruses and orthobunyaviruses based on NSs
13 Read more
The work presented in this paper proposes a novel policy administration mechanism, ICPA, to meet the requirements of the changing trust model, which has led to the widespread overclaim of privileges. ICPA leverages the similar policies to design or verify a target policy set, and simplifies the policy administration. This work provides definition of the formal model of ICPA and also the design of enforcement framework. Additionally, proposes text mining-based method of similarity measure to obtain similar policies. For future scope Safety definition is investigated and evaluated to improve permission model. For analysis of ICPA, more strengthening is required for mathematics depth.
10 Read more
Table 1 provided the complete list of 93 protein functional families collected from UniProt database  and the performances of the popular protein function prediction methods (BLAST, SVM, KNN and PNN) were measured by independent test dataset (the way to generate independent dataset was demonstrated in the following section 2.2). These 93 families included 12 molecular binding families (e.g. sodium-binding, potassium-binding, SH3-binding, RNA-binding), 15 ligand families (e.g. plastoquinone ligand, vitamin C ligand, and ubiquinone ligand), 58 functional families defined by Gene Ontology (40 molecular functions and 18 biological processes) and 8 broad families defined by the UniProt database . All families were contained in the keyword categories provided by UniProt database, and the majority (82.7%) of these 93 families were able to be mapped to their corresponding GO terms (Table 1). Protein entries haven’t been manually annotated and reviewed by UniProtKB curators in a keyword category were not considered for analysis in this study. As a result, 107 ~ 49,517 protein-entries from 93 functional families across various species were collected.
22 Read more
The modified inspectional analysis is very similar to the Lie scaling method both in development and in application. It is a somewhat intuitive version of the Lie scaling technique. It operates by examining the governing equations, applying a scaling transformation to the variables in the system, and en- forcing invariance of the scaling transformation. Similarly to the dimensional scaling, this method does not explicitly con- sider that parameters or secondary processes that are func- tions of the scaled state variables of the system need to satisfy self similarity relationships. Rather, model processes can be chosen to have different dynamical properties than the pro- totype processes (Bear, 1972). For example, an aquifer with an anisotropic hydraulic conductivity can be modeled as an isotropic aquifer.
10 Read more
To provide efficiency at runtime, Magnitude uses a custom “.magnitude” file format instead of “.bin”, “.txt”, “.vec”, or “.hdf5” that word2vec, GloVe, fastText, and ELMo use (Mikolov et al., 2013; Pennington et al., 2014; Joulin et al., 2016; Peters et al., 2018). The “.magnitude” file is a SQLite database file. There are 3 variants of the file for- mat: Light, Medium, Heavy. Heavy models have the largest file size but support all of the Magni- tude library’s features. Medium models support all features except approximate similarity search. Light models do not support approximate similar- ity searches or interpolated OOV lookups, but they still support basic OOV lookups. See Figure 1 for more information about the structure and layout of the “.magnitude” format.
The calculations have been carried out using the same method as in our recent studies of the optical properties of III-V 共 001 兲 growth planes. 14,15 In short, we use density- functional theory in the local-density approximation 共 DFT- LDA 兲 together with nonlocal norm-conserving pseudopotentials 16 to determine the structurally relaxed ground state of the system. A massively parallel, real-space finite-difference method 17 is used to deal efficiently with the large surface unit cell and the many states required for the calculation of the dielectric function. A multigrid technique is employed for convergence acceleration. The spacing of the finest grid used to represent the electronic wave functions and charge density was determined through a series of bulk calculations. We find that structural and electronic properties are converged for a spacing of 0.238 Å. This corresponds to an energy cutoff in plane-wave calculations of about 24 Ry. The calculations yield a bulk equilibrium lattice constant of 5.378 Å and a bulk modulus of 0.979 Mbar 共 experiment: 18 5.43 Å and 0.96–0.99 Mbar 兲 . The calculated excitation en- ergies suffer from the neglect of self-energy effects in DFT- LDA and are smaller than the measured values: the calcu- lated indirect band gap is 0.58 eV 共 room temperature experiment: 18 1.11 eV 兲 . A similar underestimation of about 0.5 eV occurs for the E 1 and E 2 critical points 共 CP’s 兲 of the
Abstract. In DNA based computation and DNA nanotechnology, the design of good DNA sequences has turned out to be an essential problem and one of the most prac- tical and important research topics. Basically, the DNA sequence design problem is a multi-objective problem, and it can be evaluated using four objective functions, namely, H measure , similarity, continuity, and hairpin. In this paper, particle swarm optimization
There is a multitude of clustering methods available in literature, which can be distinguished with respect to its algo- rithmic properties . First, partition algorithms strive for a successive improvement of an existing clustering and can be further classified into examplar-based and commutation- based approaches. These approaches need information with regard to expected cluster number k. Representatives are k- means  and k-medoid . Second, hierarchical algo- rithms create a tree of node subsets by successively merging (agglomerative approach) or subdividing (divisive approach) the objects. In order to obtain a unique clustering, a second step is necessary that prunes this tree at adequate places. Rep- resentatives are k-nearest-neighbor and linkage . Finally, density-based algorithms try to separate a similarity graph into subgraphs of high connectivity values. In the ideal case, they can determine the cluster number k automatically and detect clusters of arbitrary shape and size. Representatives are: DBSCAN and Chameleon .
19 Read more
The analysis of the GpSGHV genome reported here pro- vides a second set of evidence that this virus not only differs from baculoviruses, nudiviruses, and nimaviruses but in fact cannot be assigned to any of the large dsDNA virus families so far known to infect invertebrates or vertebrates. The similarity between the GpSGHV genome and those of other large dsDNA viruses appears to be limited to a very small number of genes with well-known functions, including, surprisingly, four putative orthologues to baculovirus envelope genes p74, pif-1, pif-2, and pif-3 and a DNA polymerase gene similar to those of herpesviruses. However, the phylogenic trees of these five GpSGHV genes clearly established their ancient divergence from their putative orthologues. The similarity to about 40 additional viral genes from different virus families and whose function is mostly unknown further underlines the uniqueness of GpSGHV. Taken together these data justify, in our opinion, the proposal that GpSGHV represents a new type of insect virus for which a taxonomic position has to be defined.
17 Read more
This paper proposes a query by example system for generic audio. Section 2 gives an overview of the system and previous similarity measures. We observe that the similarity of audio signals can be measured by the di ﬀ erence between the probability density functions (pdfs) of their frame-wise features. The empirical pdfs of continuous-valued features cannot be estimated directly, but they are modeled using Gaussian mixture models (GMMs). A GMM parametrizes each sample eﬃciently with small number of parameters, retaining the necessary information for similarity measure- ment. An overview of other applications utilizing GMMs in the music information retrieval can be found in .
12 Read more
performed for each list of genes of interest compared to the gene background. No threshold is applied and results are combined together. The most popular test to per- form a functional enrichment analysis is the Fisher ’ s exact test . P-values measure the degree of independence between belonging to the GO term and being enriched. They are unadjusted for multiple testing in this exploratory context. ViSEAGO offers all statistical tests and algorithms developed in the Bioconductor topGO R package , taking into account the topology of GO graph by using ViSEAGO::create_topGO- data method followed by the topGO::runTest method. A table of results that summa- rizes functional enrichment tests performed for each list of genes is built using ViSEAGO::merge_enrich_terms method. The number of enriched GO terms is displayed in a barchart plot using ViSEAGO::GOcount. The number of GO terms overlapping between lists of interest is also available in the upset plot with ViSEAGO::Upset (Fig. 2, Section “ Enrichment Analysis ” ). Thus, ViSEAGO allows comparison of biological functions associated with each list of enriched GO terms in the study. Users can inter- actively sort the table of results by p-values or query by GO term.
13 Read more
Abstract— Process mining is getting much attentions and interests in the development of the web-based information technique (IT) field. Process conformance, one of process mining techniques, may be the most common method used to find out the (dis)similar working process(es). The distinguishing processes found are greatly instrumental in making business decisions, such as conduction strategies, service tactics, manufacturing process, …, etc., and further to advance their working efficiency. Several process conformance approaches have been developed and discussed over the past decade, but those discussions involving in the investigation to seek for reasons why the working processes are (un)conformable are rare. To advance the function of the process conformance, this paper introduces the two parameters, Support and Confidence, used for newly defining the tinguishability among various processes; meanwhile, they are also taken to identify the roots resulting to the process distinguishing. “Support” parameter functions as the evaluation of the process similarity based on the working activity sequences ( or called “from-to” workflows) and on the relationships among the various processes evaluated; “Confidence” as the measure of the process relationships defined by a ratio of the identical activities within two processes to the total activities of each individual process. Moreover, the two proposed parameters had also been applied to a real case, in which the nursing processes worked in the pediatrics department of a hospital were measured and improved. To the best knowledge of the authors, there does not have an exact technique that, so far, is intact of considering the whole situation where all of the process conformance factors are involved; even the presentation of this paper cannot be avoidable. Nevertheless, this paper truly provides a way to find a certain degree of a lower bound of the process distinguishability through the two proposed conformance parameters.