Mining Association Rules from Gene Ontology and Protein Networks: Promises and Challenges.

(1)

Mining Association Rules from Gene Ontology and Protein

Networks: Promises and Challenges.

Pietro Hiram Guzzi

1

, Marianna Milano

1

, and Mario Cannataro

1,2 1 _{Department of Medical and Surgical Sciences, University of Catanzaro, Italy}

hguzzi,m.milano,cannataro@unicz.it

2 _{ICAR-CNR, RENDE, Italy.}

cannataro@icar.cnr.it

Abstract

In biology, the accumulation of raw experimental data has been accompanied by the ac-cumulation of functional information encoded by using annotation terms. Such annotations are organized into ontologies such as Gene Ontology. Each annotation may be derived using diﬀerent methods that yield to diﬀerent reliability of such information. The analysis of this annotated data using association rules may evidence the co-occurrence of annotations improv-ing for instance annotation quality. Here we give a short survey of these methods discussimprov-ing possible future directions of research. We considered in particular the impact of the nature of annotations on algorithms performances by discussing two case studies and evidencing the impact of quality of annotations.

Keywords: Association rules; Gene Ontology; Annotation; Itemset; Electronic Inferred Annotations; Manual Annotations.

1 Introduction

The accumulation of raw experimental data about genes and proteins has been accompanied by the accumulation of functional information [7, 6, 8]. Usually biological knowledge is encoded by using annotation terms, i.e. terms describing for instance function or localization of genes and proteins. Terms are organized into ontologies, that oﬀer a formal framework to represent biological knowledge [17]. For instance, Gene Ontology (GO) provides a set of description of biological aspects (namely GO Terms), structured into three main taxonomies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Terms are then linked to the related biological concept (e.g. proteins) by a process known as annotation. Then for each protein a set of related terms (or annotation) is available and stored in publicly available databases, such as the Gene Ontology Annotation (GOA) database [5].

Consequently, for each biological concept, e.g. a gene or a protein, we may associate a set of GO Terms producing a set of records as follows: P06727, GO:0002227,GO:0006810,GO:0006869.

Volume 29, 2014, Pages 1970–1980

ICCS 2014. 14th International Conference on Computational Science

(2)

Classical approaches for analysing annotated data are for instance enrichment analysis, i.e. the determination of the under or over representation of an annotation in a set of annotated data [9], or semantic similarity, i.e. the determination of relatedness of two or more annotating terms, (see [17] for a complete review). Nevertheless the possibility to study annotated data by using classical pattern mining techniques such as itemset mining is still an unexplored area. Litera-ture reports some approach of frequent itemset mining for: (i) prediction of novel annotations, (ii) improving the unsupervised annotation of biomolecules ( see [27, 25]). A recent survey by Naularets et al, reports only two works that apply frequent itemset mining on annotated data [29].

Here we extend the analysis of that survey, by providing an overview of frequent item-set mining technique, a concise formulation of the problem, and a deep discussion of related problems.The main contribution of this paper is to provide an introductory survey on frequent itemset mining over annotations and annotated data for the development of novel algorithms.

The rest of the paper is structured as follows: Section 2 introduces the problem from a Computer Science perspective. Section 3 introduces GO and GO-based annotation. Section 4 discusses current approaches and main issues. Finally Section 5 concludes the paper.

2 Background on Frequent Itemset and Association Rules

Mining

2.1 Itemset Mining

We here deﬁne the problem of Itemset mining by using as introductory examples some pro-teins annotated with Gene Ontology terms. Let I be the set of all possible items, i.e. all the GO Terms taken from Gene Ontology. An itemset is a subset X = {i₁, i₂....i_k};, e.g. {GO:0002227,GO:0006810,GO:0006869}. A transaction over I is a pair T=(tid, I), where tid is the transaction identiﬁer and I is an itemset; e.g. T = {(P 06727), (GO : 0002227, GO : 0006810, GO : 0006869)}. A transaction database D is a set of transactions over I. The support of an itemset X is the number of transactions that contain the itemset X: support(X, D) = |{tid|(tid, I) ∈ D, X ⊆ I|}. A frequent itemset is a itemset whose support is more than or equal to some threshold minimum supportσ [28].

2.2 Mining Association rules from Itemset

An association rule is an implication expression of the form X ⇒ Y where X and Y are disjoint itemsets. The sets of items X and Y are respectively called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule. An example rule, in a database of annotated proteins, is the implication: “ Nuclear localization ⇒ Origin:eukaryota ”, that means that every protein annotated as localized in nucleus has a eukaryotic origin [1]. The strength of an association rule can be measured in terms of its support and conﬁdence. The support of an association ruleX ⇒ Y is the support of X ∪ Y

support(X ⇒ Y, D) = support(X ∪ Y, D)

The conﬁdence of an association ruleX ⇒ Y is the conditional probability of having Y contained in a transaction, given that X is contained in that transaction:

conﬁdence(X ⇒ Y, D) =support(X∪Y,D)_support(X,D) .

The rule is called confidance if its confidence is higher than a minimal confidence threshold γ, where 0 ≤ γ ≤ 1. Itemset mining is actually a special case of association rule mining.

(3)

Association rule mining is typically the step conducted after the actual itemset mining, as the rules can be derived from the itemsets.

2.3 Evaluating Discovered Itemsets: Interestingness measures

Support and conﬁdence are the two most widely interestingness measures used. Support is an important measure because a rule that has low support may occur simply by chance. Con-ﬁdence, instead, measures the reliability of the inference made by a rule. Other frequently used measures are lift and coverage. The lift of an association rule is the ratio of the observed support for this association rule, to the expected support if X and Y were independent: lift (X ⇒ Y, D)=_{support(X,D)support(Y,D)}support(X∪Y,D) .

The coverage of an association rule measures how often the rule is applicable in the transaction database: coverage (X ⇒ Y, D)=support(X ,D)

2.4 Algorithms for Itemset Mining

Frequent itemsets play an essential role in many data mining tasks that try to find interesting patterns from databases such as association rules, correlations, sequences, classifiers, clusters and many more of which the mining of association rules is one of the most popular problems. Efficient algorithms for mining frequent itemsets are crucial for these data mining tasks [15]. There are many algorithms developed to mine frequent itemsets. In general, all itemset min-ing algorithms repeatedly generate relatively small collections of candidate frequent itemsets, count their supports and remove all itemsets that turn out to be unfrequent. The problem of identifying all the frequent itemsets is the bottleneck of all the algorithms. Given a set of items I, there are potentially 2|I|itemsets, and however only a small fraction of these is frequent [28]. Consequently, different algorithms have been introduced to improve the extraction of relevant itemsets.

Apriori [28] is the most important data mining algorithm for mining frequent itemsets

and association rules. In order to mine frequent itemsets, the algorithm considers each item as candidate, then it counts its support and prunes the candidate itemsets if they appear in few transactions. In the next iteration, the algorithm generates candidate itemsets using the frequent itemsets discovered in previous step. In this way the exponential growth of candidates is controlled. The process terminates when Apriori generates none candidate itemsets. Apriori achieves good performance by reducing the size of candidate sets. However, Apriori algorithm usually generates redundant itemsets, and the number of patterns it ﬁnds rapidly explodes unless parameters are stringently controlled [28].

The FP-growth [19] algorithm introduced a novel data structure, called frequent pattern tree ( FP-tree) [15] for frequent itemsets mining. FP-tree is constructed by using a transactional database and mapping each transaction onto a path in the FP-tree. Each node in the tree contains the label of an item with a counter that records the number of transactions mapped onto the given path [19]. FP-Growth algorithm is applied on this FP-tree [15]. FP-growth finds all the frequent itemsets ending with a particular suffix using a divide-and-conquer strategy to split the problem into smaller subproblems. Since every transaction is mapped onto a path in the FP-tree, it is possible to derive the frequent itemsets ending with a particular suffix, by examining only the paths containing a particular node. In turn, each of these subproblems is further decomposed into smaller subproblems. The solutions of each subproblem return frequent itemsets. [32].

(4)

together with its tidlist. The algorithm computes the support of an itemset by simply inter-secting the tidlist of any subsets [15]. To extract the frequent items the support count of each item is computed. If the support count of an itemset is greater than the minimum support threshold, then the item is a frequent item.

2.5 Tools for Frequent Itemset Mining

Many popular software frameworks for frequent itemset mining are available. Among these

Arules [18] for R (www.r-project.org) and Weka [33] are widely used.

Arules is a R package providing support for managing transaction data and for extracting

frequent itemsets and association rules. It receives as input transaction data that in general consists of tuples of the form: (< transactionID, itemID, ... >). Then it transforms data into a binary incidence matrix with columns corresponding to the different items and rows corresponding to transactions. Each element (i, j) of this matrix contains 1 if an item is present in a particular transaction, 0 elsewhere. This format is often called the horizontal database layout. Alternatively, transaction data can be represented in a vertical database layout in the form of transaction ID lists. In Arules both layouts are implemented. The result of mining transaction data in Arules are associations. Generally, associations are sets of objects describing the relationship between some items (e.g., as an itemset or a rule) which have assigned values for different measures of quality such, support, confidence, lift. The associations currently implemented in the arules package are sets of itemsets (e.g. used for frequent itemsets) and sets of rules (e.g. association rules). Class itemsets contain one itemMatrix object to store the items as a binary matrix where each row in the matrix represents an itemset. In addition, it may contain transaction ID lists as an object of class tid Lists. Class rules consist of two itemMatrix objects representing the left-hand-side (LHS) and the right-hand-side (RHS) of the rules, respectively. Arules implements both Apriori and Eclat algorithms and it is capable to produce frequent itemsets as output. The output of both algorithms is the sets of mined associations that can be further analyzed using the functionality provided for these classes.

Weka is a comprehensive suite of Java class libraries that enable to implement many machine

learning and data mining algorithms. Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. For the generation of association rules, Weka contains an implementation of the Apriori algorithm. It looks for any rules to capture strong associations between diﬀerent attributes [33].

3 Gene Ontology and Annotations

Gene Ontology [20] (GO) is one of the main resource of biological information since it provides a specific definition about protein functions. GO is a structured and controlled vocabulary of terms, called GO terms. GO is subdivided in three non-overlapping ontologies, Molecular Func-tion (MF), Biological Process (BP) and Cellular Component (CC), so, each ontology describes a particular aspect of a gene or protein functionality. GO has a specific structure, Directed Acyclic Graph (DAG), where the terms are the nodes and the relations among terms are the edges. This allows for more flexibility than a hierarchy, since each term can have multiple relationships to broader parent terms and more specific child terms [12]. Genes or proteins are connected with GO terms through annotations, this procedure is known as annotation process. Each annotation in the GO has a source and a database entry attributed to it. The source can be a literature reference, a database reference or computational evidence. Each biological

(5)

molecule is associated with the most specific set of terms that describe its functionality. Then, if a biological molecule is associated with a term, it is also associated with all the parents of that term [12]. 18 different annotation processes exist that are identified by an evidence code, the main attribute of an annotation. The evidence codes available describe the basis for the annotation. A main distinction among evidence codes is represented by electronIcally infErred Annotations (IEA), i.e. annotations that are determined without user supervision, and non-IEA or manual annotations, i.e. annotations that are supervised by experts.

3.1 GO-based Analysis of Biological Data

Main approaches based on the use of Gene Ontology to analyse biological data can be subdivided in: (i) tools for ﬁnding a set of over(under)-represented annotations (functional enrichment

analysis), (ii) tools for evaluating the semantic similarity of two GO terms or two genes or

proteins.

The rationale of functional enrichment analysis is that an experiment investigating a biolog-ical phenomenum should individuate a set of genes/proteins whose annotations are correlated, e.g. they present a common annotation that is statistically over represented considering all the annotations of the population. Functional enrichment analysis therefore individuates a set of shared annotations among selected genes, in order to make a biological meaning to the selected genes/proteins. There exist currently more than 60 bioinformatics tools that per-form such analysis, for example GO-Lorize [14] that visualizes functional enrichment on a map, ClueGO [3] that performs different functional enrichments, and CytoSevis [16] that enables to visualizes biological network by using a semantic-based layout, (see paper by Huang for a complete review [21]). They can be cathegorized on the basis of different criteria [22]: the kind of statistical model, the annotation databases, or the kind of biological data in input. Here we follow the classification proposed in [22] that distinct all the approaches in: (i) singular enrich-ment analysis (SEA), (ii) gene set enrichenrich-ment analysis (GSEA), and (iii) modular enrichenrich-ment analysis (MEA).

Regarding the tools for evaluating the Semantic Similarity of GO Terms, it should be consid-ered the diﬀerence with respect to other similarity measures well established in computational biology community. While sequence or structure-based similarity of genes and proteins has been largely investigated, the similarity based on functions presents a more complex scenario. In fact, while primary and tertiary structures can be compared in terms of number of shared amminoacids or in terms of spatial conformation, the comparison of the functions need the in-troduction of a comparison metrics among terms that are expressed often in natural language. The adoption of ontologies for managing annotations provides means to compare entities on aspects that would otherwise not be comparable. For instance, if two gene products are anno-tated within the same schema, we can compare them by comparing the terms with which they are annotated [30].

The annotations of biological concepts are currently organized in simple taxonomies or more complex ontologies, such as Gene Ontology. The use of ontologies enables the comparison of annotations in terms of analysis of the ontology schema. Thus, the problem to deﬁne the semantic similarity of two terms can be solved in terms of analysis of the underlying ontology starting from the similarity of shared terms. Several approaches are available to quantify semantic similarity between terms or annotated entities in an ontology represented as a directed acyclic graph such as GO. The most common measures are: the Resnik’s [31], Lin’s [26] and Jiang and Conrath’s [23] measures.

(6)

FussiMeg1 [10] is a measure of similarity among gene products. It is implemented in a freely available web server. It oﬀers the calculation of similiarity among GO terms or among pro-teins. For proteins, the similarity is calculated considering the associated GO terms stored in GOA. FussiMeg uses as similarity measures the well-know Lin, Jiang-Conrath and Resnik measures. The main limitation of FussiMeg is the possibility to insert only two protein identi-ﬁers. FussiMeg is currently available but is no longer maintained because it has been evolved in ProteinOn.

ProteinOn [11] is a web server that enables to retrieve the annotations shared in a list of interacting proteins not only a pair. The User can copy and paste this list into the web server, and the server calculates the similarity among all the protein pairs of the list. It employs the same similarity measures of FussiMeg. With respect to its ancestor it also oﬀers the possibility to retrieve all the interactors of a query protein, the associated GO terms and their statistical signiﬁcance.

4 Mining Annotation Data

Currently there are few approaches of Frequent Itemset Mining of annotated data. They may be categorized on the basis of their aims in: (i) Methods for prediction of Novel Annotations, i.e. the co-occurrence of annotations associated to other information e.g. pattern of co-expression or interaction may be used to annotated protein with unknown functions; (ii) Methods for prediction of Novel Interaction Among Proteins, e.g Proteins that share a set of annotations are more likely to be interacting; (iii) Methods for Identiﬁcation of hidden relationships among proteins on the basis of the co-occurrence of annotations.

Karpinets et al., [24], introduced a framework to mine annotations (GO, sequence, struc-ture) of biological concepts. Such annotations are integrated into a network model called Anets. Anets are networks of annotations in which nodes are annotations while edges connect related annotations on the basis of analysis of their co-occurrence using association rules. The frame-work utilizes a semantic-preserving vocabulary to convert records of biological annotations of an object, such as an organism, gene, chemical or sequence, into networks (Anets) of the asso-ciated annotations. An association between a pair of annotations in an Anet is determined by the similarity of their co-occurrence patterns with all the other annotations in the data. This feature captures associations between annotations that do not necessarily co-occur with each other and facilitates discovery of the most signiﬁcant relationships in the collected data through clustering and visualization of the Anets.

4.1 Issues on Frequent Itemset Mining

This section will explain main issues that arise from the use of itemset mining for the analysis of annotation data. We distinct them in internal problems, i.e. problems related to the algorithms themselves, and external problems, i.e. problems related to the structures of GO and to the quality and quantity of available annotations.

4.2 Internal Problems: Issues related to Mining Algorithms

As noted in [4] computational biology poses speciﬁc challenges to Itemset Mining. A ﬁrst chal-lenge is represented by the dimension of typical dataset. For instance, as stated in GOA

(7)

Table 1: Annotation Statistics for Diﬀerent Species. IEA column reports the number of IEA annotations. non-IEA column reports the number of non-IEA annotations.

Species IEA non IEA Total AVG for protein

UNIPROT 213544213 1448170 214992383 6 HUMAN 173436 211291 384727 8 MOUSE 102518 256872 359390 8 RAT 89256 68191 157447 7 ARABIDOSIS 52194 101804 153998 5 ZEBRAFISH 69047 21266 90313 3 CHICKEN 55232 6933 62165 4 COW 95548 16019 111567 5 DICTY 13892 19355 33247 4 DOG 108310 5132 113441 5 PIG 100215 4773 104988 5 FLY 24818 849075 10972 6 WORM 27910 27551 55461 4 YEAST 10239 53839 64078 9

database statistics, Homo Sapiens has 46, 346 proteins and 384, 727 annotations (see Ta-ble 1 that summarizes these data for main organisms). Consequently, if we map each pair (protein, annotations) as a transaction in which each protein is a row (or transaction identiﬁer -TID-) and each annotation as a column (or item), the table becomes very wide. Such datasets pose a great challenge for existing (closed) frequent itemset mining algorithms, since they have an exponential number of combinations of items with respect to the row length.

4.3 External Problems: Issues related to Gene Ontology Annotations

Issues related to the nature of GO annotations may be grouped on two main categories: (i) Number of Annotations, (ii) Nature of Annotations.

Number of Annotations

Due to the different methods, and to the different availability of experimental data, the number of annotations for each protein or gene is highly variable within the same GO taxonomy and over different species. The number of annotation terms is highly variable among proteins within the same GO and over different GOs and species.

Nature of Annotations

The association among each biological concept and its related GO Term may be performed with 14 diﬀerent methods. For example, esperimental methods infer annotations directly from experimental evidence. These annotations have an associated evidence code as, EXP, IDA, IPI, IMP etc...; or computational methods which infer annotations which have an associated evidence code as, ISS, ISO, ISA, ISM etc... [2] These methods are in general grouped onto two main categories: experimentally veriﬁed (or manual) and electronIcally infErred Annotations (IEA). To keep track of the method used to annotate a protein with GO Terms, each annotation is labeled with an evidence code (EC).

Manuals annotations are in general more precise [17] and speciﬁc than IEA ones. Unfor-tunately their number is in general lower and this ratio is variable as clearly shown in Table 1. A considerable number of genes and proteins is annotated with generic GO terms (this is particular evident when considering novel or not well studied genes). The role of these general

(8)

Table 2: Results on CYC2008 database using IEA, non-IEA and both.

Parameter

-Dataset

IEA non-IEA Both

Sensitivity 0,53 0,65 0,58

Speciﬁcity 0,42 0,51 0,50

annotations is to suggest the area in which the proteins or genes operate. This phenomenon affects particularly IEA annotations derived, from instance, from literature. Since classical frequent itemset algorithms do not distinct among attributes they may be influenced from the presence or absence (of IEA terms). A possible solution is represented on the definition of novel itemset algorithms that are aware of the Evidence Code of annotations.

4.4 Impact of Evidence Codes on annotation mining: two case

stud-ies.

We here presents some preliminary studies on the impact of the presence/absence of IEA an-notation through two case studies of application of association rules. In the first example we consider the problem of discovering protein complexes starting from co-occurrence of GO an-notation. For these aims we use as case studies yeast protein complexes contained in CYC2008 database and related GO annotations. We first use only non-IEA annotation, then only IEA, and finally both IEA and non-IEA. We used the Arules R package for mining.

We define the problem of extracting the GO signature of protein complexes as the mining for frequent co-occurrences in the annotation corpus. We built each transaction in the following way. The transaction identifier represents the interaction and it is realized by merging the identifiers of two interacting proteins. The items are composed by the union of the annotations corresponding to each proteins. Let us consider for instance a protein complex composed from proteins A, B, and C. Let us suppose that binary interactions are (A,B), (B,C), and (A,C). Consequently our transaction database is made from three transactions that have respectively AB, BC, and AC as identifiers and related annotations as items.

The underlying hypothesis is that co-occurrence of items should correspond to related pro-tein complexes or as a signature of a speciﬁc propro-tein complexes. In this preliminary work we studied how a frequent annotation may be a signature of a complex (e.g. a predictor of a protein complex).

Consequently, we considered this problem as a multi-class classification problems and we considered traditional measures. We runned Arules employing the Apriori algorithm over the cyc2008 database and latest GO annotations in order to quantify the impact of the presence of GO annotations obtaining results summarized in Table 2. Consequently, for each run we measured Sensitivity (i.e. proportion of actual true positives which are correctly identified as such e.g. how many times the co-occurrence of annotations predict correctly a protein complex), and specificity ( i.e. how many times the co-occurrence of the annotation happens in a given complex). Results confirm that there is not a big difference and in particular the presence of IEA annotations does not impact in a positive way the mining process.

In the second case study we considered the analysis of co-occurrences of annotation in Pfam [13] families. Pfam is a structured database of protein domains and families. The current release of Pfam (22.0) contains 9318 protein families. Pfam is available on the web (http://pfam.sanger.ac.uk/. Each family of Pfam is therefore composed by proteins that have a similar structure and then may have a close function. The hypothesis is that co-occurence

(9)

of annotation may be indicative of the Pfam family, i.e. proteins of the same family should share the same co-annotations.

We define the problem of mining annotated data as the mining for frequent co-occurrences in the annotation corpus. We built each transaction in the following way. The transaction identifier represents a protein. Let us consider for instance a protein family composed from proteins A, B, and C. Consequently our transaction database is made from three transactions that have respectively A, B, and C as identifiers and related annotations as items.

In this preliminary work we studied how a frequent annotation may be a signature of a family (similarly to the first problem). We considered as input dataset proteins in the Pfam databases and three different annotation corpus: only non-IEA, only IEA and both IEA and non-IEA. Then for each run we extracted rules using A-rule with following parameters: support=0,01 and confidance=0,1.

Finally, we measured the average support,average conﬁdence and average lift. Then we decided to measure the average information content (IC) of GO term [17] for each rule and we considered the average IC of all the rules. Conversely there is a signiﬁcative growth in File size. Table 3 summarizes these results.

Table 3: Results on Pfam Database

Dataset File Size Rules Avg Support Avg Conﬁdence Avg Lift Avg IC

IEA 3000 45 0,62 0,911 3,187 101,34

non-IEA 2000 25 0,629 0,644 2,008 54,33

IEA-nonIEA 5000 17 0,69 0,735 5,733 75,21

5 Conclusion and Future Directions

Gene Ontology (GO) provides a set of annotations (namely GO Terms) for each biological concept, i.e. genes or proteins. The analysis of this annotated data using itemset mining seems to be a trivial task. Nevertheless, the use of frequent itemset mining is less popular with respect to other techniques, such as statistical-based methods or semantic similarities. Nevertheless, Frequent Itemset Mining may be used to extract more meaningful information from annotated biological data when some speciﬁc challenges will be faced.

First issue is dependent from quantity of annotations. It should be investigated deeply the impact of presence of IEA annotation on mining algorithms and the loss of information due to their absence. We think that this problem may be dependent on the species. As evidenced in this preliminary study we noted that the presence of IEA annotations has not a positive impact on the quality of association rules, while we noted a significative growth in execution times that we do no report for brevity. We retain that this problem may be faced on two possible ways: (i) by developing Evidence Code aware algorithm, or (ii) by introducing a preprocessing step that filters out annotations on the basis of their impact. We also retain that a more deep study by considering also the different ontologies of GO, i.e. Molecular Function, Cellular Component and Biological Process, should be made.

Acknowledgements

This work has been partially funded by project MIUR PON Smarticities DICET-INMOTO-ORCHESTRA PON04a2 D.

(10)

References

[1] Irena I. Artamonova, Goar Frishman, Mikhail S. Gelfand, and Dmitrij Frishman. Mining sequence annotation databanks for association patterns. Bioinformatics, 21(Suppl 3):iii49–iii57, November 2005.

[2] Daniel Barrell, Emily Dimmer, Rachael P. Huntley, David Binns, Claire O’Donovan, and Rolf Apweiler. The GOA database in 2009–an integrated gene ontology annotation resource. Nucleic acids research, 37(Database issue):D396–403, January 2009.

[3] Gabriela Bindea, Bernhard Mlecnik, Hubert Hackl, Pornpimol Charoentong, Marie Tosolini, Amos Kirilovsky, Wolf-Herman Fridman, Franck Pagès, Zlatko Trajanoski, and Jérôme Galon. Cluego: a cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation net-works. Bioinformatics, 25(8):1091–1093, 2009.

[4] Christian Borgelt. Eﬃcient implementations of apriori and eclat. In Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL). CEUR Workshop Proceedings 90, 2003.

[5] Evelyn Camon, Michele Magrane, Daniel Barrell, Vivian Lee, Emily Dimmer, John Maslen, David Binns, Nicola Harte, Rodrigo Lopez, and Rolf Apweiler. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucl. Acids Res., 32(suppl 1):D262– 266, January 2004.

[6] Mario Cannataro, Pietro H Guzzi, and Pierangelo Veltri. Impreco: Distributed prediction of protein complexes. Future Generation Computer Systems, 26(3):434–440, 2010.

[7] Mario Cannataro, Pietro H. Guzzi, and Pierangelo Veltri. Protein-to-protein interactions: Tech-nologies, databases, and algorithms. ACM Comput. Surv., 43:1:1–1:36, December 2010.

[8] Mario Cannataro, Pietro Hiram Guzzi, and Pierangelo Veltri. Using ontologies for querying and analysing protein-protein interaction data. Procedia Computer Science, 1(1):997–1004, 2010. [9] Young-Rae Cho, Marco Mina, Yanxin Lu, Nayoung Kwon, and Pietro H Guzzi. M-ﬁnder:

Un-covering functionally associated proteins from interactome data integrated with go annotations. Proteome Sci, 11(Suppl 1):S3, 2013.

[10] Francisco M. Couto, Mrio J. Silva, and Pedro M. Coutinho. Measuring semantic similarity between gene ontology terms. Data & Knowledge Engineering, 61(1):137 – 152, 2007. Business Process Management - Where business processes and web services meet.

[11] Faria Daniel, Pesquita Catia, Couto Francisco, and Falcao Andre. Proteinon: A web tool for protein semantic similarity. In in proceedings of ISMB/ECCB 2007.

[12] Louis du Plessis, Nives Skunca, and Christophe Dessimoz. The what, where, how and why of gene ontology–a primer for bioinformaticians. Brieﬁngs in bioinformatics, 12(6):723–735, November 2011.

[13] Robert D. Finn, John Tate, Jaina Mistry, Penny C. Coggill, Stephen John J. Sammut, Hans-Rudolf R. Hotz, Goran Ceric, Kristoﬀer Forslund, Sean R. Eddy, Erik L. Sonnhammer, and Alex Bateman. The pfam protein families database. Nucleic acids research, 36(Database issue):D281– D288, January 2008.

[14] Olivier Garcia, Cosmin Saveanu, Melissa Cline, Micheline Fromont-Racine, Alain Jacquier, Benno Schwikowski, and Tero Aittokallio. Golorize: a cytoscape plug-in for network visualization with gene ontology-based layout and coloring. Bioinformatics, 23(3):394–396, 2007.

[15] B. Goethals. Survey on frequent pattern mining, 2003.

[16] PH Guzzi and M Cannataro. Cyto-sevis: semantic similarity-based visualisation of protein inter-action networks. EMBnet. journal, 18(A):pp–32, 2012.

[17] PH Guzzi, Marco Mina, Concettina Guerra, and Mario Cannataro. Semantic similarity analysis of protein data: assessment with biological features and issues. Brieﬁngs in bioinformatics, 13(5):569– 585, 2012.

(11)

itemsets, 2006, URL http://cran.r-project.org/, r package version. SIGKDD Explorations, 2:0–4, 2007.

[19] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Weidong Chen, Jeﬀrey Naughton, and Philip A. Bernstein, editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, May 2000.

[20] M. A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethu-raman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, P. Tonellato, P. Jaiswal, T. Seigfried, and R. White. The gene ontology (go) database and informatics resource. Nucleic Acids Res Nucleic Acids Res, 32(Database issue):258–61, Jan-uary 2004.

[21] Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1):1–13, 2009.

[22] Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1):1–13, 2009.

[23] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997.

[24] Tatiana V. Karpinets, Byung H. Park, and Edward C. Uberbacher. Analyzing large biological datasets with association networks. Nucleic Acids Research, 40(17):e131, 2012.

[25] M. Koyut¨urk, Y. Kim, S. Subramaniam, W. Szpankowski, and A. Grama. Detecting conserved interaction patterns in biological networks. J Comput Biol, 13(7):1299–1322, September 2006. [26] Dekang Lin. An information-theoretic deﬁnition of similarity. In Jude W. Shavlik and Jude W.

Shavlik, editors, ICML, pages 296–304. Morgan Kaufmann, 1998.

[27] Prashanti Manda, Seval Ozkan, Hui Wang, Fiona McCarthy, and Susan M. Bridges. Cross-Ontology multi-level association rule mining in the gene ontology. PLoS ONE, 7(10):e47411+, October 2012.

[28] Stefan Naulaerts, Pieter Meysman, Wout Bittremieux, Trung N. Vu, Wim V. Berghe, Bart Goethals, and Kris Laukens. A primer to frequent itemset mining for bioinformatics. Brieﬁngs in Bioinformatics, pages bbt074+, October 2013.

[29] Stefan Naulaerts, Pieter Meysman, Wout Bittremieux, Trung Nghia Vu, Wim Vanden Berghe, Bart Goethals, and Kris Laukens. A primer to frequent itemset mining for bioinformatics. Brieﬁngs in Bioinformatics, 2013.

[30] Catia Pesquita, Daniel Faria, Andre O. Falcao, Phillip Lord, and Francisco M. Couto. Semantic similarity in biomedical ontologies. PLoS Comput Biol, 5(7):e1000443, 07 2009.

[31] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, pages 448–453, 1995.

[32] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, June 2006.

[33] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham. Weka: Practical machine learning tools and techniques with java implementations, 1999.

[34] Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. Technical report, Rochester, NY, USA, 1997.