• No results found

1.4 Bibliographical note

2.2.3 Graphs

In mathematics, a graph is an abstract representation of a set of objects and their relationships. The objects are called vertices, and the links that connect pairs of vertices are called edges. Typically, a graph is depicted in diagrammatic form as a set of dots for the vertices, joined by lines or curves for the edges. More formally, a graph is defined as follows.

Definition 2.1 A graph is a tuple G(V, E), with V a finite set of vertices and E⊆ {(u, v) | u, v ∈ V } a set of edges.

Table 2.2: Example of a relational database representing molecules. Molecule Molecule ID Mutagenicity 1 yes 2 yes 3 no . . . Atom

Atom ID Molecule ID Element Type Charge

1 1 carbon 22 -0.117 2 1 carbon 22 -0.117 3 1 carbon 22 -0.117 4 1 carbon 195 -0.087 5 1 carbon 195 0.013 6 1 carbon 22 -0.117 7 1 hydrogen 3 0.142 . . . Bond

Molecule ID Atom ID1 Atom ID2 Type

1 1 2 7 1 2 3 7 1 3 4 7 1 4 5 7 1 5 6 7 1 6 1 7 1 1 7 1 . . .

If G is a graph, we denote with V (G) the set of vertices of G, with E(G) the set of edges of G. A graph is called:

• undirected if edges (u, v) and (v, u) are not distinguished. Otherwise a graph is directed. An edge of the form (v, v) is called a loop;

• a multigraph if the set of edges E is a multi-set; • connected if there is a path between every two vertices;

2.2 Structured data 17

(a)

(b)

Figure 2.1: Two representations of the molecule aspirin: (a) graph structure (b) 3D structure.

V ∪ E → Σ.

We will denote the set of all graphs with G. In this text, we will mostly consider undirected, labeled graphs G(V, E, λ, Σ), such as the one in Fig. 2.1a.

Graphs can be used to represent several types of relational data. In this thesis, we will mainly use them as a representation for molecules, but there are many different examples, such as roadmaps (in which vertices are crossroads and edges are roads), social networks (in which vertices are people and edges represent the friend relationship) or protein-protein interaction networks (in which vertices are proteins and edges their interactions).

Example 2.3 The graph in Fig. 2.1a is a representation of the molecule aspirin, for which the 3D structure is depicted in Fig. 2.1b. In Fig. 2.1a, colours are used for the node labels, which indicate the chemical element: black for carbon, red for oxygen and white for hydrogen. Note that we do not show edge labels in this example.

Example 2.4 Figure 2.2 shows an example of the Gene Ontology [Ashburner et al., 2000], which is a hierarchy of gene functions that is developed by biolo- gists. The graph that represents this ontology consists of directed edges, for which the arrows determine their direction. Moreover, there are no directed cycles, that is, there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. Such graphs are called directed acyclic graphs.

The advantages of graphs are that they can naturally represent relational data and that they are intuitive because of their visual nature. However, the increased expressivity compared to attribute-value data comes at the cost of efficiency: a

biological process physiological process cellular process cell cycle

M phase meiotic cell cycle cytokinesis

cellular physiological process cell division M phase of meiotic cell cycle cytokinesis after meiosis I

Figure 2.2: Example of a part of the Gene Ontology represented as a directed acyclic graph.

lot of operations on graphs are computationally expensive. As an example, we introduce the subgraph isomorphism, which is known to be NP-complete [Garey and Johnson, 1979].

Definition 2.2 Two graphs G and H are isomorphic if there exists a bijection ϕ : V (G) → V (H) such that ∀u, v ∈ V (G) the following holds: (i) {u, v} ∈ E(G)⇔ {ϕ(u), ϕ(v)} ∈ E(H), (ii) λG(u) = λH(ϕ(u)), and (iii) {u, v} ∈ E(G) ⇒

λG({u, v}) = λH({ϕ(u), ϕ(v)}).

Definition 2.3 Let G and H be graphs. G is a subgraph of H, if (i) V (G) ⊆ V (H), (ii) E(G)⊆ E(H), and (iii) λG(x) = λH(x) holds for every x ∈ V (G) ∪

2.2 Structured data 19

(a)

(b)

(c)

Figure 2.3: Molecular fragments represented as graphs.

Definition 2.4 A graph G is subgraph isomorphic to H iff G is isomorphic to a subgraph of H.

Intuitively, the subgraph isomorphism checks whether a graph G is a part of an- other graph H. If it is, we say that G can be embedded in H. Figure 2.3 shows three graphs which could be representations of molecular fragments. The graphs in Fig. 2.3a and Fig. 2.3b can be embedded in the graph representing aspirin (see Fig. 2.1a), while the graph in Fig. 2.3c cannot.

Graph theory has been an active research field for several centuries and their properties have been thoroughly investigated. For example, classes of graphs that are well-known in the computer science domain are trees (which are connected graphs for which |V (G)| = |E(G)| + 1) or sequences (which are trees for which every vertex has at most two edges). We denote the set of all trees with T and the set of all sequences with S.

Example 2.5 Figure 2.3b shows an example of a tree, while Figure 2.3c is an example of a sequence.

Because sequences and trees have a more restricted structure than general graphs, the subgraph isomorphism can be computed for these in polynomial time [Garey and Johnson, 1979].

As Fig. 2.4 shows, the several classes of graphs form subsets of each other. The set of sequences is a subset of the set of trees, which is in turn a subset of the set of general graphs. There exist a lot of graph classes T beyond sequences and trees, which are still more specific than general graphs. These are of particular interest to us since a possible strategy to improve the efficiency of graph mining techniques consists of looking for specific properties of particular graph classes and exploiting these properties [Horv´ath et al., 2006; Horv´ath and Ramon, 2008]. The idea is to find graph classes that are able to represent molecules and for which problems as the subgraph isomorphism are still efficiently computable.

If the concept of a graph is generalised to a hypergraph, where an edge can con- nect any number of vertices, there is a clear relationship with relational databases:

. . .

G

X

T

S

. . . . . . . . .

Figure 2.4: The hierarchy of graphs.

each tuple (v1, v2, . . . , vn, l1, l2, . . . , lm) in a relation R then corresponds to a hy-

peredge (v1, v2, . . . , vn) with labels (l1, l2, . . . , lm).

For a more detailed overview of graph theory, we refer to the book of Diestel [2000].