• No results found

5 Multi-Relational Rule Discovery

5.2 Implementation

The basic algorithm for multi-relational rule discovery, as outlined in the previous section, demonstrates the primary concepts involved in an MRDM algorithm. For a truly practical implementation however, a number of details will need to be dealt with, especially where efficiency and scalability are concerned. This section presents such an implementation.

The commercial Data Mining package Safarii [93] provides an implementation of multi-relational rule discovery. The package boasts a Client/Server architecture with a clear separation between the relatively lightweight search process and the computation-intensive evaluation of candidate patterns in the database. This separation is attractive from an architectural point of view because it allows a variety of RDBMSs to be used, ranging from well-known commercial database systems to Data Mining-optimised query engines such as Monet [15], without having to alter software on the client side. Furthermore, the client and server can be run on separate workstations, each optimised for the specific workload, thus making a first, but essential, step towards scalability.

The separation of search process and evaluation of candidate patterns is achieved through the definition of so-called multi-relational data mining primitives: small data structures of statistical information concerning some aspects of the contents of the database. Data mining primitives typically contain sufficient information to determine the evaluation measure value for a range of similar candidate patterns. The mining algorithm never accesses data directly, only through the use of a small set of predefined primitives. The statistical summaries can generally be produced by a single query to the database, for example a single SQL-statement. As a result, the RDBMS can optimise the access to the data that is required for evaluating multiple patterns, in one single process. Furthermore, it leaves some questions concerning details about exactly which patterns to evaluate to the RDBMS.

Multi-relational data mining primitives are covered in detail in Chapter 10, but we briefly demonstrate their benefits here. Consider a good example of a molecular database: the Mutagenesis database [96], which describes a set of 188 molecules falling in two classes, mutagenic and non- mutagenic. Assume that we are considering the multi-relational pattern ‘molecules that contain an atom’. We have already obtained counts for the positive (i.e. mutagenic = T) and the negative examples, 125 and 63 respectively. We can now run a primitive known as NominalCrossTable, in order to obtain information concerning the presence of atoms of a specific element, and the effect this presence may have on the mutagenicity. The primitive in question can be expressed in SQL as follows. It basically sums up the available elements and counts the positive and negative examples in which each element occurs.

46

Figure 5.2 The Safarii MRDM system.

SELECT molecule.mutagenic, atom.element, COUNT(distinct molecule.id)

FROM molecule, atom

WHERE molecule.id = atom.molecule_id

GROUP BY molecule.mutagenic, atom.element

Figure 5.3 shows the result of this query. The statistical information obtained tells us that a condition refinement involving atom.element is of limited use in the present context. The

elements carbon, hydrogen, nitrogen and oxygen (C, H, N, O) appear in all individuals, and the remaining elements are infrequent, producing rules of low support. Chlorine may be of some interest, as it appears in 12.7% of the negative examples, compared to 2.4% for the positive examples. A rule involving Cl has a slight (negative) novelty for a rule predicting mutagenic = T:

P(ST) – P(SP(T) = 0.023 188 125 188 11 188 3

Rules with a negative novelty indicate that a rule with the inverse conclusion is novel. Note that it does not make sense to add up counts over the elements, as molecules can contain different elements. It does, however, make sense to add counts over the target attribute, as this appears in the

target table. Furthermore, note how the primitive only reports combinations that actually occur in the database for the pattern at hand. For example, it does not consider the 100 odd remaining elements of the periodic table.

Figure 5.3 A NominalCrossTable primitive.

Next to deciding on a proper architecture, a practical implementation also needs to deal with a variety of details that may influence the efficiency or quality of results obtained. The following choices were made in the design of Safarii:

• Condition refinements involving numeric attributes will often give rise to a large number of very similar patterns, each of which may be the starting point for a very lengthy search process. This means that an unacceptable amount of computation may be spent on considering lots of slight variations. More importantly, the set of resulting rules will contain many copies of rules that only differ in a single numeric threshold. As a solution, only the condition refinement with the optimal numeric threshold will be reported and added to the queue of candidates.

• Association refinements do not necessarily lead to a reduced coverage. This is true in particular for refinements over associations that have a one or one-or-more multiplicity (e.g. a molecule has at least one atom). Therefore, interesting rules can give rise to rules of equal interest, containing irrelevant nodes. Although such rules are essential for further refinement of the irrelevant nodes, they should not be reported.

• Irrelevant nodes can give rise to overly complex queries to the database. When queries are produced from refinements, they should be filtered for irrelevant joins.