• No results found

The rationale behind a software architecture for inductive databases is clear. By cre- ating software architectures, software becomes better, lasts longer and contains fewer errors [BCK03]. However, although much research has been done on various aspects of inductive databases, the implementation of an inductive database has received very little attention, but is still vital for performance issues (which is of paramount impor- tance not only for inductive querying, but also KD in general), and extensibility of the database system (which has a huge impact on the data mining power of the inductive database).

Before we discuss the software architecture, we first want to address that the dis- tinction between patterns and data is not only an intuitive one: the patterns and the data differ in a number of aspects. Raw data usually has a rigid structure, while pat- terns are often semi-structured. Studies in the PANDA project have shown that storing patterns in a relational way can be very inefficient, due to their semi-structured na- ture [CMM+04]. Therefore, we propose that an inductive database architecture that has a separate database and a separate patternbase, connected by a fusion component

as outlined in Figure 3.1. Note that this is a general architecture, and that there are always special cases that do not benefit from or need patternbases; a nice example are distance based methods that fit quite well with relational databases [KAH+05]).

Figure 3.1: The fusion architecture

In Figure 3.1, the blue components and arrows denote data components and data flows, and the red components and arrows indicate functional components and func- tional flows. Let us consider a simple scenario: the user specifies a query which is processed in the inductive querying layer. As we shall see later, from here the re- quired sub-query calls are made to the fusion layer, whereby data mining operations are supplied, as indicated by the red arrow from the querying layer to the fusion com- ponent. From here on, the necessary data and patterns are loaded through the APIs, and transformed into an internal representation. Finally, in the data operator compo- nent, the sub-query is executed.

A crucial part of this architecture are the data and pattern representation struc- tures. According to [BCM04], a PBMS should contain three layers: a pattern layer containing the patterns, a pattern type layer containing the pattern types, and a class layer that contains pattern classes: collections of semantically related patterns. Re- gardless of how a pattern is represented within the patternbase, a pattern has at least the following information attached to it:

βˆ™ The pattern source 𝑠, i.e., the table(s) or view(s) from which the pattern is derived.

βˆ™ The pattern parameter collection𝑃, which is a (possibly empty) list of param- eter values used by𝑓.

The information specified above is the minimum amount of information needed to update patterns in case their source tables change. Changes can automatically be dis- covered and handled by database triggers supported in the DBMS query language, or by registering for them in the DBMS API. Current relational databases are unfit to represent such an architecture and XML databases have been proposed to store and represent patterns [MP02, CMM+04]. Therefore, we prefer to use an XML database for the patternbase. For representation of patterns in XML, currently the leading stan- dard is the Predictive Model Markup Language (PMML)2, a data mining standard for representing statistical and data mining models.

Apart from query execution, the fusion component is also responsible for the syn- chronization of patterns with their corresponding source data, and for maintaining data structures that allow these procedures to proceed as efficiently as possible. The fusion component should implement the following pattern and data synchronization operations:

βˆ™ π‘…π‘’π‘π‘Žπ‘™π‘(π‘Ÿ), which recalculates the patterns in the patternbase affected by a change of database relationπ‘Ÿ, according to specified function 𝑓 and param- eter values 𝑃 over source 𝑠. The function is located and known in the data mining layer.

βˆ™ 𝐷𝑒𝑙(π‘Ÿ), which deletes a pattern if (part of) its source𝑠is no longer present in the database.

Before a query is executed, first it needs to be processed in the inductive querying layer. Currently, a few specialized inductive query languages have been proposed and implemented, such as MINE RULE [MPC98], MSQL [IV99], DMQL [HFW+96] and XMine [BCKL02]. What these languages all have in common is that they are existing SQL or XML query languages extended with data mining operators. We envision a query architecture as depicted in Figure 3.2. As can be seen in Figure 3.2, the following components are involved in the querying process:

βˆ™ Query Parser

All queries posed to the system first go through the query parser. Here, queries are parsed and examined, and individual relations, data mining operations and standard query types are identified and passed to the query analyzer. Identifi- cation proceeds through matching each lexical unit (e.g., a word) in the query with both the data mining operation repository and the query language typing components.

Figure 3.2: The inductive query architecture

βˆ™ Data Mining Operation Repository

All data mining operations are stored in the data mining operation repository. Each operation should be annotated with its lexical value (for the parser) and meta-data concerning performance and dependency indications (for the ana- lyzer) as well as required operation parameters and output type (for the sched- uler).

βˆ™ Query Language Typing

The query language typing component tells the parser what lexical units are part of the underlying query language of the DBMS and PBMS, so that the parser can forward those segments without having to check for data mining operations.

βˆ™ Query Analyzer and Optimizer

The query analyzer analyses (sub)queries to see if they can be optimized. These optimizations include logical optimizations and optimizations based concur- rent execution. To optimize data mining operations, it uses meta-data provided by the operations repository.

βˆ™ Query Scheduler and Execution Handler

The query scheduler schedules the (sub)queries for execution, including con- current scheduling for execution and choice of execution platform (local and/or remote), whereby it uses load balancing to come to an optimal execution pro- file. When this process is completed, the execution handler configures any data mining primitives according to its meta-data and send the operation and query towards the Fusion Layer (either on a remote site and/or local), where it will be handled. After receiving all the query answers, it them all to the Fusion Layer again, until a final answer is received.

As shown in Figure 3.2, the data mining operation repository is heavily involved in all steps, thereby using its meta-data to support execution, scheduling and optimization. Furthermore, data mining operations have to comply with a global typing system in order to achieve typing closure, meaning that the output type is always a subset or element of the input type. Since services are always annotated with the necessary parameters and support rigid typing schemes, we argue that services are excellent candidates for data mining operations in inductive databases. Moreover, if web ser- vices are used, remote computing could be achieved fairly easy, without having to create a custom framework.

Apart from web services as data mining operators, they can also serve as execu- tors for other parts in the inductive querying process. For example, the query parser can be a service taking as input the query and the grammar of the inductive query lan- guage, and return errors or sub-queries. This service could be implemented remotely using a parser generator such as Bison3.