• No results found

Using Data Mining for Assessing Whether an Objective is “Realistic”

Chapter 5: The Development of the Framework Based on ILP and Data Mining

5.2 Using Data Mining Techniques for Assessing the Objectives

5.2.2 Using Data Mining for Assessing Whether an Objective is “Realistic”

With regards to the SMART approach, an objective is considered “realistic” if the sufficient resources are available to achieve it (Hurd et al., 2008). For example, to assess whether a business objective related to a domain such as sales is “realistic”, this necessitates classifying the required resources that must be allocated to accomplish the desired sales within a specified period of time. The classification task here is based on a decision making process, such that the sales people can use some historical data to make accurate decisions in the future regarding to the quality, quantity and availability of the required resources to perform the expected sales. The required resources that should be available to achieve the target sales for a specific product may include items, staff, and time. There are several data mining techniques which can be employed for supporting the decision making process and classifying data by inducing decision and classification rules.

Thus, the aim here is to check if data mining methods could be also used as part of the framework to ensure whether an objective is “realistic” by classifying the required resources to achieve an objective, such as identifying the required number of items and staff in order to attain the desired sales within a given timeframe.

To perform this task, the JRIP classification data mining algorithm, which is the WEKA implementation of the RIPPER (Cohen, 1995) rule induction algorithm, is

applied to assess whether an objective is “realistic”. In this study, JRIP has used some information extracted by ALEPH (i.e. ‘Product’ and ‘Date’ patterns) and some other additional information (i.e. the available staff and items) for inducing some classification rules that classify the product type with the required number of staff and items for a proposed date in order to obtain the expected sales.

For the experiments, the JRIP rule induction algorithm has been applied on a dataset of randomly selected examples created by using some suggested rules that describe some information about two products (i.e. PC and mobile)11. These rules are suggested

to produce a dataset which can be exploited by the applied rule induction algorithm. The suggested rules are provided as input data to a Java program in order to generate the experimental dataset. Thus, a simple Java program is implemented to produce random positive and negative examples to be used by the JRIP algorithm.

The proposed rules are:

1. if (year = 2008) and (product = mobile) and (staff >= 100) and (staff <= 150) and (items >= 500) and (items <= 550) then realistic=yes

2. if (year = 2009) and (product = mobile) and (staff >= 170) and (staff <= 200) and (items <= 570) and (items >= 600) then realistic=yes

3. if (year = 2008) and (product = PC) and (staff >= 300) and (staff <= 350) and (items >= 1000) and (items <= 1300) then realistic=yes

4. if (year = 2009) and (product = PC) and (staff >= 400) and (staff <= 450) and (items >= 1500) and (items <= 1600) then realistic=yes

For the experiments, the created dataset is given to WEKA and it contains some attributes such as “product”, “staff”, “items” and “year”, while the target class attribute is named as “realistic”. More specifically, this dataset consists of 790 instances and 5 attributes (product, year, items, staff, and realistic), of which 3 are nominal (i.e. “product”, “realistic”, “year”) and 2 are numeric (i.e. “items”, “staff”). The instances in this dataset represent the examples which include 392 positive examples and 398 negative examples.

11 "This was done since data was unavailable, though one can envision that such data could be collected if the kind of system proposed is deployed in practice."

By using WEKA, a classifier model has been built from the utilised dataset and a set of JRIP classification rules represented in a form of if-then rules has been generated. The following represents the output of the applied JRIP rule induction model within the WEKA toolkit:

1. (staff >= 101) and (staff <= 150) and (items >= 501) and (items <= 550) and (year = 2008) and (product = mobile) => realistic=yes (94.0/0.0)

2. (staff >= 170) and (staff <= 200) and (year = 2009) and (items <= 600) and (product = mobile) and (items >= 570) => realistic=yes (98.0/0.0)

3. (items >= 1012) and (staff >= 304) and (staff <= 342) and (items <= 1298) and (year = 2008) and (product = PC) => realistic=yes (84.0/0.0)

4. (staff >= 400) and (items >= 1506) and (items <= 1600) and (staff <= 450) and (product = PC) and (year = 2009) => realistic=yes (94.0/0.0)

5. (items >= 1007) and (items <= 1261) and (staff <= 350) and (staff >= 301) and (year = 2008) and (product = PC) => realistic=yes (14.0/0.0)

6. => realistic=no (406.0/8.0)

As it is shown above, the JRIP model has induced 6 classification rules from the applied training dataset, such that all these rules are explicit and understandable as well as they cover most of the examples. All these rules are constructed in a bottom- up fashion. The first rule can be read as:

If the year is equal to 2008 and the product is mobile and the number of staff is greater than or equal to 101 and less than or equal to 150 and the number of items is greater than or equal to 501 and less than or equal to 550, then the result is “realistic”. The number between the two brackets (94.0/0.0) means that 94 of the training examples are covered successfully by this rule.

The other rules can be read in a similar way and the last rule is activated if all the rules above it are not applicable.

The performance of the JRIP classification model has been evaluated using 10-fold cross-validation in WEKA in order to randomise the data instances and achieve representative classification accuracy. By using cross validation, the applied data is

divided into two sets: a set of training examples that is used to train the model and a set of testing examples is utilised to evaluate performance of the model. The rule induction model has achieved a classification accuracy of 95%, where 752 instances have been correctly classified by the utilised model.

The aim above is to illustrate what could be possible in using a classification algorithm for assessing if an objective is “realistic” as part of the framework if relevant data were available. Clearly, in practice, the accuracy will depend on the availability of data and whether such data can be collected. If not, alternative subjective methods may need to be explored, such as using expert judgment.