The Environment - Using Ourmine - OURMINE: An open source data mining toolkit

3.4 Using Ourmine

3.4.1 The Environment

OURMINE’s environment was constructed with modularity in mind. Each and every different type of process used in the toolkit is located in its own location and give way to easily customizable scripts. Here I will begin from the initial execution of the environment to modifying scripts to provide a better understanding of how the interconnected code of OURMINE works as a whole.

By the time this section is finished, the user should have a working knowledge of where code

Figure 3.5: The OURMINE homescreen after installation.

segments are located in OURMINE, and also how to modify and add new segments as desired.

After installing OURMINE (see the Appendix), the default prompt is given, as in Figure 3.5.

This prompt notifies the user that the OURMINE environment is available for use, regardless of location. Since the data mining environment is loaded into memory in addition to standard shell commands (this becomes very powerful in later use), normal command line operations can still be utilized to its full extent while in OURMINE. To exit the environment, type exit at the prompt.

To re-enter OURMINE, navigate to where the toolkit was installed, for example in the default location $HOME/opt/ourmine/our, and issue the command ./ourmine. By executing the script ourmine, a variable is set to the current directory (default install), and then this value is passed to a shell script called minerc.sh, located in $HOME/opt/ourmine/our/lib/sh. While this may sound complicated, it’s actually very simple. Figure 3.6 shows the contents of the minerc file as it is upon installation. Note that this can grow to become much more large and complex as code is added to the environment as desired.

As can be seen in Figure 3.6, the variable Base is set to the default installation path. Us-ing this variable, we can construct other variables representUs-ing locations of important directories within OURMINE. For example, the variable Data is set to the location of the $HOME/opt/our-mine/our/arffs directory. This is where all of the data that comes packaged with OURMINE is stored. On the same note, the variables Sh and Java store the locations of shell code to be exe-cuted, as well as java code respectively.

These variables set to their respective paths can be used for many applications. For instance,

#define and create required directories

[ -f "$config" ] && . $config done

PS1="OURMINE> "

Figure 3.6: The minerc file used in OURMINE for setting up the environment.

we can see in Figure 3.6 that $Java is used to set another variable, $Weka, which allows us to now access all of the data mining properties of the Weka toolkit. Afterwards, every file declared in the string $Files is used to import all BASH functions from their respective files. Let’s examine further one of these files, such as $Sh/learn.sh.

learn.sh is one of many shell scripting files located in $HOME/opt/ourmine/our/lib/sh, and each are named according to use; util.sh contains code for standard utility functions used in build-ing experiments in OURMINE, fss.sh utilizes code crucial for Feature Subset Selection (FSS), cluster.shhas scripts responsible for calling code used to cluster data, etc. Since our main concern is with learn.sh at the moment, I will spend a bit of time discussing how this works, and how its scripts can easily be modified and extended for further use.

Figure 3.7 shows a subset of the available functions in learn.sh that can be accessed from the OURMINE prompt. Here, we can see the function nb() that is used to call Weka’s Na¨ıve Bayes using as input a training set, followed by a testing set. Likewise, j48() executes the J48 decision tree on the training set. Notice those functions ending in 10, such as nb10 and j4810. This is an arbitrary convention that has been used since the earlier days of OURMINE’s construction, and simply means that there is no testing data provided, and thus all testing is conducted on the training set.

Running a learner on input data is simple, and is detailed in Figure 3.8. At the top of the screen shot, we can see the command j4810 $Data/discrete/weather.arff being issued to OURMINE.

Internally, this calls Weka’s J48 decision tree classifier on the extremely popular weather data set found in the $Data directory under discrete (for discrete data). Further information about the results of running the learner is also printed to the screen, such as the decision tree itself (and corresponding information pertaining to it), the execution time, classification accuracy, and per-class statistics obtained from the confusion matrix, such as precision and recall.

Configuring shell functions in OURMINE for any desired task is simple. For example, sup-pose a user wished to import the OneR rule learner to be used in future experiments. This operation

nb() {

Figure 3.7: A few of the contents of the learn.sh file used in OURMINE for data training and classification.

Figure 3.8: Running the J48 algorithm in OURMINE.

requires the addition of one BASH function. The example in Figure 3.9 shows that the environ-ment does not recognize the command oner10 at first using the input training data. However, by simply editing learn.sh to include the function code for the learner, and then supplying the com-mand reload (discussed in Section 3.4.2), the environment is reloaded and now contains working function code. Figure 3.9 shows the rules learned from running the newly created function.

Thus, in summary, OURMINE’s structure operates by taking advantage of BASH functions located in clearly defined files. These functions, however, call any number of other functions and libraries written in almost any other programming language. The result is an extremely powerful

”sandbox” in which massive amounts of modification and experimentation can take place.

In document OURMINE: An open source data mining toolkit (Page 40-46)