9 Conclusions and Further Work
9.1 R ESEARCH C ONTRIBUTION
The main contribution that has been made by the research presented in this thesis is a new machine leaning method entitled DynamicWEB. This method was required in order to meet the six needs that were outlined in the introduction and provides the following six capabilities:
1. The learner is able to profile object activity over an extended time period. 2. The learner is able to establish relationships between the profiles.
3. The learner is able to adapt to concept drift. 4. The learner is able to adapt to object drift.
5. The leaner is able to preserve context across multiple observations.
6. The learner is able to be able to track a large number of target objects simultaneously in real-time.
Existing methods in this area of research are largely supervised learners or are batch- based approaches and aren’t able to build a profile over time of a single target object. DynamicWEB is presented as a method that builds profiles across multiple observations of a set of target objects and is an unsupervised learner. This online approach allows DynamicWEB to operate on a stream of data within time sensitive contexts. The method is a hierarchical probabilistic conceptual clustering learner built upon COBWEB (Fisher 1987; Gennari, Langley et al. 1989). COBWEB is a
- 168 -
respected method in the machine learning field and, as such, is seen as a solid foundation to build upon. An index structure was added to COBWEB to enable fast search that facilitate the addition of new operations. This index structure is an elegant solution to enable the learner to have a scalable search able to meet the size requirements of large datasets containing many objects of interest (6).
Two new operations were added in DynamicWEB to those already present in COBWEB: Remove and Update. These operations were implemented with care to maintain the integrity of the existing concept hierarchy. As change takes place within the concepts over many observations, it is vitally important to maintain the quality of the concepts that are being learned. This allows DynamicWEB to tackle learning domains which COBWEB was incapable of examining, while still remaining true to the theoretical base on which it was founded.
Once the hierarchy was modified to enable profiles to be added and removed, DynamicWEB had the capacity to update profiles, combining data from multiple observations. This allowed for profiles, stored within the hierarchy, to contain the most recently observed data, and by enabling the re-addition of profiles to the hierarchy, allowed the learner to adapt to any changes in any of the profile. By allowing for changes to take place within the concept hierarchy, concept drift or object drift can be accommodated by the learning process (3 and 4). It is DynamicWEB’s ability to operate in the presence of these twin forms of drift that sets it apart from other machine learning methods.
Once DynamicWEB was provided with the ability to update profiles when new observations occur, it was possible to preserve past observations by using derived attributes to store historical context (1 and 5). This allowed the learner to become aware of trends occurring in relation to each observed object. DynamicWEB’s ability to preserve context allows it to profile the behaviour of an object over a number of observations. This is a simple idea for a learner to aim to undertake, however, it is an ability that the bulk of machine learning methods are unable to carry out. In addition to preserving context, the derived attributes also improve the ability of the learner to handle noise in the dataset because they enable the profile to contain some attributes that have a smoothing effect over past observations.
Chapter 9 - Conclusions and Further Work
DynamicWEB, with the ability to profile objects over multiple observations, is able to establish relationships between the profiles created (2). The hierarchy that is produced relies upon the previous work carried out in COBWEB and CLASSIT, but builds upon this foundation to learn in a totally different environment than that in which these methods were intended. Using these relationships, which are constantly being updated, DynamicWEB’s concept hierarchy is able to be used in several many different learning tasks. It can be used to discover patterns between different activity profiles, or it can be used to make classifications in real-time. A summary of the results that were produced using DynamicWEB in this research will now be discussed.
9.2
RESULTS SUMMARY
After DynamicWEB was introduced in Chapter 5, the remainder of the thesis described the results of applying the method to a number of different knowledge domains. An important aim of the research was to ensure that DynamicWEB would be suited to a range of application domains. While its original innovation derived from a single application (port scanning reconnaissance profiling), for it to be truly useful it needs to be more broadly applicable.
After initially confirming that the COBWEB implementation used within DynamicWEB produced comparable results to those produced by COBWEB itself, several small machine learning datasets were examined. DynamicWEB was then demonstrated on the Dynamic Weather dataset, which was created specifically for that purpose. This dataset illustrates several examples of object drift in an environment where the concepts are themselves not strictly defined.
The Quadruped Animals dataset was examined to illustrate DynamicWEB’s ability to function as an ensemble learner. Here, using the “wisdom of the crowd”, DynamicWEB was able to produce improved classification accuracy by splitting the attributes in the dataset across eight trees, derived in parallel. The final machine learning dataset examined was the STAGGER Concepts dataset. This dataset demonstrates concept drift, with three distinct concepts being present within the dataset. The dataset has been used by multiple authors and was used when comparing the performance of DynamicWEB with that of COBBIT. DynamicWEB
- 170 -
demonstrated an improved performance compared with COBBIT on a single ordering of the data. DynamicWEB also performed well, in comparison with two supervised learners, when tested over 100 trials.
After the validation was completed, several real world data mining datasets were examined (Chapters 7 and 8). These spanned several application domains, and included a combination of datasets from the Australian Bureau of Statistics (ABS) as well as a data mining workshop dataset (Chapter 7). All of these datasets are real world datasets and most have not been examined using machine learning techniques prior to this research.
Several of the datasets did not have existing class labels so DynamicWEB was used to examine them to see if any structure could be extracted from within the dataset. In some cases this was not possible and the results relating to those are recorded in the appendices. In cases where DynamicWEB was able to discover structure, it had been assisted by the use of derived attributes which, preserving the context over time, thus allowed for trends within the data to be discovered.
Real world datasets were also used to evaluate DynamicWEB’s predictive ability. DynamicWEB demonstrated an acceptable level of success when predicting the class distribution in the “watching TV” class of the BodyMedia dataset (2004). On the ABS network performance dataset, attribute values for some nominal classes (type and brand) were predicted reasonably accurately. It was concluded that DynamicWEB could potentially be applied within the domain in order to locate computers that were behaving abnormally.
The original inspiration for this research was the port scanning problem, in which users change IP address whilst undertaking scan activities, in order to avoid detection. Examination of the port scanning dataset (Chapter 8) showed that groups of similar port scan profiles, with scans stretching across large time periods, could be revealed by the unsupervised learner and then compared with each other to establish relationships between them. A key complication of this problem which aided in focuses the research that was undertaken was that these scan profiles change over time; with profiles being built and being considered benign to then being of interest (object drift) or the possibility of behaviours changing over time (concept drift). The
Chapter 9 - Conclusions and Further Work
method described here, as demonstrated first on the smaller datasets earlier is able to adapt under these learning difficulties. The scan dataset was also shown to be able to extract relationships between profiles further illustrating DynamicWEB’s ability to meet these challenges.