3.2 Association Rule Mining
3.2.2 Establishing Support Value
Given the fact that ARM algorithm has no prior knowledge about the dataset and the output is based on frequent itemsets, there is a possibility of extracting meaningless rules that do not add value to the security knowledge. For this reason, choosing a right support for ARM algorithm is critical for generating rules of better quality and quantity. In traditional ARM, the algorithm takes a single support value and outputs each rule with that support value or above. For example, if the input support is 0.5, all rules having support between 0.5-1 will be displayed.
Events triggered due to the security activities may be lesser in number as compared to the routine events. This is because system administrators do not often repeat the same security actions on a frequent basis. So, we can form an assumption that quality results be produced by lowering the support value. But, if the support value is kept at minimal to identify low frequency activities, it would allow more routine events to correlate, hence generating large amount of meaningless rules. Due to these issues, there is a need to define both minimum and maximum support values, hereafter termed as support range (SR), at the same time to guide the algorithm to consider less frequent events whilst avoiding large number of routine events. As each dataset has different properties and specifying appropriate support thresholds without the knowledge of dataset can be diffi- cult, we have devised an automated mechanism to determine the SR. The implemented ARM algorithm takes both minimum (minsup) and maximum (maxsup) support values and finds all those rules having minsup 6 support > maxsup. Moreover, as the aim is to generate high quality association rules and the SR may come out as quite low (e.g. 3%-10%), the confidence is always set to maximum. The confidence value is an indi- cation of how often the rule has been found to be true, and setting it to 100% ensures the discovery of the most interesting as well as reliable association rules [128] within the limits of SR.
The SR is calculated based on the object frequency distribution (OFD) of events types. The OFD is a set of positive numbers, where each element describes the number of times a unique object has appeared in a given event log dataset. Two other techniques
of a similar nature have also been developed [129, 130]; however, they do not calculate support values from the dataset but rather rely on the confidence to define threshold and filter interesting rules. The first approach is not capable of finding those interesting rules, which are composed of more than two items. This approach is unsuitable for events correlation as more than two events can exist in an associative relationship. The second approach generates all possible association rules, regardless of support and confidence thresholds, and then use transitive set property and manual examination to filter interesting patterns [131]. This approach too is not reasonable as it would require huge computing resources (or manual effort) for large-scale datasets. So in this research, the ARM algorithm employs frequency of objects to estimate the SR. The formulae used to determine the minimum and maximum support values from an OFD are shown in Equations3.4aand3.4b. It should be noticed here that the algorithm uses frequency of objects rather than frequency of events for SR calculation as the rules are mined from object-based model, not directly from event log dataset.
SR = Fmin Ftot to Fmax Ftot
if OFD is normal (3.4a)
Favg
Ftot
to Fmax Ftot
if OFD is not normal (3.4b)
Where Fmin is the minimum of OFD, Fmaxis the maximum of OFD, Favg is the average
of OFD and Ftot is the sum of all elements of OFD. In any normal distribution, the data
points are in symmetrical order and if the size of distribution is relatively small, there will not be a substantial difference between the minimum and maximum elements [132]. So the minimum support value is calculated as the ratio of minimum frequency to the total of the OFD, while the maximum support value is calculated as the ratio of maximum frequency to the total of OFD. However, if the distribution is not normal, the minimum support value is calculated as the ratio of average frequency to the total of OFD, while the maximum support value is calculated as the ratio of maximum frequency to the total of OFD. If the same Equation3.4ais used here to calculate the SR, the large difference between Fmin and Fmax will generate a wider SR, i.e. the Fmin/Ftot value becomes
significantly lower and that will subsequently force the ARM algorithm to include less interesting and redundant rules. Hence the distinction between a normal and abnormal
distribution is necessary during the calculation of SR. This can be performed using a normality test, which will allow the ARM algorithm to include interesting and useful rules, meanwhile, preventing the extraction of irrelevant rules.
3.2.2.1 Normality Test
Many methods are available to determine whether a given distribution is normal, such as Shapiro–Wilk (SW) and Two-Sample Kolmogorov–Smirnov (TSKS) tests. The distri- bution size can grow large as there are hundreds of distinct events that can be triggered. The SW test only provides better results for small (50 or less) sample sizes [133]. Re- cent comparisons [134, 135] show that the TSKS test is an effective method among others and is suitable for large sample sizes. The TSKS test takes two one-dimensional distributions and decides if they significantly differ from each other. The TSKS is a non-parametric test [136], which means it does not make any assumptions about the distribution and quantifies a distance between the empirical distribution functions of samples. The process of normality test is provided in the following:
Step 1 (Generate a random normal distribution). The first step in the TSKS test is to produce a known, standard normal distribution, which will be used as a reference against the object frequency distribution (OFD) of the dataset. We used an algorithm proposed in [137] that either takes a pair of standard deviation and mean values or first and last elements of the OFD to generate a reference normal distribution (RND). The RND has similar range of values as of OFD, which helps in increasing the accuracy of normality test.
Step 2 (Determine empirical distribution functions). The next step of TSKS test is to determine the empirical distribution function (EDF) [138] of both OFD and RND. The EDF assigns n1 probability to n elements and outputs a discrete distribution. This uses the formula provided in Equation3.5.
EDFn(x) = 1 n n X i=1 1xi≤t (3.5)
where 1xi≤t is the indicator function. It is a step function and outputs 1 if xi ≤ t is true
or else 0.
Step 3 (Test the hypothesis). The next step is to test the empirical distributions of RND (EDFOF D) and OFD (EDFRN D) under the hypothesis that both samples come from
a common distribution. If the hypothesis is accepted, then OFD is normal as RND is known to be normal. Otherwise if the hypothesis is refuted, then the OFD is not normal. This is determined by finding the maximum set of distances (or differences) between the items of EDFOF D and EDFRN D using Equation 3.6:
D = sup |EDFOF D− EDFRN D| (3.6)
where sup is the supremum or maximum of found distances. The maximum distance value, D, is used to determine if the hypothesis is acceptable or void by calculating a critical value using the Equation 3.7.
D > 1.36r n + m
nm (critical value) (3.7)
where n is the size of EDFOF D and m is the size EDFRN D. If D is greater than the
critical value, the hypothesis is acceptable, and hence the OF D will be considered as a normal distribution, otherwise not normal [139].
Step 4 (Step-4: Repeat the test for better accuracy). This process is repeated with several, distinct reference normal distributions that are generated from the same pair of values from OF D. The dominant outcome of hypothesis, which is either true or false, is considered as a final outcome. The multiple tests improves the consistency of TSKS tests, which consequently provisions better support values for correlation mining.
Consider an OFD shown in Equation3.8. The total number of objects in OF D1 is 12,
minimum value is 1, maximum value is 144, average is 41.58 and sum of all elements is 499. The empirical distribution of OFD, EDFOF D1, is presented in Equation 3.9.
OF D1 = {1, 2, 3, 4, 14, 22, 31, 59, 60, 68, 91, 144} (3.8)
Now consider the RND shown in Equation 3.10. It is generated using the minimum (1) and maximum (144) values of OF D1. The total number of objects in RN D1 is 12,
minimum value is 9.80, maximum value is 137.35, average is 71.09 and sum of all elements is 853.09. The empirical distribution of RND, EDFRN D1, is presented in Equation3.11.
RN D1 = {9.80, 40.60, 43.03, 57.11, 60.50, 73.42, 74.15, 79.30, 82.98,
90.93, 103.92, 137.35}
(3.10)
EDFRN D1 = {0.08, 0.17, 0.25, 0.33, 0.42, 0.5, 0.58, 0.67, 0.75, 0.83, 0.92, 1.00} (3.11)
After applying Equations 3.6 and 3.7 on EDFOF D1 and EDFRN D1, the value of D is
found to be 0, which is not greater than the critical value 0.54. Hence, the hypothesis is proven false. After repeating this process for 12 times (as there are 12 elements), the dominant output still comes out as false. This means OF D1 has failed the TSKS
test and is considered as not normal. The Equation 3.4b will used to define a range of support values for OF D1. The minimum support value would be
41.58
499 = 0.08,
whereas the maximum support would be 144
499 = 0.29. Therefore, the ARM algorithm will mine all those (object-based) association rules, whose support is between 0.08-0.29. At this stage, correlations amongst event objects have been discovered and extracted, and the next stage is to translate these relationships to determine connections among event types. The object-based rules describe how objects contained in event entries are related in a particular machine; however, the aim of this work is to determine generic relationships amongst event types.