3.3 Data Preparation
3.3.4 Data Discretization and Noise Mitigation
The proposed approach in this work is based on time-series analysis. Therefore, prior to analysis the continuous monitoring data collected from computing systems should be discretized. System logs are discrete time-series by nature, therefore, further discretization is not required. However, discrete binning of syslog entries can be used to mitigate noises. In this work a dynamic time binning is applied on syslog entries. Each bin contains the accumulated number of events occurred in a certain time window per node and per event class.
The dynamic binning significantly contributes to detection of periodic patterns in mon- itoring data. Many patterns are only visible using a certain (binning) bucket size. Larger bucket sizes may completely hide a pattern and smaller bucket sizes may reduce the sig- nificance of patterns. A sample binning of system logs using three different bucket sizes is shown in Figure 3.17. Only in Figure 3.17b a significant periodic pattern can be detected.
0 1 2 3 1517266890 1517266892 1517266894 1517266896 1517266898 1517266900 1517266902 1517266904 1517266906 1517266908 1517266910 1517266912 1517266914 1517266916 1517266918 1517266920 1517266922 1517266924 1517266926 1517266928 1517266930 1517266932 1517266934 1517266936 1517266938 1517266940 FR EQ UEN CY TIME EPOCH
(a) Bucket size = 1 second: Patterns are not significant.
0 1 2 3 1517266886- 1517266890 1517266891- 1517266895 1517266896- 1517266900 1517266901- 1517266905 1517266906- 1517266910 1517266911- 1517266915 1517266916- 1517266920 1517266921- 1517266925 1517266926- 1517266930 1517266931- 1517266935 1517266936- 1517266940 FR EQ UE N CY TIME EPOCH
(b) Bucket size = 5 seconds: Significant periodic patterns.
0 1 2 3 1517266890- 1517266899 1517266900- 1517266909 1517266910- 1517266919 1517266920- 1517266929 1517266930- 1517266939 FR EQ UE N CY TIME EPOCH
(c) Bucket size = 10 seconds: No detectable patterns. Figure 3.17: Significance of data binning bucket size on detectability of periodic patterns
To calculate the suitable bucket size, syslog entries of correlated nodes17, collected in
the period of one hour, are re-sampled using multiple bucket sizes. The size of each bucket varies from 60 to 3600 seconds with 60-second steps. Each bucket holds the average num- ber of syslog entries generated during that period per second. The standard deviation of values in buckets with a similar size are calculated.
16Node vicinity defined in Section 4.2.1 further expands the concept of neighborhood homogeneity. 17Refer to Section 4.2.1 for more information.
The smallest bucket size that is (1) a local minimum in comparison to the nearest smaller and larger buckets, (2) is less than a certain threshold18, and (3) projects a descending trend
is chosen as the suitable bucket size. Figure 3.18 illustrates the final step of this calculation for Taurus. Green dots are potential suitable bucket sizes (local minimums). The horizontal yellow line indicates the threshold (standard deviation = 1) and the vertical red line repre- sents the automatically chosen bucket size (600 seconds = 10 minutes) for data binning. This calculation will be repeated after each major change in syslog generation pattern.
original 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900 960 1020 1080 1140 1200 1260 1320 1380 1440 1500 1560 1620 1680 1740 1800 1860 1920 1980 2040 2100 2160 2220 2280 2340 2400 2460 2520 2580 2640 2700 2760 2820 2880 2940 3000 3060 3120 3180 3240 3300 3360 3420 3480 3540 3600
Bucket size (second) 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation
Figure 3.18: Calculation of suitable bucket size for data binning
Noise is an erroneous presence or absence of entries within the monitoring data. Sys- logs are generated by applications on individual computing nodes, thus, any failure directly affects syslog entries via introducing random noises, interrupting log generation, or im- peding log collection. Furthermore, even harmless errors may introduce random noises in syslog entries.
To identify the normal behavior of computing systems, it is necessary to remove the ran- dom noises. Beside software and hardware failures which may inject random noises into the monitoring data, other actions such as software updates, administration activities, and system maintenance can also introduce noises. In addition, most production HPC systems are used by various groups of users and for different applications. Therefore, existence of random noises in monitoring data is highly plausible due to human errors and applications misbehavior [57, 5]. Part of these noises can be removed via discrete binning of the mon- itoring data.However, an extreme discrete binning can decrease the accuracy of anomaly detection by decreasing the monitoring data precision and hiding the existing patterns.
This work utilizes the neighborhood homogeneity of HPC systems to mitigate random noises. Computing nodes in HPC systems are divided into smaller subsets such as chassis or racks. Majority of these small subsets consist of homogeneous computing nodes which share various physical resources such as power supply, cooling system, and network in- frastructure. Homogeneous computing nodes which are physically collocated (adjacent) and share similar physical resources tend to project similar behaviors [250]. Therefore, in a homogeneous subset of computing nodes, common behavior of the majority can be considered as the normal behavior in that particular subset.
Figure 3.19 shows the extraction of common node behavior from noisy syslog entries on Taurus in a subset consisting of 8 homogeneous computing nodes. Colored cells mark the occurrences of event a5803a8a (event pattern) on 8 adjacent nodes during 32 minutes. The bucket size is 60 seconds. The bottom row indicates the normal pattern of event oc- currences, extracted via majority voting among the 8 computing nodes. Events are placed in each bin according to their relative time passed since midnight. Further time synchro- nization is not required.
Node ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Normal behavior
Figure 3.19: A sample of normal behavior extraction using majority voting. Occurrence pat- tern of one event class (a5803a8a) on 8 nodes, in 32 minutes with bucket size of 60 seconds. The bottom row, shown in green, holds the result of majority voting on all 8 nodes. The darker shades indicate higher number of entries. Since each event class has its own pattern, to emphasis on the important de- tails only one event class is shown.
Outliers are valid events which distant from the norm. Outliers can impede correct anal- ysis of systems behavior. However, in contrast to noises, outliers are part of the systems behavior, thus, they should not be removed. Outliers are not always indicators of abnor- mal behaviors. It is worth to emphasize that the goal of this stage is extracting the pattern of normal (healthy) system behavior. Therefore, standardizing the data range (scaling) is sufficient to omit the negative effect of outliers.
Considering the noise mitigation approach shown in Figure 3.19, when the majority of computing nodes project abnormal behavior, the extracted behavior pattern will be in- correct. However, analyzing Taurus behavior revealed that except major system failures, that affect the majority of computing nodes, utilizing the neighborhood homogeneity and majority voting extracts the common event patterns correctly. Nevertheless, the system log entries collected during major system failures must be excluded from the training data (ground truth) to prevent unexpected results.