Monitoring Data - Toward Resilience in High Performance Computing: A Prototype to Analyze and P

System logs6 _{are the main source of monitoring data in this work. Syslog entries pro-}

vide a wide range of information about the behavior of the underlying hardware, software, and users. All current TOP500 [234] HPC systems are powered by Linux. The Linux logging protocol is implemented according to RFC 5424 [64] thus, there is a high degree of consistency among system log entries generated by all TOP500 HPC systems7_{. Therefore, the}

results of syslog-based analysis as a general approach can be applied to all other TOP500 HPC systems without further modiﬁcations.

In addition to syslog entries, three other data sources were employed to assist the anal- ysis in this work: outages database, service notifications, and job status reports. The outages database reports all system outages from the users perspective, the service notifications notify users regarding scheduled maintenance and system-wide outages, and the job status reports indicate the final status of submitted jobs after their allocation. Table 3.2 provides an overview of the four main data sources used in this work.

Table 3.2: Four main data sources used as monitoring data

Data source Data collection Information Granularity

System logs Automatic Software and hardware Component

Outage database Semi-automatic Service availability Entire system Service notiﬁcations Manual Service availability Entire system Job status report Automatic Job completion status Node

System Log Entries

Syslog entries typically consist of four parts: PRIVAL, timestamp, source and message. PRIVAL, timestamp and source contain structured data while message is unstructured. The timestamp denotes the time at which an event occurs. The source provides information about the location of the event occurrence, and the message describes the event properties. Table 3.3 illustrates three syslog entries divided into their timestamp, source and message ﬁelds. In this example, the timestamps are in the UNIX time format8_{, the}

sources are node IDs, and the messages contain event details.

3.2.1 Data Collection

Taurus syslog entries were collected since September 2014. During this time two fundamental hardware and software upgrades affected the data collection process. From the

6_{System log and syslog are used interchangeably in this work}

7_{It worth to mention that /proc/kmsg and /dev/kmsg provide the kernel ring buffer, thus their entries are not} in RFC 5424 format.

Table 3.3: Sample of raw syslog entries

Timestamp Source Message

1515625261 taurusi1230 (siavash) CMD (/home/config.sh > output.stat) 1515625370 taurusi3417 pam_unix: session closed for siavash

1515625713 taurusi6201 disabling lock debugging due to kernel taint

available collection of syslog entries, the period of January to December 2017 is chosen as the main data source for this work. The main reasons for focusing on the time frame of 01-01-2017 to 31-12-2017 are (1) experiencing the lowest number of updates and unplanned maintenance, (2) having the most reliable data sub-collection, (3) constant monitoring of data consistency, and (4) availability of other monitoring data (i.e. Table 3.2). Figure 3.9 provides an overview of the Taurus syslog collection timeline since 2014.

September 2014

Beginning of Syslog collection Islands 1-3 on an external server

April 2016

Switching to syslog collection on an internal Taurus node

May 2015

Extending Syslog collection to Islands 4-6 (on an external server)

June 2018 Fundamental update of software stack to SCS5 April 2019 Live collection of System logs… Reliable collection Unreliable collection No collection

Figure 3.9: Timeline of syslog collection on Taurus

The collection mechanism is passive. Syslog daemons, on each node, collect the syslog entries and forward them to a central log collector. To maintain the neutrality of the results and providing a general approach suitable for all HPC systems, the pre-conﬁgured syslog daemons were used without any modiﬁcations.

Although the most reliable period of data collection is chosen for the purpose of this study, there are certain gaps in the data. These gaps are mainly incurred due to the in- terruption of the data collection mechanism. Syslog entries cover the entire period of the year 2017. The job status reports generated by Slurm covers the period of 28-02-2017 to 14- 11-2017. Service notiﬁcations and outages database, shown in Figure 3.10a and Figure 3.10b respectively, provide information from a higher perspective and are available for the entire period of one year from January to December 2017.

(a) Service notiﬁcations (b) Outages database Figure 3.10: Availability and Maintenance Notiﬁcations of Taurus in 2017

In each section of this work, different subsets of the collected Taurus syslog entries (2014-2018) are used according to the requirements of the respective analysis. However, the

behavior of Taurus is generally analyzed for year 2017 with the focus being on the period of 01-03-2017 to 31-10-2017. Existence of gaps in data sources is a common challenge in similar works [163, 15].

During the ﬁrst round of data collection from September 2014 until April 2016, an external server was used as the central data collector. The external data collector was passively receiving syslog entries over UDP9_{protocol. The UDP protocol was chosen because of its}

higher transmission speed and lower overhead. However, further analysis revealed various inconsistencies between the remotely collected syslog entries and their local copy on Taurus computing nodes. Most inconsistencies were observed during the high workload periods of the HPC system. Therefore, the data transfer protocol was switched to TCP10_.

Using the TCP protocol, fewer inconsistencies between the remotely collected syslog entries and their local copy were detected. However, high network congestion during major system-wide failures still introduced inconsistencies and delays.

Since April 2016, the central syslog collector resides inside the Taurus HPC system, and is using TCP protocol to collect syslog entries. Using an internal syslog collector, the syslog collection may not be accessible during a major system-wide failure, however, the prob- ability of such major system-wide failures are signiﬁcantly low and does not impair the functionality of the current setup. Additionally, the collected syslog entries can be backed up on external storage for further analysis and long-term archiving.

Beside the noises and gaps among collected syslog entries, several irregularities were identified. These irregularities are part of the system’s normal behavior. Thus, they should not be considered as abnormal behavior. Due to the insignificant difference of these irregular system logs and regular syslog entries, in most cases they can not be easily detected. The proposed approach in this work is highly noise-tolerant. Therefore, syslog irregularities are not affecting the final results. The most frequent irregularities observed on Taurus are (1) out-of-order entries, (2) out of sync clock, (3) daylight saving, and (4) invalid entries.

The sequential arrangement of syslog entries is one of the most important character- istics that make them a highly adequate data source for behavioral analysis. However, among the collected syslog entries various instances of irregular appearance of syslog entries were observed. In Table 3.4 three occurrences of such timestamp confusions are marked in red. Further analysis revealed that this problem is mainly caused by unsuccess- ful timezone detection. The problem can be solved by restarting the problematic daemons on the HPC system. However, the occurrence of such timestamp confusions on Taurus is so few that they can safely be ignored without signiﬁcant negative impacts.

During cold boots, similar confusions may happen due to delayed clock synchroniza- tion, as shown in Table 3.5. Data analyzing approach in this work skips a short interval of instability after each reboot (and failure) to address the early-life failures suggested by Bathtub Curve (Figure 1.1 on page 4), which additionally addresses the out-of-sync clock phenomena.

Another important point to consider is the missing and overlapping hours caused by

9_{https://tools.ietf.org/html/rfc768} 10_{https://tools.ietf.org/html/rfc7805}

Table 3.4: System logs with out-of-order timestamps

Example 1 Example 2 Example 3

1514974801 2018-01-03 11:20:01 taurusi5003 1516901631 2018-01-25 18:33:51 taurusi5003 1516344245 2018-01-19 07:44:05 taurusi5003 1514974835 2018-01-03 11:20:35 taurusi5003 1516902001 2018-01-25 18:40:01 taurusi5003 1516344286 2018-01-19 07:44:46 taurusi5003 1514974835 2018-01-03 11:20:35 taurusi5003 1516902241 2018-01-25 18:44:01 taurusi5003 1516344601 2018-01-19 07:50:01 taurusi5003 1514974852 2018-01-03 11:20:52 taurusi5003 1516898641 2018-01-25 17:44:01 taurusi5003 1516345017 2018-01-19 07:56:57 taurusi5003

1514971326 2018-01-03 10:22:06 taurusi5003 1516902601 2018-01-25 18:50:01 taurusi5003 1516341419 2018-01-19 06:56:59 taurusi5003

1514974951 2018-01-03 11:22:31 taurusi5003 1516902834 2018-01-25 18:53:54 taurusi5003 1516345201 2018-01-19 08:00:01 taurusi5003 1514974993 2018-01-03 11:23:13 taurusi5003 1516902834 2018-01-25 18:53:54 taurusi5003 1516345201 2018-01-19 08:00:01 taurusi5003

1514975032 2018-01-03 11:23:52 taurusi5003 1516345201 2018-01-19 08:00:01 taurusi5003

1516345261 2018-01-19 08:01:01 taurusi5003

Table 3.5: Invalid syslog entries timestamp caused be out of sync system clock

Example 1 Syslog message

1486462610 2017-02-07 11:16:50 taurusi6086 Disabling lock debugging due to kernel taint

1486462610 2017-02-07 11:16:50 taurusi6086 Succesfully Started x86 Adapt Processor Feature Device Driver 1486462610 2017-02-07 11:16:50 taurusi6086 0.0.0.0 c61c 0c clock_step -3459.669479 s

1486459151 2017-02-07 10:19:11 taurusi6086 0.0.0.0 c615 05 clock_sync 1486459152 2017-02-07 10:19:12 taurusi6086 0.0.0.0 c618 08 no_sys_peer 1486459154 2017-02-07 10:19:14 taurusi6086 br0: no IPv6 routers present 1486459154 2017-02-07 10:19:14 taurusi6086 eth0: no IPv6 routers present

daylight saving time changes. Therefore, for the data used in this work (collected in 2017), one missing hour of information on March 26th from 02:00 to 03:00, as well as one overlapping hour of information on October 29th from 02:00 to 03:00 is considered.

The TCP connections guarantee the correct transmission of syslog entries from their source to the central syslog collector. However, due to component failures at the source as well as the silent errors, syslog messages may be incorrectly generated or incompletely stored. Therefore, each syslog entry which does not contain all the expected data ﬁelds is considered invalid and is excluded from further analysis.

3.2.2 Taurus System Log Dataset

During the year 2017, in total more than 3.2 billion syslog entries with a total size of 344 GiB were collected. Less than 0.33% of the collected syslog entries were incorrect or pro- vided no useful information for the purpose of this work. Detailed statistics regarding the number of syslog entries collected on Taurus divided by their originating island are shown in Table 3.6. The column /Node/Day shows the average number of entries generated by a single node per day in its respective island. The #Event patterns column indicates the number of extracted unique event patterns per island. The event pattern extraction process is further explained in Section 3.3.1.

Table 3.6: Statistics of Taurus syslog Entries in year 2017

#Log entries #Event patterns

Island #Nodes Total Invalid /Node/Day Total Exclusive

1 270 299,046,032 535,719 3,034 1,601 291 2 108 58,106,243 603,848 1,474 1,360 357 3 180 56,778,161 16,443 864 1,254 215 4 264 490,299,895 4,585,815 5,088 1,667 285 5 612 1,034,193,160 1,846,396 4,629 1,945 340 6 612 1,328,650,822 3,264,407 5,947 2,168 483 All islands 2046 3,267,074,313 10,852,628 4,374 4,026 1,488

Since the message field of syslog entries is an unstructured free-format text, the message length has a great impact on the performance and implementation of log processing methods. The message field of Taurus syslog entries has a diverse length of 1 to 1108 characters. Most of this variation is caused by variables such as hardware addresses and file paths. Although there are extremely long messages with 1108 characters, the majority of Taurus syslog messages are shorter than 100 characters. Figure 3.11a illustrates the frequency of syslog messages with their lengths count in characters.

2.9E9 2.9E8 1.8E5 5.9E6 1E7 1.2E4 3.2E4 1E1 9.1E6 5.6E6 4E5 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1-100 101 -200 201 -300 301 -400 401 -500 501- 600 601- 700 701- 800 801- 900 90 1-10 00 100 1-11 00 Fr eque nc y

Message Length (characters)

(a) 90% of syslog messages are shorter than 100 characters 2.9E9 3.2E8 1.9E7 1.5E7 1.7E1 16.4E3 6 1E1 9 7.5E2 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1-10 _11-20 _21-30 _31-40 _41-50 _51-60 _61-70 _71-80 _81-90 91-10 0 Fr eque nc y

Message Length (words)

(b) 89% of syslog messages are shorter than 10 words

Figure 3.11: Frequency of Taurus syslog messages with various lengths

The impact of variables on diversifying the length of syslog messages can be reduced via performing word processing rather than character processing. Considering words as the measurement unit decreases the length diversity by the factor of 10. Although Taurus syslog messages consist of 1 to 99 words, using the new measurement unit, the majority of Taurus syslog messages are shorter than 10 words. Which is an acceptable message length for achieving a near real-time processing performance. Figure 3.11b illustrates the frequency of syslog messages with their length measured in words. Converting syslog messages into event patterns (Section 3.3.1), reduces the size of syslog entries to a constant length that signiﬁcantly improves the syslog processing performance.

In document Toward Resilience in High Performance Computing: A Prototype to Analyze and Predict System Behavior (Page 36-40)