3.5 Prescriptive Maintenance
4.1.1 Rsyslog
The system used to collect the log files from every machine is rsyslog.
The Rocket-fast system for Log processing (rsyslog) is an open source project used for forwarding log messages in a network [43]. The service is based on syslog which is a common standard for messaging logging.
The standard consists in adding to the log message information such as the hostname, the timestamp of the logging and in some case a tag corresponding to the severity level (i.e. INFO, WARN, ERROR etc.).
The log messages collected by the system are usually directed to a text file in the machine that produced the log.
4.1.2
InfluxDB
Some information not included in log messages are acquired by a monitoring sys- tem, that collects various metrics from the machines in the centre and stores them into a database called InfluxDB.
InfluxDB is a time-series database used to store large amount of timestamped data [44]. The data collected from the monitoring system is stored into different tables in the database, one for each metric acquired.
A database table is indexed by the timestamps and contains a column corre- sponding to the hostname of the machine to which that measure corresponds and the value of the measure. Various tags are used to group the different machine corresponding to the same service or experiment.
The system grants the use of different retention policies according to the lifetime of the data — i.e. the measurements are logged every 5 minutes for a week, with the ‘one week’ retention policy; every 15 minutes for one month, with the ‘one month’ retention policy, etc.
The data collected this way is then visualised through the use of a Grafana web interface. Grafana is an open source platform used to visualise InfluxDB data (and data from other sources) into dynamic dashboards [45]. The dynamicity of the dashboard allows the user to filter the data coming from the InfluxDB databases
without the need of manually typing queries.
4.2
Log Repository
In April 2019, the CNAF management, started exploring the development of a single Log repository to be used in the centre. The purpose of this repository is to unify the collection of log files and store them into a single storage system both for safe-keeping and to ease the collection and analysis effort in view of future improvement to the maintenance process of the centre.
The system developed so far makes use of rsyslog to redirect the flow of logs from every machine in the centre to the same repository. The repository is struc- tured in the way shown in Table 4.1.
host 1 year month day 1 log file 1 log file 2 day 2 log file 3 host 2 year month 1 day log file 4 month 2 day log file 5
Table 4.1: Sample directory structure for the Log Repository. Details in the text. This platform allows for easier retrieval of the log files. The access to the repository is allowed via Network File System (NFS), a distributed file system that allows to access files over network as if they were in local storage. The NFS is an open standard defined in a Request For Comments (RFC) [46].
Figure 4.1: Total size of the log files stored into the Log Repository as a function of time.
The number of log files redirected into the Log Repository increased gradually in the last month, while more and more machines were configured to start logging into the repository. At the time of this thesis, the machines that are logging into the Log Repository are 1133 and include all of the Storage hosts, all of the Network hosts and all of the Farm hosts with every Worker Node included. Each machine stores daily an average of 16 log files, for an average of 14 GB and a maximum of 19 GB of data daily. Figure 4.1 shows the amount of data stored in the Log Repository in the last two weeks. Every day more than 18000 log files are stored; the number of log files in the repository at the time of recording was 577585, for a total of 502 GB of data.
Few characteristic log files from StoRM will be briefly presented in the following section. These log types are relevant since they are the type of log files studied in the last part of this thesis.
4.3
StoRM logs
StoRM is a storage manager service for generic disk-based storage systems [25]. StoRM is the storage management solution adopted by the INFN-CNAF Tier-1, and it has been developed in the context of WLCG computational Grid framework with the specific aim of providing high performing parallel file systems to manage the resource distribution for the storage of data in disks. StoRM has a multilayer architecture made by two stateless components, called Frontend (FE) and Backend (BE), and one database use to store the SRM requests and its metadata. A simple StoRM architecture is shown in Figure 4.2.
StoRM supports the Storage Resource Manager (SRM) protocol dividing the operation int synchronous and asynchronous requests. The asynchronous requests include srmPrepareToPut, srmPrepareToGet, srmBringOnLine, srmCopy; while synchronous requests are Namespace operations (srmLs, srmMkdir,etc.), Discovery operation (srmPing) and space operations (srmReserveSpace, srmGetSpaceMeta- data, etc.).
The logging activity represents an important functionality of both BE and FE StoRM components, and of the services linked to them. There are different kind of files in which specific information is stored. A summary of the roles of the two main components is discussed below, with example of their log files.
4.3.1
Frontend server
The FE provides the SRM web service interface available to the user, manages user credentials and authentication, stores SRM requests data into a database, retrieves the status of ongoing requests, and interacts with the BE.
Frontend services logs into two different files: storm-frontend-server.log and monitoring.log. The former has been the first log type analysed throughout Section 6.2; a sample line of the log file is shown in Table 4.2.
When a new SRM request is managed, the FE logs a new line on the file. A log line contains information on the type of request, the user, the success of failure of the request and the token that links it to the BE process.
Figure 4.2: Simple StoRM Service Architecture schema with one Backend and one Fron- tend.
12/07 03:16:09.282 Thread 28 - INFO [0c79ddfe-07cd-48e8- 8547-53d9ca38d154]: Result for request ’PTP’ is
’SRM REQUEST QUEUED’. Produced request token: ’cae6a710 -d292-4404-9bdf-c7a67dad4759’
00:00:00.144 - INFO [xmlrpc-5916] - srmRm: user ¡/DC=ch/DC= cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN= Robot: ATLAS Pilot1¿ operation on [SURL: srm://storm-fe
.cr.cnaf.infn.it/atlas/atlasdatadisk/rucio/mc15 13TeV/bf/ b0/log.16290046. 051554.job.log.tgz.1.rucio.upload]
failed with: [status: SRM INVALID PATH: File does not exist]
Table 4.3: Sample of log line from storm-backend-server.log from 8/12/2018. 00:00:17.827-synch.ls [(count=10669590, m1 rate=329.633,
m5 rate=362.073, m15 rate=501.721) (max=613.200, min =9.381, mean=41.350, p95=171.993, p99=378.712) ] duration units=milliseconds , rate units=events/minute
Table 4.4: Sample of log line from storm-backend-metrics.log from the 8/12/2018.
4.3.2
Backend server
The BE is the core of StoRM service since it executes all synchronous and asyn- chronous SRM functionalities. It processes the SRM requests managing files and space, it enforces authorisation permissions and it can interact with other Grid services.
Table 4.3 shows a sample line of the storm-backend-server.log file.
The BE server log files provide information on the execution process of all SRM requests, error or warning. BE logging is based on the logback framework. Logback provides a way to set the level of verbosity depending on the use case. The level supported are FATAL, ERROR, INFO, WARN, DEBUG. At the INFO level, the BE logs for each SRM operation who has requested the operation (DN), on which files (SURLs) and with which result.
Backend logging also consists of the heartbeat log file, that contains information on the number requests processed by the system from its startup, adding new information at each beat, and the storm-backend-metrics.log log file. An example of the latter is shown in Table 4.4.
The information stored in this logfile are the type of operation, the number of operation in the last minute, the number of operation from last startup, the
maximum, minimum and average duration of the operation from the last minute and the highest duration of the last 95th and 99th percentile of operation.
Chapter 5
Infrastructure setup
As explained in the previous section, the amount of data being regularly stored in the Log Repository averages to 12 GB of data per day. This amount of data, while still being manageable is more and more close to the definition of BD, discussed in Section 1.2.
To elaborate this amount of data, a standard computer may prove insufficient. A single machine can perform elaboration in parallel on its cores, but oftentimes this parallelisation is not enough for BD. A step forward from core parallelisation is distributed computing: A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions to divide a problem into many tasks, each of which is solved by one or more computers.
In many cases it may be more cost-efficient to obtain the desired level of per- formance by using a cluster of several low-end computers, in comparison with a single high-end computer.
Since April 2019, CNAF started studying different system concept to develop a distributed system for Log Ingestion and Analysis, loosely inspired by previous works at CERN and at the Fermi National Accelerator Laboratory (FNAL).
I.e at CERN a complex system of heterogeneous log ingestion and analysis has been put in place to consolidate the data managing structure. A schematic representation of the framework in use at CERN is shown in Figure 5.1.
Figure 5.1: Schematic representation of the unified monitoring system in use at CERN [47].
It is worth mentioning that the tools and techniques used to extract and analyse data from such a large and complicated “dataset” at CERN are many, and very different in scope and functions. While some have been developed by HEP experts for handling big HEP datasets, there is an increasing need within the field to make an effective use of industry developments, i.e. tools that have not been developed in-house but - given the synergy among HEP and non-HEP on BD-related need - they nevertheless seem to offer exactly what HEP needs.
The adoption of industry techniques that have not been developed in-house by HEP experts implies (at least) a training curve and (for sure) additional work to implement some level of adaptation of such tools to the specific details of HEP work-flows, which is often a far from trivial work. At the same time, industry tech- niques, mainly if open source, can rely on a large and open developers communities, whose knowledge is capable to streamline technology adoption and troubleshoot- ing on a stable basis and in a sustainable fashion, which is a hard to meet goal for CERN specific products, which suffer from risk related to inability to guar- antee stable funding schemes for hardware, software and human resources on a multi-year scale.
Figure 5.2: Schematic representation of the distributed log ingestion and analysis plat- form being implemented at CNAF.
veloped to satisfy the need of the new CNAF log repository is schematically rep- resented in Figure 5.2.
A complete description of the technical components of this framework is beyond the scope of this thesis, but few highlights on the major software toolkits in use is given in the following sections.
5.1
DODAS
The infrastructure developed is based on Openstack [48], a cloud operating system that controls and manages the resources of the datacentre. The instances created through Openstack form a cloud network over which is distributed the software that defines the framework described above.
The distribution is done via the Dynamic On Demand Analysis Service (DO- DAS), a Platform as a Service tool which allows to instantiate on-demand container- based clusters [49]. DODAS as been developed by the INDIGO-DataCloud H2020 project [50] and has been funded in the scope of the European Open Science Cloud hub (EOSC-hub) Horizon 2020 project[51].
The purpose of DODAS is to reduce the learning curve, as well as the opera- tional cost of managing community specific services running on distributed cloud, by automating the process of provisioning, creating, managing and accessing com- puting and storage resources. DODAS can deploy both HTCondor batch systems, and platforms for Big Data analysis based on Spark or Hadoop such as our case.
DODAS has already been integrated by the Submission Infrastructure of CMS, as well as by the Alpha Magnetic Spectrometer (AMS-02) computing environment.
5.2
Hadoop-HDFS-Yarn
Apache Hadoop is a collection of open-source software utilities that facilitate using a distributed platform to run jobs [15]. The main software that compose this toolkit are:
• Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on machines, with a very high bandwidth across the cluster; It has many similarities with existing distributed file systems but is designed to be deployed on low-cost hardware with a high failure tolerance. It is tuned to manage vast volumes and can easily scale to hundreds of nodes. The main difference between HDFS and other file systems such as NFS, is that it follows the assumption that is better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. As such is suitable for application with large data sets, supporting tens of millions of files in a single instance.
• MapReduce - MapReduce is the implementation of a programming model for processing and generating big data sets with parallel, distributed algorithms
on the cluster. A MapReduce framework is defined by two operations: Map: a function that each worker node applies to the local data, writing the output to a temporary storage and Reduce: a function applied to the data based on the result of the mapping function. The role of the MapReduce is to redistribute the data resulting from the map function such as all the data with the same result is Reduced by the same worker node.
• Yet Another Resource Negotiator(YARN) - YARN is the middleware between HDFS and the processing engines being used to run applications and can dynamically allocate resources to applications as needed. By decentralising the execution and monitoring of processing jobs YARN removes performance bottlenecks and scalability problems that appears as the cluster sizes and the number of applications increased.
Using YARN to separate HDFS from MapReduce allows the Hadoop environ- ment to be more suitable for real-time processing and online applications that would suffer from serialisation of processes [52].
5.3
Flume-Kafka
On the log ingestion side of the framework, two software manages the stream of logs:
• Apache Flume - Apache Flume is a distributed, software for efficiently col- lecting, aggregating, and moving large amounts of log data [53]. It has an architecture based on streaming data flows. It uses a simple extensible data model that allows for online analytic application. Flume can be used to transport massive quantities of event data in a reliable way: A Flume source consumes events delivered to it by an external source; the events are then staged in a channel, and are removed from it only after they are stored in the next channel or in the terminal repository. This ensures that the set of events are reliably passed from point to point in the flow.
• Apache Kafka - Apache Kafka is a distributed streaming platform to manage streams of data [54]. The platform can publish and subscribe to streams of records, similar to a message queue, reading incoming data and outputting data on the same stream. The software allows also to process the stream as it occurs making it possible to transform data as it arrives.
The combination of these two software, Flume for the stream of the logs and Kafka for its ability to transform the data of the stream as it occurs allows to process event with sub-second latency and scales greatly with large amount of data [55].
5.4
Spark
The previous software discussed, following the definitions of Figure 1.3, deals with the Data Management part of a BDA process.
Apache Spark is a fast and general-purpose cluster computing system used to perform Analytics investigation on data [56]. It provides high-level APIs in four different coding languages for data management and analysis: Java, Scala, Python and R. In addition, it provides an optimised engine for graphs, higher-level tools for dealing with relational databases and implements an efficient Machine Learning library.
• Spark Core - Spark Core is the base of the framework. It manages the dispatch of distributed and basic I/O functionalities.
• Spark SQL - Spark SQL is a component on top of Spark Core that introduced a DataFrames based data abstraction, which provides support for structured and semi-structured data and standard SQL queries for extract data. • Spark MLlib - Spark MLlib is a distributed machine-learning framework
on top of Spark Core. Many common machine learning and statistical al- gorithms have been implemented simplifying large-scale machine learning pipelines
All of these functions are optimised to run on distributed computing following the MapReduce model.
The system presented is still in development, and as such it is not yet in active use and no software has been yet developed that fully makes use of the platform. This type of infrastructure is still being experimented on at CNAF, with low prior knowledge and experience: using open source tools with industry wide acknowledgement and distribution is, therefore, a great advantage for the success of this type of experimentation.
The next and final chapter of this dissertation presents the development of a BDA algorithm for the clustering of Log files. The script presented is developed with the objective of presenting a preliminary work that could be performed on the distributed platform described above.
However, due to the still in-development state of the system, the algorithm is not yet optimised to be run on a distributed system and future efforts are needed to adapt the code presented and employ it on the platform in a real use-case scenario.
Chapter 6
Log files analytics
In Summer 2018, the INFN-CNAF management decided to launch an exploratory effort to investigate the feasibility of the collection of various log data from CNAF