}w!"#$%&'()+,-./012345<ya

(1)

Masarykova univerzita Fakulta informatiky

}w !"#$%&'()+,-./012345<yA|

Application Log Analysis

Master’s thesis

Júlia Murínová

(2)

(3)

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Júlia Murínová

(4)

(5)

I would like to express my gratitude to doc. RNDr. Vlastislav Dohnal, Ph.D. for his guidance and help during work on this thesis. Furthermore I would like to thank my parents, friends and family for their continuous support. My thanks also belongs to my boyfriend for all his assistance and help.

(6)

(7)

The goal of this thesis is to introduce the log analysis area in general, compare available systems for web log analysis, choose an appropriate solution for sample data and implement the proposed solution. Thesis contains overview of monitoring and log analysis, specifics of application log analysis and log file formats definitions. Various available systems for log analysis both proprietary and open-source are compared and categorized with overview comparison tables of supported functionality.

Based on the comparison and requirements analysis appropriate solution for sample data is chosen. The ELK stack (Elasticsearch, Logstash and Kibana) and ElastAlert framework are deployed and configured for analysis of sample application log data. Logstash configuration is adjusted for collecting, parsing and processing sample data input supporting reading from file as well as online socket logs collection. Additional information for anomaly detection is computed and added to log records in Logstash processing. Elasticsearch is deployed as indexing and storage system for sample logs. Various Kibana dashboards for overall statistics, metrics and anomaly detection dashboards are created and provided. ElastAlert rules are set for real-time alerting based on sudden changes in events monitoring. System supports two types of input – server logs and client logs that can be reviewed in the same UI.

(8)

(9)

log analysis, threat detection, application log, machine learning, knowledge discovery, anomaly detection, real-time monitoring, web analytics, log file format, Elasticsearch, Kibana, Logstash, ElastAlert, dashboarding, alerting

(10)

(11)

1 Introduction . . . 1

2 Monitoring & Data analysis . . . 3

2.1 Monitoring in IT . . . 3

2.2 Online service/application monitoring . . . 5

2.3 Data analysis . . . 5

2.3.1 Big data analysis . . . 5

2.3.2 Data science . . . 5

2.3.3 Data analysis in statistics . . . 7

2.4 Data mining . . . 7

2.5 Machine learning . . . 8

2.6 Business intelligence . . . 9

3 Log analysis . . . 11

3.1 Web log analysis . . . 11

3.2 Analytic tests . . . 12

3.3 Data anomaly detection . . . 12

3.4 Security domain . . . 13

3.5 Software application troubleshooting . . . 15

3.6 Log file contents. . . 17

3.6.1 Basic types of log files . . . 18

3.6.2 Common Log File contents . . . 19

3.6.3 Log4j files contents . . . 20

3.7 Analysis of log files contents . . . 23

4 Comparison of systems for log analysis . . . 25

4.1 Comparison measures . . . 25

4.1.1 Tracking method . . . 25

4.1.2 Data processing location . . . 27

4.2 Client-side information processing software . . . 28

4.3 Web server log analysis . . . 30

4.4 Custom application log analysis . . . 31

4.5 Software supporting multiple log files types analysis with advanced functionality . . . 33

4.6 Custom log file analysis using multiple software solutions integration 36 5 Requirements analysis . . . 39

5.1 Task description. . . 39

5.2 Requirements and their analysis . . . 39

5.3 System selection. . . 40

5.4 Proposed solution . . . 41

(12)

6.1 Server log file . . . 43

6.2 Client log file . . . 47

6.3 Data contents issues . . . 48

7 Logstash configuration . . . 51

7.1 Input . . . 51

7.1.1 File input . . . 51

7.1.2 Multiline . . . 52

7.1.3 Socket based input collection . . . 53

7.2 Filter . . . 54

7.2.1 Filter plugins used in configuration . . . 54

7.2.2 Additional computed fields . . . 58

7.2.3 Adjusting and adding fields . . . 59

7.2.4 Other Logstash filters . . . 61

7.3 Output . . . 61 7.3.1 Elasticsearch output . . . 61 7.3.2 File output . . . 62 7.3.3 Email output . . . 62 7.4 Running Logstash . . . 64 8 Elasticsearch . . . 65 8.1 Query syntax . . . 65 8.2 Mapping . . . 66 8.3 Accessing Elasticsearch . . . 67 9 Kibana configuration . . . 69 9.1 General dashboard . . . 69 9.2 Anomaly dashboard . . . 75 9.3 Client dashboard . . . 79

9.4 Encountered issues and summary . . . 80

10 ElastAlert . . . 83

10.1 Types of alert rules . . . 83

10.2 Created alert rules . . . 84

11 Conclusion . . . 87

11.1 Future work . . . 88

11.1.1 Nested queries . . . 88

11.1.2 Alignment of client/server logs . . . 89

12 Appendix 1: Electronic version . . . 91

13 Appendix 2: User Guide . . . 93

13.1 Discover tab . . . 94

13.2 Settings tab . . . 97

13.3 Dashboard tab . . . 97

(13)

14.1 Logstash setup . . . 99

14.2 Elasticsearch setup . . . 99

14.3 Kibana setup . . . 100

14.4 ElastAlert setup. . . 100

15 Appendix 4: List of compared log analysis software . . . 101

(14)

(15)

Millions of online accesses and transactions per day create great amounts of data that are a significant source of valuable information. Analysis of these high amounts of data needs appropriate and sophisticated methods to process them promptly, efficiently and precisely.

Data logging is an important asset in web application monitoring and reporting as it contains massive amounts of data about the application behavior. Analysis of logged data can be a great help with reporting of malicious use, intruders detection, compliance assurance and the anomalies that might lead to actual damage.

In my master’s thesis I will be looking into the main benefits of monitoring, web application service log analysis and log records processing. I will be comparing a number of available systems for log records collecting and processing, considering both the existing commercial and open-source solutions. With regards to the sample data collected from a chosen web application, the most fitting solution will be chosen and proposed for the required data processing.

This solution will then be implemented, deployed and tested on the sample application log records. Goals of this thesis are:

• Get familiar with the terms of monitoring, data mining and the log records analysis;

• Investigate possibilities and benefits of log records data collecting and analysis;

• Look into different types of log formats and information they contain;

• Compare and categorize commercial and open-source systems available for log analysis;

• Propose an appropriate solution for the sample log records analysis based on previous comparison and requisites;

• Implement the proposed solution, deploy and test on the sample data;

• Summarize the results of the implementation and list possible future improvements.

(16)

(17)

Monitoring1_{as a verb means: To watch and check a situation carefully for a period}

of time in order to discover something about it.

The fundamental challenge in IT monitoring process is to adapt quickly to continuous changes and make sure that the cost-effective and appropriate software tools are used. Strength of controlling process is based on both preventive and detective controls which also are the crucial parts of changes monitoring. There might be some bottlenecks in regards to different types of data that need to be monitored as not all types of monitoring systems allow records logging. Also the automated data logging processes might not be cost-effective due to slowing down the processing of data itself. Basically the strategies for automated monitoring include IT-inherent, IT-configurable, IT-dependent manual or manual guidelines and these need to be evaluated carefully considering the requisites and available resources. [1]

2.1

Monitoring in IT

For Information technologies in particular, there are a few types of monitoring that are used specifically according to their purpose rather than the contents of the monitoring processes themselves as those are often overlapping. Some of the types are listed below and briefly described:

• System monitoring – system monitor (SM) is a basic process of collecting and storing system state data;

• Network monitoring – monitoring system set for reporting network issues (slow processing, connection discrepancies);

• Error monitoring – focuses on error detection, catching and handling potential issues within the code;

• Website monitoring – specific monitoring of website contents and access, reporting broken functionality or other issues related to the monitored website;

• APM (Application performance management)[2] – Based on end user experience and other IT metrics the APM is a fundamental software application monitoring and reporting system that ensures certain level of service. It consist of 4 elements, see Figure 2.1:

1. From Cambridge dictionary: http://dictionary.cambridge.org/dictionary/ british/monitor

(18)

– Top Down Monitoring (Real-time Application Monitoring) – focuses on End-user experience and can be active or passive;

– Bottom Up Monitoring (Infrastructure Monitoring) – monitoring of operations and a central collection point for events within processes; – Incident Management Process (as defined in ITIL) – foundation pillar

of APM, focuses on improvement of the application;

– Reporting (Metrics) – monitor collecting raw data for analysis of application performance.

Figure 2.1: Anatomy of APM [2]

Online service/application monitoring using the log analysis is often compared to APM or Error monitoring and contains a lot of overlapping processes. The main difference between them is in the core purpose of monitoring. For APM the emphasis is put more on the end user perspective and on enabling the best application performance possible. Error monitoring focuses on catching the potential code errors by implementing the adequate level of error controlling mechanism in the code.

(19)

2.2

Online service/application monitoring

Near real-time monitoring of data logging with automatic reporting is needed to obtain expected levels of security and quality that need to be maintained 24 hours a day. A certain level of uniformity in logging patterns is important for more possibilities in standardization of the log analysis process. The specified event levels and categories should simplify detecting and handling the suspicious activity or system failures. [3]

The crucial part of the monitoring and reporting process is identifying the problematic data in log records and evaluating the appropriate response – automatic, semi-automatic or manual. To decide on the rules to be run for these malicious patterns recognition, the speed of detection and the ability of processing need to be considered.

2.3

Data analysis

What is usually understood by the term data analysis is a process of preparing, transforming, evaluating and modeling data to discover useful information helpful in subsequent conclusion finding and data-driven decisions making.

The process itself includes obtaining raw data, converting to a format appropriate for analysis (cleaning dataset), applying required algorithms on collected data and visualizing the output for evaluation.

2.3.1 Big data analysis

The term Big data2 _{is mostly used for much larger and complicated data sets}

than usual. Huge amounts of records cause great challenges in their treatment and processing as the traditional approaches are often not effective enough. Advanced techniques are needed to extract and analyze the Big data and new promising approaches are being developed specifically for their treatment.

Big data processing focuses on collection and management of large amounts of various data to serve large-scale web applications and sensor networks. Field called Data science focuses on discovering underlying patterns in complex data and modeling them into required output. [32]

2.3.2 Data science

A basic data science process consists of a few phases (see Figure 2.2 for visualization). Process is iterative due to possible introduction of new

2. Definition from Cambridge dictionary: http://dictionary.cambridge.org/ dictionary/english/big-data

(20)

characteristics during execution. The phases of data science are listed below: [4]

• Data requirements – Clear understanding of data specifics that need to be analyzed;

• Data collection – Collection of data from specific sources (sensors in environment, recording, online monitoring etc.);

• Data processing – Organization and processing of obtained data into a suitable form;

• Data cleaning – Process of detecting and correcting errors in data (missing, duplicate and incorrect values);

• Exploratory data analysis – Summarizing main characteristics of data and its properties;

• Models and algorithms – Data modeling using specific algorithms based on the type of problem;

• Data product – Result of the analysis based on required output;

• Communication – Visualization and evaluation of the data product, modifications based on feedback.

(21)

2.3.3 Data analysis in statistics

Statistic methods are essential in data analysis as they can derive the most important characteristics from the data set and use this information directly for visualizing via basic information graphics (line chart, histogram, plots and charts). In statistics, data analysis can be divided into three different areas: [5]

• Descriptive statistics – It is mostly used for quantitative description. It contains basic functions (sum, median, mean) as characteristics of the data set.

• Confirmatory data analysis (CDA) (also refers to hypothesis testing) – It is based on the probability theory (significance level). It is used to confirm or reject a hypothesis.

• Exploratory data analysis (EDA) – In comparison to Confirmatory data analysis, EDA does not have a pre-specified hypothesis. It is mostly used for summarizing main characteristics and exploring data without formal modeling or testing of content assumptions.

2.4

Data mining

Data mining is in a sense a deeper step inside the analyzed data. It is a computational process of discovering patterns in the full data set records to gain knowledge about its contents. Data mining combines artificial intelligence, machine learning, statistics and database systems areas to achieve significant information extraction and transformation into a simplified format for future use. A basic task of data mining is mostly automatic analysis of large amounts of data to detect outstanding patterns, which might be consequently used for further analysis by machine learning or other analytics. There are six basic tasks in data mining:

• Anomaly detection (Outlier/change/deviation detection) – Detection of outstanding records in a data set;

• Association rule learning (Dependency modeling) – Detection of relationships between variables and attributes;

• Clustering – Detection of similar properties of analyzed data and creating groups based on this information;

• Classification – Generalization of type of structure and classification of input data based on the learnt information;

(22)

• Regression – Detection of a function to model data with the least error;

• Summarization – Detection of compact structure representing the data-set (often using visualization and reports).

Data mining is also considered to be analytics part of the Knowledge Discovery in Databases (KDD) process, used for processing data stored in database systems. Data mining placement in the KDD process is also shown in Figure 2.3. The additional parts, such as data collection and preparation or results evaluation, do not belong to data mining but rather to KDD process as a whole. [7]

Figure 2.3: Data mining placed in the KDD Process [7]

2.5

Machine learning

Machine learning is a specific field exploring possibilities to use algorithms that are capable of learning from data. These algorithms are based on finding structural foundations to build a model from the training data and derive rules and predictions. Based on the given input the machine learning is divided into the main categories listed below:

• Supervised learning – Example input and corresponding output are presented in training data.

(23)

• Unsupervised learning – No upfront information is given about the data, leaving the pattern recognition to the algorithm itself.

• Semi-Supervised Learning – Incomplete input information is provided. It is a mixture of known and unknown desired output information.

• Reinforcement learning – It is based on interaction with a dynamic environment to reach a certain goal (e.g. winning a game and developing a strategy based on the previous success).

Machine learning and data mining contain similar methods and often overlap. However they can be distinguished based on the properties they are processing. While machine learning is working with known properties learnt from the training data, data mining focuses on unknown properties and pattern recognition. [6]

2.6

Business intelligence

Business intelligence (BI) is a set of tools and technologies used for processing of raw data and other relevant information into business analysis. There are numerous definitions of what exactly BI consists of. In this thesis the definition where internal data analysis is considered a part of BI is used3_:_{business intelligence is the process}

of collecting business data and turning it into information that is meaningful and actionable towards a strategic goal.

BI is based on transformation of available data into a presentable form enabling easy-to-use visualization. This information might be crucial for strategic business decisions, threats and opportunities detection and better business insight. [9] Basic elements of Business intelligence are:

• Reporting – Accessing and processing of raw data into a usable form;

• Analysis – Identifying patterns in reported data and initial analysis;

• Data mining – Extraction of relevant information from collected data;

• Data quality and interpretation – Quality assurance and comparison between the obtained data and the real objects they represent;

• Predictive analysis – Using the output information to predict probabilities and trends.

3. Definition available on World Wide Web: <http://www.logianalytics.com/ resources/bi-encyclopedia/business-intelligence/>

(24)

(25)

The term is usually used for monitoring systems where data logs are records of the events detected by the sensors. The data logs are further processed in the log analysis.

Log analysis consists of the subsequent research, interpretation and processing of the generated records obtained by the data logging. The most usual reasons for the log analysis are: Security, Regulation, Troubleshooting, Research, Incident automatic response. The semantics for the specific log records are designed by the developers of the software therefore might differ for some specific areas of usage and sometimes these differences are not fully documented. A significant amount of time might be therefore needed for the log records pre-processing and their modification into a usable form for the following data analysis.

In this thesis I will mainly focus on the web log analysis – analysis of logs generated in web communication and interaction. Following sections include the general information about web log analysis, possible uses and common formats of these logs.

3.1

Web log analysis

The web log is basically an electronic record of the interaction between the system and its user. Therefore there may be additional user actions that would trigger a record creation (not only the requests for the connection or the data transmitting but also the overall behavior on the webpage, the link/button clicking and similar). An area of that is targeting measurement, collection, analyzing and reporting of web data is called web analytics. The web analytics have been studied and improved significantly over the past years mainly because of their significance in increasing the usability of web applications and gaining more users/customers from the marketing point of view. [11]

In comparison to Business Intelligence, the BI is focused more on marketing-based analysis of the internal data from multiple sources. Even though there are various approaches and software solutions available, it is still considered freer in terms of the implementation and depends highly on the organization needs, structure and tools. Web analytics on the other hand are specified for analysis of web traffic and web usage trends. As a whole it offers a solution for one area and is separated from the rest of data. However the borders are now more blurred and the web analytics can sometimes be perceived as a specific data flow from one source along others used as part of Business Intelligence.

The purpose of the web log analysis lays also in the system-user communication monitoring. The actions of this communication are stored in

(26)

electronic records and are subsequently analyzed for behavior patterns. These patterns are important for the research of both user and system behavior and their reaction to various actions. The users actions can include useful information about their usage of web applications and can be analyzed for system improvements, security defects detection and compliance records. The system replies and actions can detect malfunctions on the server side, the unusual behavior for the specific actions treatment and the erroneous responses. As a result, there are specific areas for the analytic tests performed on the data logs that are discussed in the following section.

3.2

Analytic tests

From the statistical analysis of the data, there are two main different kinds of approaches, or branches of communication information classification. The quantitative approach focuses on the numbers of accesses, transmissions, request/actions and their distribution over time and the number of clients/ports/sessions. The qualitative approach on the other hand detects parts of the communication which are out of the ordinary. Either according to the expectation of the web application usage or the analysis of the test data, there is a certain basic behavioral pattern expected to be seen in a log records output. The records that follow the expected values are considered normal dataset and also most of the overall analyzed dataset usually belongs to this group.

3.3

Data anomaly detection

There are often records that indicate different results than expected and might be significantly different from the other records in the dataset. These are considered anomalies in the data and one of the most important goals of log analysis is their detection and treatment.

Anomalies, also called the outliers are ones of the primary steps in data-mining applications. In the first steps of an analysis there is the detection of the outlying observations, which may be considered as an error or noise, but also carries significant information as the observations might lead to the incorrect specification and results. Some of the definitions of outliers are more general than others, depending on the context, data structure and method of detection used. The most basic view is that an outlier is an observation in the data set which appears inconsistent to the rest of the data. There are multiple methods for the outlier detection differing according to the data set specifics, and are often based on the distance measures, clustering and spatial methods. The outlier/anomaly detection is often used for various applications, such as credit card frauds, data

(27)

cleansing, network intrusion, weather prediction and other data-mining tasks. [10]

The subsequent anomaly analysis is essential for the root cause investigation of the detected anomaly and it helps greatly in both the inside and the outside threat prevention. The inside kind of defects might include malfunctions in the system code or the erroneous request processing. The outside threats are often web-based attacks and intrusion attempts. Anomaly detection plays a significant role in web-based attacks detections, also called anomaly-based intrusion detection systems (IDSs). The basic intrusion detection system is monitoring the web communication against the directory of the known types of intrusion attacks and takes action once the suspicious behavior is detected. However to ensure a certain level of security against the unknown types of attacks, also potentially anomalous communication should be monitored for possible threats. In this area the monitoring of the anomalies of web traffic is essential for finding new types of attack attempts that can be detected by behavior records stored in the data logs. [13]

3.4

Security domain

Frequent attack attempts are based on finding the applications with flawed functionality. Taking advantage of vulnerabilities the attacker inserts code which is executed by the web application causing transfer of malicious code into backend or reading of unauthorized data from the database.

These types of attacks can be detected in the log files as the injected code is recorded when sent to server. The post- detection is important for avoiding future attacks but due to its late running the pro-active monitoring is essential. Basic regular expressions or more complicated methods can be used for rule making for known attacks detection. The communications containing harmful code injected is then rejected as a result.

The application runs on the 7th layer of ISO/OSI model and for detection to be efficient it has to see relevant traffic. There are multiple parts of the communication that can be subject to attack. In Figure 3.1 there is an illustration of high-level attack detection in a network.

On the lower layers (network and transport layer) there is firewall working on traffic analysis based on common protocols. It can detect anomalies in protocols. However it cannot detect the attacks on application as it does not see the additional data from higher layers. Web application firewall on the other hand is processing the higher-layer protocols and can analyze more precisely. It contains enough information for filtering and detection and as a result is a good place for defined allowance rules for specific requests and attacks detection. Web servers

(28)

Figure 3.1: Illustration of communication zones for attack detection [18] such as Apache and IIS usually create log files in Common Log Format (CLF) described further in the following section. But this kind of format does not contain data sent in the HTTP header (e.g. POST parameters) – since this header information can contain important data about possible attacks, it is a great deficiency for web server logs. As a part of application logic there should also be a certain degree of validation of input and output data and security information logging integrated. The application log files should contain a full information about the actions of user and therefore allow wide possibilities for misuse and threat detection mechanisms. A network intrusion detection system (NIDS) analyzes the whole traffic to and from the application. However it has some disadvantages, such as difficulties with decrypting SSL communication and real time processing in high traffic overload. Also working on ISO/OSI layer 3 and 4 is causing disability of detecting attacks targeted on higher layers information.

For attacks detection, there are two possibilities – log file analysis and full traffic analysis. Even though log files do not contain all data about the communication, they are easily available and collected. Due to default server-side logging to standard formats and applications usually containing basic logging process for traceability of users’ actions, log files provide easily set-up process for security monitoring.

Attacks can be detected using two strategies – by using static and dynamic rules. Differences among these are based on their creation. The recommended attack monitoring system should consist of both types of rules. [18]

(29)

• Rule-based detection (static rules) – This strategy defines static rules based on known attacks patterns that need to be rejected in order to avoid attacks. These rules are specifically prepared beforehand and stay the same during detection. Static rules are prepared manually based on pre-known information. Static rules can be divided into two models:

– Negative security model – Blacklist approach allows everything by default, all is considered normal and the policy defines what is not allowed (listed on a blacklist). The biggest disadvantage is in the quality of policy and its need to be updated regularly.

– Positive security model – The positive model is the opposite of the negative one – it denies all the traffic except for the allowed by policy (listed on whitelist). Whitelist contents can be learnt in the training phase by a machine learning algorithm or manually defined.

• Anomaly-based detection (dynamic rules) – Dynamic rules are not prepared beforehand on known information. They are obtained in the learning phase on training dataset using machine learning algorithms. It is essential to make sure the dataset is without any attacks and anomalies to ensure the correct rules generated result. Afterwards the traffic considered different from the normal dataset will be flagged as anomalous.

Anomalous patterns may also be helpful in other application monitoring areas like system troubleshooting. While security monitoring targets detection of suspicious behavior coming from the outside, system performance analysis and troubleshooting are focused on the inside behavior. Internal behavior patterns might reveal errors in code or even in design of the application or system setup.

3.5

Software application troubleshooting

Log files can be used in multiple stages of software development, mainly debugging and functionality testing. It is possible to check the logic of a program without the need to run it in a debug mode using log files for information extraction. Another advantage is that this type of testing is not affected by the probe effect (time-based issues introduced when testing in a specific run time environment), environment and system setting generation required for currently used testing and debugging customs and offer important insight into overall functionality and performance of a system.

With sufficient background implementation for automatic log file analysis in software testing, making use of language and specification capabilities, log file

(30)

analysis can be considered a useful methodology for software verification, somewhere between current testing practice and formal verification methodologies. [26]

From the software development, testing and monitoring perspective, there is valuable information that can be extracted from the log files. This information can be divided into several main classes: [23]

• Generic statistics (e.g. peak and average values, median, modus, deviations) – They are mostly used in setting hardware requirements, accounting and general view into the system functionality.

• Program or system warnings (e.g. power failure, low memory) – They are mostly used in system maintenance and performance analysis.

• Security related warnings – They are used in security monitoring discussed in previous chapter.

• Validation of program runs – It is used as a type of software testing, included in development cycle.

• Time related characteristics – They are important for software profiling and benchmarking, can also reveal system performance issues.

• Causality and trends – They contain essential information about the processed transactions and are used mostly in data mining.

• Behavioral patterns – They are mostly used in system troubleshooting, performance and reliability monitoring.

For system troubleshooting, there are various types of valuable information logged and their extraction can provide essential knowledge about the system behavior and detect performance issues that are not easily found otherwise. Some of the most basic ways to use log analysis for system performance analysis are: [24]

• Slow response – Detection of slow response times can point out directly the functionality area that should be optimized and checked for eventual code errors.

• Memory issues and Garbage collection – Basic error massages analysis can provide indications about the malformed behavior in specific scenarios and the out of memory issues are some of the most common ones. Also these might often be caused by slow or long lasting garbage collection implementation which can also result in overall slow application behavior.

(31)

• Deadlocks and Threading issues – With more users accessing the application resources simultaneously, the greater becomes the potential of them creating deadlock situations1_{. Preventing as well as dealing with}

these occurrences is therefore an important part of application logic and their detection can significantly improve the performance optimization.

• High resource usage (CPU/Disk/Network) – High resource usage might result in slowing down the performance or even halting the system. These irregularities can therefore help to detect the busiest times of system usage or even need of additional resources allocation due to increased user demands.

• Database issues – Once the applications are communicating directly with the database, the queries results as well as response times and potential multithread access issues are significant to overall functionality and application responsiveness.

However, not only what occurs in the system is worth detecting. The inactivity, which can be easily found by log file analysis, provides also important insight into system monitoring. If the important action that was scheduled to run had not happened, it would not generate any error message, but it would still make a significant impact. As a result, it is important to not only monitor and search the logged data for error messages and behavior patterns that happened, but also detect those actions and situations when nothing happened even though it should have. Therefore it is worth looking into the possibilities of their detection and compilation in order to maintain a certain quality of service. [25]

However the contents of the log files can differ greatly from system to system. Depending on the desired information, the format of log files needs to be often adjusted to contain the specific information. Basic server logging files usually contain standard information used for server-side monitoring and troubleshooting, for specific application logic analysis additional log files may need to be generated with more descriptive information.

3.6

Log file contents

The web log analysis software (sometimes also called the web log analyzer) is a tool processing a log file from a server and according to its values obtains knowledge about who, when and from where accessed the system and what actions took place during a session. There are various approaches for log files

1. Deadlock – a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does.

(32)

generation and processing. They may be parsed and analyzed in real time or may be collected and stored in databases to be examined later on. The subsequent analysis then depends on the required metrics and types of data the analysis focuses on. The basic information contained in the web log format tends to be similar across different systems. This however depends on the software application type. As a result, there is a different log file output generated by the intrusion detection systems, the antivirus software, and the operation system or web server when creating access logs. These differences need to be taken into account when storing and processing data from multiple sources. There are also various recommendations for log management security published by the National Institute of Standards and Technology that should be followed when processing log records internally within organizations. [14]

There are some default types of variables and values that are generated for the web logs by the specific software web server solutions. However also for the web servers solutions like the Apache web server software2_{, it is possible to alter and}

configure the web log format generated according to specific needs. [67] 3.6.1 Basic types of log files

There are basic log file types that are used by web server logging services. These may differ according to the type of server as well as its version and an important part of preparation for log file analysis is based on getting familiar with the contents and requirements. There are also multiple different logs generated based on their triggering event, contents and logic such as error logs, access logs, security logs and piped logs. Selected web server log formats are: [15]

• NCSA Log Formats – The NCSA log formats are based on NCSA httpd and are mostly used as a standard for HTTP server logging contents. There are also specific types of NCSA formats such as:

– NCSA Common (also referred to as access log) – It contains only basic HTTP access information. Its specific contents are listed in the following section.

– NCSA Combined Log Format – It is an extension of the Common NCSA format as it contains the same information with additional fields (referrer, user agent and cookie field).

– NCSA Separate (three-log format) – In this case the information is stored in three separate logs – access log, referrer log and agent log.

2. The Apache web server software is one the most used open-source solutions used worldwide. More information available from World Wide Web: <https://www.apache.org/>

(33)

• W3C Extended Log Format – This type of log format is used by Microsoft IIS (Internet Information Service) versions. It contains a set of lines that might consist of directive or entry. Entries are made of fields corresponding to HTTP transactions, separated by spaces and using a dash for fields with missing values. Directives contain information about the rules for the logging process.

Apart from main server log file types, there are multiple specific ones that might be generated by FTP servers, supplemental servers, application servers3.

Application logging functionality is also important to be setup to simplify troubleshooting and maintenance as well as increase protection from outside threats. A lot of systems contain server and database logging but the application event logging is missing, disabled or poorly configured. However the application logging provides valuable insight into the application specifics and has a potential of bringing much more information than the basic server data compilation. The application logs formats might differ greatly as they are highly dependent on the application specifics, its development and needs. Nevertheless, within the application, organization or infrastructure the log files format should be consistent and as close to standards as possible. [27]

There are also logging utilities created for simplified definition of consistent application logging and tracking API. Once the standardized logging file format is used, it makes its subsequent pre-processing and analysis much simpler. An example for widely used API is the open source log4j API for Java that offers a whole logging capabilities package and is often used for log generation in applications written in Java. [28]

However for the basic logging functionality or simple web application the default utilities might generate sufficient records. To decide if the default log file contents are enough for needs of users, basic insight and knowledge about the common log file formats is required.

3.6.2 Common Log File contents

A Common Log Format or the NCSA Common log format is based on logging information about the client accessing the server. Due to its standardization, it can be more easily used in multiple web log analysis software tools. It contains the requested resource and some additional information, but no referral, user agent or cookie information. All the log contents are stored in a single file.

3. For example server logs types of Tomcat server:<https://support.pivotal.io/hc/ en-us/articles/202653818-Tomcat-tc-Server-log-file-types-2009881>

(34)

An example of the log file format is:

h o s t i d username d a t e : t i m e r e q u e s t s t a t u s b y t e s

• host – the IP address of the HTTP client that made a request;

• id – the identifier used for a client identification;

• username – the username or the user ID for the authentication of the client;

• date:time – the date and time stamp of the HTTP request;

• request – the HTTP request, containing three pieces of information – resource (e.g. URL), the HTTP method (e.g. GET/POST) and the HTTP protocol version;

• status – the numeric code indicating the success/failure of the request;

• bytes – the number of bytes of the data transferred as part of the request without the HTTP header.

The described type of the common log file format contains only the most essential information. Usually more items are added into the log obtained throughout the session, depending on the type of the data needed to be received from the web server visit logs. Often information is included about the browser type and its version, the operating system, or other actions of the user during the session. 3.6.3 Log4j files contents

Log4j Java logging utility is developed under Apache Software Foundation and is platform independent. Log file contents are labeled with defined standard levels of severity of the generated message. Basic Log4j log message levels are listed below: [30]

• OFF – The OFF level has the highest possible rank and is intended to turn logging off.

• FATAL – The FATAL level designates by very severe error events that will presumably lead the application to abort.

• ERROR – The ERROR level designates error events that might still allow the application to continue running.

(35)

• WARN – The WARN level designates potentially harmful situations.

• INFO – The INFO level designates informational messages that highlight the progress of the application at coarse-grained level.

• DEBUG – The DEBUG level designates fine-grained informational events that are most useful for application debugging.

• TRACE – The TRACE level designates finer-grained informational events than the DEBUG level type.

• ALL – The ALL level has the lowest possible rank and is intended to turn all logging on.

Log4j files contents can be adjusted using properties file, XML or through Java code itself. The log4j logging utility is based on three main components which can be configured:

• Loggers – Loggers are logical log file names which can be independently configured according to their level of logging and they are used in application code to log a message.

• Appenders – Appenders are responsible for sending a log message to output e.g. file or remote computer. Multiple appenders can be assigned for a logger to enable sending its information to more outputs.

• Layouts – Layouts are used by appenders for output formatting. Mostly used format with every log input on one line containing defined information is PatternLayout, which can be also specified using ConversionPattern parameter.

The PatternLayout is a flexible layout type defined by a conversion pattern string (regular expression defining the requested string pattern). The goal is to format logging event information into a suitable format and return it as a string. Each conversion specifier starts with a percent sign (%) and is followed by optional format modifiers and a conversion character. The conversion character specifies the type of data, e.g. category, priority, date, thread name. Any type of literal text can be inserted into the pattern. [31] Conversion characters are listed in Table 3.6.3. As a result, the ConversionPattern can be used to define the specific logger output format using the listed characters in its definition.

(36)

Conversion

character Type of data

c Category of the logging event

C Class name of the caller issuing the logging request d Date of the logging event

F File name where the logging request was issued l Location information of the caller which generated

the logging event

L Line number from where the logging request was issued m Application supplied message associated with the event M Method name where the logging request was issued

n Platform dependent line separator character or characters p Priority of the logging event

r Number of milliseconds elapsed from the construction of the layout until the creation of the logging event t Name of the thread that generated the logging event x NDC (nested diagnostic context) associated with the

thread that generated the logging event

X MDC (mapped diagnostic context) associated with the thread that generated the logging event

% The sequence %% outputs a single percent sign. Table 3.1: List of Conversion characters used in ConversionPattern For example the desired pattern can be defined by string sequence:

\%d [\% t ] \%-5p \%c - \%m\%n

Possible output might then display like:

2 0 1 5 - 0 2 - 0 3 0 0 : 0 0 [ main ] INFO l o g 4 j . S o r t A l g o - S t a r t s o r t

Meanings of the items separated by spaces in the example are:

• %d (2015-02-03 00:00) – date of the logging event;

• %t ([main]) – name of the thread that generated the logging event (in brackets according to pattern definition);

• %-5p (INFO)– priority of the logging event (the conversion specifier %-5p means the priority of the logging event should be left justified to a width of five characters);

(37)

• %c (log4j.SortAlgo) – category of the logging event;

• %m (Start sort)– application supplied message associated with the logging event (dash is an added literal character between category and the message full text according to pattern definition);

• %n – adds line separator after logging event record.

Log4j logging utility provides wide possibilities for adjusting the format, contents and functionality of application logging, which can ease the subsequent analysis and log files management.

There is also a variety of possibilities for filtering messages contents in generated log file records. Full textual searches and results filtering based on specific message strings that might reveal potential threats or system malfunctioning can be configured and automated. Contextual patterns that are potentially important to review can be also often easily defined by e.g. regular expressions and be searched for.

3.7

Analysis of log files contents

To gain the desired knowledge from the log file contents, the subjected parts of the records need to be collected, extracted, pre-processed and analyzed as a dataset. The subsequent visual representation is for easier behavior and pattern recognition from the development or marketing point of view. Some of the basic metrics learnt from the web log analysis are:

• Number of users and their visits;

• Number of visits and their duration;

• Amount and the size of the accessed/transferred data;

• Days/hours with the highest numbers of visits;

• Additional information about the users (e.g. domain, country, OS). The goal of the web log analysis software therefore is to obtain among others the listed information from the generated log records. In the following chapter there is an overview of the selected available software systems designed for this task and their comparison.

(38)

(39)

When choosing the most appropriate analytics software, there is a couple of things that need to be taken into account. These include the required or expected functionality of the analysis software, web application and data storage specifics and size and amount of data for analysis. Also support and competency on premises and financial options should be evaluated when making the decision. There are various possibilities for categorization of available systems for log analysis. In this thesis, I would firstly describe multiple different approaches and categorize according to the main focus. Then choose and compare some existing systems that belong to specified categories according to the capabilities they offer. This comparison is based on the overall information offered publicly by the selected systems and is meant primary for high level overview of functionality that is available.

4.1

Comparison measures

Considering web analytics as not only a tool for web traffic measurement but also a business research information source, the offerings of some web analytics software types might contain functionality closer to web page optimization and performance increase with on-page actions monitoring.

These are divided into off-site web analytics which analyze the web page visibility on the Internet as a whole and on-site web analytics which track user actions while visiting the page. To ease the use of web analytics with no need for on premises demands and also for client-side monitoring, a different method apart from log file analysis came up – the page tagging. As a result, software can be divided into categories according to the tracking method it uses – client-side tracking (page tagging), physical log files tracking and analysis or eventually full network traffic monitoring.

4.1.1 Tracking method

The two main approaches, considered mainly in the web log analysis area, are tracking the client-side and server-side information. Page tagging is a tracking method based on adding third party script to the webpage code enabling recording of user actions on the client-side using JavaScript and Cookies and sending the information to the outside server. These types of solutions are also often based on hosted software approach or Software as a service (SaaS). On the other hand, the log files are generated on the server side and contain therefore server-side information. However log files can also be transferred outside for processing and

(40)

there are also hosted software solutions available for log files analysis that is done on the third party premises. Some of the differences in contents between the client and server side information processing are listed below: [19]

• Visits – Due to tracking based on JavaScript and cookies, the hosted software might not be able to output completely accurate information as a result of users with disabled JavaScript, regularly deleted Cookies or blocked access to analytics. Also it does not track robots and spiders while all the interaction information including the above mentioned is recorded in the web logs.

• Page views – While the log file is tracking only communication going through the server, it would not include the page reload as it is usually cached in the browser. Client-side software would on the other hand record the re-visit.

• Visitors – There is a difference in visitor recognition as the tagging script uses to identify the user Cookies (which might be deleted) while log file records the Internet address and browser.

• Privacy – Specifically for the SaaS tagging-based systems – as the third party is collecting and processing the obtained information there are some privacy concerns which are not present for local log file analysis.

To sum up, there are advantages and disadvantages to both approaches and the decision should be based on the specific requirements for the software. Log file analysis does not need to make changes to webpages and contains the basic required information by default (as default setting for logging can be easily enabled and tracked for web servers). Also data are stored and processed on premises and more inside information can be extracted from the records. Page tagging on the other hand contains information from the client side that is not recorded in log files (e.g. on click, cached re-visit etc.), it is available to web page owners that do not have local web servers and support for on premises analysis. Often both approaches are combined and used for in-depth analytics.

However, even though the page tagging term is mostly used for client-side information tracking, there can also be PHP server-based tags used for additional information generation. As already mentioned, there is a number of valuable data sources that are omitted when using only client-side tagging, however there is too much redundant information in physical log files. PHP tagging enables both acquiring server-side information and choosing the information that needs to be collected.

There are also other tracking methods that can be used in (mostly) web-based applications/systems, such as full network traffic monitoring. Network

(41)

traffic monitoring might include much more information about the overall system behavior than log files or page-tags information output. But it is also more complicated to implement and the whole monitoring process needs to be setup carefully and manually while logging is generally a built-in capability that is easy to setup, adjust and process.

4.1.2 Data processing location

As partly noted in the previous section, the systems for log analysis can be categorized also according to how (or where) the obtained data is collected or processed. From this point of view the basic distinction is between the Hosted (SaaS) type which processes data on centrally hosted servers (also used as

on-demand software) and Self-hosted (On premise) type that runs on local user’s server.

Gradually increasing interest in cloud-based and outsourced services shows that it is often the easiest solution for standalone non-complex applications and small businesses without sufficient hardware and software foundation. Software as a service or hosted type of software solution is based on a delivery model where data is processed (and sometimes also collected) on premises of the software provider. The main advantages of this approach are that the user does not need to own the hardware and software equipment with desired capacity and performance, as well as does not need to cover the need for maintenance, support and additional technical services. The basic idea of hosted software is that the service is managed entirely by the software provider and the user only gets the desired results of the process. The understandable disadvantage is that data (often containing sensitive information) is transferred and processed by a third party and in this case the security and privacy are questioned. Even though the cloud and SaaS providers are legally required to commit to certain data protection, transparency and security, users might consider processing their data on premise as a safer and more convenient approach.

The second type of data processing location is traditionally self-hosted or called deployment on premise. This approach includes installation and setup of the software solution on the user’s server and allowing it to process data locally.

In conclusion, the location of data processing requirements might differ according to type of organization, on premise hardware and software support or sensitivity of data contents. Apart from the data location and type of tracking used, there is one more important thing to consider when choosing an appropriate solution. That would be price and license of the software.

(42)

4.2

Client-side information processing software

Client-side information is usually obtained using page tagging, even though some of the software solutions listed in this section also include the log analysis as an additional source of input data. The common feature for these types of software is a priority based on tracking the user actions and activity as well as basic statistics containing information about the background of the user. The aim of client-side tracking software is to optimize performance of a web based application/page to be appealing for the current customers/users as well as attractive for new ones. Selected solutions of client based software solutions:

• Google Analytics [33] – One of the most used web analytics software worldwide, contains a wide variety of features. It includes anomaly detection [21], is easy to use and for basic use is free (possibility to upgrade to paid premium version).

• Clicky web analytics [34] – Hosted analytics software that offers real time results processing, basic customer interaction monitoring functionality and ease of use. Pricing depends on the daily page views and number of tracked web pages.

• KISSmetrics [35] – Tool offering funnel (visitors’ progression through specified flows)1_{, A/B test, behavior changes reports. It is offering}

a 14-day trial and the starter price begins at $200 per month.

• ClickTale [36] – Software based on customer interaction monitoring, based on providing heat map analytics, session playback and conversion funnels along with basic web analytics reports. It is offering a trial demo and pricing depends on the bought solution.

• CardioLog [37] – Software designed to work on the Windows platform specified for use in on premises SharePoint servers, Yammer and hybrid deployments including Active Directory integration. Contains basic analytic reporting with UI directly built-in SharePoint site and is easy to deploy. Available is 30-day trial, full functionality pricing depends on the chosen solution On premise/On demand/Hybrid and the chosen features.

• WebTrends [38] – Solution offering rich functionality containing mobile, web, social and SharePoint monitoring. Apart from reports there is a possibility to integrate internal data to statistics and use performance monitoring for anomalies detection.

1. More information about funnel functionality: <http://support.kissmetrics.com/ tools/funnels/>

(43)

• Mint [39] – On premises solution for JavaScript tagging based tool, offering basic reports for visits, page views, referrers etc. Requirements are Apache with a MySQL and PHP scripting, payment is $30 per site.

• Open Web Analytics [40] – Open source web analytics software written in PHP working with MySQL database that is deployed on premise but also using tagging for analytics processing. There is also built-in support for content management frameworks like WordPress and MediaWiki.

• Piwik [41] – Open analytics platform offers apart from default JavaScript tracking and PHP server-side tagging also option to import log files to the Piwik server for analysis and reporting. There are more possibilities to adjust the reporting according to needs, however as a result the solution is not as easy to use. Piwik PRO contains also on premises solutions for Enterprise and SharePoint with pricing depending on the scale.

• CrawlTrack2 [42] – Open source analytics tools that is based on PHP tagging, enabling a wider range of obtained information including spiders hits and other server-side information.

• W3Perl [43] – CGI-based open source web analytics tool that works with both page tracking tags and reporting from log files.

Some chosen features of client-side tracking software types are compared in the Table 4.1. First compared feature Tracking traffic sources & visitors is a fundamental functionality of client-side analysis software types as it is based on information of client-side log source and unique IDs of visitors. Tracking robot visits feature is less often supported, as it is not usually detected using client script only (however it can be detected by php tagging). Custom dashboard

feature compares capability of adjusting dashboard or statistics report output contents. Real-time analysis is based on continuity of information being processed/received thanks to script present on pages, simplifying this functionality support in contrast with log files analysis. Keyword analysis can be very helpful feature mainly for SEO optimization work while it does not always belong to basic features of client-side analyzers. Mobile geo-location is a nice feature for increased tracking ability, but supported by only limited number of reviewed solutions.

2. CrawlTrack uses PHP tagging which enables also server-side information, however due to its main focus on basic client-side statistics with only spiders hit included it is listed among the client-side tracking type of software

(44)

Solution Tracking traffic sources & visitors Tracking robot visits Custom dashboard Real-time analysis Keyword analysis Mobile geo-location Google Analytics 3 7 3 3 3 3 Clicky ₃ ₇ ₇ ₃ ₇ ₇ KISSmetrics 3 7 3 3 3 – ClickTale ₃ ₇ ₇ ₃ ₃ ₃ CardioLog 3 7 3 3 3 7 WebTrends ₃ ₃ ₃ ₃ ₃ ₇ Mint ₃ ₇ ₃ ₃ ₇ ₇ Open Web Analytics 3 3 3 7 3 7 Piwik ₃ ₃ ₃ ₃ ₃ ₃ CrawlTrack ₃ ₃ ₇ ₃ ₃ ₃ W3Perl ₃ ₃ ₇ ₃ ₃ ₃

Table 4.1: Comparison of selected client-based software features

4.3

Web server log analysis

Types of web server log analysis use the log file in its standard format (IIS or Apache generated) and are optimized for their processing. Even though they might support analysis also for customized log file formats, the output is mostly made for basic server connectivity statistics and monitoring with no additional features that might be required for application log analysis.

• AWStats [44] – Free open source tool that works as a CGI script on the web server or launched from the command line. It evaluates the log file records and creates basic reports for visits, page views, referrers etc. It can be also used for FTP and mail logs.

• Analog [45] – Open source web log analysis program running for all major operating systems is provided in multiple languages and processes configurable log file formats as well as the standard ones for Apache, IIS and iPlanet.

• Webalizer [46] – Portable free platform-independent solution with advantages in scalability and speed. However it does not support as wide range of reporting mechanisms as other alternatives.

• GoAccess [47] – Open source real-time web log analyzer for Unix-like systems with interactive view running in terminal. It includes mostly general statistics in server report on the fly for system administrators.

(45)

• Angelfish [48] – Proprietary possibility for on premise analysis, often accompanying page tagging solutions. Contains also traffic and bandwidth analysis and also include client-side information in the reports, which was gained from web analytics tagging software. Pricing starts at $1 295 per year.

Some chosen features of server log files analysis software types are compared in the Table 4.2. First is the Custom log format capability, which might not be always available but is often crucial in requirements when slightly modified log files are to be analyzed. The Unique human visitors feature is quite easy to be accomplished for client-side tracking, however from log file analysis standpoint it is not always a priority along with theSession durationproperty. On the other hand the log files offer easy-to-get capability of Report countries tracking based on domain and IP address. There are often supported detailed Daily statistics, butWeekly statistics might not be supported in types of analyzers with basic functionality due to high numbers of records computation.

Solution Custom log format Unique human visitors Session duration Report countries Daily statistics Weekly statistics

AWStats 3 3 3 IP & Domain 3 7

Analog ₃ ₇ ₇ Domain name ₃ ₇

Webalizer 7 7 7 Domain name 3 7

GoAccess ₃ ₇ ₇ IP & Domain ₃ ₇

Angelfish 3 3 3 IP & Domain 3 3

Table 4.2: Comparison of selected server log file analysis software features

4.4

Custom application log analysis

Fundamental functionality expected from the application log analysis consists of: parsing custom fields in log records, view the records in a consolidated form, search for specific data using custom queries and highlighting results that might be of interest. For a simple application, the log file viewers with searching capabilities might offer sufficient functionality for basic application monitoring as they can be set up for searching in high numbers of log files records for specific issues, working with custom log files field data that differ across different platforms and application types. Searching and filtering is often based on regular expression input and configurable queries filtering contents. Some of the application log files view and analysis tools are:

(46)

• Log Expert [49] – Free open source tool for Windows, contains search, filtering, highlighting and timestamp features.

• Chainsaw [50] – Open source project under Apache logging services focuses on GUI-based Log4J files view, monitoring and processing. It offers searching, filtering and highlighting features.

• BareTail [51] – A free real-time log file monitoring tool with built-in filtering, searching and highlighting capabilities supporting multiple platforms and also configurable user preferences.

• GamutLogViewer [52] – Free Windows log file, log file, viewer that works with Log4J, Log4Net, NLog, and user defined formats including ColdFusion. It supports filtering, searching, highlighting and other useful features.

• OtrosLogViewer [53] – Open source software for logs and traces analysis. Contains searching, filtering with automatic highlighting based on filters and multiple additional options using plugins.

• LogMX [54] – Universal log analyzer for multiple types of log files, includes built-in customable parser, filtering & searching options for large files, real time monitoring with alerts and auto response options. Pricing starts for 1 user basic license at $99.

• Retrospective [55] – Commercial solution for managing log files data working on multiple platforms and offering wide search, monitoring, security and analytic capabilities with a friendly UI design. Pricing for personal use starts at $92.

These types can differ according to supported Log files (even though custom log file format is often configurable) and also according to Platform. In the following table All listed for platform stands for Windows, OS X and Unix-like, while Win

stands for Windows platform. While client-side analyzers are often based on statistics, visitors and source referrers tracking and their diagramming in dashboards, application logs analysis might not even support the statistics generation as these types of tools are mainly used for middle processing after logging and before visualization. Their capabilities are based on filtering & highlighting tools to make better sense of multiple types of data. Log files can be designed to be straightforward, in which case there are only specific types of log files data that are of interest. They can be easily retrieved with configurable searching and automatic highlighting and filtering. Regex or regular expressions3

3. A regular expression (regex or regexp for short) is a special text string for describing a search pattern – more information on page http://www.regular-expressions.info/

(47)

functionality is priceless when searching custom data sources as they are powerful tools for retrieving valuable information in specified format. As for the

Real time support, for locally gathering multiple format types this capability can be a plus (mainly for monitoring), however often it is not treated as a priority.

Solution Platform Statistics Log files Filter & Highlight

Regex search

Real time

Log Expert Win ₇ Custom ₃ ₃ ₇

Chainsaw All 7 Log4j 3 – 7

BareTail Win ₇ IIS/Unix/

custom 3 7 3 GamutLog Viewer Win 7 Log4j/ custom 3 7 7 OtrosLog Viewer All 7 Log4j/ Java logs 3 3 7

LogMX All 3 Log4j/_custom 3 3 3

Retrospective All ₃ Server/Java/

custom 3 3 3

Table 4.3: Comparison of selected application log file analysis software features Solutions listed up till now contain mostly basic functionality for visitors/page views/referrer stats extraction and visualization while working with either client-side tracked information (obtained by page tagging) or standard web server file format analysis. For the custom log application viewers, they contain basic searching and highlighting capabilities based on custom search rules setup. Even though some offer also additional functionality for bandwidth/anomaly detection/performance monitoring they are mostly recommended for small to midsize businesses with webpages or simple web application monitoring.

Once the application log files are needed to be processed more in-depth, specific statistics for security and compliance are required, standard reporting mechanisms might not be sufficient for the web log file analysis.

4.5

Software supporting multiple log files types analysis

with advanced functionality

Software solutions that include additional deeper analytics capabilities as well as processing of distinct log file formats can be used for both the basic log files and client side tagging output analysis. They also often offer a fully functional platform for log file analysis that can factor in also additional data input streams. According to specific needs the contents (input and/or output) can be highly customized and prepared to fit in the user’s requirements.

(48)

To remind the basic steps of data analysis, it consists of: data collection, pre-processing, data cleaning, analysis, results overview and communication. It is possible to get a full solution including all the required steps. On the other hand, it is possible also to compile the output from separate software tools according to the systems for data management already used in the organization. Some of the tools used also for application log files analytics include:

• Logentries [56] – Hosted SaaS cloud-based alternative for log file collection and analysis. It collects and analyzes log data in real time using a pre-processing layer to filter, correlate and visualize. Software offers rich functionality including security alerts, anomaly detection and both log file and on-page analytics. A free trial is available, limited functionality option with sending less than 5 GB/month is free. Starter pack for up to 30 GB/month costs $29 per month.

• Sawmill [57] – Mostly universal solution using both log file entries analysis and on-page script tagging, can be deployed locally or hosted. Covers also web, media, mail, security, network and application logs, supports most platforms and databases. Pricing depends on the chosen solution, lite pack with limited functionality starts at $99.

According to the needs of the analysis, multiple possibilities are present for acquiring data from local/hosted log files. The system used for data collection as well as the overall data management is significant in choosing the appropriate tool [17]. Some of the richer functionality solutions for web log monitoring and analysis are: