Novel Data Extraction Language for Structured Log Analysis

(1)

Novel Data Extraction Language for Structured Log Analysis

P.W.D.C. Jayathilake

99X Technology, Sri Lanka.

A BSTRACT

This paper presents the implementation of a new log data extraction language. Theoretical formation of the language schema was presented in a previous work of ours (Jayathilake, 2011). In the design of new language we focus on specific problems encountered in automating log analysis. Emphasis is put on the structured nature of log files. A brief review on existing data format description mechanisms is also provided. After describing the implementation of the new language, we compare it with another popular data description language to highlight the unique capabilities of it.

K EYWORDS

Log data extraction, Data format description language, Log analysis, Declarative language

I NTRODUCTION

Software log files contain information pertaining to most user and system actions within an organization. Regulations such as PCI DSS, FISMA, HIPAA and frameworks like ISO 27001 and COBIT emphasize standards on logging. If utilized properly, log data can generate a huge value in various facets of a business. Log analysis has proven its potential in intrusion detection, unintended user activity identification, system compliance testing, software trou- bleshooting, software monitoring, performance benchmarking and functional testing. Despite its benefits, log analysis is a process that incurs a huge cost if the entire range of its phases is performed manually. Reasons are two fold; log analysis requires expertise and a significant time is consumed for digging deep into loads of data and making inferences.

Commercial tools that exist in the market deliver a range of functionalities for automating certain stages of the analysis process. These include log data collection from different sources, data indexing, searching, automatic identification of common log file constructs such as timestamps and IP addresses, customizable dashboards for data visualization, highlighting anomalies, automatic compliance checks, etc. All existing tools treat log data as unstructured information. Though high entropy of log information justifies this practice, it imposes nu- merous limitations in automating log analysis. Lack of contextual correctness, for example, poses many challenges in creating semantics for inferring results automatically.

Jayathilake (2011) published an initial version of a framework that creates a platform for

structured log analysis. Its core constituent was a new procedural language, which was de-

signed to be used in every phase of automated log analysis. Though the language proved to

be powerful in processing log data, we soon realized its inappropriateness in log data extrac-

tion. Jayathilake (2011) later published a specification for a declarative language for describ-

ing the format of any log file. The intention was to make the log file format declaration more

readable and to pick information of interest from log files more easily. We proved the flexi-

bility of the specification in expressing formats of different log file types such as line logs,

highly structured logs and tabular logs. Furthermore, we verified that the specification is re-

silient for log file corruptions, which is a prominent problem in the domain.

(2)

This paper presents an implementation of that specification based on Simple Declarative Language. Simple Declarative Language (SDL) is an easy representation mechanism for data structures (Leuck, 2012). Java and .Net implementations for the language already exist so that a syntax expressed in a compliant format can be parsed easily. We formulate a syntax that facilitates easy expression of all language constructs. Since the syntax is compliant with SDL, we could use an existing SDL parser for lexical analysis. Interpretation stage is implemented according to a new algorithm, which uses recursion extensively. Inconsistent log data are handled through a hierarchical fault tolerance mechanism that provides users to select the level of recovery after detecting a log file corruption. Selective data extraction is supported to enable users to cherry-pick data from huge log files for further analysis. Supportive routines are added to reduce effort in dealing with common log file constructs such as timestamps, IP addresses, port numbers and error codes. Output of the data extraction process is a tree that incarnates semantic relationships between log entries. High expressiveness, simplicity, short learning curve, readability and immunity for log file corruptions are the strengths that we identify in the language.

E XISTING D ATA D ESCRIPTION L ANGUAGES

This section provides an overview on existing data description languages.

EAST - East is an ISO standard data description language, which is developed by the Consul- tative Committee for Space Data Systems (CCSDS, 2007). It provides a rich mechanism to express data format completely and non-ambiguously. Data are regarded as a collection of data entities and the EAST description is used to interpret and gain access to those entities. Its main design goals are strong data description capabilities, human readability, and computer interpretability. One prominent problem with EAST is the lack of support for describing file structures where position of one data entity needs to be determined at run time by examining fields in other entities.

DRB – Data Request Broker is an open source Java application programming interface (GAEL, 2009). It is an expansion on EAST. It can be used for reading, writing and pro- cessing heterogeneous data. DRB is a software abstraction layer, which can be utilized by developers for programming applications independently from the way data are encoded with- in files. It is also possible to perform calculations using XQuery from within the data descrip- tion allowing full description of files where the locations of data fields within a file must be calculated from other data fields. However calculations must be described in XQuery and can lead to increased complexity hence reduced human readability.

PADS/ML - This is a domain-specific language designed to improve the productivity of data analysts. It is a functional computer language to formally specify the logical and physical structure in data (Mandelbaum et al, 2007). In contrast to other data description languages, PADS/ML provides a platform where the description can stand as a sound documentation on the data too. However, it does not offer a satisfactory level of support for describing semantic information.

DFDL – Data Format Description Language is an open standard that came up due to the need

for representing text and binary data with various formats in a common XML paradigm

(OGF, 2010). It also allows data to be taken from an XML model and written out to its native

format. By having a data format described with a DFDL description, which is accessible to

(3)

multiple applications, one can provide a common interface to the data, therefore facilitating data interchange. DFDL does not inherently support semantic information but can be used in conjunction with ontologies for this purpose. One drawback is the verbose nature of DFDL because of XML metadata that affects human readability.

HAWK – This is a powerful, flexible language for log file analysis, which uses simple meth- ods to analyse. Its basis in pattern-action pairs allow for flexible combination of programs. It provides support for a range of log file analysis functionality such as filtering, recoding, and counting. The language provides the processing power for analysing the log files (HAWK, 2009).

BFD - The Binary Format Description language is an XML-based language for expressing binary data formats (National Collaboratories, 2003). It is an extension to eXtensible Scien- tific Interchange Language(XSIL). A BFD template can be used to extract data from a set of files and put them into an XML for further processing.

R EQUIREMENT FOR A N EW L OG D ATA E XTRACTION L ANGUAGE

Above-mentioned languages are mostly generic data format description languages that con- sider a wide range of applications. Log analysis is one niche where those tools can be uti- lized. However, log analysis, as a separate domain, exhibits unique characteristics and poses specific problems. For example, corrupted data is a prominent challenge facing any attempt to automate the analysis process. Huge amount of data, inconsistent formats and frequent format changes further add to this. Even so it is vital to have a data description scheme that results in more human readable templates compared to highly verbose XML solutions.

T HE N EW L ANGUAGE

In order to address these unique needs we designed a new log data extraction language based on a simple schema. Jayathilake (2011) discussed the theoretical formulation of this language along with case studies on its applications. In summary, it is based on interpreting a log file as a hierarchy of units termed “log entities”. Three types of log entities are identified.

1. Type A

This type is defined as a sequence of other log entries defined by the pair ([LE

1

, LE

2

, …, LE

N

], ERROR_RECOVERY) where LE

i

are log entries. The sequence should be built with the same order of log entries as specified inside the square brackets in the first element of the pair. ERROR_RECOVERY is a flag that indicates whether the system should try to recover from parse errors for this type of log entries.

2. Type B

This is a sequence of other log entries defined by the 4-tuple ({LE

1

, LE

2

, …, LE

N

}, MAX,

MIN, ERROR_RECOVERY) where LE

i

are log entries. The sequence can be built with those

log entries by putting them in any order. Each LEi can appear in the sequence zero or more

times. The list containing LEis is termed the candidate list for the sequence. MAX is the max-

imum number of log entries permitted in the sequence. If its value is -1, there is no limit for

the length of the sequence. Similarly MIN is the minimum number of log entries that should

(4)

be present in the sequence. -1 indicates that there is no lower bound for the length of the se- quence. ERROR_RECOVERY is a flag having same semantics as in definition for Type A.

3. Type C

A singleton (k) where k is a fixed sequence of bytes.

The language also provides a mechanism to recover from corruptions in a log file. When a part of text that does not follow the format in the description is detected, the interpreter has the ability to fall-back to the next log entry and to continue execution without premature ter- mination.

I MPLEMENTATION

We implemented the language syntax in Simple Declarative Language (SDL), which pro- vides infrastructure for describing arbitrary data formats. Below we explain the syntax for each of the three log entry types through examples.

1. Type A

Line typeA Timestamp Process TID Area Category ER=true

This syntax defines a log entry named “Line”, which is built by a sequence of other log en- tries “Timestamp”, “Process”, “TID”, “Area”, and “Category”. Error recovery (ER) is set to true.

2. Type B

Gap typeB Space Tab Max=-1 Min=2 ER=false

This is a definition of a log entry named “Gap”, which stands for an empty space created by two or more spaces and tabs. Spaces and tabs can occur in any order and quantity. Error re- covery (ER) is set to false.

3. Type C Char typeC a

The Type C log entry “Char” defined here stands for the character “a”.

Implementation of the language is shown in Fig. 1. It constitutes two main components; the

lexical analysis module (parser) and an interpreter module. The parser module processes the

given format specification using SDL. This is possible since the new language syntax is com-

pliant with SDL. Log file content is lexically analysed with respect to the pre-processed for-

mat specification. After that, the interpreter extracts log file content and converts it to a pro-

prietary binary format. This data format is ready to be processed by the log data analysis

framework presented in Jayathilake (2011). A recursive algorithm is used to implement the

interpreter module.

(5)

Figure 1: Implementation of the new language

C OMPARISON W ITH DFDL

In this section we provide a comparison of the new language with DFDL, which is another promising technology to express file formats. Similar to any other XML based schemas DFDL incurs significant metadata overhead. Log entry formats expressed in DFDL are much verbose than their expressions in our schema.

Figure 2: Comparison between our schema and DFDL Less verbose

Resilient for log corruptions Optimized for log file formats

Lot of metadata

Offers a powerful type system

Unable to handle log corruptions

Line typeA Timestamp Process

TID Area Category ER=true

(6)

Fig. 2 provides a comparison between the expressions of one log entry in our schema and in DFDL. The new schema results in more compact and readable format expressions. Since the new language is specifically designed for log file formats, in contrast to DFDL, which is a generic format expression mechanism, the new language offers few other benefits too. One prominent advantage of it is the ability to deal with log corruptions. On the other hand, DFDL provides a rich type system so that most common data types are natively identified.

C ONCLUSION

The new log data extraction language has the capability to express a wide range of log file formats while offering a simple, human-readable syntax. Its hierarchical interpretation of log entries enables it to capture difficult log formats containing lot of peculiarities that many oth- er existing data format description languages fail on. The schema is proven to work with many industrial log file types such as line logs, message logs, XML logs and tabular logs. A prominent feature in the new language is its ability to deal with inconsistencies and corrup- tions in log files. It strengthens the automated log analysis mechanism with the ability to use as much “correct data” as possible. Simple Declarative Language provided a useful platform when implementing the language syntax. The current implementation of the language sup- ports only text log files, which constitutes a limitation. It can be enhanced to handle binary logs too. A further improvement can be adding the capability to handle log file formats where the location of one log entry should be dynamically read from another log entry.

R EFERENCES

Jayathilake, D. (2011) ‘A mind map based framework for automated software log file analysis’, Proceedings of the International Conference on Software and Computer Applications (ICSCA 2011), pp. 1-6.

Jayathilake, D. (2011) ‘A novel mind map based approach for log data extraction’, Proceedings of the 6th IEEE International Conference on Industrial and Information Systems (ICIIS 2011), pp. 130-135.

Andrews, J. H. (1988) ‘Testing using log file analysis: tools, methods and issues’, Proceedings of the 13th IEEE International Conference on Automated Software Engineering, pp. 157-166.

Valdman, J. (2001) ‘Log file analysis’, Department of Computer Science and Engineering (FAV UWB), Tech. Rep. DCSE/TR-2001-04.

Consultative Committee for Space Data Systems, 2007. The Data Description language EAST Specification. [pdf]. Available at: <http://standards.gsfc.nasa.gov/reviews/ccsds/ccsds- 644.0-p-2.1/ccsds-644.0-p-2.1.pdf> [Accessed 05 May 2012].

GAEL Consultant, 2009. Data Request Broker. [online] Available at:

<http://www.gael.fr/drb> [Accessed 05 May 2012].

Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M. and Gleyzer, A. (2007) ‘PADS/ML:

A Functional Data Description Language’, Proceedings of the 34th annual ACM SIGPLAN-

SIGACT symposium on Principles of programming languages (POPL 07), pp. 77-83.

(7)

OGF Data Format Description Language Working Group, 2010. Data Format Description Language (DFDL) v1.0 Core Specification. [pdf]. Available at:

<http://www.ogf.org/Public_Comment_Docs/Documents/2010-03/draft-gwdrp-dfdl-core- v1.0.pdf> [Accessed 05 May 2012].

HAWK Network Defense, 2009. The Future: Dynamic Log Analysis. [pdf]. Available at:

<http://www.cleartechnologies.net/wp-content/uploads/2011/08/Dynamic-Log-Analysis- Whitepaper4.pdf> [Accessed 05 May 2012].

National Collaboratories, 2003. Binary Format Description (BFD) Language. [online] Avail- able at: <http://collaboratory.emsl.pnl.gov/sam/bfd> [Accessed 05 May 2012].

Daniel Leuck, 2012. Simple Declarative Language. [online] Available at:

<http://107.20.201.134/display/SDL/Home> [Accessed 05 May 2012].