High Performance Logging System for Embedded UNIX and GNU/Linux Applications

(1)

High Performance Logging System for

Embedded

UNIX

and GNU/Linux Applications

Jaein Jeong

Cisco Systems

San Jose, California 95134, USA [email protected]

Abstract—We present a high performance logging system for embedded UNIX and GNU/Linux applications. Compared to the standard UNIX and GNU/Linux logging method, syslog, our method has two orders of magnitude lower latency and an order of magnitude higher message throughput. This speed-up is mainly due to the use of a memory-mapped file as the means of inter-process communication, fewer memory copies and the batching of output messages in the logging daemon. In addition, our logging system also accepts syslog messages, providing compatibility with existing applications. Our logging system is in production use in the Cisco UCS Virtual Interface Card.

I. INTRODUCTION

Logging is a helpful tool for analyzing the behavior of computer systems and applications. By analyzing a sequence of events in a log, one can see whether computer systems or applications behave properly. In production environments logs are even more important and may be the only way to analyze system problems. In this paper, we focus on logging systems for embedded UNIX applications (we will use UNIX to refer to all UNIX-like systems including GNU/Linux). Traditional

UNIXapplications log messages by writing them over a socket to a system daemon which then writes the messages to log files and/or forwards them to peers running on remote hosts. The system daemon is syslogd and the application library interface is syslog.

There are two performance problems with this design. The first is the time it takes for an application to transmit a message to syslogd. The second is the inefficiency of syslogd

writing messages to a file one at a time using unbuffered writes. These unbuffered writes are not a big problem on a conventional file system using a hard disk (the file system buffer cache reduces disk I/O and hard drives are fast com-pared to flash memory), but are very slow when writing to flash memory using a flash memory file system. The first of these problems increases application response times when there are many messages to log. The second reduces the number of messages that can be logged per second by at least an order of magnitude, and also increases the application response time as no new messages may be transmitted to syslogd while it is writing the previous message.

In order to address these performance problems, we devel-oped a logging method that uses shared memory to transfer messages to the system logging daemon and briefly buffers received messages before writing them to permanent storage.

In our logging method, application processes create log mes-sages by writing into a shared memory segment. A write lock is used to ensure applications don’t conflict with one another as they write into the shared memory. A logging daemon retrieves messages from the shared memory segment and dispatches them to one or more destinations, such as permanent local storage, a remote syslogd daemon, or a dynamically connectable network client. We note that our logging method drops incoming messages when the shared memory becomes full due to a burst of incoming messages. This trade-off was made to meet real-time requirements for applications.

After a message is written to the shared memory, the logging daemon must be notified. This is done by writing to a pipe. The logging daemon waits on the receiving end of the pipe and wakes up when data (i.e. a message notification) is available. Event-driven applications may defer notification until their event handler completes. This has two positive effects, the most important of which is reducing the notification overhead from once per message to once per handler invocation. It also defers the cost of notification until the handler has completed its work. Finally, when the logging daemon is running it sets a flag in the shared memory area indicating that (since it’s already running) it doesn’t need to be notified.

As the logging daemon processes each message, those that go to a network destination (either a remote syslogdor a dynamic log viewer) are sent to the destination immediately. Those to be stored locally are formatted into their final text representation and written via a pipe to a buffering process,

flashlogger. After the first message that arrives when its buffer is empty, flashloggerwaits for up to 0.5 seconds for additional messages. After the wait period or sooner if its buffer fills up,flashloggerwrites its buffered messages to flash and resumes waiting for additional messages. This buffer-ing reduces the number of writes to local storage and enables a much higher overall throughput than obtained by writing each message to storage individually. In addition to forwarding to remote syslogd processes, the logging daemon also acceptssyslogmessages from local (only) applications. This provides full compatibility for existing applications. Messages received via the syslog interface are dispatched the same way (i.e. according to their priority and possibly source) as messages received via the shared memory segment.

(2)

log-ging system is almost two orders of magnitude faster for an application than syslog. This is because logging a message only involves user-space memory copying without a context switch. The maximum message throughput (how many can be saved per second before having to drop or throttle messages) with the buffering is over an order of magnitude greater.

The rest of this paper is organized as follows. In Sections II and III, we review previous work and the evolution of logging systems on which our work is based. We describe the design principles and the implementation details for our logging method in Sections IV and V. We evaluate the performance in Section VI, describe a further optimization in Section VII and conclude this paper in Section VIII.

II. RELATEDWORK

Introduced in BSD in the early 1980’s, syslog[1], [2] is still the most notable logging method for UNIX applications due to its simplicity, flexibility and public availability. This is exemplified by a number applications that use or extend it [3]– [9]. In order to provide more features, a number of extensions

of syslog have been implemented. Syslog-ng [10] and

rsyslog[11] are two of the most popular ones [12]–[14].

Syslog-ngis an extension ofsyslogthat was created to support the original BSD syslogprotocol [1] from another

syslogextension,nsyslogd[15]. Its goals are to provide additional features such as reliable transport, encryption, and richer set of information and filtering. Over time, it has evolved to support the latest syslog protocol [2] and is used by several UNIX distributions. Rsyslog [11] is another well-known extension of syslog and is being used in many of latest Linux distributions. Rsyslog extends syslog not only in features but also in performance with a multi-threaded implementation. By using threads it avoids the problem of blocking the acceptance of new messages while writing pre-viously received messages.

However, these existing systems are not designed for em-bedded/flash logging and they use the same fundamental design in their message handling and thus have the same limi-tations when used for embedded flash logging as mentioned in Section I: (i) slow message passing due to message copying over the kernel to user space boundary, and (ii) unbuffered writes of messages to output devices. In the next section, we describe our shared-memory logging methodology and shows how it addresses the performance problems of existing logging systems.

III. CISCOUCS VIRTUALINTERFACECARD AND

EVOLUTION OFLOGGINGSYSTEM

In this section, we introduce the Cisco UCS Virtual Interface Card and explain how its logging system evolved based on design requirements and experience using it.

A. Cisco UCS Virtual Interface Card

Cisco Unified Computing System (UCS) [16] is a data-center server system that enables large deployments of vir-tualized servers with scalable performance through hardware-accelerated I/O virtualization. Cisco UCS mainly consists

of the following components: servers, virtualization interface cards and switches. A UCS server provides computation capability, and it can be used for a computation-intensive single server or a number of virtualized servers. A UCS virtual interface card [17], or VIC, is a high-speed network interface card for a UCS server, and one of its key features is support of virtual network interfaces (VNICs). To the server, a VNIC is seen as a separate device with a unique PCI bus ID and separate resources. To the network, it is seen as a network frame with a special tag, calledVN-Tag. Thus, it is treated by a server in the same way as other PCI devices and does not cause any overhead for translation. A VIC is implemented by an ASIC that consists of a management processor, two fiber-channel processors, VNIC resources and resource mappings. The management processor, which the firmware runs on and is the target of our logging system, has a 500MHz MIPS 24Kc processor core and runs Linux kernel 2.6.23-rc5.

B. Evolution of VIC Logging System

• logd - A Simple Logging Daemon: We initially wrote

logd, a simple logging daemon that accepts messages with different severity levels from multiple processes through IPC calls, formats the message and writes the formatted output message to flash storage using stdio functions. Although it is very simple,logdachieved its goals with reasonable performance. Since writes through stdio functions are buffered, logd was able to write messages reasonably quickly most of the time.

• Unbuffered syslogd: logd served its purpose well enough at first, but there came a need to support forward-ing serious errors to syslogd on the UCS switches. Instead of extending logd with such a feature, we decided to usesyslogd, which of course already has the feature. Usingsyslogdachieved the immediate goal but its write performance was much worse becausesyslogd

writes log messages without buffering them.

• Buffered syslogd: In order to address the write perfor-mance problem ofsyslogd we introduced a buffering process, called flashlogger, betweensyslogdand the flash storage.flashloggerhas an adjustable wait time (currently half a second) to collect messages before it writes out what it has collected. However, if its buffer fills up before the wait time has passed, it goes ahead and writes out the buffer immediately. This combination of

syslogdandflashloggerwas used for the logging

system for VIC until we developed the new logging system that is introduced in this paper.

IV. DESIGNREQUIREMENTS

A. Faster Message Transfer

A representation ofsyslogdis shown in Figure 1(a). The round-trip time between a user application process and the

syslogd process is typically around 700us on average and has been observed as high as 13,000us under heavy load. This is not a small number for a modern computing system. The reason why accessingsyslogdtakes a long time is because

(3)

Switch Switch System Process System Process App Process log syslogd System Process

…

Process App Process log Process App

…

Flash log Process JFFS2 Flash Logger syslog g USER JFFS2

KERNEL Buffer Named pipe

(a) TypicalUNIXLogging

Switch Switch System Process System Process App P log mqlogd System Process

…

dequeue Process log App

Process log Process

…

Flash enqueue q Process App JFFS2 Flash Logger Memory Mapped File App Process log USER JFFS2 File

KERNEL Named pipe

(b) Shared Memory Logging Fig. 1. Data Flow of Log Messages

it requires passing data between user and kernel spaces and context switches between the application and syslogd. In order to address the long round-trip problem of syslogd, we designed a logging scheme that does not require message transfer between user and kernel spaces in its critical data path, as illustrated in Figure 1(b). With this scheme, an application or system process uses the same logging API to log a message. The message is copied to shared memory by the log functions and forwarded to the flash memory or a remote host by a message collector mqlogd. Since the shared memory is allocated in user space, message transfer can be very fast once the shared memory section is initialized. Thus, a user or system application is finished logging right after it writes a message to the shared memory instead of waiting for the reply from the message collector, and this improves latency.

The faster message transfer of the shared memory logging is also attributable to the number of times a message is copied in memory, which is a significant factor for message transfer time. With syslogd, a message is copied four times in memory: first, it is formatted into a user buffer; second, it is written from the user buffer to a kernel socket; third, it is read from the kernel socket to a syslogd user buffer; and fourth, it is written from the syslogduser buffer to a file. With the shared memory logging, the number reduces to two. A process that logs a message formats the message into the shared memory. Since the logging daemon accesses the shared memory, the first three message copies in syslogd

are reduced to a single message copy.

B. Compatibility with Existing Logging Applications

VIC applications use the following logging macros:

log_info, log_debug, log_warn, and log_error.

These macros format input strings and previously made

syslog calls. Applications that use these logging macros were converted to shared-memory-based logging without mod-ification of application code by changing the syslog calls to message copies to the shared memory. However, there are

some applications that directly call syslog without using the logging macros. Therefore the logging daemon of our logging system also accepts messages intended forsyslogd. This allows existing applications that use syslogd to be used with our logging system without modifications. One such application isklogd.

C. Output Flash File System

The syslogdprogram for VIC used theflashlogger

program to buffer messages, and our logging system takes the same approach.

D. Support of Destination-Aware Message Formatting One of the advantages of syslogd is that it allows log messages to be sent to multiple destinations of different types, where a destination may be a simple file, a named pipe or a socket on a remote host. However, there is a drawback.

syslogduses the same data representation shown in Figure 2 for all destinations. This representation has two problems. The first problem is inefficiency. The string representation of date and time requires more space than a binary or simpler text representation. In addition, the host name is redundant for flash logs because all of the events that are stored in the flash memory are local and are from the same host. The second problem is loss of precision. The message format specifies the time of a log message at a granularity of seconds, while the system supports timing in milliseconds. Logging data is more useful with more precise timings.

Our logging system formats messages differently depend-ing on the destination. For remote forwarddepend-ing, it outputs log messages in the same way as the current Cisco ver-sion of syslogd does for compatibility. For writing to

flashlogger, it changes the message format to use a higher precision timestamp while omitting the hostname.

(4)

Format: DateStr TimeStr Host app.module[pid] : Message

Example: Mar 9 15:53:23 simserv01 log.logtest[16665]: This is a message

Fig. 2. Message format for syslogd

CLIENT

logperf Client Utility (an example)

CLIENT log Receives API call

Data Transfer

mq Shared Memory Management

logger Client Utility (an example)

Memory Mapped File mq Queue Management

UDP Unix Socket syslog() system call

SERVER

mqlogd Outputs Log mq

Shared Memory Management Queue Management mqlogd _{Data Transfer} mq Queue Management Message Formatting

Named Pipe UDP IP Socket TCP IP Socket

OUTPUT UTILITY Flash l Local Logging OUTPUT UTILITY syslogd Remote OUTPUT UTILITY netcat Dynamic Remote logger Local Logging syslogd Forwarding netcat Forwarding

Fig. 3. Program Modules for themqlogdLogging Service

V. IMPLEMENTATION

A. Platform

While our logging method can be used for anyUNIXsystem that does logging, it is especially designed for the embedded firmware for the Cisco UCS Virtual Interface Card [17] where low-latency logging is desirable and the storage medium log files are written to is slow.

B. Modular Description

Our logging method consists of a core library, server modules, client modules and output utilities as illustrated in Figure 3. The module mq, which is in charge of managing the memory-mapped file, the circular queue, and message formatting, is made as a library that can be used by both a logging daemon and clients. The logging daemon, mqlogd, accepts logging messages either through the shared memory or a UDP UNIX socket, and it outputs messages through the named pipe or a UDP IP socket. The client modules receive logging requests in aprintf-like format and insert them into the shared memory. As for the client utilities, we consider two different kinds depending on how they submit logging requests: one uses shared memory and the other uses the

syslog library call and UDP UNIX socket. Output utilities are connected to the server through a named pipe, a UDP IP socket, or a TCP IP socket. The named pipe interface is for a program such asflashloggerthat stores logging messages locally. The UDP IP socket interface is to forward messages to thesyslogdprocess on the switch. The server also supports dynamic remote forwarding through a TCP IP socket interface. This interface is different from the UDP IP socket interface in that it dynamically maintains forwarding destinations to a client such asnetcaton a remote host.

Header Entry (48B in size) Flag

Unused Reader

PID Mutex Unused Head Tail

(1B) PID Unused

(4B) (20B) Data Entry (48B in size)

(3B) (4B) (4B) Prog Name (16B) Logging ID (16B) Log Lvl (2B) Msg Length (2B) PID (4B) Time-stamp (8B) Msg String (48B)

Data Entry only with Msg String (48B in size)

(48B)

Queue Memory Layout Offset

Header Entry (48B) Non-Header Entry (48B) 0 1 * BlockSize Non-Header Entry (48B)

…

2 * BlockSize

…

Non-Header Entry (48B) (N-1) * BlockSize

Fig. 4. Memory Layout of the Circular Queue

C. Shared Memory and Circular Queue Management 1) Data Representation: The shared memory is used as an array of fixed-sized blocks that are managed as a circular queue as illustrated in Figure 4. The first block in the memory-mapped file is used as the header entry, which contains all the bookkeeping information for the circular queue and the logging service. The rest of the blocks are data entries, which contain the information for each logging message. Just after each message header, the message uses as many blocks as it needs for the message string. The length of the message string is specified in the message length field in the first data entry. Since data entries are consecutive, we can handle multiple segments of message string fields like a single field by pointing to the starting address of the first message string field. However, there is an exception to this. When a sequence of data entries roll around the end of the memory-mapped file, we cannot treat the message string as consecutive memory. In that case, the message string must be handled as two pieces: the first one is from the beginning of the message string and to the end of the memory-mapped file, the second one is from the beginning of the memory-mapped file and the location where

(5)

the last data entry ends. Thus, a message string can be handled as one or two consecutive byte arrays.

2) Notification Mechanism: The queue management mod-ule notifies the logging daemon when a client writes a logging message in the queue or when the queue is full. It implements two different methods for client-to-server notification: one using write-and-select (the default) and the other with a signal. Each time a client writes a logging message to the memory-mapped file, it checks whether it should notify the logging daemon. If the daemon sets the notify flag in the header entry, the client does not send a notification. If the daemon clears the notify flag, the client sends a data-available command, which makes the daemon retrieve all the logging messages in the queue. This notification batching mechanism suppresses redundant notifications and helps improve the latency and throughput for message passing and dispatch. The client also sends a queue-full notice when it detects that the queue is full. It then drops the message. This makes the daemon output a queue-full message to the destinations in order to mark that there was a drop.

3) Locking Mechanism: The queue management module uses a locking mechanism in order to maintain the consistency of the circular queue. It implements two different mechanisms: semaphore lock and pthread mutex lock. The semaphore lock is a set of lock functions based on the Linux semaphore API. While it is not really fast due to a kernel access for each lock or unlock operation, it is widely supported. The pthread mutex lock uses the Linux pthread API. While it is faster than the semaphore lock, it is unreliable for inter-process use in our current version of Linux. Thus, the deployed queue management module uses the semaphore lock.

D. Client Logging Interface

Client applications in the original logging system first initialized application-wide logging objects before they called one of the logging macros (log_info, log_debug,

log_warn, and log_error). This two-step approach fits well to the shared memory based logging and allowed existing applications to be converted to the new logging without modification of application code. In order to handlesyslog

messages the logging daemon waits on the UNIX socket at

/dev/log, adding received messages to the circular queue. This functionality provides the same application interface as

syslog, allowing clients for syslog to be used with the shared memory logging without any modification.

E. Server Output Interface

After receiving each logging message, the logging daemon formats it and outputs the formatted message to the internal flash storage logger. For a logging message with a high priority level (i.e. error or warn), the logging daemon also outputs the formatted message to a UDP socket on remote hosts. By default, the daemon sends the formatted message string to UDP port 514 of the two switches to which a UCS Virtual Interface Card is attached. Since UDP port 514 is a designated

port forsyslogd, this makes thesyglogdprocesses on the switches receive the message.

In order to support dynamic remote forwarding, the logging daemon listens to a TCP port. When a client on a remote host makes a TCP connection to this port, the daemon adds the file descriptor of the TCP connection to a pool of active TCP connections. For each incoming logging message, the daemon formats the message and sends it to clients in the pool. When a remote client disconnects the TCP connection, any attempt to send a message to the client generates a file error. The daemon detects this as an inactive TCP connection and removes the file descriptor from the pool. The dynamic remote forwarding may look similar to the UDP port forwarding, but these two methods are designed for different purposes. The UDP port forwarding sends high-priority messages to the switches and is always enabled. The TCP dynamic remote forwarding, which is designed for easy debugging, sends messages of all priority levels and is enabled only on explicit requests to avoid unnecessary overhead.

VI. EVALUATION

A. Experiment Set-up

1) Metrics: For evaluation, we measured the performance of our shared memory-based service (mqlogd) andsyslogd

for two metrics – request latency and drop rate:

• Request Latency:We measured statistics (average, mini-mum, and maximum) for the request latency by repeating a logging message multiple times for each set of pa-rameters. Among these statistics, average and maximum request time are more interesting because they show the expected and worst-case service times.

• Request Drop Rate:We also measured the request drop rate. syslogd has a zero drop rate at the sacrifice of latency, which is an order of magnitude higher. In con-trast, themqlogd logging service drops requests under high load. By measuring the request drop rate for various sets of parameters, we will show the conditions where the

mqlogdlogging service achieves lower request latency while achieving a zero or acceptably low drop rate. 2) Parameters: For performance comparison of mqlogd

andsyslogd, we varied the number of clients and number of iterations for both logging services:

• Number of clients - 1 and 2: Changing the number of clients gives the lower bound for performance (a single client) and shows system performance as contention increases (two clients). No more than 2 clients were measured because most of the time, there is only one client logging messages.

• Number of iterations - 100, 1000, 5000, 10000, and 50000: Currently, the queue size of the system is set to 10240 items. In this test, we see how our logging service module behaves with different numbers of requests. For all of the measured configurations, five iterations were run.

(6)

For evaluating the most suitable design out of a few design variations, we measured the performance while varying the following parameters – locking mechanism and notification mechanism:

• Locking Mechanism - semaphore lock and pthread mutex lock: The purpose of this test is to see how the semaphore lock performs compared to the pthread mutex lock and how it performs when we change various parameters.

• Notification Mechanism - signal vs. write-and-select: While both signals and write-and-select are legitimate ways of notifying the server, signals may have side effects when used with the VIC library code. Thus, the purpose of this test is to see how write-and-select performs relative to signal, which has a smaller latency.

B. Performance Results

As for the average request latency, mqlogdshowed more than 10-times speed-up for all configurations compared to

syslogd.

1) Effects of Notification and Lock Mechanisms: For notifi-cation mechanisms, the measurements showed little difference between the implementation with write-and-select and the one with signal. In addition, signals cause problems with VIC library. This made us use write-and-select.

For lock mechanisms, the implementation with the pthread mutex lock was about 40% faster than the one with the semaphore lock. This implies that we can easily get an addi-tional 40% improvement by replacing calls to the semaphore lock with the pthread mutex lock when a working version is available. As for minimum request latency, mqlogd showed more than 20-times speed-up for all its configurations: 25-times speed-up for semaphore lock implementations and 40-times speed-up for pthread mutex lock implementations com-pared tosyslogd. As for maximum request latency,mqlogd

showed more than a 2x speed-up for all its configurations: by a factor of 2.3 for semaphore lock implementations and 4.0 for pthread mutex lock implementations compared to syslogd. 2) Effects of Queue Size: While syslogd did not drop requests, mqlogd dropped logging messages for a burst longer than or equal to 10000 (the queue size), and this became more severe as there were more requests. We can see that we need to have the queue size larger than the maximum expected size of a burst to avoid drops.

3) Effects of Multiple Clients: We measured the request latency and the request drop rate for both one and two clients. As for the average request latency, we have observed that having two clients makes about a factor of two increase. This ratio gets higher when the number of requests is larger than the size of queue as in the case of 50000 iterations. This shows that setting the queue size is important to have the performance bounded unsurprisingly. We notice that with two clients the request starts to drop with with smaller number of iterations than the case with a single client.

100 1000 5000 10000 50000 0 200 400 600 800 1000 1200 Number of Iterations Latency (us)

Average Request Latency − 1 Client syslogd

mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) # - Iterations 100 1000 5000 10000 50000 syslogd 710.6 741.0 748.0 739.8 716.0 mqlogd (select, semaphore) 50.4 49.8 42.2 45.8 72.6 mqlogd (signal, semaphore) 90.4 49.2 46.0 50.4 85.2 mqlogd (select, pthread) 191.4 27.4 23.4 27.8 78.8 mqlogd (signal, pthread) 69.2 26.2 23.4 31.2 94.2 Speed-up over syslogd 100 1000 5000 10000 50000 mqlogd (select, semaphore) 14.1 14.9 17.7 16.2 9.9 mqlogd (signal, semaphore) 7.9 15.1 16.3 14.7 8.4 mqlogd (select, pthread) 3.7 27.0 32.0 26.6 9.1 mqlogd (signal, pthread) 10.3 28.3 32.0 23.7 7.6

Fig. 5. Average Request Latency for a Single Client

4) Effects of Client Interface Type: We evaluated the perfor-mance of our logging system using theUNIXsocket interface and compared it with the performance with shared memory. For reference, we compared this with the performance of

syslogd. The experiment results are shown in Figure 11. We find that the average latency with theUNIXsocket is more than an order of magnitude longer on average than that with the shared memory and about the same level as that ofsyslogd. While we provided theUNIXsocket interface for compatibility reasons, the use of the UNIX socket interface (e.g., syslog) should be minimized to achieve the best performance.

VII. OPTIMIZATION

We have shown thatmqlogdprovides a logging service that is at least 10 times faster thansyslogdby using user-space

(7)

100 1000 5000 10000 50000 0 100 200 300 400 500 600 700 800 900 1000 Number of Iterations Latency (us)

Minimum Request Latency − 1 Client syslogd

mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) # - Iterations 100 1000 5000 10000 50000 syslogd 575.2 574.6 560.8 478.2 302.4 mqlogd (select, semaphore) 23.0 19.0 19.0 18.8 18.6 mqlogd (signal, semaphore) 23.4 21.0 21.0 20.4 20.2 mqlogd (select, pthread) 122.4 10.0 10.0 10.0 10.0 mqlogd (signal, pthread) 16.4 10.0 10.2 10.0 10.0 Speed-up over syslogd 100 1000 5000 10000 50000 mqlogd (select, semaphore) 25.0 30.2 29.5 25.4 16.3 mqlogd (signal, semaphore) 24.6 27.4 26.7 23.4 15.0 mqlogd (select, pthread) 4.7 57.5 56.1 47.8 30.2 mqlogd (signal, pthread) 35.1 57.5 55.0 47.8 30.2

Fig. 6. Minimum Request Latency for a Single Client

shared memory as the means of inter-process communication. In this section, we discuss a notification mechanism that can improve this speed-up even further.

A. Deferred Notification Mechanism

In order to improve the processing time of logging requests, we introduce the deferred notification mechanism. With de-ferred notification, a logging client sends one notification for a batch of messages rather than sending a separate notification for each message. Deferred notification is expected to give a smaller latency than per-message notification, but it has a restriction in that it requires an application to determine an appropriate time to send notifications. This is easy for most applications in VIC because they are event driven and the completion of an event handler is a very suitable time to send

100 1000 5000 10000 50000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 10 4 Number of Iterations Latency (us)

Maximum Request Latency − 1 Client syslogd

mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) # - Iterations 100 1000 5000 10000 50000 syslogd 9129 11977 14366 24143 41100 mqlogd (select, semaphore) 1984 11814 12234 12251 15844 mqlogd (signal, semaphore) 1976 9796 12635 13813 15060 mqlogd (select, pthread) 696 7440 12223 14516 14980 mqlogd (signal, pthread) 789 8159 11850 11415 13927 Speed-up over syslogd 100 1000 5000 10000 50000 mqlogd (select, semaphore) 4.6 1.0 1.2 2.0 2.6 mqlogd (signal, semaphore) 4.6 1.2 1.1 1.7 2.7 mqlogd (select, pthread) 13.1 1.6 1.2 1.7 2.7 mqlogd (signal, pthread) 11.6 1.5 1.2 2.1 3.0

Fig. 7. Maximum Request Latency for a Single Client

a notification.

B. Host-Side Round-Trip Time Measurement

In this section, we compare the performance of the deferred notification mechanism by measuring the round-trip time from the host. To make this measurement, we execute a program called devcmd on the host. The devcmd program sends queries to the main firmware application, which may produce logging messages, and measures the total latency from the perspective of the host. Usingdevcmdis useful because the latency on the host side is the latency a user of VIC (i.e., a network driver) will experience.

Thedevcmdprogram allows a user to specify a command to issue, and each command results in a different action in

(8)

100 1000 5000 10000 50000 0 10 20 30 40 50 60 70 80 90 100 Number of Iterations Percent

Request Drop Rate − 1 Client mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) # - Iterations 100 1000 5000 10000 50000 mqlogd (select, semaphore) 0.0% 0.0% 0.0% 21.2% 80.0% mqlogd (signal, semaphore) 0.0% 0.0% 0.0% 27.7% 83.1% mqlogd (select, pthread) 0.0% 0.0% 0.0% 35.0% 86.7% mqlogd (signal, pthread) 0.0% 0.0% 0.0% 39.3% 87.6%

Fig. 8. Request Drop Rate for a Single Client

firmware. Firmware processes the received command and logs messages when the minimum threshold for logging is set to the

debuglevel. When the minimum threshold is set to theinfo

level, firmware processes a received command without logging messages. The difference in the round-trip time between the two modes is the logging time for the specified command. The parameters we considered are summarized as follows:

• Notification mechanisms: write-and-select, deferred and

syslogd

• Commands: macaddr (gets the MAC address) and

capability(checks availability of a given command) • Minimum threshold for logging: infoanddebug

Figure 12 shows the latency for the capability mand. Figure 12(a) is the graph for total latency for all com-binations of notification mechanisms and minimum threshold levels. As expected, the total latencies for the three notification mechanisms are very close when there is no logging. For total latencies with a minimum logging threshold ofdebug, we see differences among the notification mechanisms. Since the difference in latency between the debug level and the

info level is the time to process logging messages, we can quantify the cost of each notification mechanism. Figure 12(b) shows thatmqlogdwith the deferred notification mechanism achieves more than 2x speed-up compared to the one with write-and-select.

Figure 13 shows the latency for the macaddrcommand.

100 1000 5000 10000 50000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Iterations Latency (us)

Average Request Latency − 1 and 2 Clients syslogd (1 client)

syslogd (2 clients) mqlogd (select, 1 client) mqlogd (select, 2 clients)

# - Iterations 100 1000 5000 10000 50000 syslogd (1 client) 710.6 741.0 748.0 739.8 716.0 syslogd (2 clients) 1376.8 1433.0 1458.2 1400.4 1439.6 mqlogd(select, 1 client) 50.4 49.8 42.2 45.8 72.6 mqlogd(select, 2 clients) 139.4 73.6 72.2 81.8 206.4 mqlogd (signal 1 client) 90.4 49.2 46.0 50.4 85.2 mqlogd (signal 2 clients) 111.0 82.2 63.4 84.2 196.0 Speed-up over 1 client 100 1000 5000 10000 50000 syslogd 1.9 1.9 1.9 1.9 2.0 mqlogd (select) 2.8 1.5 1.7 1.8 2.8 mqlogd (signal) 1.2 1.7 1.4 1.7 2.3

Fig. 9. Average Request Latency for Two Clients

We see that the ratio among each notification mechanism is similar to that Figure 12; mqlogd with the deferred notification mechanism achieves about 2x speed-up compared to the one with write-and-select. The difference is that absolute values of latency for macaddr are smaller than those for

capabilitybecausemacaddris a trivial command – it’s essentially retrieving 4 bytes.

VIII. CONCLUSION

We presented a logging system for embeddedUNIX applica-tions that provides faster logging times thansyslogd, while providing application-level interface compatibility. It achieves almost two orders of magnitude speed-up in latency and an order of magnitude improvement in message throughput. This

(9)

0 500 1000 1500 2000 Time (us)

Total Latency for ’capability’ Command

write−and− select (debug) write−and− select (info) deferred (debug) deferred (info) syslogd (debug) syslogd (info) Average (us)

(a) Total Latency

0 500 1000 1500 2000 Time (us)

Logging Latency for ’capability’ Command

write−and− select

deferred syslogd Logging Time (us)

(b) Logging Latency Fig. 12. Latency for the Capability Command

0 500 1000

Time (us)

Total Latency for ’macaddr’ Command

write−and− select (debug) write−and− select (info) deferred (debug) deferred (info) syslogd (debug) syslogd (info) Average (us)

(a) Total Latency

0 500 1000

Time (us)

Logging Latency for ’macaddr’ Command

write−and− select

deferred syslogd Logging Time (us)

(b) Logging Latency Fig. 13. Latency for the Macaddr Command

100 1000 5000 10000 50000 0 10 20 30 40 50 60 70 80 90 100 Number of Iterations Percent

Request Drop Rate − 1 and 2 Clients

mqlogd (select, 1 client) mqlogd (select, 2 clients)

# - Iterations 100 1000 5000 10000 50000 mqlogd (select, 1 client) 0.0% 0.0% 0.0% 21.2% 80.0% mqlogd (select, 2 clients) 0.0% 0.0% 10.0% 45.7% 91.9%

Fig. 10. Request Drop Rate for Two Clients

speed-up is attributed to the use of user-level shared memory as the means of inter-process message communication. It also supports the sysloginterface, allowing system applications that usesyslogto be interoperable without additional modi-fications while achieving about the same level of performance

as syslogd.

ACKNOWLEDGMENTS

I would like to thank my colleagues at Cisco, Jim Orosz, Brad Smith, Puneet Shenoy and James Dong for having fruitful discussions and helping edit and proofread this paper.

REFERENCES

[1] C. Lonvick, “The BSD Syslog Protocol,” 2001. http://tools.ietf.org/html/ rfc3164.

[2] R. Gerhards, “The Syslog Protocol,” 2009. http://tools.ietf.org/html/ rfc5424.

[3] T. Qiu, Z. Ge, D. Pei, J. Wang, and J. Xu, “What happened in my network: mining network events from router syslogs,” inProceedings of the 10th ACM SIGCOMM conference on Internet measurement (IMC ‘10), ACM, November 2010.

[4] K. Slavicek, J. Ledvinka, M. Javornik, and O. Dostal, “Mathematical processing of syslog messages from routers and switches,” in Proceed-ings of the 4th International Conference on Information and Automation for Sustainability (ICIAFS 2008), IEEE, December 2008.

[5] M. Bing and C. Erickson, “Extending unix system logging with sharp,” in Proceedings of the 14th Systems Administration Conference (LISA 2000), USENIX, December 2000.

[6] M. Hutter, A. Szekely, and J. Wolkerstorfer, “Embedded system manage-ment using wbem,” inIFIP/IEEE International Symposium on Integrated Network Management, 2009 (IM ‘09), IEEE, June 2009.

[7] T. Park and I. Ra, “Design and evaluation of a network forensic logging system,” in Convergence and Hybrid Information Technology, 2008. ICCIT’08. Third International Conference on, vol. 2, pp. 1125–1130, IEEE, 2008.

[8] M. Roeschet al., “Snort-lightweight intrusion detection for networks,” in

Proceedings of the 13th USENIX conference on System administration, pp. 229–238, Seattle, Washington, 1999.

[9] M. Sch¨utte, “Implementation of ietf syslog protocols in the netbsd syslogd,” 2009.

[10] B. Scheidler, “syslog-ng.” http://www.balabit.com/network-security/ syslog-ng.

(10)

100 1000 5000 10000 50000 0 200 400 600 800 1000 1200 Number of Iterations Latency (us)

Average Request Latency − 1 Client syslogd

mqlogd (select, semaphore) mqlogd (Unix socket)

# - Iterations 100 1000 5000 10000 50000 syslogd 701.8 717.8 707.2 714.4 691.2 mqlogd (select, semaphore) 50.4 49.8 42.2 45.8 72.6 mqlogd (UNIX socket) 743.6 767.6 754.6 747.6 734.8 Speed-up over syslogd 100 1000 5000 10000 50000 mqlogd (select, semaphore) 14.1 14.9 17.7 16.2 9.9 mqlogd (UNIX socket) 1.0 1.0 1.0 1.0 1.0

Fig. 11. Average Request Latency for a Single Client with Different Client Interfaces

[12] A. Tomono, M. Uehara, M. Murakami, and M. Yamagiwa, “A log management system for internal control,” inNetwork-Based Information Systems, 2009. NBIS’09. International Conference on, pp. 432–439, IEEE, 2009.

[13] F. Nikolaidis, L. Brarda, J.-C. Garnier, and Nikifeld, “A universal logging system for lhcb online,”International Conference on Computing in High Energy and Nuclear Physics (CHEP 2010), October 2010. [14] J. Garnier, “Lhcb online log analysis and maintenance system,” in

Proceedings of the 13th International Conference on Accelerator and Large Experimental Physics Control Systems, October 2011.

[15] D. Reed, “nsyslogd.” http://coombs.anu.edu.au/∼_{avalon/nsyslog.html.} [16] S. Gai, T. Salli, and R. Andersson,Project California: a Data Center

Virtualization Server UCS (Unified Computing System). lulu.com, March 2009.

[17] Cisco Systems, “Cisco UCS M81KR Virtual Interface Card Data Sheet.” http://www.cisco.com/en/US/prod/collateral/ps10265/ps10280/ data sheet c78-525049.html.