in Large UNIX Environments
RobbNewman NORTHWEST AIRLINES ms: MSPB3300 5101 Northwest Drive St. Paul, MN 55111 Abstract
What has been standard resource usage and capacity management practice for most mainframe shops is not yet in common practice in many UNIX based environments. With the continued movement of business
"critical" IS functions to UNIX based distributed processing systems, the current situation is changing. This paper recommends a UNIX resource usage data collection process that satisfies the data requirements of UNIX capacity management. It also addresses the problem in large UNIX environments of data overload, by recommending a specific set of key indicators to retain, as well as what UNIX
asand network resource usage levels to use for exception reporting. The recommendations are based on the experiences of working in Northwest Airlines' large, highly integrated UNIX environment used for the processing of several important IS business applications. The solution implemented at Northwest Airlines was to loosely couple a UNIX
based, SNMP compliant, resource usage and event monitoring software package, and a network monitoring software package with the UNIX version of SAS. Together they provided the ability to collect, manage, analyze, and report on system and network resource usage.
What has been standard production resource usage and capacity management practice for most mainframe shops is not yet in common practicein many UNIX based distributed process-ing datacenters. There are several reasons why the UNIX shops, while running on the hottest architectures and latest graphical user interfaces, have been "backward cousins" to the mainframe datacenters in this area. The primary reason for the difference is that only recently have a
significant number of UNIX systems moved into large production environments that necessitate performance and capacity management. The most common usage of UNIX systems in business to date has been by engineers and white collar professionals [1J. Cknerally each user had herlhis own workstation with minimal dependence on other systems on the network. When there was contention for a shared resource, it usually involved a small number of users and a single server. In general terms, the
environments while distributed, did not have complex dependencies for shared resources. In UNIX is a registered trademark of Unix System Laboratories, Inc.
this "traditional" UNIX based business environment nothing usually needed to be done relating to resource usage and capacity planning until a system became so slow the users could no longer tolerate it. Basically, the "cost" of not having a resource usage and capacity planning process was acceptable.
Today, however, an increasing number of UNIX based businesses environments are supporting very important IS applications with time critical process-ing requirements. In these types of environments it
is unacceptable to allow the point to be reached where throughput degrades to a snail's pace. The "cost" of not having a resource usage and capacity planning process can become very significant in a short period of time.
When a UNIX IS management team realizes that it needs to implement standard resource usage and capacity management practices, many at first find themselves in a "catch 22" situation. With relatively few businesses having implemented mission-critical UNIX based distributed processing environments, there are relatively few analysts with a knowledge of both performance/capacity management and
UNIX working in the IS department. Compounding the lack of experience problem has been a general lack of "off the shelf' UNIX computer performance evaluation (CPE) tools.
2.0 DATA COLLECTION 2.1 TRADITIONAL PROCESS
Because of a general lack of tools, the most common method used to collect the resource usage data required to satisfy the data requirements of UNIX capacity management has been to write relatively simple UNIX shell scripts and execute them on a regular basis. The scripts would execute a number of standard UNIX system and network utilization commands such as VMSTAT (reports virtual memory statistics) and collect reams of data. If the site had a fairly good knowledge of the UNIX kernel, it may have even written C programs to query the internal resource usage tables for the data. These methods until recently were basically the only options available.
The basic problem with both methods is not a matter of what data is collected, but with the amount oflabor required up front and throughout the life of the process. In an environment with many critical systems and networks, scripts must be written to execute on each system and return the output to a central location. A system administrator must constantly be "baby·sitting" the process to ensure data continues to be collected from all systems. In addition, there is also a
problem collecting data from non-UNIX network devices, such as routers and concentrators. Obtain-ing resource usage data from these type of units usually required the purchase of a separate network management station and developing a process for transferring the data collected on it to the central location.
2.2 RECOMMENDED PROCESS
One of the recent benefits of a UNIX standard setting body has been the development of a Simple Network Management Protocol (SNMP) standard . It specifies a protocol for the management of multivendor, heterogeneous, environments including both systems and network interfaces (such as concentrators). While the protocol can be used to physically manage devices on the network
it can also be used to query a device for resource usage data.
A number of the large UNIX computer vendors have announced both a SNMP management console/station based on the OSI Network management framework , and the counterpart (called an agent) that runs on the managed objects. Examples of managed objects are workstations, servers, and network interfaces (bridges, routers, concentrators, etc.) Together the management console and its agents have the capability of greatly reducing the labor required to collect data.
The agents run on the managed systems,
collecting utilization data at predefined intervals and reporting back to the management console. The data can be activity monitored as well as logged in ASCII files for later analysis. The console communicates with agents using the industry standard Simple Network Management Protocol (SNMP). This allows the product to communicate with agents running on other vendor's platforms as well as SNMP compliant network interfaces.
Using an SNMP management console and corresponding agents to collect resource usage data represents a significant improvement over the previous methods. It is reliable, does not require a significant amount of development or support time, and has minimal system and network overhead.
An additional feature, or reason to use one of the SNMP management console stations is the constant flow of new UNIX system management products which are controlled by the SNMP management console. An good example is
Concord's TRAKKER product. It is an intelligent box which when connected to a network reports back to Sun's (and soon HP's) SNMP management console, network resource usage, workload characterization, and network problems. Previous to products such as this, a very expensive network analyzer had to be connected to the network and used to capture a snapshot of the traffic.
2.3 WHAT DATA TO COLLECT
Having an easy, reliable mechanism for collecting resource usage data only amplifies the already existing problem of data overload. In a large environment the potential exists to collect several 100 MBs of performance data a day! Realizing this problem, effort must be exerted to limit what is being collected, and then to further reduce that to only the key indicator values.
One way is to only run those agents that collect important resource usage data. As explained in the section 4, Northwest Airlines selected SUN's SunNet Manager product as the SNMP manager. Table 1 lists the four agents ofthe more than twenty supplied which are recommended to be run on a regular basis on important servers in order to collect resource usage data required for capacity management.
SunNet Mgr Agents Usedna.rpcnts COllects HI-'\; If< Nt-::; aala
(server or client variant depending on system type)
na.layers collects network usage dala (udp variant of layers agent for NFS servers
tcp variant for all)
na.iostat collects Disk 110 data na.hostperf collects system data
2.4 WHAT DATA TO RETAIN
The next method to reduce the amount of data collected is to strip out and retain only those data elements deemed critical. This still results in too much data to look at daily, necessitating an
exception reporting process. Exception reporting is the process where by only those data elements exceeding a pre-specified threshold are reported. Exception reporting ensures that potential problems are highlighted. Table 2 contains a list of recommended data items deemed critical and the associated values used for the generation of an exception report. Note that only eleven of the
thirty-three data items retained are used for exception reporting. Decisions concerning the data items and values were derived from a variety of sources, including Sun's Analyzing Network and File Server Performance paper [4,5]. As environments change, the list of key indicators will change and the threshold values will also change. Included in the appendix is a very high level list of the reasons behind choosing these performance indicators and threshold values. If more detail is required concerning what to retain and corresponding threshold values, it is recommended that at a minimum the reader obtain Sun's paper .
3.0 DATA MANAGEMENT
Again because of a general lack of tools, the most common process for managing the reams of UNIX resource usage data collected was a three ring binder. After a couple of binders filled, another UNIX shell script would be written which reduced the amount of data printed by using a standard UNIX pattern scanning command such as GREP or AWK
An alternative to the binder method of data management was to transfer the data up to a mainframe and use a traditional CPE tool like SAS to manage the data. The logical progression of this alternative is to keep the data on UNIX and use the UNIX version of SAS for data management. Running SAS on UNIX allows the benefits of UNIX to be taken advantage of, and simplifies the process of data management. SAS, of course, offers much more than data
management to the UNIX performance
analyst/capacity planner. It also provides easy data access, as well as near limitless analysis and presentation capabilities.
Resource Usage Indicators
Collected using Sun's SunNet Mgr product TCP (na.Jayers agent)
ratio duplicate pkts sent/total pkts sent packets received
Thresholds for Daily Exception Reporting'
10% ratio duplicate pkts received/total pkts received 10% connections dropped by keepali ve
NFS (na.rpcnfs agent) RPC calls RPC timeouts
ratio RPC retransmits/total RPC calls' ratio RPC bad transaction ids/total RPC caBs' NFS caBs
ratio NFS bad calls/total NFS calls ratio NFS writes/total NFS calls UDP (na.Jayers agent)
UNIX (na.hostperf $< na.iostat agents) CPU busy
CPU avg. queue length total disk I/Os
disk I/Os for each disk
% disk busy (for each disk on system) page ins
page outs pages swapped in pages swapped out output packets
ratio output packets errors/total output pckts ratio output collisions/total output pkts Indicators Collected by Netmetrix product
Network utilization NFS Response Time
Indicators Collected by other Products Database Mgr/Server Response Time' Database Client Request Timeouts
Not all indicators are used for exception processing Collect on NFS clients only
Collect on NFS selVers only Data not currently collected
Table 210% 10% 10% 5 3 30 per second 60% 0.025% 5%
Daily Exception Threshold 30%
4.0 CASE STUDY
An example of a complex and business critical UNIX based distributed processing environment requiring resource data collection and capacity management is Northwest Airlines' Passenger Revenue Accounting (PRA) environment. The environment is responsible for all accounting exception processing relating to ticket sales. To perform the exception processing Northwest Airlines' PRA department has almost 400 Sun SparcStation workstations running UNIX in use by accounting clerks. Shared resources in the PRA network include over 30 Sun servers that are providing a variety of file, boot, application, and database services, plus several image management systems and a scanner. There are over 20
production SYBASE relational databases, seven of which approach or exceed 1 GB in size. The largest database is a UNIX record breaking 4.5 GB in size.
The applications in use on the workstations concur-rently access many of the databases in a distributed client server mode, as well as numerous data files. Many database requests first go to a common database and then are directed to the server that has the required database. The data files are also distributed across most ofthe servers.
The PRA environment processes a large volume of work using many shared resources, most of which are dependent on other shared resources. The potential for resource contention is near limitless. Should one of the several critical shared resources approach capacity, productivity for the entire environment degrades. Production UNIX based distributed processing environments with such a high requirement for resource usage and capacity management such as PRA's, were extremely rare not that long ago.
At Northwest Airlines, it did not take long for the people responsible for the PRA development project to realize the need for resource usage data in order to solve the several performance problems that arose during implementation. Due to the general lack of any "off' the shelf' UNIX computer performance evaluation (CPE) tools, UNIX shell scripts were written to collect the data by executing a number of the standard UNIX system and network utilization commands. The reams of data collected were then whittled down and summarized using the UNIX utilities GREP and AWK While
the process was rough and did not include any long term data management or automated data analysis, it greatly improved the team's ability to find performance bottlenecks during
As mentioned previously, while this type of resource usage collection process works, and can even be improved with further custom
development, it is very labor and resource intensive. Fortunately, there are a few "off' the shelf' UNIX resource usage tool packages available today that have a much cleaner implementation and are therefore less labor and resource intensive.
When Sun released its SNMP management console product named SunNet Manager, Northwest Airlines stopped using its original process and switched to this product as the primary solution for system and network resource usage data collection. Switching from a
non-standard, internally supported set of UNIX shell scripts to a standard, vendor supported, SNMP management product, like SunN et Manag-er, was an easy decision. It provides an easy to use graphical user interface, flexibility, and multiple vendor support.
Unfortunately, there were a few additional data items deemed important for which SNMP agents were not supplied by Sun or another vendor, and have not yet been written in-house. Until the agents are available a process similar to the early UNIX shell script process is used to collect SYBASE database kernel resource usage, and the product from Metrix named N etMetrix is used to calculate and collect NFS response times, network load, and some network workload characterization data.
While SunN et Manager contains some capabilities for summarizing and reporting utilization it is lacking when compared to other data analysis and reporting packages. For that reason the SAS software system on UNIX was chosen for data reduction, long term data management & retention, automated performance bottleneck and resource usage reporting, as well as statistical data analysis. While SAS CPE products exist for several operating systems, they did not yet have one for UNIX when this project began. The absence ofa SAS CPE product required Northwest Airlines to write the SAS programs required to
maintain, process, and report on the UNIX workload and resource usage. Now, however, version 6.0.7 of SAS supplies in the samples directory a suite of SAS programs written specifical-ly to read and manage SUN's SunNet Manager data. After proving the SAS provided code had at least as good functionality and reliability, the in-house written code was dropped.
Having the tools available to collect, maintain, process, and report on the resource usage data still, however, leaves problems unique to large distributed processing environments to be solved. A number of decisions have to be made concerning what to collect, and of that, what is really important.
Effort was exerted to limit what is being collected, and then to further reduce that to only the key indicator values, and rely on exception reporting to highlight problems.
In Northwest Airlines' PRA environment, four SunNet Manager agents are run on each of the servers and on a small (5%) random sample of workstations. Each agent reports to the
management console once every fifteen minutes .. The agents used are listed in table 1.
The total amount of raw data collected is
approximately 10 MB of data a day. At the end of each day a combination of shell scripts and SAS programs are executed to reduce the logs by keeping only those data items currently deemed critical. Combined the SAS data sets grow by approximately 1 MB per day. The unreduced logs are compressed and archived in case a data element not retained in a SAS data set is needed at a later date.
One megabyte of performance data is, of course, still to much data to look at daily. With this in mind another SAS program scans the day's data and generates an exception report containing only the performance "problems" observed and a second report which summarizes the day's prime time activity. The process of exception reporting allows the 10 MB of raw data collected to be reduced to a couple of pages to be looked at by a performance analyst, (unless, of course, there are numerous problems).
In addition to the daily processing, several SAS programs were written to perform a number of other CPE functions, including: resource usage reporting, high-level management reports, spotting of unbalanced resource usage, comparing current
usage to past usage, trends, and using regression analysis to estimate future resource usage. To a large number of UNIX performance analysts who do not come from a mainframe background, SAS is a relatively foreign application. SAS, however, is a very powerful tool that can greatly improve a performance analyst's ability to detect problems and make correct decisions. Its use on UNIX should be seriously considered.
Clearly resource usage and capacity management practices need to be incorporated into the manage-ment of complex UNIX based distributed
processing environments, especially those respon-sible for critical business IS functions. While there are other methods by which to implement these practices, at Northwest Airlines the loosely coupling of SunN et Manager and SAS have been successfully implemented without investing an unreasonable amount of money.
 E. Liederman, "An Interview with Andy Bechtolsheim," Sun World 4,3 (March 1991), pp. 14-15.
 Sun Microsystems, Inc., SunNet Manager 1.1 Installation and User's Guide.
Mountain View, CA: Sun Microsystems., 1991. 277pp.
 C.K Law, "SunNet Manager," Sun World 4,3 (March 1991), pp. 60-68.
 Sun Microsystems, Inc. "Analyzing Network and File SerVer Performance," (Sun Microsystems. [July 1990]), 50pp.  G. Dennis, "PRA System Management
Pilot ," (Anderson Consulting, [Nov. 1990]), 32pp.
 R. Berry and J.Hellerstein. "An Approach to Detecting Changes in the Factors Affecting the Performance of Computer Systems," Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems 19,1 (May 1991), pp. 39-49.
Appendix A: High Level Overview of Key Performance and Resource Usage Indicators
Network Utilization: How much of the Ethernet bandwidth is being used. General rule of thumb is that Ethernet network utilization should not exceed 30% to 35% for any length of time.
TCP Packets Sent & Received: For this site, TCP traffic is primarily database requests and replies. These elements give an indication as to the database workload
Percent of Duplicate TCP Packets Sent & Received: Indicate that a performance problem exists. Either the servers andlor network are not keeping up.
Connections Dropped by Keepalive: Indicates that a NFS performance problem exists. The NFS server took to long to reply.
RPC Calls: For this site, RPC calls are primarily NFS calls. This element gives an indication of the NFS
workload characteristics. Should the nUmber of RPC calls significantly differ from the number NFS calls then that would indicate a change in this environment's workload characteristics.
RPC Time-outs: Indicates that a NFS performance problem exists. The NFS server took too long to reply.
Percent of RPC Retransmits: Indication of the server's and network's ability to keep up with the amount of RPC (NFS) calls.
Percent of RPC Bad Transaction IDs: Number of duplicate acknowledgments received for a single NFS request. Indication of a server andlor network problem. If
the number of bad transaction ids is close the the number of retransmissions then it is an indicator that the server is having performance problems. If the number of bad transaction ids is much less than the number of
retransmissions, then it is an indicator that the network is causing performance problems.
NFS Calls: This element gives an indication of the NFS workload characteristics.
Percent of Bad NFS Calls: Indication of potential hardware problems.
NFS Reads, Write, and Ratio of Writes to Total Calls: This elements gives a breakdown of the NFS workload. The amount of writes is of particular interest because a number of vendors sell products which will "speed up" the writes by allowing them to complete asynchronously.
UD P Socket overnows: Indication of NFS performance problems. The server is unable to keep up with the NFS requests.
CPU Statistics: Usual CPU statistics collected on all systems. Queue size was deemed more important for exception processing, because it is a slightly better indicator of the level of service being delivered to the users. With balCh processing in the background. servers will run close to 100% utilized, but possibly with minimal impact on the users.
I/O Statistics: Usual I/O statistics collected on all systems. Provide an indication as to the I/O workload characteristics. Until recently SunNet Mgr was weak in this area. They now have distributed an unsupported agent that collects the same data as the ioslal command.
Memory Statistics: Usual memory statistics collected on virtual memory systems. Provide an indication as to the memory usage characteristics of the workload.
Network Statistics: The standard na.hostperf agent also provides some network statistics which are not broken down. Retained are indicators of the amount of traffic generated by each system, potential hardware problems, and an additional indicator as to how busy is the network as experienced by this system.