If something isn’t going well on your server, you want to know about it. So, before you can con- duct any process management, you need to tune process activity. Linux has an excellent tool that allows you to see exactly what’s happening on your server: the toputility. From this utility you can see everything you need to know. It is very easy to start top: use the topcommand. When the utility starts, you’ll see something like Figure 6-1.
Figure 6-1.The toputility gives you everything you need to know about the current state of your server.
Using topto Monitor System Activity
The topwindow consists of two major parts. The first (upper) part provides a generic overview of the current state of your system. These are the first five lines in Figure 6-1. In the second (lower) part of the output, you can see a list of processes, with information about the activity of these processes.
The first line of the topoutput starts with the current system time. This time is followed by the “up” time; in Figure 6-1, you can see that the system has been up for only a few minutes. Next, you see the number of users currently logged in to your server. The end of the first line
contains some very useful information: the load average. This line shows three different num- bers. The first is the load average for the last minute, the second is the load average for the last 5 minutes, and the third is the load average for the last 15 minutes.
The load average is displayed by a number that indicates the current activity of the process queue. The value here is the number of processes that are waiting to be handled by the CPU on your system. On a system with one CPU, a load average of 1.00 indicates that the CPU is completely occupied, but there are no processes waiting in the queue. If the value increases past 1.00, the processes are lining up and users may experience delays while com- municating with your server. It’s hard to say what exactly a critical value is. On many systems, a value anywhere between 1 and 4 indicates that the system is just busy, but, if you want your server to run as smoothly as possible, make sure that this value exceeds 1.00 only rarely.
If an intensive task (such as a virus scanner) becomes active, the load average can easily rise to a value of 4. It may even happen that the load average reaches an extreme number like 254. In this case, it’s very likely that processes will wait in the queue for so long that they will die spontaneously. What exactly indicates a healthy system can be determined only by doing some proper baselining of your server. In general, 1.00 is the ideal number for a one-CPU sys- tem. If your server has hyperthreading, dual-core, or two CPUs, the value would be 2.00. And, on a 32-CPU system with hyperthreading enabled on all CPUs, the value would be 64. So the bottom line is that each (virtual) CPU counts as 1 toward the overall value.
The second line of the topoutput shows you how many tasks currently are active on your server and also shows you the status of these tasks. A task can have four different statuses:
• Running: In the last polling interval, the process has been active. You will normally see that this number is rather low.
• Sleeping: The process has been active, but it was waiting for input. This is a typical sta- tus for an inactive daemon process.
• Stopped: The process is stopping. Occasionally, you’ll see a process with the stopped status, but that status should disappear very soon.
• Zombie: The process has stopped, but it hasn’t been able to send its exit status back to the parent process. This is a typical example of bad programming. Zombie processes will sometimes disappear after a while, and will always disappear when you have rebooted your system.
The third row of topprovides information about current CPU activity. This activity is sep- arated into different statistics:
• us: CPU activity in user space. Typically, these are commands that have been started by normal users.
• sy: CPU activity in system space. Typically, these are kernel routines that are doing their work. Although the kernel is the operating system, kernel routines are still often con- ducting work on behalf of user processes or daemons.
• id: CPU inactivity, also known as the idle loop. A high value here just indicates that your system is doing nothing.
• wa: For “waiting,” this is the percentage of time that the CPU has been waiting for new input. This should normally be a very low value; if not, it’s time to make sure that your hard disk can still match up with the other system activity.
• hi: For “hardware interrupt,” this is the time the CPU has spent communicating with hardware. It will be rather high if, for example, you’re reading large amounts of data from an optical drive.
• si: For “software interrupt,” this is the time your CPU has spent communicating with software programs. It should be rather low on all occasions.
• st: This parameter indicates the time that is stolen by the virtualization hypervisor (see Chapter 12 for more details about virtualization and the hypervisor) from a virtual machine. On a server that doesn’t use any virtualization, this parameter should be set to 0 at all the times. In a virtual machine that sees a lot of activity, the parameter will rise from time to time.
The fourth and fifth lines of the topoutput display the memory statistics. These lines show you information about the current use of physical RAM (memory) and swap space. (Similar information can also be displayed using the freeutility.) The important thing that you should see here is that not much swap space is in use. Swapping is bad, because the disk space used to compensate for the lack of physical memory is approximately a thousand times slower than real RAM.
If all memory is in use, you should take a look at the balance between buffers and cache. Cache is memory that can be freed for processes instantaneously, and buffer memory is mem- ory that is actually used by processes and that cannot be freed without stopping the processes that are consuming it. A healthy server should have a relatively high value for cache and a rela- tively low value for buffers. On a busy workstation, you should expect the opposite.
The lower part of the topwindow provides details about the process that’s most active in terms of CPU usage. It’ll be the first process listed, and the line also displays some usage statistics:
• PID: Every process has a unique process ID (the so-called PID). Many tools such as kill
need this PID for process management.
• User: This is the name of the user ID the process is using. Many processes run as root, so you will see the username root rather often.
■Note
For well-programmed processes, it’s generally not a problem that they’re running as root. It’s a dif- ferent story though for logging in as the user root.• PRI: This is the priority indication for the process. This number is an indication of when the process will get some CPU cycles again. A lower value indicates a higher priority so that the process will have its share of CPU cycles sooner. The value RT indicates that it is a real-time process and is therefore given top priority by the scheduler.
• NI: The nice value of the process. See “Setting Process Priority” later in this chapter for more details on nicing processes.
• VIRT: This is the total amount of memory that is claimed by the process.
• RES: The resident memory size is the memory that the process is actually using at the moment.
• SHR: The amount of shared memory is what the process shares with other processes. You’ll see this quite often, as processes often share libraries with other processes. • S: This is the status of the process, and they’re the same status indications as the ones in
the second line of the topscreen.
• %CPU: This is the amount of CPU activity that the process has caused in the last polling cycle (which is typically every 5 seconds).
• %MEM: This is the percentage of memory the process has used in the last polling cycle. • TIME+: This indicates the total amount of CPU time that the process has used since it
was first started. You can display this same value by using the timecommand, followed by the command that you want to measure the CPU time for.
• Command: This is the command that started the process.
As you have seen, the topcommand really provides a lot of information about current system activity. Based upon this information, you can tune your system so that it works in the most optimal way.