Chapter 2. Performance tuning getting started
2.2 CPU performance overview
When investigating a performance problem, CPU constraint is probably the easiest to find. That is why most performance analysts start by checking for CPU constraints, and then work their way through the flowchart shown in Figure 3.
If a system is CPU bound, investigation should focus on the two entities using the CPU - processes and threads. The CPUs basic work unit is the thread, so a process must have at least one thread. Commonly (on AIX Version 4), a process is multi-threaded, which means that a process can use multiple threads to accomplish its task. In Figure 4, the relationship between processes and threads is symbolized.
When initiating a process, the first resource to be allocated is a slot in the process table; before this slot is assigned, the process in SNONE state. While the process is undergoing creation (waiting for resources [memory] to be allocated) it is in SIDL state. These two states are together called the I state.
Figure 4. Process state
When a process is in the A state, one or more of its threads are in the R state. This means they are ready to run. A thread in this state has to compete for the CPU with all other threads in the R state. Only one process can use the CPU at any given time.
If a thread is waiting for an event or for I/O, the thread is said to be sleeping, or in the S state. When the I/O is complete, the thread is awakened and placed in the ready to run queue.
If a thread is stopped with the SIGSTOP signal (to be awakened with the SIGCONT signal), it is in the T state while suspended.
Manipulating the run queue, the process and thread dispatcher, and priority calculation are all ways to tune (and misstune, if not carefully done) the CPU. The run queue and how to decide which thread is to be prioritized is
discussed in Chapter 6, “Performance management tools” on page 177. When tuning the CPU, you need to know what can be tuned on a process level and what can be tuned on a thread level and choose accordingly. Table 2 provides a list that associates several process related properties with thread related properties.
Table 2. Processes and threads
When working in the area of CPU performance tuning, you should use historical performance information for comparison reasons. Usually,
performance has subjective view points. To avoid confusion, hard copies of performance statistics, from a time when users did not report poor system performance, should be filed. A very useful tool for this task is thesar
command.
The sar command
Two shell scripts, /usr/lib/sa/sa1 and /usr/lib/sa/sa2, are structured to be run by thecroncommand and provide daily statistics and reports. Sample stanzas are included (but commented out) in the
/var/spool/cron/crontabs/adm crontab file to specify when the cron daemon should run the shell scripts. The sa1 script creates one output file each day and the sa2 scripts collects data and saves the data for one week. Another useful feature ofsaris that the output can be specific about the usage for
Process properties Thread properties
PID and PGID TID
UID and GID Stack
Environment Scheduling policy
Cwd Pending signals
each processor in a multiprocessor environment, as seen in the following output. The last line is an average output.
# sar -P ALL 2 1
AIX client1 3 4 000BC6DD4C00 07/06/00 14:46:52 cpu %usr %sys %wio %idle
14:46:54 0 0 0 0 100
1 0 1 0 99
2 0 0 0 100
3 0 0 0 100
- 0 0 0 100
More information on thesarcommand can be found in Section 3.1, “The sar command” on page 47.
Occasionally, the time spent in an application execution or an application startup can be useful to have as reference material. Thetimecommand can be used for this.
The time command
Use the timecommand to understand the performance characteristics of a single program and its synchronous children. It reports the real time, that is, the elapsed time from beginning to end of the program. It also reports the amount of CPU time used by the program. The CPU time is divided into user and sys components. The user value is the time used by the program itself and any library subroutines it calls. The sys value is the time used by system calls invoked by the program (directly or indirectly). An example output follows:
# time ./tctestprg4 real 0m5.08s user 0m1.00s sys 0m1.59s
The sum of user + sys is the total direct CPU cost of executing the program. This does not include the CPU costs of parts of the kernel that can be said to run on behalf of the program, but which do not actually run on the program’s thread. For example, the cost of stealing page frames to replace the page frames taken from the free list when the program started is not reported as part of the program's CPU consumption. Another example of thetime
command is provided in Section 7.1, “CPU performance scenario” on page 201.
When starting to analyze a performance problem, most analysts start with the
vmstatcommand, because it provides a brief overall picture of both CPU and memory usage.
The vmstat command
The vmstatcommand reports statistics about kernel threads, virtual memory, disks, traps, and CPU activity. Reports generated by thevmstatcommand can be used to balance system load activity. These system-wide statistics (among all processors) are calculated as averages for values expressed as percentages, and as sums otherwise. Most interesting from a CPU point of view are the highlighted two left-hand columns and the highlighted four right-hand columns in the following output:
# vmstat 2 4
The r column shows threads in the R state, while the b column shows threads in S state, as shown in Figure 4 on page 22. The four right-hand columns are a breakdown in percentages of CPU time used on user threads, system threads, CPU idle time (running the wait process), and CPU idle time when the system had outstanding disk or NFS I/O requests. For further discussion on thevmstatcommand, see Section 3.2, “The vmstat command” on page 60.
If the system has poor performance because of a lot of threads on the run queue or many threads waiting for I/O, thenps output is useful to determine which process has used the most CPU resources.
The ps command
Thepscommand is a flexible tool for identifying the programs that are running on the system and the resources they are using. It displays statistics and status information about processes on the system, such as process or thread ID, I/O activity, CPU, and memory utilization. In Section 3.3, “The ps
command” on page 68, thepscommand output relevant to a CPU tuning perspective is discussed.
When looking for a run-away process, the next step in the analysis is to find out which part of the process uses the CPU. For this, a profiler is needed. The AIX profiler of preference istprof.
kthr memory page faults cpu
--- --- --- --- --- r b avm fre re pi po fr sr cy in sy cs us sy id wa 0 0 16998 14612 0 0 0 0 0 0 101 10 8 55 0 44 0 0 1 16998 14611 0 0 0 0 0 0 411 2199 54 0 0 99 0 0 1 16784 14850 0 0 0 0 0 0 412 120 51 0 0 99 0 0 1 16784 14850 0 0 0 0 0 0 412 88 50 0 0 99 0
The tprof command
The tprofcommand can be runned over a time period to trace the activity of the CPU. The CPU utilization is divided into kernel, user, shared, and other to show how many clock timer ticks were spent in each respective address space. If the user column shows high values, application tuning may be necessary. More information about the tprofcommand can be found in Section 3.4, “The tprof command” on page 73.
When finding a process that cannot be optimized, another way to tune the process is to lessen its priority in the run queue. This can be accomplished by grouping processes together to be handled by AIX Version 4.3 Workload Manager or by use of theniceand renicecommands.
The nice and renice commands
The nicecommand can run a process at a priority lower than the process’ normal priority. You must have root user authority to run a process at a higher priority. The priority of a process is often called its nice value, but while the priority of a process is recalculated at every clock timer tick, the nice value is stable and manipulated with the niceorrenicecommands. The nice value can range from 0 to 39, with 39 being the lowest priority. For example, if a process normally runs with a default nice value of 20, resetting the nice value with an increment of 5 runs the process at a lower priority, 25, and the process may run slower. More information about the priorities and nice values can be found in Section 6.1.1, “Priority calculation on AIX versions prior to 4.3.2” on page 179, Section 6.1.2, “Priority calculation on AIX Version 4.3.2 and later” on page 182, and Section 6.3.2, “The nice and renice commands” on page 189.
Finally, in the list of common performance tools, there is theschedtune
command. This command is mentioned last for a reason: do not manipulate the scheduler without thorough knowledge of the scheduler mechanism.
The schedtune command
The priority of most user processes varies with the amount of CPU time the process has used recently. The CPU scheduler's priority calculations are based on two variables, SCHED_R (the weighting factor) and SCHED_D (the decay factor). More information about the scheduler and theschedtune
command is covered in Section 6.1, “The AIX scheduler” on page 177 and in Section 6.3.1, “The schedtune command” on page 186.