Anda di halaman 1dari 7

CPU Idle, Run Queues and Load Average

Among the many FAQs (Frequently Asked Questions) that get kicked around various internal
and external mail aliases is this: why does my system have idle CPU cycles (%idle is greater than zero)
with runnable threads on the run queues? This seems counter-intuitive. If I have runnable threads, they
should be running and burning CPU cycles. Theoretically, if a system's run queue is non-zero, I should
not see idle CPU time.

It's a reasonable question, and the answer lies in understanding how the metrics in question are
derived. The short answer is we can see non-zero run queues with idle CPUs because we derive these
metrics using different methods, and CPU utilization (idle time) is more accurate than run queue
metrics.

CPU utilization in Solaris 10 (and of course OpenSolaris) is measured using per-CPU


microstates (which is very different from the clock-based polling mechanism used in previous
releases). The CPU microstates tracked are time spent in user, system (the kernel), and idle time. You
may be familiar with the concept of microstates as applied to kernel threads in Solaris 10. Thread state
changes are tracked and timestamped, enabling fine-grained measurement of where the kernel threads
are spending their time. You can observe this using “prstat -Lm” (my favorite first-thing-to-run when
looking at a system). With the “-m” flag, the columns presented by prstat(1M) are the various
microstates (USR, SYS, SLP, etc), and the numeric value reported in each row is the percentage of time
the thread corresponding to the row spent in each microstate. The sum for each row of all the
microstates should always total 100, accounting for 100% of the time for the sampling period (within
rounding of course).

CPU microstate accounting works much the same way. When a CPU goes from idle to running
in user mode, a timestamp captures the exit from the previous state (idle) and entry into the new state
(running in user mode). If the CPU then enters the kernel, this is another microstate change, and
another timestamp captures the exit from user and entry into kernel mode. The math is done to maintain
total time in a given microstate, which can be examined using kstat(1).

module: cpu instance: 1


name: sys class: misc
bawrite 44
bread 771
bwrite 46753
canch 2011
cpu_load_intr 0
cpu_nsec_idle 19600843919092196
cpu_nsec_intr 8453061052574
cpu_nsec_kernel 87997351396578
cpu_nsec_user 10876052845953
cpu_ticks_idle 1960084391
cpu_ticks_kernel 8799735
cpu_ticks_user 1087605
cpu_ticks_wait 0
. . .

The above sample shows the CPU states are tracked at both nanosecond granularity (via high
resolution time stamps) and in ticks, which align with the clock interrupt frequency of 10 milliseconds.
The raw kstats for 1 or more CPUs on a system can be monitored with various incantations of kstat(1);
# kstat -p -s cpu_nsec_user 1
cpu:0:sys:cpu_nsec_user 92743762816
cpu:1:sys:cpu_nsec_user 57543394495
cpu:2:sys:cpu_nsec_user 71279077930
cpu:3:sys:cpu_nsec_user 53435407234
cpu:4:sys:cpu_nsec_user 70497699811
cpu:5:sys:cpu_nsec_user 52251562152
cpu:6:sys:cpu_nsec_user 60368583736
cpu:7:sys:cpu_nsec_user 49661873803
cpu:40:sys:cpu_nsec_user 134050079157
cpu:41:sys:cpu_nsec_user 60151580054
cpu:42:sys:cpu_nsec_user 106208115248
cpu:43:sys:cpu_nsec_user 61801106156
cpu:44:sys:cpu_nsec_user 80068148411
cpu:45:sys:cpu_nsec_user 54260903619
cpu:46:sys:cpu_nsec_user 75393852625
cpu:47:sys:cpu_nsec_user 52891692971

cpu:0:sys:cpu_nsec_user 92744256016
cpu:1:sys:cpu_nsec_user 57543394495
cpu:2:sys:cpu_nsec_user 71279080830
cpu:3:sys:cpu_nsec_user 53435407234
cpu:4:sys:cpu_nsec_user 70936335088
cpu:5:sys:cpu_nsec_user 52251562152
cpu:6:sys:cpu_nsec_user 60368682336
cpu:7:sys:cpu_nsec_user 49661873803
cpu:40:sys:cpu_nsec_user 134050079157
cpu:41:sys:cpu_nsec_user 60151580054
cpu:42:sys:cpu_nsec_user 106208357448
cpu:43:sys:cpu_nsec_user 61801106156
cpu:44:sys:cpu_nsec_user 80068215111
cpu:45:sys:cpu_nsec_user 54260903619
cpu:46:sys:cpu_nsec_user 75393852625
cpu:47:sys:cpu_nsec_user 52891692971

The example above dumps the number of nanoseconds in user mode for each CPU on the
system at 1 second intervals. Named kstats are interesting and can be useful, but when it comes to CPU
utilization, the standard tools, vmstat(1) and mpstat(1), do the job just fine, so there's no need to begin
crafting complex scripts that attempt to do the math to derive utilization.

The summary statement here is CPU utilization, as reported by the standard tools in Solaris, is
generally very accurate. Things get somewhat complicated with current processor technology
(multicore sockets with multiple threads-per-core, etc), but for the purposes of this discussion, we can
rely on accurate reporting of CPU utilization.

Run queue depth, as reported in the “r” column in vmstat(1M), is derived using a sampling
mechanism driven by the frequency of the clock interrupts, which by default are every 10 milliseconds.
This is a much, much coarser method, and is the key contributing factor to the seeming anomaly of idle
CPU with runnable threads.

Example

Here's a sample of vmstat(1M) that shows the FAQ-generator;


# vmstat 1
. . .
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s3 s3 s3 in sy cs us sy id
7 0 0 9980248 6774016 130 1758 0 0 0 0 0 -0 13 1 16 112674 44694 206269 9 24 67
10 1 0 7767536 4331964 530 4454 0 0 0 0 0 0 46 0 0 128757 86760 269168 5 32 63
7 1 0 7767544 4350432 170 3167 0 0 0 0 0 0 77 0 0 171521 124763 391469 7 34 60
11 1 0 8208252 4357208 237 3696 0 0 0 0 0 0 4 0 0 192055 141030 467303 7 35 58
13 1 0 8207760 4356752 237 4002 0 0 0 0 0 0 0 0 0 171798 123960 395777 6 32 62
6 1 0 6622908 4350140 275 5056 0 0 0 0 0 0 2 0 0 91344 56916 154653 3 28 69
9 1 0 7042548 4323808 237 4012 0 0 0 0 0 0 0 0 0 164210 130582 403526 6 35 58
5 1 0 6010252 4356328 315 5163 0 0 0 0 0 0 1 0 0 101912 74108 202730 4 30 66
7 1 0 5928760 4332960 201 3562 0 0 0 0 0 0 2 0 0 168142 136473 423903 8 42 51
18 1 0 8208056 4357008 217 3521 0 0 0 0 0 0 3 0 0 181251 146643 467283 8 42 50
19 1 0 7767544 4354528 212 3460 0 0 0 0 0 0 1 0 0 177194 157213 476082 8 41 50
16 1 0 8208252 4357208 43 713 0 0 0 0 0 0 105 0 0 268182 240806 821644 14 58 27
16 1 0 8208252 4357208 0 0 0 0 0 0 0 0 15 0 0 286035 266280 909418 15 60 25
24 1 0 8208252 4357208 0 0 0 0 0 0 0 0 3 0 28 98448 78900 264947 5 86 9
18 1 0 8208252 4357208 0 0 0 0 0 0 0 0 141 0 72 92266 57938 195136 10 76 14
15 1 0 8208252 4357208 0 0 0 0 0 0 0 0 0 0 76 144323 90247 360482 13 66 21
11 1 0 8208252 4357208 0 0 0 0 0 0 0 0 0 0 98 200226 114333 493278 10 54 36
9 1 0 7065076 4319712 38 1005 0 0 0 0 0 0 1 0 35 222837 164410 568452 9 47 44

Note in the sample above we see a fair amount of CPU idle time, although it does drop towards
the end of the sample as the run queue depth goes into double digits. Nonetheless, with this number of
runnable threads on the run queues, one would think the CPUs would have zero idle time.

In order to illustrate why this can happen, a DTrace script was written that tracks the insertion
of runnable threads onto CPU run queues. Remember how idle time is derived; when a kernel thread is
context switched off a CPU, that CPU checks its run queue for the next best priority runnable thread to
run. If the CPUs run queue is empty, it enters the idle loop, where the CPU microstate timestamp is
taken, marking the beginning of a state change to idle. Inside the idle loop, the CPU will periodically
check its run queue. If the CPU's run queue remains zero, it enters another code path in the kernel to
check for runnable threads on the run queues of other CPUs, adhering to the latency group hierarchy
implemented in the kernel that helps minimize the effects of NUMA (Non-Uniform Memory Access)
systems. The key point here is that in order for a CPU to accumulate idle time, its run queue must be
empty when a runnable thread is inserted on the queue. If the CPU's run queues were never empty, they
would never enter the idle loop, and thus not accumulate idle time.

The DTrace script below instruments the kernel run queue insertion functions (setfrontdq and
setbackdq) using the DTrace fbt provider1. The first clause in the script does not have a predicate, and
tracks the total number of calls into these functions system wide (@total_ticks) and per-CPU
(@per_cpu). The second clause for the same set of probes has a predicate that tests to see if the number
of runnable threads on the CPU's run queue is zero. If it is, the predicate evaluates true and the action is
taken, which counts the number of times this happened on a per-CPU basis. The tick probe provides
per-second statistics.

1 Note there's a third kernel run queue insertion function, setkpdq(), which inserts threads onto a queue dedicated to
threads in the realtime scheduling class.
#!/usr/sbin/dtrace -s

#pragma D option quiet

fbt::setfrontdq:entry,
fbt::setbackdq:entry
{
@total_ticks = count();
@per_cpu[args[0]->t_cpu->cpu_id] = count();
}
fbt::setfrontdq:entry,
fbt::setbackdq:entry
/ args[0]->t_cpu->cpu_disp->disp_nrunnable == 0 /
{
@zero[args[0]->t_cpu->cpu_id] = count();
}
tick-1sec
{
printa("TOTAL TICKS: %@d\n",@total_ticks);
printf("%-8s %-16s %-16s\n","CPU ID","ENQUEUES","WAS ZERO");
printa("%-8d %-@16d %-@16d\n",@per_cpu, @zero);
trunc(@total_ticks);
trunc(@per_cpu);
trunc(@zero);
}

What we end up with is a count on a per-CPU basis of the number of times per second the
insertion routines ran, along with the number of times they ran when a CPU's dispatch queue was
empty. Let's have a look:
# ./dqz.d
TOTAL TICKS: 298898
CPU ID ENQUEUES WAS ZERO
9 3036 2996
1 3828 3641
5 4189 4115
4 4653 4600
0 6355 6253
13 8586 8397
12 8924 8731
8 15053 15003
11 17708 16822
10 18192 14768
15 19384 14534
14 23843 14371
2 30250 29929
3 31273 29069
6 51326 51076
7 52298 50557
TOTAL TICKS: 275961
CPU ID ENQUEUES WAS ZERO
9 3040 2969
5 3268 3159
1 3347 3192
8 5240 5153
0 6702 6565
4 7694 7616
12 8243 8059
13 8281 8052
10 10785 9125
11 10980 10309
15 23066 18298
14 29473 18361
2 36705 36268
3 38489 35409
6 39647 39015
7 41001 38650

The resulting data shows us that most of the time, when a thread was inserted into a CPU's run
queue, that CPU's run queue was indeed empty, which caused it to enter the idle loop. Note the TOTAL
TICKS values – almost 300k, meaning that during this data collection period, the thread run queue
insertion routines were being called almost 300k times per-second, or about every 3 microseconds. As
Solaris Kernel Engineer Bart Smaalders reminds us, the Nyquist sampling theorem states that a
continuous signal must be sampled at 2X the peak frequency to get an accurate measurement2. The run
queue values, as reported by vmstat(1M), are derived based on the system clock frequency of 100hz
(100 times per second, or every 10 milliseconds). The sample for our test system shows that, on a busy
system, the run queue insertion rate can be 300khz – potentially more on a large system with faster
CPUs. Thus, the sampling rate for the run queue calculation is much too low.

The other key component to this is the threads themselves must be spending a very short
amount of time on-CPU. This particular system was executing an IO intensive workload, where many
of the threads were getting on CPU, issuing a blocking system call (read(2), write(2)), and very quickly
getting switched off. prstat -Lm can be used to determine where threads are spending most of their
time, and verify that most of the threads are not accumulating substantial on-CPU time (USR or SYS).
The -L flag provides per-thread state information, where each row represents a thread in a process.

Here's an example from the test case

2 This is why the sampling rate for audio CDs is 44.1Khz, or roughly 2X the peak audible frequency of music (20khz)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
573 root 29 33 0.3 0.0 0.0 0.0 29 8.6 12K 980 31K 0 java/75
573 root 31 31 0.5 0.0 0.0 0.0 27 11 19K 1K 45K 0 java/90
573 root 30 30 2.2 0.0 0.0 0.0 21 16 12K 9K 33K 0 java/96
573 root 27 34 0.4 0.0 0.0 0.0 29 10 16K 1K 39K 0 java/91
573 root 28 29 1.7 0.0 0.0 0.0 23 19 8K 4K 23K 0 java/81
573 root 26 29 2.1 0.0 0.0 0.0 20 23 9K 5K 26K 0 java/85
573 root 28 28 1.6 0.0 0.0 0.0 31 12 9K 6K 24K 0 java/76
573 root 27 28 1.3 0.0 0.0 0.0 28 16 10K 3K 30K 0 java/89
573 root 0.4 0.3 0.1 0.0 0.0 0.0 98 1.0 104 126 204 0 java/110
573 root 0.2 0.3 0.0 0.0 0.0 0.0 99 0.5 102 99 205 0 java/111
573 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 17 28 16 0 java/27
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 java/19
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 java/14
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 java/13

The prstat(1M) data shows a Java Virtual Machine with many threads (note the PID in each row
is the same), several of which are compute bound (USR and SYS), or waiting for a CPU (LAT), while
others are sleeping (SLP) or waiting on a user lock (LCK – 100% in LCK typically indicates a mutex
lock associated with a condition variable, not necessarily serious user lock contention).
Also, the data can be skewed by the unpredictable and potentially bursty nature of runnable
threads in a given workload. Relatively large numbers of threads may become runnable around the
same time, causing a large number of insertions into run queues within a very short time window,
which can skew the other metric related to this subject – load average.
The load average is intended to provide a glimpse of how busy a system is over the last 1, 5 and
15 minute period. Historically, load average was derived based on the number of runnable and running
threads. CR 4982219 was filed as a result of observing a system running Java workloads generating
large numbers of GC (garbage collection) threads in bursts, resulting in large load average values on a
relatively idle system. With CR 6911646, the method for calculating load average was changed in
Solaris 10. The calculation is now made by summing high-resolution user time, system time, and thread
wait time, then processing this total to generate averages with an exponential decay. The way you
interpret load average is the same – use it as an indicator of how busy the system is over the past 1, 5
and 15 minutes.
That said, keep in mind that the load average is calculated in the kernel clock interrupt handler
every second, so, similar to the run queue depth values, the data generated is calculated at a relatively
coarse granularity. This results in the possibility of observing both run queue values and load average
values that appear inconsistent with CPU utilization.

Summary

Because of the different methods used to track CPU utilization, the number of runnable threads,
and load average, it's possible to have a system showing more idle time than expected with runnable
threads present on the run queue, and load average values that are also seemingly inconsistent. If you're
looking at a system showing similar metric values, you can use the included DTrace script to
demonstrate the fact that there are a large number of threads being inserted onto the run queue of CPUs
that have a zero runnable thread count, along with prstat -Lm to determine just how CPU-intensive the
workload is by tracking the USR and SYS component of the thread's microstates.
References

CR 6911646 Solaris 10 changed the way average load is calculated/displayed

CR 4982219 load average computation can be fooled by java gc behavior

Anda mungkin juga menyukai