Among the many FAQs (Frequently Asked Questions) that get kicked around various internal
and external mail aliases is this: why does my system have idle CPU cycles (%idle is greater than zero)
with runnable threads on the run queues? This seems counter-intuitive. If I have runnable threads, they
should be running and burning CPU cycles. Theoretically, if a system's run queue is non-zero, I should
not see idle CPU time.
It's a reasonable question, and the answer lies in understanding how the metrics in question are
derived. The short answer is we can see non-zero run queues with idle CPUs because we derive these
metrics using different methods, and CPU utilization (idle time) is more accurate than run queue
metrics.
CPU microstate accounting works much the same way. When a CPU goes from idle to running
in user mode, a timestamp captures the exit from the previous state (idle) and entry into the new state
(running in user mode). If the CPU then enters the kernel, this is another microstate change, and
another timestamp captures the exit from user and entry into kernel mode. The math is done to maintain
total time in a given microstate, which can be examined using kstat(1).
The above sample shows the CPU states are tracked at both nanosecond granularity (via high
resolution time stamps) and in ticks, which align with the clock interrupt frequency of 10 milliseconds.
The raw kstats for 1 or more CPUs on a system can be monitored with various incantations of kstat(1);
# kstat -p -s cpu_nsec_user 1
cpu:0:sys:cpu_nsec_user 92743762816
cpu:1:sys:cpu_nsec_user 57543394495
cpu:2:sys:cpu_nsec_user 71279077930
cpu:3:sys:cpu_nsec_user 53435407234
cpu:4:sys:cpu_nsec_user 70497699811
cpu:5:sys:cpu_nsec_user 52251562152
cpu:6:sys:cpu_nsec_user 60368583736
cpu:7:sys:cpu_nsec_user 49661873803
cpu:40:sys:cpu_nsec_user 134050079157
cpu:41:sys:cpu_nsec_user 60151580054
cpu:42:sys:cpu_nsec_user 106208115248
cpu:43:sys:cpu_nsec_user 61801106156
cpu:44:sys:cpu_nsec_user 80068148411
cpu:45:sys:cpu_nsec_user 54260903619
cpu:46:sys:cpu_nsec_user 75393852625
cpu:47:sys:cpu_nsec_user 52891692971
cpu:0:sys:cpu_nsec_user 92744256016
cpu:1:sys:cpu_nsec_user 57543394495
cpu:2:sys:cpu_nsec_user 71279080830
cpu:3:sys:cpu_nsec_user 53435407234
cpu:4:sys:cpu_nsec_user 70936335088
cpu:5:sys:cpu_nsec_user 52251562152
cpu:6:sys:cpu_nsec_user 60368682336
cpu:7:sys:cpu_nsec_user 49661873803
cpu:40:sys:cpu_nsec_user 134050079157
cpu:41:sys:cpu_nsec_user 60151580054
cpu:42:sys:cpu_nsec_user 106208357448
cpu:43:sys:cpu_nsec_user 61801106156
cpu:44:sys:cpu_nsec_user 80068215111
cpu:45:sys:cpu_nsec_user 54260903619
cpu:46:sys:cpu_nsec_user 75393852625
cpu:47:sys:cpu_nsec_user 52891692971
The example above dumps the number of nanoseconds in user mode for each CPU on the
system at 1 second intervals. Named kstats are interesting and can be useful, but when it comes to CPU
utilization, the standard tools, vmstat(1) and mpstat(1), do the job just fine, so there's no need to begin
crafting complex scripts that attempt to do the math to derive utilization.
The summary statement here is CPU utilization, as reported by the standard tools in Solaris, is
generally very accurate. Things get somewhat complicated with current processor technology
(multicore sockets with multiple threads-per-core, etc), but for the purposes of this discussion, we can
rely on accurate reporting of CPU utilization.
Run queue depth, as reported in the “r” column in vmstat(1M), is derived using a sampling
mechanism driven by the frequency of the clock interrupts, which by default are every 10 milliseconds.
This is a much, much coarser method, and is the key contributing factor to the seeming anomaly of idle
CPU with runnable threads.
Example
Note in the sample above we see a fair amount of CPU idle time, although it does drop towards
the end of the sample as the run queue depth goes into double digits. Nonetheless, with this number of
runnable threads on the run queues, one would think the CPUs would have zero idle time.
In order to illustrate why this can happen, a DTrace script was written that tracks the insertion
of runnable threads onto CPU run queues. Remember how idle time is derived; when a kernel thread is
context switched off a CPU, that CPU checks its run queue for the next best priority runnable thread to
run. If the CPUs run queue is empty, it enters the idle loop, where the CPU microstate timestamp is
taken, marking the beginning of a state change to idle. Inside the idle loop, the CPU will periodically
check its run queue. If the CPU's run queue remains zero, it enters another code path in the kernel to
check for runnable threads on the run queues of other CPUs, adhering to the latency group hierarchy
implemented in the kernel that helps minimize the effects of NUMA (Non-Uniform Memory Access)
systems. The key point here is that in order for a CPU to accumulate idle time, its run queue must be
empty when a runnable thread is inserted on the queue. If the CPU's run queues were never empty, they
would never enter the idle loop, and thus not accumulate idle time.
The DTrace script below instruments the kernel run queue insertion functions (setfrontdq and
setbackdq) using the DTrace fbt provider1. The first clause in the script does not have a predicate, and
tracks the total number of calls into these functions system wide (@total_ticks) and per-CPU
(@per_cpu). The second clause for the same set of probes has a predicate that tests to see if the number
of runnable threads on the CPU's run queue is zero. If it is, the predicate evaluates true and the action is
taken, which counts the number of times this happened on a per-CPU basis. The tick probe provides
per-second statistics.
1 Note there's a third kernel run queue insertion function, setkpdq(), which inserts threads onto a queue dedicated to
threads in the realtime scheduling class.
#!/usr/sbin/dtrace -s
fbt::setfrontdq:entry,
fbt::setbackdq:entry
{
@total_ticks = count();
@per_cpu[args[0]->t_cpu->cpu_id] = count();
}
fbt::setfrontdq:entry,
fbt::setbackdq:entry
/ args[0]->t_cpu->cpu_disp->disp_nrunnable == 0 /
{
@zero[args[0]->t_cpu->cpu_id] = count();
}
tick-1sec
{
printa("TOTAL TICKS: %@d\n",@total_ticks);
printf("%-8s %-16s %-16s\n","CPU ID","ENQUEUES","WAS ZERO");
printa("%-8d %-@16d %-@16d\n",@per_cpu, @zero);
trunc(@total_ticks);
trunc(@per_cpu);
trunc(@zero);
}
What we end up with is a count on a per-CPU basis of the number of times per second the
insertion routines ran, along with the number of times they ran when a CPU's dispatch queue was
empty. Let's have a look:
# ./dqz.d
TOTAL TICKS: 298898
CPU ID ENQUEUES WAS ZERO
9 3036 2996
1 3828 3641
5 4189 4115
4 4653 4600
0 6355 6253
13 8586 8397
12 8924 8731
8 15053 15003
11 17708 16822
10 18192 14768
15 19384 14534
14 23843 14371
2 30250 29929
3 31273 29069
6 51326 51076
7 52298 50557
TOTAL TICKS: 275961
CPU ID ENQUEUES WAS ZERO
9 3040 2969
5 3268 3159
1 3347 3192
8 5240 5153
0 6702 6565
4 7694 7616
12 8243 8059
13 8281 8052
10 10785 9125
11 10980 10309
15 23066 18298
14 29473 18361
2 36705 36268
3 38489 35409
6 39647 39015
7 41001 38650
The resulting data shows us that most of the time, when a thread was inserted into a CPU's run
queue, that CPU's run queue was indeed empty, which caused it to enter the idle loop. Note the TOTAL
TICKS values – almost 300k, meaning that during this data collection period, the thread run queue
insertion routines were being called almost 300k times per-second, or about every 3 microseconds. As
Solaris Kernel Engineer Bart Smaalders reminds us, the Nyquist sampling theorem states that a
continuous signal must be sampled at 2X the peak frequency to get an accurate measurement2. The run
queue values, as reported by vmstat(1M), are derived based on the system clock frequency of 100hz
(100 times per second, or every 10 milliseconds). The sample for our test system shows that, on a busy
system, the run queue insertion rate can be 300khz – potentially more on a large system with faster
CPUs. Thus, the sampling rate for the run queue calculation is much too low.
The other key component to this is the threads themselves must be spending a very short
amount of time on-CPU. This particular system was executing an IO intensive workload, where many
of the threads were getting on CPU, issuing a blocking system call (read(2), write(2)), and very quickly
getting switched off. prstat -Lm can be used to determine where threads are spending most of their
time, and verify that most of the threads are not accumulating substantial on-CPU time (USR or SYS).
The -L flag provides per-thread state information, where each row represents a thread in a process.
2 This is why the sampling rate for audio CDs is 44.1Khz, or roughly 2X the peak audible frequency of music (20khz)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
573 root 29 33 0.3 0.0 0.0 0.0 29 8.6 12K 980 31K 0 java/75
573 root 31 31 0.5 0.0 0.0 0.0 27 11 19K 1K 45K 0 java/90
573 root 30 30 2.2 0.0 0.0 0.0 21 16 12K 9K 33K 0 java/96
573 root 27 34 0.4 0.0 0.0 0.0 29 10 16K 1K 39K 0 java/91
573 root 28 29 1.7 0.0 0.0 0.0 23 19 8K 4K 23K 0 java/81
573 root 26 29 2.1 0.0 0.0 0.0 20 23 9K 5K 26K 0 java/85
573 root 28 28 1.6 0.0 0.0 0.0 31 12 9K 6K 24K 0 java/76
573 root 27 28 1.3 0.0 0.0 0.0 28 16 10K 3K 30K 0 java/89
573 root 0.4 0.3 0.1 0.0 0.0 0.0 98 1.0 104 126 204 0 java/110
573 root 0.2 0.3 0.0 0.0 0.0 0.0 99 0.5 102 99 205 0 java/111
573 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 17 28 16 0 java/27
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 java/19
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 java/14
573 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 java/13
The prstat(1M) data shows a Java Virtual Machine with many threads (note the PID in each row
is the same), several of which are compute bound (USR and SYS), or waiting for a CPU (LAT), while
others are sleeping (SLP) or waiting on a user lock (LCK – 100% in LCK typically indicates a mutex
lock associated with a condition variable, not necessarily serious user lock contention).
Also, the data can be skewed by the unpredictable and potentially bursty nature of runnable
threads in a given workload. Relatively large numbers of threads may become runnable around the
same time, causing a large number of insertions into run queues within a very short time window,
which can skew the other metric related to this subject – load average.
The load average is intended to provide a glimpse of how busy a system is over the last 1, 5 and
15 minute period. Historically, load average was derived based on the number of runnable and running
threads. CR 4982219 was filed as a result of observing a system running Java workloads generating
large numbers of GC (garbage collection) threads in bursts, resulting in large load average values on a
relatively idle system. With CR 6911646, the method for calculating load average was changed in
Solaris 10. The calculation is now made by summing high-resolution user time, system time, and thread
wait time, then processing this total to generate averages with an exponential decay. The way you
interpret load average is the same – use it as an indicator of how busy the system is over the past 1, 5
and 15 minutes.
That said, keep in mind that the load average is calculated in the kernel clock interrupt handler
every second, so, similar to the run queue depth values, the data generated is calculated at a relatively
coarse granularity. This results in the possibility of observing both run queue values and load average
values that appear inconsistent with CPU utilization.
Summary
Because of the different methods used to track CPU utilization, the number of runnable threads,
and load average, it's possible to have a system showing more idle time than expected with runnable
threads present on the run queue, and load average values that are also seemingly inconsistent. If you're
looking at a system showing similar metric values, you can use the included DTrace script to
demonstrate the fact that there are a large number of threads being inserted onto the run queue of CPUs
that have a zero runnable thread count, along with prstat -Lm to determine just how CPU-intensive the
workload is by tracking the USR and SYS component of the thread's microstates.
References