Brendan's Blog The USE Method - Solaris Performance Checklist

12/14/13 Brendan's blog The USE Method: Solaris Performance Checklist
Brendan Gregg's professional blog
Home
About
Brendan's blog
Search Find
The USE Method: Solaris Performance

Checklist
The USE Method provides a strategy for performing a complete check of system health, identifying common
bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified
and checked. Any issues discovered are then investigated using further strategies.
In this post, Ill provide an example of a USE-based metric list for the Solaris family of operating systems.
Im writing this for later Solaris 10, Oracle Solaris 11, and illumos-based systems: SmartOS, OmniOS. This
is primarily intended for system administrators of the physical systems (not tenants of cloud or zone instances;
for those users, see my SmartOS performance checklist).
Last Updated: 29-Sep-2013
Physical Resources
component type metric
per-cpu: mpstat 1, usr + sys; system-wide: vmstat 1, us + sy; per-
CPU utilization process: prstat -c 1(CPU == recent), prstat -mLc 1(USR + SYS);
per-kernel-thread: lockstat -Ii rate, DTrace profile stack()
system-wide: uptime, load averages; vmstat 1, r; DTrace dispqlen.d (DTT)
CPU saturation
for a better vmstat r; per-process: prstat -mLc 1, LAT
fmadm faulty; cpustat(CPC) for whatever error counters are supported (eg,
CPU errors
thermal throttling)
Memory system-wide: vmstat 1, free (main memory), swap (virtual memory); per-
utilization
capacity process: prstat -c, RSS (main memory), SIZE (virtual memory)
system-wide: vmstat 1, sr (bad now), w (was very bad); vmstat -p 1,
Memory api (anon page ins == pain), apo; per-process: prstat -mLc 1, DFL;
dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist/ 1/7
capacity saturation DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory fmadm faultyand prtdiagfor physical failures; fmstat -s -m cpumem-

errors
capacity retire(ECC events); DTrace failed malloc()s
Network
utilization nicstat(latest version here); kstat; dladm show-link -s -i 1 interface
Interfaces
Network nicstat; kstatfor whatever custom statistics are available (eg, nocanputs,
saturation
Interfaces defer, norcvbuf, noxmtbuf); netstat -s, retransmits
netstat -i, error counters; dladm show-phys; kstatfor extended errors,
Network
errors look in the interface and link statistics (there are often custom counters for the
Interfaces
card); DTrace for driver internals
Storage
utilization system-wide: iostat -xnz 1, %b; per-process: DTrace iotop
device I/O
Storage
saturation iostat -xnz 1, wait; DTrace iopending (DTT), sdqueue.d (DTB)
device I/O
Storage iostat -En; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB),
errors
device I/O scsireasons.d (DTB), sdretry.d (DTB)
Storage
utilization swap: swap -s; file systems: df -h; plus other commands depending on FS type
capacity
Storage
saturation not sure this one makes sense once its full, ENOSPC
capacity
Storage
errors DTrace; /var/adm/messages file system full messages
capacity
Storage
utilization iostat -Cxnz 1, compare to known IOPS/tput limits per-card
controller
Storage
saturation look for kernel queueing: sd (iostat wait again), ZFS zio pipeline
controller
Storage
errors DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
controller
Network
utilization infer from nicstatand known controller max tput
controller
Network
saturation see network interface saturation
controller
Network
errors kstatfor whatever is there / DTrace
controller
CPU cpustat(CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu
utilization
interconnect script)
CPU
saturation cpustat(CPC) for stall cycles
interconnect
CPU
errors cpustat(CPC) for whatever is available
interconnect
Memory cpustat(CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC
utilization
interconnect may also have local vs remote counters
Memory
interconnect
Memory errors cpustat(CPC) for whatever is available
interconnect
I/O busstat(SPARC only); cpustatfor tput / max if available; inference via known
utilization
interconnect tput from iostat/nicstat/
I/O
interconnect
I/O
errors cpustat(CPC) for whatever is available
interconnect
CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt.
Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
lockstat and plockstat are DTrace-based since Solaris 10 FCS.
vmstat r: this is coarse as it is only updated once per second.
CPC == CPU Performance Counters (aka Performance Instrumentation Counters (PICs), or
Performance Monitoring Events), read via programmable registers on each CPU, by cpustat(1M) or
the DTrace cpc provider. These have traditionally been hard to work with due to differences
between CPUs, but are getting much easier with the PAPI standard. Still, expect to spend some
quality time (days) with the processor vendor manuals (what cpustat -h tells you to read), and to
post-process cpustat with awk or perl. See my short talk (video) about CPC (2010). (Many years
ago, I made a toolkit including CPC scripts CacheKit that was too much work to maintain.)
Memory capacity utilization: interpreting vmstats free has been tricky across different Solaris
versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and
tunables that affect when the system will kick-off the page scanner. Itll also typically shrink as the
kernel uses unused memory for caching (ZFS ARC).
Be aware that kstat can report bad data (so can any tool); there isnt really a test suite for kstat data,
and engineers can add new code paths and forget to add the counters.
DTT == DTraceToolkit scripts, DTB == DTrace book scripts.
CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device
busses (eg, PCIe).
Software Resources
component type metric
Kernel
utilization lockstat -H(held time); DTrace lockstat provider
mutex
Kernel lockstat -C(contention); DTrace lockstat provider; spinning shows up with
saturation
mutex dtrace -n 'profile-997 { @[stack()] = count(); }'
Kernel lockstat -E, eg recusive mutex enter (other errors can cause kernel
errors
mutex lockup/panic, debug with mdb -k)

User mutex utilization plockstat -H(held time); DTrace plockstat provider
User mutex saturation plockstat -C(contention); prstat -mLc 1, "LCK"; DTrace plockstat provider
DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM,
User mutex errors
EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process sar -v, proc-sz; kstat, unix:0:var:v_proc for max,
utilization
capacity unix:0:system_misc:nproc for current; DTrace (`nproc vs `max_nprocs)
Process not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(),
saturation
capacity as it scans for available slots once the table gets full
Process
errors cant fork() messages
capacity
user-level: kstat, unix:0:lwp_cache:buf_inuse for current, prctl -n
Thread
utilization zone.max-lwps -i zone ZONEfor max; kernel: mdb -kor DTrace, nthread
capacity
for current, limited by memory
Thread threads blocking on memory allocation; at this point the page scanner should be
saturation
capacity running (vmstat sr), else examine using DTrace/mdb.
Thread user-level: pthread_create() failures with EAGAIN, EINVAL, ; kernel:
errors
capacity thread_create() blocks for memory but wont fail.
system-wide (no limit other than RAM); per-process: pfilesvs ulimitor prctl
File
utilization -t basic -n process.max-file-descriptor PID; a quicker check than
descriptors
pfiles is ls /proc/PID/fd | wc -l
File does this make sense? I dont think there is any queueing or blocking, other than on
saturation
descriptors memory allocation.
File trussor DTrace (better) to look for errno == EMFILE on syscalls returning fds
errors
descriptors (eg, open(), accept(), ).
lockstat/plockstat often drop events due to load; I often roll my own to avoid this using the DTrace
lockstat/plockstat provider (examples in the DTrace book).
File descriptor utilization: while other OSes have a system-wide limit, Solaris doesnt (at least at the
moment, this could change; see my writeup about it).
Whats Next
See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this
checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency
analysis.
Also see my USE method performance checklists for SmartOS, Linux, Mac OS X, and FreeBSD.
Posted on March 1, 2012 at 7:30 am by Brendan Gregg Permalink

In: Performance Tagged with: illumos, omnios, performance, smartos, solaris, usemethod
2 Responses
Subscribe to comments via RSS
1. Written by Kebabbert
on March 8, 2012 at 1:48 am
Permalink
Wow! Great list!!!! Thank you for sharing this info. :o)
2. Written by Harsha Nippani

on March 22, 2012 at 8:49 am
Permalink
Brendan,
I have been using dtrace to successfully identifying bottlenecks on SUN servers (particularly Global
zones with 20-30 containers). I am sure, the USE method takes it further in quickly isolating
performance problems when dealing with resource contention. I like the template that USE method
provides and we can always expand on these.
I have been using fishbone methodology in my troubleshooting efforts in enterprise environments. I

am now going to leverage both (Fishbone and USE) and I am sure, sysadmins will be thrilled to see the
quick results that this tool will deliver particularly when users are screaming of slowness in the
system.
Appreciate your breakdown of key metrics.
-Harsha Nippani
Subscribe to comments via RSS
Previous post
Next post
Recent Posts
Cloud Performance Training
Systems Performance: available now
Open Source Systems Performance
The TSA Method
Control T for TENEX
The USE Method: Unix 7th Edition Performance Checklist
The USE Method: FreeBSD Performance Checklist
The USE Method: Mac OS X Performance Checklist
Memory Leak (and Growth) Flame Graphs
What the Mean Really Means

Modes and Modality
Detecting Outliers
My Books
Tags
7410 analytics art benchmarking book cloud cloud analytics CPI dtrace example experimental
filesystem frequencytrail heatmaps illumos iSCSI javascript joyent L2ARC latency limits linux macosx
methodology mysql NAS nfs off-cpu omnios performance personal PICs pid provider slides
SLOG smartos solaris SSD statistics talk testing usemethod video visualizations ZFS
People
Adam Leventhal dtrace.org
Brendan Gregg dtrace.org (professional)

Brendan Gregg blogspot (personal)
Bryan Cantrill dtrace.org
Dave Pacheco dtrace.org
Deirdr Straughan beginningwithi.com
Jim Mauro sun.com
Robert Mustacchi dtrace.org
Links
Brendan's homepage
Joyent
SolarisInternals
Meta
Log in
Entries RSS
Comments RSS
WordPress.org
Copyright 2013 Brendan Gregg, all rights reserved
Brendan's blog.
Powered by WordPress and Grey Matter.

Brendan's Blog The USE Method - Solaris Performance Checklist

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Brendan's Blog The USE Method - Solaris Performance Checklist

Diunggah oleh

Hak Cipta:

Format Tersedia

12/14/13 Brendan's blog The USE Method: Solaris Performance Checklist

Brendan Gregg's professional blog

The USE Method: Solaris Performance

Last Updated: 29-Sep-2013

capacity saturation DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname

Memory fmadm faultyand prtdiagfor physical failures; fmstat -s -m cpumem-

mutex lockup/panic, debug with mdb -k)

Posted on March 1, 2012 at 7:30 am by Brendan Gregg Permalink

Subscribe to comments via RSS

2. Written by Harsha Nippani

I have been using fishbone methodology in my troubleshooting efforts in enterprise environments. I

Appreciate your breakdown of key metrics.

Subscribe to comments via RSS

What the Mean Really Means

Brendan Gregg dtrace.org (professional)

Anda mungkin juga menyukai