Tuned
NUMA Tuning
Power Management
Performance Engineering
Overview
Micro-Benchmarks
Applications/Benchmarks
Profiling challenges
Data address profiling (cache-2-cache detection),
providing:
the hottest contended cachelines
the process names, addresses, pids, tids causing that contention
the cpus they ran on,
and how the cacheline is being accessed (read or write)
Performance Optimization
Out-of-the-box
Automatic Tuning
Manual Tuning
Tuned
N/A
Transparent Hugepages
Static Hugepages
numad
irqbalance
RHEL7
numa_balancing
10
Need Determinism
11
Need Determinism
12
Overview of Performance
Analysis Utilities
perf
13
perf
14
perf list
15
perf list
16
perf top
17
perf record
18
perf record
19
perf record
20
perf record
21
perf record
22
perf record
23
perf report
/dev/zero
24
perf report
/dev/zero
oflag=direct
25
perf diff
26
27
28
Overview of Performance
Analysis Utilities
Performance Co-Pilot (PCP)
29
30
31
32
33
Overview: CPU%/Load/IOPS/Net/Memory
34
# CPU
/root/pig -s 5
# DISK
dd if=/dev/zero of=/root/2GB count=2048 bs=1M oflag=direct
# NETWORK
netperf -H lab7 -l 5
# MEMORY
/root/pig -m 16384 -l sleep -s 5
35
CPU %
Load Avg
IOPS
Network
Memory
Allocated
36
collectl mode
CPU
37
collectl mode
CPU
IOPS
38
collectl mode
CPU
IOPS
NET
39
collectl mode
CPU
IOPS
NET
MEM
40
atop mode
41
Questions so far ?
42
NUMA Tuning
Discovery
43
44
NUMA Node 1
45
PCI Devices
NUMA Node 1
46
x86_64
16
0-15
1
8
2
2
0-7
8-15
Logical Cores/HT
x86_64
16
0-15
1
8
2
2
0-7
8-15
Logical Cores/HT
49
RHEL6
RHEL7 numabalance
Enable / Disable
50
sysctl kernel.numabalancing={0,1}
Checklist
Tool
Research Topology
lstopo/lscpu
cgroups, numactl
Consider I/O
irqbalance/PCI Bus
Virtualization
numatune/numad
51
Checklist
Tool
Research Topology
lstopo/lscpu
cgroups, numactl
Consider I/O
irqbalance/PCI Bus
Virtualization
numatune/numad
52
Checklist
Tool
Research Topology
lstopo/lscpu
cgroups, numactl
Consider I/O
irqbalance/PCI Bus
Virtualization
numatune/numad
53
Checklist
Tool
Research Topology
lstopo/lscpu
cgroups, numactl
Consider I/O
irqbalance/PCI Bus
Virtualization
numatune/numad
54
Checklist
Tool
Research Topology
lstopo/lscpu
cgroups, numactl
Consider I/O
irqbalance/PCI Bus
Virtualization
numatune/numad
55
56
57
58
node0
node1
77587739
0
0
30254
69302710
8285029
131990042
0
0
30099
129511360
2478682
Node 0
-----65491
60366
5124
2650
2021
1686
964
964
341
340
380
208
173
134
NUMA Tuning
numad
61
Maintains responsiveness
BUT!
Using remote memory degrades
performance!
62
Userspace solution
numad
63
numad
64
numabalance
65
PID
2578
2579
2580
2581
66
(pig)
(pig)
(pig)
(pig)
Node 0 Node 1
2123 11878
1988 12013
14000
1
1981 12020
PID
2578
2579
2580
2581
PID
2578
2579
2580
2581
67
(pig)
(pig)
(pig)
(pig)
(pig)
(pig)
(pig)
(pig)
Node 0 Node 1
2123 11878
1988 12013
14000
1
1981 12020
Node 0 Node 1
14000
0
0 14000
14000
0
0 14000
After numad
Effect of numad/numabalance
Automatic NUMA Balancing - NUMAD
14000
12000
10000
8000
6000
numad begins
numad done
4000
2000
0
1
10
11
NODE-0-MB
68
NODE-1-MB
12
13
14
15
16
Questions on NUMA ?
69
tuned
70
What is tuned ?
71
72
Larger is better
73
Latency (Microseconds)
250
200
C6
C3
C1
150
100
50
0
Max
C0
75
default
enterprisestorage
sched_min_
granularity_ns
4ms
10ms
10ms
10ms
10ms
sched_wakeup_granula
rity_ns
4ms
15ms
15ms
15ms
15ms
dirty_ratio
20% RAM
40%
10%
40%
40%
dirty_background_ratio
10% RAM
5%
swappiness
60
10
30
CFQ
deadline
deadline
deadline
Filesystem Barriers
On
Off
Off
Off
CPU Governor
ondemand
performance
Disk Read-ahead
virtual-host virtualguest
latencyperformance
deadline
deadline
performance
performance
4x
Disable THP
Yes
CPU C-States
Locked @ 1
76
throughputperformance
77
78
Installed by default!
Installed by default!
79
Desktop/Workstation: balanced
Server/HPC: throughput-performance
80
81
82
83
84
Parents
throughput-performance
balanced
latency-performance
Children
network-throughput
desktop
virtual-host
virtual-guest
85
network-latency
Parents
throughput-performance
balanced
latency-performance
Children
network-throughput
desktop
network-latency
Your-DB
Your-Middleware
virtual-host
virtual-guest
Your-Web
86
87
Units
Balanced
throughput-performance
Inherits From/Notes
throughput-performance
sched_min_ granularity_ns
nanoseconds
auto-scaling
10000000
sched_wakeup_granularity_ns
nanoseconds
3000000
15000000
dirty_ratio
Percent
20
40
dirty_background_ratio
Percent
10
10
swappiness
Weight 1-100
60
10
network-throughput
deadline
Boolean
CPU Governor
Enabled
ondemand
Disk Read-ahead
KB
128
Disable THP
Boolean
Enabled
normal
performance
4096
performance
kernel.sched_migration_cost_ns
nanoseconds
500000
Percent
auto-scaling
tcp_rmem
Bytes
auto-scaling
Max=16777216
tcp_wmem
Bytes
auto-scaling
Max=16777216
88
udp_mem
Max=16777216
100
Units
Balanced
latency-performance
network-latency
Inherits From/Notes
latency-performance
sched_min_ granularity_ns
nanoseconds
auto-scaling
sched_wakeup_granularity_ns
nanoseconds
3000000
dirty_ratio
percent
20
10
dirty_background_ratio
percent
10
swappiness
Weight 1-100
60
10
10000000
10000000
deadline
Boolean
CPU Governor
Enabled
ondemand
performance
N/A
No
CPU C-States
N/A
Locked @ 1
normal
performance
Disable THP
Boolean
kernel.sched_migration_cost_ns
nanoseconds
percent
net.core.busy_read
microseconds
50
net.core.busy_poll
microseconds
50
net.ipv4.tcp_fastopen
Boolean
89
kernel.numa_balancing
N/A
Yes
Boolean
5000000
100
Enabled
Disabled
Units
throughput-performance
Inherits From/Notes
sched_min_ granularity_ns
nanoseconds
10000000
sched_wakeup_granularity_ns
nanoseconds
15000000
dirty_ratio
percent
40
dirty_background_ratio
percent
10
swappiness
Weight 1-100
10
virtual-host
virtual-guest
throughputperformance
throughputperformance
30
5
Boolean
CPU Governor
Disk Read-ahead
performance
Bytes
4096
performance
kernel.sched_migration_cost_ns
nanoseconds
5000000
min_perf_pct
(intel_pstate only)
90
30
91
Power Management
92
Power Saving
Performant
C-state Impact on Jitter
Latency (Microseconds)
250
200
C6
C3
C1
150
100
50
0
Max
C0
94
Kernel Build
+12.5%
Disk Read
+32.2%
Disk Write
+25.6%
Unpack tar.gz
+23.3%
Active Idle
+41%
Default
pk cor CPU
%c0
GHz
TSC
%c1
%c3
%c6
%c7
5.72
1.32
0.00
92.72
3.13
0.15
0.00
94.18
1.47
0.00
0.00
96.25
1.21
0.47
0.12
96.44
%c1
%c3
%c6
%c7
latency-performance
pk cor CPU
95
%c0
GHz
TSC
0 0.00
0.00
0.00
0.00
1 0.00
0.00
0.00
0.00
2 0.00
0.00
0.00
0.00
3 0.00
0.00
0.00
0.00
96
Take-aways
97
Helpful Utilities
redhat-support-tool
sos
kdump
perf
Networking
NUMA
Supportability
hwloc
dropwatch
Intel PCM
ethtool
numactl
netsniff-ng (EPEL6)
numad
tcpdump
numatop (01.org)
Power/Tuning
wireshark/tshark
Storage
psmisc
strace
cpupowerutils (R6)
blktrace
sysstat
kernel-tools (R7)
iotop
systemtap
powertop
iostat
trace-cmd
tuna
util-linux-ng
tuned
98
Helpful Links
99
Perf
Questions
100