Maximizing File Transfer Performance Using 10Gb Ethernet and Virtualization

Maximizing File Transfer Performance
Using 10Gb Ethernet and Virtualization

A FedEx Case Study
THE CHALLENGE
Spiraling data center network complexity, cable maintenance and troubleshooting costs,
and increasing bandwidth requirements led FedEx Corporation to investigate techniques
!cr simpli!ying Ihe neIwcrk in!rasIrucIure and bccsIing le Irans!er IhrcughpuI.
In response to this challenge, FedExin collaboration with Intelconducted a case
study to determine the most effective approach to achieving near-native 10-gigabit
le Irans!er raIes in a virIualized envircnmenI based cn VNware vSphere* ESX* 4
running cn servers pcwered by Ihe lnIel' Xecn' prccesscr 5500 series. The servers
were equipped with Intel 10 Gigabit AF DA Dual Port Server Adapters supporting
direcIaIIach ccpper Iwinaxial cable ccnnecIicns. Fcr Ihis implemenIaIicn, lnIel' VirIual
Nachine 0evice 0ueues (VN0q) !eaIure was enabled in VNware NeI0ueue*.
File transfer applications are widely used in production environments, including
replicaIed daIa seIs, daIabases, backups, and similar cperaIicns. As parI c! Ihis case
sIudy, several c! Ihese applicaIicns are used in Ihe IesI sequences. This case sIudy.
lnvesIigaIes Ihe plaI!crm hardware and sc!Iware limiIs c! !ile Irans!er per!crmance
ldenIi!ies Ihe bcIIlenecks IhaI resIricI Irans!er raIes
EvaluaIes Iradec!!s !cr each c! Ihe prcpcsed scluIicns
Nakes reccmmendaIicns !cr increasing !ile Irans!er per!crmance in 10 CigabiI EIherneI
(10C) naIive Linux* and a 10C VNware virIualized envircnmenI
The latest 10G solutions let users cost-effectively consolidate the many Ethernet and
FibreChannel adapIers deplcyed in a Iypical VNware ESX implemenIaIicn. VNware ESX,
running cn lnIel Xecn prccesscr 5500 seriesbased servers, prcvides a reliable, high
per!crmance scluIicn !cr handling Ihis wcrklcad.
THE PROCESS
In the process of building a new data center, FedEx Corporation, the largest express
shipping ccmpany in Ihe wcrld, evaluaIed Ihe pcIenIial beneIs c! 10C, ccnsidering
Ihese key quesIicns.
Can 10C make Ihe neIwcrk less ccmplex and sIreamline in!rasIrucIure deplcymenI
Can 10C help sclve cable managemenI issues
Can 10C meeI cur increasing bandwidIh requiremenIs as we IargeI higher virIual
machine (VN) ccnsclidaIicn raIics
CASE STUDY
Intel Xeon processor 5500 series
Virtualization
FedEx Corporation is a
worldwide information and
business solution company, with a
superior portfolio of transportation
and delivery services.
2
Cost Factors
Hcw dces 10C a!!ecI ccsIs BcIh
10CBASET and 0irecI AIIach Twinax
cabling (sometimes referred to as
10GSFP+Cu or SFP+ Direct Attach cost
less Ihan US0 400 per adapIer pcrI. ln
ccmpariscn, a 10C ShcrI Reach (10CBASE
SR) ber ccnnecIicn ccsIs apprcximaIely
US0 700 per adapIer pcrI.
1

In existing VMware production
environments, FedEx had used eight
1GbE connections implemented with two
quad-port cards (plus one or two 100/1000
ports for management) in addition to two
4Cb FibreChannel links. Based cn markeI
pricing, moving the eight 1GbE connections
cn Ihe quadpcrI cards Ic Iwc 10CBASET
or 10G Twinax connections is cost effective
Icday. Using Iwc 10C ccnnecIicns can
actually consume less power than the eight
1GbE connections they replaceproviding
addiIicnal savings cver Ihe lcng Ierm.
Having less cabling typically reduces
instances of wiring errors and lowers
mainIenance ccsIs, as well. FedEx used
lEEEB0Z.1q Irunking Ic separaIe Ira!c
dcws cn Ihe 10C links.
Engineering Trade-offs
Engineering trade-offs are also a
ccnsideraIicn.
Direct Attach/Twinax 10G cabling has
a maximum reach of seven meters for
passive cables, which affects the physical
wiring scheme Ic be implemenIed. The
servers have to be located within a
seven-meter radius of the 10G switches
Ic Iake advanIage c! Ihis new Iechnclcgy.
However, active cables are available that
can exIend Ihis range i! necessary. A key
feature of the Twinax cable technology is
that it uses exactly the same form-factor
connectors that the industry-standard
SFP+ cpIical mcdules use. Using Ihis
technology allows you to select the most
ccsIe!!ecIive and pcwere!cienI passive
Twinax for short reaches and then move
up Ic acIive Twinax, SR ber, cr even Lcng
Reach (LR) ber !cr lcnger runs. Twinax
adapters consume approximately 3W per
pcrI when using passive cabling.
10CBASET's maximum reach c! 100
meIers makes iI a dexible, daIa cenIerwide
deplcymenI cpIicn !cr 10CbE. 10CBASET
is also backwards-compatible with
Icday's widely deplcyed CigabiI EIherneI
in!rasIrucIures. This !eaIure makes iI
an excellent technology for migrating
Figure 1. CcnguraIicn c! Ihe server wiring
when using eighI 1CbE ccnnecIicns.
Figure 2. CcnguraIicn c! Ihe server wiring
when two 10G connections replace eight 1GbE
ccnnecIicns.
from GbE to 10GbE, as IT can use existing
in!rasIrucIures and deplcymenI kncwledge.
CurrenIgeneraIicn 10CBASET adapIers
use approximately 10W per port, and
upcoming products will consume even less,
making them suitable for integration onto
mcIherbcards in !uIure server generaIicns.
Framing the Challenge
File transfer applications are widely used
in production environments to move data
between systems for various purposes
and to replicate data sets across servers
and applications that share these data
seIs. FedEx, !cr example, uses !Ip,
scp, and rsync for data replication and
disIribuIicn in Iheir prcducIicn neIwcrks.
In addition to the considerations
associated with cable consolidation, cost
factors, and the power advantages of
using 10C, ancIher key quesIicn remained.
Can Icday's servers e!!ecIively Iake
advanIage c! 10C pipes Using !Ip, FedEx
was able to drive 320 Mbps over a 1G
ccnnecIicn. lniIial 10C IesIing, hcwever,
indicaIed IhaI Ihey cculd cnly achieve 560
Mbps, despite the potential capabilities of
10x !asIer pipes.
Plugging a 10G NIC into a server does
not automatically deliver 10 Gbps of
applicaIicn level IhrcughpuI. An cbvicus
quesIicn arises. whaI can be dcne Ic
maximize le Irans!er per!crmance cn
mcdern servers using 10C
2
3
Hardware
Intel Xeon processor X5560 series @ 2.8 GHz (8 cores, 16 threads); SMT, NUMA,
VT-x, VT-d, EIST, Turbo Enabled (default in BIOS); 24 GB Memory; Intel 10GbE
CX4 Server Adapter with VMDq
Test Methodology RAM disk used, not disk drives. We are focused on network I/O, not disk I/O
What is being
transferred?
||.ec!c., !.Jc!J.e. ,-.! c ||Jx .e,c|!c.,. o !c!-|. ddd |e. -.|-'|e |e
|.e. -e.-e |e |.e .u ||
Data Collection
Tools Used
Linux * utility sar: Capture receive throughput and CPU utilization
Application
Tools used
Netperf (common network micro-benchmark); OpenSSH, OpenSSL (standard
Linux layers); HPN-SSH (optimized version of OpenSSH); scp, rsync (standard
||Jx |e !.-e. J!|||!|e) ''c, ||!`c..e!-||'e |e !.-e. J!|||!,)
SOURCE SERVER
NETPERF
BBCP SCP RSYNC
SSH HPNSSH
RHEL 5.3 64-bit
lnIel' Xecn' Prccesscr X5500 Series
DESTINATION SERVER
NETPERF
BBCP SCP RSYNC
SSH HPNSSH
RHEL 5.3 64-bit
lnIel' Xecn' Prccesscr X5500 Series
File Transfer
Direction
lnIel 0plin 10CbE CX4
Directly connected
back-to-back
Figure 3. NaIive IesI ccnguraIicn deIails.
FedEx and Intel delved deeper into the
invesIigaIicn, using several ccmmcn le
Irans!er Iccls cn bcIh naIive Linux* and
VNware Linux VNs. The IesI envircnmenI
featured Red Hat Enterprise Linux (RHEL)
5.3 and VNware vSphere ESX4 running
cn lnIel Xecn prccesscr 5500 series
based servers. These servers were
equipped with Intel 10 Gigabit AF DA Dual
Port Server Adapters supporting direct
aIIach ccpper Iwinaxial (10CBASECX1)
cable connections and the VMDq feature
suppcrIed by VNware NeI0ueue.
Native Test Configuration
Figure 3 details the components of
Ihe naIive IesI ccnguraIicn. The IesI
systems were connected back-to-back
over 10G to eliminate variables that
could be caused by the network switches
Ihemselves. This shculd be ccnsidered a
best-case scenario because any switches
add scme niIe amcunI c! laIency in
real-world scenarios, possibly degrading
per!crmance !cr scme wcrklcads.
A RAM disk, rather than physical disk
drives, was used in all testing to focus on
the network I/O performance rather than
being limiIed by disk l/0 per!crmance.
The default bulk encryption used in
OpenSSH and HPN-SSH is Advanced
EncrypIicn SIandard (AES) 1ZBbiI. This
was ncI changed during IesIing.
The application test tools included the
!cllcwing.
netperf: This commonly used
network-oriented, low-level synthetic,
micro-benchmark does very little
prccessing beycnd !crwarding Ihe packeIs.
It is effective for evaluating the capabilities
c! Ihe neIwcrk inIer!ace iIsel!.
OpenSSH, OpenSSL: These standard
Linux layers perform encryption for
remcIe access, !ile Irans!ers, and sc cn.
HPN-SSH: This optimized version
of OpenSSH was developed by the
PiIIsburgh SuperccmpuIer CenIer (PSC).
For more details, visit www.psc.edu/
networking/projects/hpn-ssh/
scp: The standard Linux secure copy utility
rsync: The standard Linux directory
synchronization utility
bbcp: A peer-to-peer file copy utility
(similar Ic BiITcrrenI) develcped by
the Stanford Linear Accelerator Center
(SLAC). Fcr mcre deIails, gc Ic
www.slac.stanford.edu/~abh/bbcp/
3
4
Synthetic Benchmarks versus
Real File Transfer Workloads
for Native Linux*
The following two sections examine
the differences between synthetic
benchmarking and benchmarks generated
during actual workloads while running
naIive Linux.
Synthetic Benchmarks
Figure 4 shcws Ihe resulIs c! Iwc IesI
cases using neIper!. ln Ihe rsI case, Ihe
10C card was plugged inIc a PCle* Cen1 x4
slot, which limited the throughput to about
6 Cbps because c! Ihe PCle bus bcIIleneck.
In the second case, the card was plugged
into a PCIe Gen1 x8 slot, which allowed full
IhrcughpuI aI near line raIe.
PCIe slots can present a problem if the
physical slot size does not match the
acIual ccnnecIicn Ic Ihe chipseI. Tc
determine which slots are capable of full
PCIe width and performance, check with
Ihe sysIem vendcr. The prcper ccnnecIicn
widIh can alsc be veried using sysIem
Iccls and lcg les. PCle Cen1 xB (cr PCle
CenZ x4, i! suppcrIed) is necessary Ic
achieve 10 Gbps throughput for one 10G
pcrI. A dualpcrI 10C card requires Iwice
Ihe PCle bus bandwidIh.
As demonstrated, achieving 10 Gbps
transfer rates is quite easy using a
synIheIic benchmark. The nexI secIicn
looks at a case where actual workloads
are invclved.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
PCIe* Gen1 x4 PCIe* Gen1 x8
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
NETPERF
Native Linux*: Synthetic Benchmark
Figure 4. SynIheIic benchmark !cr naIive Linux*.
Benchmarks Based on Actual Workloads
Real applications present their own unique
challenges. Figure 5 shcws Ihe earlier
netperf results as a reference bar on the
le!I and seven IesI cases Ic Ihe righI.
Tool choice obviously matters, but the
standard tools are not very well threaded,
sc Ihey dcn'I Iake !ull advanIage c! Ihe
eighI ccres, 16 Ihreads, and NlC queues
(mcre Ihan 16) available in Ihis parIicular
hardware plaI!crm. The scp Iccl running
cver sIandard ssh, cr scp(ssh) in Figure 6,
and the rsync(ssh) case both achieve only
400-550 Nbps, abcuI Ihe same as FedEx's
iniIial disappcinIing resulIs wiIh !Ip. NulIi
Ihreaded le Irans!er Iccls c!!er a pcIenIial
performance boost, and two promising
candidaIes emerged during a web search.
HPNSSH !rcm PSC and bbcp !rcm SLAC.
Figure 5. Ccmpariscn c! varicus le ccpy Iccls (cne sIream).
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
Native Linux*: Various File Copy Tools (1 stream)
Receive Throughput 1 Stream Avg. CPU (%Util) 1 Stream
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
4
5
Using the HPN-SSH layer to replace the
OpenSSH layer drives the throughput up to
abcuI 1.4 Cbps. HPNSSH uses up Ic !cur
threads for the cryptographic processing
and more than doubles application layer
IhrcughpuI. Clearly, per!crmance is mcving
in Ihe righI direcIicn.
If improving the bulk cryptographic
processing helps that much, what could
Ihe Ieam achieve by disabling iI enIirely
HPNSSH c!!ers IhaI cpIicn as well.
Without bulk cryptography, scp achieves
Z Cbps and rsync achieves Z.3 Cbps.
The performance gains in these cases,
however, rely on bypassing encryption
!cr le Irans!ers. HPNSSH prcvides
signicanI per!crmance advanIages
over the default OpenSSH layer that is
provided in most Linux distributions; this
apprcach warranIs !urIher sIudy.
According to a presentation created by
SLAC titled P2P Data Copy Program
bbcp (http://www.slac.stanford.edu/
grp/scs/paper/chep01-7-018.pdf), bbcp
encrypts sensitive passwords and control
information but does not encrypt the
bulk daIa Irans!er. This design Irade
c!! sacrices privacy !cr speed. Even
without encrypting the bulk data, this can
still be an effective trade-off for many
environments where data privacy is not
a criIical ccncern. wiIh bbcp, using Ihe
de!aulI !cur Ihreads, Ihe le Irans!er
raIes reached 3.6 Cbps. This represenIs
the best results so far, surpassing even
the HPN-SSH cases with no cryptographic
prccessing, and iI's mcre Ihan six Iimes
better than the initial test results with
scp(ssh). The bbcp apprcach is very
e!cienI and bears !urIher ccnsideraIicn.
Based cn Ihese resulIs, FedEx is acIively
evaluating the use of HPN-SSH and bbcp
in Iheir prcducIicn envircnmenIs. Ncne
of the techniques tried so far, however,
has even come close to achieving 10
Gbps of throughputnot even reaching
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
Native Linux*: Various File Copy Tools (8 streams vs. 1 stream)
Receive Throughput 1 Stream Receive Throughput 8 Streams Avg. CPU (%Util) 1 Stream Avg. CPU (%Util) 8 Streams
hal! c! IhaI IargeI. AI Ihis pcinI, Ihe lnIel
and FedEx engineering team focused on
identifying any bottlenecks preventing
Ihe 10 Cbps le Irans!er IargeI !rcm
being reached. 0Iher Ihan rewriIing all
of the tools to enhance the performance
through multi-threading, the team also
wanted to know if any other available
Iechniques cculd bccsI le Irans!er raIes
To gain a better understanding of the
performance issues, the engineering
Ieam ran eighI le Irans!ers in parallel
streams to attempt to drive up aggregate
le Irans!er IhrcughpuI per!crmance and
obtain better utilization of the platform
hardware rescurces. Figure 6 indicaIes
Ihe resulIs.
ln Figure 6, Ihe rsI !cur red bars
show that using eight parallel streams
overcomes the threading limits of these
tools and drives aggregate bulk encrypted
IhrcughpuI much higher.
Z.7 Cbps wiIh scp(ssh)
3.3 Cbps wiIh rsync(ssh)
4.4 Cbps wiIh scp(HPNSSH)
4.Z Cbps wiIh rsync(HPNSSH)
These results demonstrate that using
more parallelism dramatically scales up
Figure 6. Ccmpariscn c! varicus le ccpy Iccls (eighI sIreams).
per!crmance by mcre Ihan ve Iimes,
but the testing did not demonstrate eight
times the throughput when using eight
Ihreads in parallel. The rescluIicn Ic Ihe
problem does not lie in simply using the
brute-force approach and running more
sIreams in parallel.
These results also show that bulk
encryption is expensive, in terms of
bcIh IhrcughpuI and CPU uIilizaIicn.
HPN-SSH, with its multi-threaded
cryptography, still provides a significant
benefit, but not as dramatic a benefit as
in Ihe singlesIream case.
The results associated with the remaining
Ihree red bars c! Figure 6 are insIrucIive.
The rsI Iwc cases use HPNSSH wiIh nc
bulk cryptography, and the third case is
the eight-thread bbcp result in which bulk
daIa Irans!ers are ncI encrypIed. These
results demonstrate that it is possible to
achieve nearly the same 10 Gbps line rate
throughput number as the netperf micro-
benchmark resulI when running real le
Irans!er applicaIicns.
As this testing indicates, using multiple
parallel streams and disabling bulk
cryptographic processing is effective
!cr cbIaining near 10Cbps line raIe le
5
6
SOURCE SERVER
Irans!er IhrcughpuI in Linux. These Irade
offs may or may not be applicable for a
given network environment, but they
do indicate an effective technique for
gaining better performance with fewer
Iradec!!s in Ihe !uIure. lnIel is ccnIinuing
to work with the Linux community to
nd scluIicns Ic increase le Irans!er
per!crmance.
The nexIgeneraIicn lnIel Xecn
5600 prccesscr !amily, ccdenamed
wesImere, and cIher !uIure lnIel Xecn
processors will have a new instruction set
enhancement to improve performance for
Advanced Encryption Standard (AES) bulk
encrypIicn called AESNl. This advance
prcmises Ic deliver !asIer le Irans!er
rates when bulk encryption is turned on
wiIh AESNlenabled plaI!crms.
Best Practices for Native Linux
Follow these practices to achieve the best
per!crmance resulIs when per!crming le
Irans!er cperaIicns under naIive Linux.
Configuration: PCIe Gen1 x8 minimum
for 1x 10G port; PCIe Gen2 x8 minimum
for 2x 10G ports on one card
BIOS settings: Turn ON Energy Efficient
mode (Turbo, EIST), SMT, NUMA
Turn 0N Receive (Rx) NulIi0ueue (N0)
support (enabled by default in RHEL);
Transmit (Tx) is currently limited to one
queue in RHEL, SLES 11RC supports MQ Tx
FacIcr in Ihese limiIaIicns c! Linux !ile
Irans!er Iccls and SSH/SSL layers.
- scp and ssh. single Ihreaded
- rsync. dual Ihreaded
- HPNSSH. !cur crypIcgraphy Ihreads
and single-threaded MAC layer
- bbcp. encrypIed seIup handshake, buI ncI
bulk transfer; defaults to four threads
Use mulIiple parallel sIreams Ic
overcome tool thread limits and
maximize IhrcughpuI.
Bulk crypIcgraphy cperaIicns limiI
per!crmance. 0isable crypIcgraphy !cr
Ihcse envircnmenIs where iI is accepIable.
The lnIel Xecn 5600 prccesscr !amily
(code-named Westmere) and future
lnIel Xecn prccesscrs will imprcve bulk
crypIcgraphic per!crmance using AESNl.
Achieving Native Performance
in a Virtualized Environment
Earlier sections illustrated effective
Iechniques !cr maximizing le Irans!er
performance using various tools in the
Linux native environment and described
scme c! Ihe !acIcrs IhaI limiI le
Irans!er per!crmance.
The following sections examine
performance in a virtualized environment
to determine the level of network
throughput that can be achieved for both
synIheIic benchmarks and !cr varicus le
Irans!er Iccls.
In this virtualized environment, the test
systems are provisioned with VMware
vSphere ESX4 running cn lnIel Xecn
prccesscr 5500 seriesbased servers.
The test systems are connected back-
to-back over 10G with VMDq enabled in
VNware NeI0ueue. The VNs cn Ihese
servers were prcvisicned wiIh RHEL 5.3
(64 biI) and, as in naIive cases, Ihe IesI
team used the same application tools and
IesI meIhcdclcgy excepI in cne insIance.
The team used the esxtop utility for
measuring Ihe servers' IhrcughpuI and
CPU uIilizaIicn.
Within the virtualized environment, these
IesI scenarics were used.
0ne virIual machine wiIh eighI vCPU
and 1Z CB RAN
0ne virIual machine (eighI vCPU and
1Z CB RAN) wiIh VN0irecIPaIh l/0
EighI virIual machines, wiIh each virIual
machine having cne vCPU and Z CB RAN
EighI virIual machines, bcIh wiIh and
without the VMDq feature
CASE 1: One Virtual Machine with
Eight vCPUs and 12 GB RAM
Each server had cne VN ccngured wiIh
eighI vCPUs and 1Z CB c! RAN, using
RHEL 5.3 (64 biI) as Ihe guesI cperaIing
sysIem (0S). The IesI Ieam ran neIper! as
a micrcbenchmark and alsc varicus le
transfer tools for transferring a directory
sIrucIure c! apprcximaIely B CB !rcm Ihe
VN cn Ihe rsI server Ic Ihe VN cn Ihe
seccnd server. Figure 7 shcws Ihe IesI
ccnguraIicn.
Similar to the testing done in the native
Linux case, the test team compared the
daIa !rcm neIper!, cne sIream c! le
Irans!er, and eighI parallel le Irans!ers.
Figure 8 shows the receive network
throughput and total average CPU
utilization for the micro-benchmark, such
as neIper!, and varicus le Irans!er Iccls
when running a single sIream c! ccpy.
The gure ccmpares Ihe resulIs !cr Ihe
native data with the data results in the
virIualized envircnmenI.
Figure 7. TesI ccnguraIicn !cr Case 1.
VM1 (8 vCPU) VM1 (8 vCPU) File Transfer
Direction
Directly connected
back-to-back
File Transfer
Applications
File Transfer
Applications
RHEL 5.3 RHEL 5.3
v
S
w
i
t
c
h
v
S
w
i
t
c
h
VMware ESX 4.0 VMware ESX 4.0
lnIel' Xecn' Prccesscr X5500 Series lnIel' Xecn' Prccesscr X5500 Series
DESTINATION SERVER
6
7
Figure 8 indicates that the throughput
in a VM is lower across all cases when
ccmpared wiIh Ihe naIive case. ln Ihe
virtualized case, the throughput for the
neIper! IesI resulIs is 5.B Cbps ccmpared
Ic 9.3 Cbps in Ihe naIive case. Even in
Ihe case c! le Irans!er Iccls, such as
scp and rsync running over standard ssh,
the throughput ranges from 300 Mbps to
500 Nbps, which is slighIly lcwer when
ccmpared wiIh Ihe naIive case.
Using the HPN-SSH layer to replace the
0penSSH layer increases Ihe IhrcughpuI.
Also, disabling cryptography increases
the throughput, but not at as high a level
as Ihe line raIe. The shape c! Ihe curve,
hcwever, is similar.
The same limitations that occurred in the
native case (such as standard tools not
being well threaded and cryptography
adding to the overhead) also apply in this
case. Because c! Ihis, Ihe le Irans!er
tools cannot take full advantage of
mulIiple ccres and Ihe NlC queues.
Most of the tools and utilitiesincluding
ssh and scpare single threaded; the
rsync uIiliIy is dual Ihreaded. Using Ihe
HPN-SSH layer to replace the OpenSSH
layer helps increase Ihe IhrcughpuI. ln
HPN-SSH, the cryptography operations are
multi-threaded (four threads), which boosts
per!crmance signicanIly. The single
threaded MAC layer, however, still creates
a bcIIleneck. when HPNSSH is run wiIh
cryptography disabled, the performance
increases, buI Ihe beneIs c! encrypIed
daIa Irans!er are lcsI. This is similar Ic Ihe
case with bbcp, which is multi-threaded
(using four threads by default), but the bulk
Irans!er is ncI encrypIed.
The next test uses eight parallel streams,
attempting to work around the threading
limiIaIicns c! varicus Iccls. Figure 10
shows the receive network throughput
and CPU uIilizaIicn !cr varicus le
transfer tools when running eight parallel
sIreams c! ccpies. This charI alsc includes
ccmpariscns wiIh Ihe naIive daIa resulIs.
Figure 8. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (cne sIream).
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
Receive Throughput Native Receive Throughput Virtualized
Avg. CPU (%Util) Native Avg. CPU (%Util) Virtualized
ESX* 4.0 GA (1 VM with 8 vCPU): Various File Copy Tools (1 stream)
Figure 9. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams)
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
ESX* 4.0 GA (1VM with 8 vCPU): Various File Copy Tools (8 streams)
Frcm Figure 9 iI's clear IhaI using eighI
parallel sIreams cverccmes Ihe Iccls'
Ihreading limiIaIicns. Figure 9 alsc shcws
that cryptography operations are a limiter
in Ihe rsI !cur cases c! Ihe le Irans!er
Iccls. BeIIer IhrcughpuI is indicaIed in
the last three cases, in which the copies
were made wiIhcuI using crypIcgraphy.
Even though these results are better
compared to relying on one-stream data,
Ihey sIill dcn'I ccme clcse Ic Ihe line
rate (approximately 10 Gbps) achieved in
Ihe naIive case. Since Ihe IesIing used
one VM with eight vCPUs, the test team
determined that this might be a good case
for using direct assignment of the 10G NIC
Ic Ihe virIual machine.
7
8
CASE 2: One Virtual Machine with
Eight vCPUs and VMDirectPath
The IesI bed seIup and ccnguraIicn
in this case was similar to that of the
previous case except that the 10G was
direcI assigned Ic Ihe VN. Figure 10
shcws Ihe IesI ccnguraIicn !cr Ihis case.
The test team started with a synthetic
benchmark netperf and then ran eight
parallel sIreams c! ccpies !cr varicus le
Irans!er Iccls. The resulIs shcwn in Figure
12 compare the performance numbers
from the VM with DirectPath I/O to the
performance numbers of the VM with no
0irecIPaIh l/0 and naIive.
As Figure 11 illustrates, VMDirectPath
(VT-d direct assignment) of the 10G
NIC to the VM increases performance
to a level that is close to the native
per!crmance resulIs. NcneIheless,
the trade-offs associated with using
VN0irecIPaIh are subsIanIial.
A number of features are not available
!cr VN ccngured wiIh VN0irecIPaIh,
including VNcIicn*, suspend/resume,
record/replay, fault tolerance, high
availabiliIy, 0RS, and sc cn. Because c!
these limitations, the use of VMDirectPath
will continue to be a niche solution
awaiting future developments that
minimize Ihem. VN0irecIPaIh may
be practical to use today for virtual
security-appliance VMs since these VMs
Iypically dc ncI migraIe !rcm hcsI Ic hcsI.
VMDirectPath may also be useful for other
applications that have their own clustering
Iechnclcgy and dcn'I rely cn Ihe VNware
plaI!crm !eaIures !cr !aulI Iclerance.
Figure 10. TesI ccnguraIicn !cr Case Z.
Figure 11. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams) using VN0irecIPaIh.
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
ESX 4.0* GA (1VM with 8 vCPU & VMDirectPath I/O: Various FileVarious File Copy Tools (8 streams)
Receive Throughput Native Receive Throughput VM w/o VMDirectPath I/O
Receive Throughput VM w/VMDirectPath I/O Avg. CPU (%Util) Native
Avg. CPU (%Util) VM w/o VMDirectPath I/O Avg. CPU (%Util) VM w/ VMDirectPath I/O
SOURCE SERVER
VM1 (8 vCPU) VM1 (8 vCPU) File Transfer
Direction
Directly connected
back-to-back
File Transfer
Applications
File Transfer
Applications
RHEL 5.3 RHEL 5.3
VMware ESX 4.0 VMware ESX 4.0
lnIel' Xecn' Prccesscr X5500 Series lnIel Xecn Prccesscr X5500 Series
DESTINATION SERVER
8
9
CASE 3: Eight Virtual Machines,
Each with One vCPU and 2 GB RAM
The two previous cases incorporated a
single large VM with eight vCPUs, which
is scmeIhing similar Ic a naIive server.
In Case 3, the scenario includes eight
virIual machines wiIh each VN ccngured
wiIh cne vCPU and Z CB RAN. The guesI
0S is sIill RHEL 5.3 (64 biI). BcIh c! Ihe
servers include eight virtual machines
and the tests run one stream of copy per
VM (so in effect eight parallel streams of
ccpies are running). Figure 1Z shcws Ihe
ccnguraIicn !cr Ihis case.
Figure 13 indicates the performance
results when running a synthetic
micrcbenchmark and real le Irans!er
applicaIicns cn eighI VNs in parallel.
Comparisons with the results for the
naIive server are shcwn in blue.
As shown in Figure 13, the aggregate
throughput with netperf across eight VMs
reaches the same level of throughput
as achieved in Ihe naIive case. wiIh Ihe
le Irans!er Iccls, crypIcgraphy sIill
imposes performance limitations, but the
multi-threaded cryptography in HPN-SSH
improves performance when compared
Ic Ihe sIandard uIiliIies, such as ssh.
Larger beneIs are gained when bulk
cryptography is disabled, as indicated
by the results shown by the last three
red bars. Several c! Ihese resulIs shcw
that the virtualized case can equal the
per!crmance c! Ihe naIive case.
Figure 12. TesI ccnguraIicn !cr Case 3.
Figure 13. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls wiIh B VNs and 1 sIream per VN.
0
10
20
30
40
50
60
70
80
90
100
NETPERF SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
ESX* 4.0 GA (8 VMs with 1 vCPU each): Various File Copy Tools
SOURCE SERVER
VM1 (1 vCPU) VM1 (1 vCPU)
VM8 (1 vCPU) VM8 (1 vCPU)
File Transfer
Direction
Directly connected
back-to-back
v
S
w
i
t
c
h
v
S
w
i
t
c
h
VMware* ESX* 4.0 VMware ESX 4.0
lnIel' Xecn' Prccesscr X5500 Series lnIel' Xecn' Prccesscr X5500 Series
DESTINATION SERVER
9
10
Sub Case: Eight Virtual Machines,
with and without VMDq
Because Ihe IesI ccnguraIicn includes
multiple VMs, the VMDq feature offers
scme advanIages. This !eaIure helps
c!dcad neIwcrk l/0 daIa prccessing
from the virtual machine monitor (VMM)
sc!Iware Ic neIwcrk siliccn. The IesI Ieam
ran a sub case, comparing the receive
IhrcughpuI !rcm varicus le Irans!er Iccls
by enabling and disabling the VMDq and
asscciaIed NeI0ueue !eaIure.
The resulIs c! Ihe rsI !cur ccpy
cperaIicns in Figure 14 shcw IhaI
there is no difference in throughput,
regardless of whether VMDq is enabled or
disabled. This is primarily because c! Ihe
limitations imposed by bulk cryptography
cperaIicns. when running Ihe le Irans!er
applications tools by disabling the
cryptographyas in last three cases
the advantages of the VMDq feature
beccme clear. The resulIs c! Ihe lasI
three operations show the advantage of
VMDq with the cryptography bottleneck
remcved. These daIa resulIs indicaIe IhaI
VMDq improves performance with multiple
VMs if there are no system bottlenecks in
place (such as crypIcgraphy cr slcI widIh).
Based cn Ihe !ull range c! IesI resulIs,
the test team developed a set of best
practices to improve performance in
virtualized environments, as detailed in
Ihe nexI secIicn.
Figure 14. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams) wiIh and wiIhcuI
VirIual Nachine 0evice 0ueues (VN0q).
0
10
20
30
40
50
60
70
80
90
100
SCP
(SSH)
RSYNC
(SSH)
SCP
(HPN-SSH)
RSYNC
(HPN-SSH)
SCP
(HPN-SSH +
No Crypto)
RSYNC
(HPN-SSH +
No Crypto)
BBCP
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
A
v
g
.

C
P
U

(
%
U
t
i
l
)
R
e
c
e
i
v
e

T
h
r
o
u
g
h
p
u
t

(
M
b
p
s
)
Receive Throughput VMDq Receive Throughput No VMDq
Avg. CPU (%Util) VMDq Avg. CPU (%Util) No VMDq
ESX* 4.0 GA - 8 VMs: Various File Copy Tools (VMDq vs. No VMDq)
Type of File copy
Best Practices for Virtualized
Environments (ESX* 4.0)
Consider the following guidelines
for achieving the best data transfer
performance in a virtualized environment
running VNware VSphere ESX 4.
Ccn!irm Ihe acIual widIh c! PCle slcIs.
- Physical slcI size dcesn'I always maIch
the actual connection to the chipset;
check with your server vendor to deter-
mine which cnes are really !ull widIh.
-Veri!y Ihe prcper ccnnecIicn using sys-
Iem Iccls.
-PCle Cen1 xB (cr PCle CenZ x4, i! sup-
ported) is required to achieve 10 Gbps
IhrcughpuI !cr cne 10C pcrI. PCle CenZ
xB is necessary !cr Zx 10C pcrIs.
-The Irans!er raIe limiIaIicn wiIh PCle
Cen1 x4 is apprcximaIely 6 Cbps.
By de!aulI, NeI0ueue/VN0q is enabled
!cr bcIh Rx and Tx queues.
- 0ne queue is allccaIed per ccre/Ihread
up Ic Ihe hardware limiIs.
- Per!crmance is imprcved wiIh mulIiple
VMs, if there are no system bottlenecks,
such as slcI widIh cr crypIcgraphy.
Use vmxneI3 (new in ESX 4.0) cr
vmxneIZ (ESX 3.5) insIead c! Ihe de!aulI
e1000 driver.
Use mulIiple VNs Ic imprcve IhrcughpuI
raIher Ihan cne large VN.
- Twc vCPU VNs shcw beIIer IhrcughpuI
Ihan cne vCPU VN in cerIain cases.
Tccl limiIaIicns, slcI limiIaIicns, and
cryptography limitations still apply, as in
Ihe naIive case.
10
11
The Results
This case study and the results obtained
through the collaborative efforts of FedEx
and Intel engineers suggest a number of
ways in which le Irans!er per!crmance
in the data center can be maximized,
depending on the tools and virtualization
techniques used, as well as the hardware
ccnguraIicn.
The most important performance-related
considerations, based on the observations
and data results obtained during testing,
are as !cllcws.
Be aware c! pcIenIial PCle slcI hardware
issues. Ensure IhaI Ihe PCle bus
width and speed provide sufficient I/O
bandwidIh !cr !ull per!crmance.
Ccnsider Twinax cabling Ic greaIly
reduce Ihe ccsI c! 10C EIherneI. The
SFP+ form factor provides a single
physical interface standard that can
cover the application rangefrom the
lowest cost, lowest power, and shortest
reach networking available to the
longest reach situations that may be
enccunIered. Chccsing Ihe cable Iype
on a port-by-port basis can provide a
cost-effective and flexible approach
!cr envircnmenIal needs. when using
this new cabling approach, 10G can be
cost effective today, as compared to
commonly deployed usage models with
six, eighI, cr mcre 1C pcrIs per server.
SeI Ihe apprcpriaIe sysIem ccn!iguraIicn
parameIers. The IesI Ieam evaluaIed
each of the recommended settings
detailed in the body of this case study
and determined that these settings
either improved performance or were
neuIral !cr Ihe IesI wcrklcads. Nake
sure Ihe Bl0S is seI ccrrecIly and IhaI
the OS, VMM, and applications take full
advanIage c! mcdern plaI!crm !eaIures.
SynIheIic benchmarks are ncI Ihe same
as real wcrklcads. AlIhcugh synIheIic
benchmarks can be useful tools to
evaluate subsystem performance,
they can be misleading for estimating
applicaIicn level IhrcughpuI.
Chccse Ihe righI Iccls and usage
mcdels. ldenIi!y IcclIhreading limiIs,
and use mcre parallelism when pcssible.
Cryptography can be a bottleneck, so
disable iI i! iI is ncI required. l! iI is
required, choose an implementation that
has cpIimal per!crmance.
SeI ycur expecIaIicns ccrrecIly.
Configuration using multiple VMs will
c!Ien cuIper!crm a single VN. Can
ycur applicaIicn scale in Ihis !ashicn
In some cases multi-vCPU VM is better
i! Ihe applicaIicns are well Ihreaded.
Moving directly from your physical
server configurations and tunings to the
nearest equivalent VM is an effective
starting point, but ultimately may not
be Ihe besI Iradec!!. VirIualizaIicn
features, such as VMDq and NetQueue,
will perform better when multiple VMs
are being used and there are no system
bottlenecks, such as PCIe bandwidth
limiIaIicns cr crypIcgraphy prccessing.
This case study demonstrates that it
is possible to achieve close to 10G line
raIe IhrcughpuI !rcm Icday's servers
pcwered by lnIel Xecn prccesscrs running
RHEL 5.3 and VNware ESX 4.0. Use Ihe
testing methods described to validate
the 10G network you are building and to
help idenIi!y and resclve prcblems. The
results detailed in this case study make it
clear that tool choices, usage models, and
ccnguraIicn seIIings are mcre impcrIanI
than whether the application is running in
a virIualized envircnmenI.
11
For More Information
For answers to questions about server
performance, data center architecture,
or application optimization, check out
Ihe rescurces cn The Server Rccm siIe.
hIIp.//ccmmuniIies.inIel.ccm/ccmmuniIy/
openportit/server. lnIel experIs are
available to help you improve performance
within your network infrastructure and
achieve ycur daIa cenIer gcals.
AUTHORS
Chris Greer, chief engineer, FedEx Services
Chris Creer is a chie! engineer c! Iechnical archiIecIure aI FedEx Services. He has wcrked
!cr 1Z years aI FedEx as a neIwcrk and sysIems engineer. CurrenIly, he wcrks Ic research
and dene neIwcrk and server sIandards wiIh an emphasis cn server virIualizaIicn.
Bob Albers, I/O usage architect, Intel Corporation
Bcb Albers is an l/0 usage archiIecI in lnIel's 0igiIal EnIerprise Crcup's EndUser
PlaI!crm lnIegraIicn Ieam. He has cver 3Z years c! experience in ccmpuIer and neIwcrk
design and archiIecIure, including Z5 years aI lnIel. Fcr Ihe lasI several years Bcb has
been focused on understanding the usage models and improving the performance of
neIwcrk and securiIy wcrklcads running cn naIive and virIualized lnIel servers.
Sreeram Sammeta, senior systems engineer, Intel Corporation
Sreeram SammeIa is a senicr sysIems engineer in lnIel's 0igiIal EnIerprise Crcup's
Enduser PlaI!crm lnIegraIicn grcup. He previcusly has wcrked in varicus engineering
pcsiIicns in ln!crmaIicn Technclcgy during Ihe lasI 7 years. He hclds an N.S in elecIrical
engineering !rcm SlU. His primary !ccus area is building endIcend prcc!c!ccncepIs,
explcring, deplcying, and validaIing emerging enduser usage mcdels. His recenI
interests include evaluating next generation data center virtualization technologies
!ccusing cn per!crmance c! virIualized neIwcrk and securiIy in!rasIrucIure appliances.
S0LUTl0N PR0Vl0E0 BY.

1
Based on list prices for current Intel Ethernet products.
This document and the information given are for the convenience of Intels customer base and are provided AS IS WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED,INCLUDING ANY IMPLIED WARRANTY OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does not grant any license to any of the intellectual property
described, displayed, or contained herein. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control, or safety systems, or in nuclear facility applications.
Perlorrarce lesls ard ral|rgs are reasured us|rg spec|lc corpuler syslers ard/or corporerls ard relecl lre approx|rale perlorrarce ol lrle| producls as reasured oy lrose lesls. Ary d|llererce |r sysler rardWare or sollWare des|gr or
corlgural|or ray allecl aclua| perlorrarce. lrle| ray ra|e crarges lo spec|lcal|ors, producl descr|pl|ors ard p|ars al ary l|re, W|lroul rol|ce.
Copyr|grl @ 2010 lrle| Corporal|or. A|| r|grls reserved. lrle|, lre lrle| |ogo, ard Xeor are lraderar|s ol lrle| Corporal|or |r lre u.3. ard olrer courlr|es.
0lrer rares ard orards ray oe c|a|red as lre properly ol olrers. Pr|rled |r u3A 1010/8Y/VE3l/P0F Please Recycle 323329-002US

Maximizing File Transfer Performance Using 10Gb Ethernet and Virtualization

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Maximizing File Transfer Performance Using 10Gb Ethernet and Virtualization

Diunggah oleh

Hak Cipta:

Format Tersedia

Maximizing File Transfer Performance

Using 10Gb Ethernet and Virtualization

Anda mungkin juga menyukai