0 penilaian0% menganggap dokumen ini bermanfaat (0 suara)
1K tayangan12 halaman
Spiraling data center network complexity, cable maintenance and troubleshooting costs, and increasing bandwidth requirements led FedEx Corporation to investigate techniques for simplifying the network infrastructure and boosting file transfer throughput.
In response to this challenge, FedEx—in collaboration with Intel—conducted a case study to determine the most effective approach to achieving near-native 10-gigabit file transfer rates in a virtualized environment based on VMware vSphere* ESX* 4 running on servers powered by the Intel® Xeon® processor 5500 series. The servers were equipped with Intel® 10 Gigabit AF DA Dual Port Server Adapters supporting direct-attach copper twinaxial cable connections. For this implementation, Intel® Virtual Machine Device Queues (VMDq) feature was enabled in VMware NetQueue*.
Spiraling data center network complexity, cable maintenance and troubleshooting costs, and increasing bandwidth requirements led FedEx Corporation to investigate techniques for simplifying the network infrastructure and boosting file transfer throughput.
In response to this challenge, FedEx—in collaboration with Intel—conducted a case study to determine the most effective approach to achieving near-native 10-gigabit file transfer rates in a virtualized environment based on VMware vSphere* ESX* 4 running on servers powered by the Intel® Xeon® processor 5500 series. The servers were equipped with Intel® 10 Gigabit AF DA Dual Port Server Adapters supporting direct-attach copper twinaxial cable connections. For this implementation, Intel® Virtual Machine Device Queues (VMDq) feature was enabled in VMware NetQueue*.
Spiraling data center network complexity, cable maintenance and troubleshooting costs, and increasing bandwidth requirements led FedEx Corporation to investigate techniques for simplifying the network infrastructure and boosting file transfer throughput.
In response to this challenge, FedEx—in collaboration with Intel—conducted a case study to determine the most effective approach to achieving near-native 10-gigabit file transfer rates in a virtualized environment based on VMware vSphere* ESX* 4 running on servers powered by the Intel® Xeon® processor 5500 series. The servers were equipped with Intel® 10 Gigabit AF DA Dual Port Server Adapters supporting direct-attach copper twinaxial cable connections. For this implementation, Intel® Virtual Machine Device Queues (VMDq) feature was enabled in VMware NetQueue*.
A FedEx Case Study THE CHALLENGE Spiraling data center network complexity, cable maintenance and troubleshooting costs, and increasing bandwidth requirements led FedEx Corporation to investigate techniques !cr simpli!ying Ihe neIwcrk in!rasIrucIure and bccsIing le Irans!er IhrcughpuI. In response to this challenge, FedExin collaboration with Intelconducted a case study to determine the most effective approach to achieving near-native 10-gigabit le Irans!er raIes in a virIualized envircnmenI based cn VNware vSphere* ESX* 4 running cn servers pcwered by Ihe lnIel' Xecn' prccesscr 5500 series. The servers were equipped with Intel 10 Gigabit AF DA Dual Port Server Adapters supporting direcIaIIach ccpper Iwinaxial cable ccnnecIicns. Fcr Ihis implemenIaIicn, lnIel' VirIual Nachine 0evice 0ueues (VN0q) !eaIure was enabled in VNware NeI0ueue*. File transfer applications are widely used in production environments, including replicaIed daIa seIs, daIabases, backups, and similar cperaIicns. As parI c! Ihis case sIudy, several c! Ihese applicaIicns are used in Ihe IesI sequences. This case sIudy. lnvesIigaIes Ihe plaI!crm hardware and sc!Iware limiIs c! !ile Irans!er per!crmance ldenIi!ies Ihe bcIIlenecks IhaI resIricI Irans!er raIes EvaluaIes Iradec!!s !cr each c! Ihe prcpcsed scluIicns Nakes reccmmendaIicns !cr increasing !ile Irans!er per!crmance in 10 CigabiI EIherneI (10C) naIive Linux* and a 10C VNware virIualized envircnmenI The latest 10G solutions let users cost-effectively consolidate the many Ethernet and FibreChannel adapIers deplcyed in a Iypical VNware ESX implemenIaIicn. VNware ESX, running cn lnIel Xecn prccesscr 5500 seriesbased servers, prcvides a reliable, high per!crmance scluIicn !cr handling Ihis wcrklcad. THE PROCESS In the process of building a new data center, FedEx Corporation, the largest express shipping ccmpany in Ihe wcrld, evaluaIed Ihe pcIenIial beneIs c! 10C, ccnsidering Ihese key quesIicns. Can 10C make Ihe neIwcrk less ccmplex and sIreamline in!rasIrucIure deplcymenI Can 10C help sclve cable managemenI issues Can 10C meeI cur increasing bandwidIh requiremenIs as we IargeI higher virIual machine (VN) ccnsclidaIicn raIics CASE STUDY Intel Xeon processor 5500 series Virtualization FedEx Corporation is a worldwide information and business solution company, with a superior portfolio of transportation and delivery services. 2 Cost Factors Hcw dces 10C a!!ecI ccsIs BcIh 10CBASET and 0irecI AIIach Twinax cabling (sometimes referred to as 10GSFP+Cu or SFP+ Direct Attach cost less Ihan US0 400 per adapIer pcrI. ln ccmpariscn, a 10C ShcrI Reach (10CBASE SR) ber ccnnecIicn ccsIs apprcximaIely US0 700 per adapIer pcrI. 1
In existing VMware production environments, FedEx had used eight 1GbE connections implemented with two quad-port cards (plus one or two 100/1000 ports for management) in addition to two 4Cb FibreChannel links. Based cn markeI pricing, moving the eight 1GbE connections cn Ihe quadpcrI cards Ic Iwc 10CBASET or 10G Twinax connections is cost effective Icday. Using Iwc 10C ccnnecIicns can actually consume less power than the eight 1GbE connections they replaceproviding addiIicnal savings cver Ihe lcng Ierm. Having less cabling typically reduces instances of wiring errors and lowers mainIenance ccsIs, as well. FedEx used lEEEB0Z.1q Irunking Ic separaIe Ira!c dcws cn Ihe 10C links. Engineering Trade-offs Engineering trade-offs are also a ccnsideraIicn. Direct Attach/Twinax 10G cabling has a maximum reach of seven meters for passive cables, which affects the physical wiring scheme Ic be implemenIed. The servers have to be located within a seven-meter radius of the 10G switches Ic Iake advanIage c! Ihis new Iechnclcgy. However, active cables are available that can exIend Ihis range i! necessary. A key feature of the Twinax cable technology is that it uses exactly the same form-factor connectors that the industry-standard SFP+ cpIical mcdules use. Using Ihis technology allows you to select the most ccsIe!!ecIive and pcwere!cienI passive Twinax for short reaches and then move up Ic acIive Twinax, SR ber, cr even Lcng Reach (LR) ber !cr lcnger runs. Twinax adapters consume approximately 3W per pcrI when using passive cabling. 10CBASET's maximum reach c! 100 meIers makes iI a dexible, daIa cenIerwide deplcymenI cpIicn !cr 10CbE. 10CBASET is also backwards-compatible with Icday's widely deplcyed CigabiI EIherneI in!rasIrucIures. This !eaIure makes iI an excellent technology for migrating Figure 1. CcnguraIicn c! Ihe server wiring when using eighI 1CbE ccnnecIicns. Figure 2. CcnguraIicn c! Ihe server wiring when two 10G connections replace eight 1GbE ccnnecIicns. from GbE to 10GbE, as IT can use existing in!rasIrucIures and deplcymenI kncwledge. CurrenIgeneraIicn 10CBASET adapIers use approximately 10W per port, and upcoming products will consume even less, making them suitable for integration onto mcIherbcards in !uIure server generaIicns. Framing the Challenge File transfer applications are widely used in production environments to move data between systems for various purposes and to replicate data sets across servers and applications that share these data seIs. FedEx, !cr example, uses !Ip, scp, and rsync for data replication and disIribuIicn in Iheir prcducIicn neIwcrks. In addition to the considerations associated with cable consolidation, cost factors, and the power advantages of using 10C, ancIher key quesIicn remained. Can Icday's servers e!!ecIively Iake advanIage c! 10C pipes Using !Ip, FedEx was able to drive 320 Mbps over a 1G ccnnecIicn. lniIial 10C IesIing, hcwever, indicaIed IhaI Ihey cculd cnly achieve 560 Mbps, despite the potential capabilities of 10x !asIer pipes. Plugging a 10G NIC into a server does not automatically deliver 10 Gbps of applicaIicn level IhrcughpuI. An cbvicus quesIicn arises. whaI can be dcne Ic maximize le Irans!er per!crmance cn mcdern servers using 10C 2 3 Hardware Intel Xeon processor X5560 series @ 2.8 GHz (8 cores, 16 threads); SMT, NUMA, VT-x, VT-d, EIST, Turbo Enabled (default in BIOS); 24 GB Memory; Intel 10GbE CX4 Server Adapter with VMDq Test Methodology RAM disk used, not disk drives. We are focused on network I/O, not disk I/O What is being transferred? ||.ec!c., !.Jc!J.e. ,-.! c ||Jx .e,c|!c.,. o !c!-|. ddd |e. -.|-'|e |e |.e. -e.-e |e |.e .u || Data Collection Tools Used Linux * utility sar: Capture receive throughput and CPU utilization Application Tools used Netperf (common network micro-benchmark); OpenSSH, OpenSSL (standard Linux layers); HPN-SSH (optimized version of OpenSSH); scp, rsync (standard ||Jx |e !.-e. J!|||!|e) ''c, ||!`c..e!-||'e |e !.-e. J!|||!,) SOURCE SERVER NETPERF BBCP SCP RSYNC SSH HPNSSH RHEL 5.3 64-bit lnIel' Xecn' Prccesscr X5500 Series DESTINATION SERVER NETPERF BBCP SCP RSYNC SSH HPNSSH RHEL 5.3 64-bit lnIel' Xecn' Prccesscr X5500 Series File Transfer Direction lnIel 0plin 10CbE CX4 Directly connected back-to-back Figure 3. NaIive IesI ccnguraIicn deIails. FedEx and Intel delved deeper into the invesIigaIicn, using several ccmmcn le Irans!er Iccls cn bcIh naIive Linux* and VNware Linux VNs. The IesI envircnmenI featured Red Hat Enterprise Linux (RHEL) 5.3 and VNware vSphere ESX4 running cn lnIel Xecn prccesscr 5500 series based servers. These servers were equipped with Intel 10 Gigabit AF DA Dual Port Server Adapters supporting direct aIIach ccpper Iwinaxial (10CBASECX1) cable connections and the VMDq feature suppcrIed by VNware NeI0ueue. Native Test Configuration Figure 3 details the components of Ihe naIive IesI ccnguraIicn. The IesI systems were connected back-to-back over 10G to eliminate variables that could be caused by the network switches Ihemselves. This shculd be ccnsidered a best-case scenario because any switches add scme niIe amcunI c! laIency in real-world scenarios, possibly degrading per!crmance !cr scme wcrklcads. A RAM disk, rather than physical disk drives, was used in all testing to focus on the network I/O performance rather than being limiIed by disk l/0 per!crmance. The default bulk encryption used in OpenSSH and HPN-SSH is Advanced EncrypIicn SIandard (AES) 1ZBbiI. This was ncI changed during IesIing. The application test tools included the !cllcwing. netperf: This commonly used network-oriented, low-level synthetic, micro-benchmark does very little prccessing beycnd !crwarding Ihe packeIs. It is effective for evaluating the capabilities c! Ihe neIwcrk inIer!ace iIsel!. OpenSSH, OpenSSL: These standard Linux layers perform encryption for remcIe access, !ile Irans!ers, and sc cn. HPN-SSH: This optimized version of OpenSSH was developed by the PiIIsburgh SuperccmpuIer CenIer (PSC). For more details, visit www.psc.edu/ networking/projects/hpn-ssh/ scp: The standard Linux secure copy utility rsync: The standard Linux directory synchronization utility bbcp: A peer-to-peer file copy utility (similar Ic BiITcrrenI) develcped by the Stanford Linear Accelerator Center (SLAC). Fcr mcre deIails, gc Ic www.slac.stanford.edu/~abh/bbcp/ 3 4 Synthetic Benchmarks versus Real File Transfer Workloads for Native Linux* The following two sections examine the differences between synthetic benchmarking and benchmarks generated during actual workloads while running naIive Linux. Synthetic Benchmarks Figure 4 shcws Ihe resulIs c! Iwc IesI cases using neIper!. ln Ihe rsI case, Ihe 10C card was plugged inIc a PCle* Cen1 x4 slot, which limited the throughput to about 6 Cbps because c! Ihe PCle bus bcIIleneck. In the second case, the card was plugged into a PCIe Gen1 x8 slot, which allowed full IhrcughpuI aI near line raIe. PCIe slots can present a problem if the physical slot size does not match the acIual ccnnecIicn Ic Ihe chipseI. Tc determine which slots are capable of full PCIe width and performance, check with Ihe sysIem vendcr. The prcper ccnnecIicn widIh can alsc be veried using sysIem Iccls and lcg les. PCle Cen1 xB (cr PCle CenZ x4, i! suppcrIed) is necessary Ic achieve 10 Gbps throughput for one 10G pcrI. A dualpcrI 10C card requires Iwice Ihe PCle bus bandwidIh. As demonstrated, achieving 10 Gbps transfer rates is quite easy using a synIheIic benchmark. The nexI secIicn looks at a case where actual workloads are invclved. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 PCIe* Gen1 x4 PCIe* Gen1 x8 R e c e i v e
T h r o u g h p u t
( M b p s ) NETPERF Native Linux*: Synthetic Benchmark Figure 4. SynIheIic benchmark !cr naIive Linux*. Benchmarks Based on Actual Workloads Real applications present their own unique challenges. Figure 5 shcws Ihe earlier netperf results as a reference bar on the le!I and seven IesI cases Ic Ihe righI. Tool choice obviously matters, but the standard tools are not very well threaded, sc Ihey dcn'I Iake !ull advanIage c! Ihe eighI ccres, 16 Ihreads, and NlC queues (mcre Ihan 16) available in Ihis parIicular hardware plaI!crm. The scp Iccl running cver sIandard ssh, cr scp(ssh) in Figure 6, and the rsync(ssh) case both achieve only 400-550 Nbps, abcuI Ihe same as FedEx's iniIial disappcinIing resulIs wiIh !Ip. NulIi Ihreaded le Irans!er Iccls c!!er a pcIenIial performance boost, and two promising candidaIes emerged during a web search. HPNSSH !rcm PSC and bbcp !rcm SLAC. Figure 5. Ccmpariscn c! varicus le ccpy Iccls (cne sIream). 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Native Linux*: Various File Copy Tools (1 stream) Receive Throughput 1 Stream Avg. CPU (%Util) 1 Stream 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 4 5 Using the HPN-SSH layer to replace the OpenSSH layer drives the throughput up to abcuI 1.4 Cbps. HPNSSH uses up Ic !cur threads for the cryptographic processing and more than doubles application layer IhrcughpuI. Clearly, per!crmance is mcving in Ihe righI direcIicn. If improving the bulk cryptographic processing helps that much, what could Ihe Ieam achieve by disabling iI enIirely HPNSSH c!!ers IhaI cpIicn as well. Without bulk cryptography, scp achieves Z Cbps and rsync achieves Z.3 Cbps. The performance gains in these cases, however, rely on bypassing encryption !cr le Irans!ers. HPNSSH prcvides signicanI per!crmance advanIages over the default OpenSSH layer that is provided in most Linux distributions; this apprcach warranIs !urIher sIudy. According to a presentation created by SLAC titled P2P Data Copy Program bbcp (http://www.slac.stanford.edu/ grp/scs/paper/chep01-7-018.pdf), bbcp encrypts sensitive passwords and control information but does not encrypt the bulk daIa Irans!er. This design Irade c!! sacrices privacy !cr speed. Even without encrypting the bulk data, this can still be an effective trade-off for many environments where data privacy is not a criIical ccncern. wiIh bbcp, using Ihe de!aulI !cur Ihreads, Ihe le Irans!er raIes reached 3.6 Cbps. This represenIs the best results so far, surpassing even the HPN-SSH cases with no cryptographic prccessing, and iI's mcre Ihan six Iimes better than the initial test results with scp(ssh). The bbcp apprcach is very e!cienI and bears !urIher ccnsideraIicn. Based cn Ihese resulIs, FedEx is acIively evaluating the use of HPN-SSH and bbcp in Iheir prcducIicn envircnmenIs. Ncne of the techniques tried so far, however, has even come close to achieving 10 Gbps of throughputnot even reaching 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Native Linux*: Various File Copy Tools (8 streams vs. 1 stream) Receive Throughput 1 Stream Receive Throughput 8 Streams Avg. CPU (%Util) 1 Stream Avg. CPU (%Util) 8 Streams hal! c! IhaI IargeI. AI Ihis pcinI, Ihe lnIel and FedEx engineering team focused on identifying any bottlenecks preventing Ihe 10 Cbps le Irans!er IargeI !rcm being reached. 0Iher Ihan rewriIing all of the tools to enhance the performance through multi-threading, the team also wanted to know if any other available Iechniques cculd bccsI le Irans!er raIes To gain a better understanding of the performance issues, the engineering Ieam ran eighI le Irans!ers in parallel streams to attempt to drive up aggregate le Irans!er IhrcughpuI per!crmance and obtain better utilization of the platform hardware rescurces. Figure 6 indicaIes Ihe resulIs. ln Figure 6, Ihe rsI !cur red bars show that using eight parallel streams overcomes the threading limits of these tools and drives aggregate bulk encrypted IhrcughpuI much higher. Z.7 Cbps wiIh scp(ssh) 3.3 Cbps wiIh rsync(ssh) 4.4 Cbps wiIh scp(HPNSSH) 4.Z Cbps wiIh rsync(HPNSSH) These results demonstrate that using more parallelism dramatically scales up Figure 6. Ccmpariscn c! varicus le ccpy Iccls (eighI sIreams). per!crmance by mcre Ihan ve Iimes, but the testing did not demonstrate eight times the throughput when using eight Ihreads in parallel. The rescluIicn Ic Ihe problem does not lie in simply using the brute-force approach and running more sIreams in parallel. These results also show that bulk encryption is expensive, in terms of bcIh IhrcughpuI and CPU uIilizaIicn. HPN-SSH, with its multi-threaded cryptography, still provides a significant benefit, but not as dramatic a benefit as in Ihe singlesIream case. The results associated with the remaining Ihree red bars c! Figure 6 are insIrucIive. The rsI Iwc cases use HPNSSH wiIh nc bulk cryptography, and the third case is the eight-thread bbcp result in which bulk daIa Irans!ers are ncI encrypIed. These results demonstrate that it is possible to achieve nearly the same 10 Gbps line rate throughput number as the netperf micro- benchmark resulI when running real le Irans!er applicaIicns. As this testing indicates, using multiple parallel streams and disabling bulk cryptographic processing is effective !cr cbIaining near 10Cbps line raIe le 5 6 SOURCE SERVER Irans!er IhrcughpuI in Linux. These Irade offs may or may not be applicable for a given network environment, but they do indicate an effective technique for gaining better performance with fewer Iradec!!s in Ihe !uIure. lnIel is ccnIinuing to work with the Linux community to nd scluIicns Ic increase le Irans!er per!crmance. The nexIgeneraIicn lnIel Xecn 5600 prccesscr !amily, ccdenamed wesImere, and cIher !uIure lnIel Xecn processors will have a new instruction set enhancement to improve performance for Advanced Encryption Standard (AES) bulk encrypIicn called AESNl. This advance prcmises Ic deliver !asIer le Irans!er rates when bulk encryption is turned on wiIh AESNlenabled plaI!crms. Best Practices for Native Linux Follow these practices to achieve the best per!crmance resulIs when per!crming le Irans!er cperaIicns under naIive Linux. Configuration: PCIe Gen1 x8 minimum for 1x 10G port; PCIe Gen2 x8 minimum for 2x 10G ports on one card BIOS settings: Turn ON Energy Efficient mode (Turbo, EIST), SMT, NUMA Turn 0N Receive (Rx) NulIi0ueue (N0) support (enabled by default in RHEL); Transmit (Tx) is currently limited to one queue in RHEL, SLES 11RC supports MQ Tx FacIcr in Ihese limiIaIicns c! Linux !ile Irans!er Iccls and SSH/SSL layers. - scp and ssh. single Ihreaded - rsync. dual Ihreaded - HPNSSH. !cur crypIcgraphy Ihreads and single-threaded MAC layer - bbcp. encrypIed seIup handshake, buI ncI bulk transfer; defaults to four threads Use mulIiple parallel sIreams Ic overcome tool thread limits and maximize IhrcughpuI. Bulk crypIcgraphy cperaIicns limiI per!crmance. 0isable crypIcgraphy !cr Ihcse envircnmenIs where iI is accepIable. The lnIel Xecn 5600 prccesscr !amily (code-named Westmere) and future lnIel Xecn prccesscrs will imprcve bulk crypIcgraphic per!crmance using AESNl. Achieving Native Performance in a Virtualized Environment Earlier sections illustrated effective Iechniques !cr maximizing le Irans!er performance using various tools in the Linux native environment and described scme c! Ihe !acIcrs IhaI limiI le Irans!er per!crmance. The following sections examine performance in a virtualized environment to determine the level of network throughput that can be achieved for both synIheIic benchmarks and !cr varicus le Irans!er Iccls. In this virtualized environment, the test systems are provisioned with VMware vSphere ESX4 running cn lnIel Xecn prccesscr 5500 seriesbased servers. The test systems are connected back- to-back over 10G with VMDq enabled in VNware NeI0ueue. The VNs cn Ihese servers were prcvisicned wiIh RHEL 5.3 (64 biI) and, as in naIive cases, Ihe IesI team used the same application tools and IesI meIhcdclcgy excepI in cne insIance. The team used the esxtop utility for measuring Ihe servers' IhrcughpuI and CPU uIilizaIicn. Within the virtualized environment, these IesI scenarics were used. 0ne virIual machine wiIh eighI vCPU and 1Z CB RAN 0ne virIual machine (eighI vCPU and 1Z CB RAN) wiIh VN0irecIPaIh l/0 EighI virIual machines, wiIh each virIual machine having cne vCPU and Z CB RAN EighI virIual machines, bcIh wiIh and without the VMDq feature CASE 1: One Virtual Machine with Eight vCPUs and 12 GB RAM Each server had cne VN ccngured wiIh eighI vCPUs and 1Z CB c! RAN, using RHEL 5.3 (64 biI) as Ihe guesI cperaIing sysIem (0S). The IesI Ieam ran neIper! as a micrcbenchmark and alsc varicus le transfer tools for transferring a directory sIrucIure c! apprcximaIely B CB !rcm Ihe VN cn Ihe rsI server Ic Ihe VN cn Ihe seccnd server. Figure 7 shcws Ihe IesI ccnguraIicn. Similar to the testing done in the native Linux case, the test team compared the daIa !rcm neIper!, cne sIream c! le Irans!er, and eighI parallel le Irans!ers. Figure 8 shows the receive network throughput and total average CPU utilization for the micro-benchmark, such as neIper!, and varicus le Irans!er Iccls when running a single sIream c! ccpy. The gure ccmpares Ihe resulIs !cr Ihe native data with the data results in the virIualized envircnmenI. Figure 7. TesI ccnguraIicn !cr Case 1. VM1 (8 vCPU) VM1 (8 vCPU) File Transfer Direction lnIel 0plin 10CbE CX4 Directly connected back-to-back File Transfer Applications File Transfer Applications RHEL 5.3 RHEL 5.3 v S w i t c h v S w i t c h VMware ESX 4.0 VMware ESX 4.0 lnIel' Xecn' Prccesscr X5500 Series lnIel' Xecn' Prccesscr X5500 Series DESTINATION SERVER 6 7 Figure 8 indicates that the throughput in a VM is lower across all cases when ccmpared wiIh Ihe naIive case. ln Ihe virtualized case, the throughput for the neIper! IesI resulIs is 5.B Cbps ccmpared Ic 9.3 Cbps in Ihe naIive case. Even in Ihe case c! le Irans!er Iccls, such as scp and rsync running over standard ssh, the throughput ranges from 300 Mbps to 500 Nbps, which is slighIly lcwer when ccmpared wiIh Ihe naIive case. Using the HPN-SSH layer to replace the 0penSSH layer increases Ihe IhrcughpuI. Also, disabling cryptography increases the throughput, but not at as high a level as Ihe line raIe. The shape c! Ihe curve, hcwever, is similar. The same limitations that occurred in the native case (such as standard tools not being well threaded and cryptography adding to the overhead) also apply in this case. Because c! Ihis, Ihe le Irans!er tools cannot take full advantage of mulIiple ccres and Ihe NlC queues. Most of the tools and utilitiesincluding ssh and scpare single threaded; the rsync uIiliIy is dual Ihreaded. Using Ihe HPN-SSH layer to replace the OpenSSH layer helps increase Ihe IhrcughpuI. ln HPN-SSH, the cryptography operations are multi-threaded (four threads), which boosts per!crmance signicanIly. The single threaded MAC layer, however, still creates a bcIIleneck. when HPNSSH is run wiIh cryptography disabled, the performance increases, buI Ihe beneIs c! encrypIed daIa Irans!er are lcsI. This is similar Ic Ihe case with bbcp, which is multi-threaded (using four threads by default), but the bulk Irans!er is ncI encrypIed. The next test uses eight parallel streams, attempting to work around the threading limiIaIicns c! varicus Iccls. Figure 10 shows the receive network throughput and CPU uIilizaIicn !cr varicus le transfer tools when running eight parallel sIreams c! ccpies. This charI alsc includes ccmpariscns wiIh Ihe naIive daIa resulIs. Figure 8. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (cne sIream). 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Receive Throughput Native Receive Throughput Virtualized Avg. CPU (%Util) Native Avg. CPU (%Util) Virtualized ESX* 4.0 GA (1 VM with 8 vCPU): Various File Copy Tools (1 stream) Figure 9. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams) 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Receive Throughput Native Receive Throughput Virtualized Avg. CPU (%Util) Native Avg. CPU (%Util) Virtualized ESX* 4.0 GA (1VM with 8 vCPU): Various File Copy Tools (8 streams) Frcm Figure 9 iI's clear IhaI using eighI parallel sIreams cverccmes Ihe Iccls' Ihreading limiIaIicns. Figure 9 alsc shcws that cryptography operations are a limiter in Ihe rsI !cur cases c! Ihe le Irans!er Iccls. BeIIer IhrcughpuI is indicaIed in the last three cases, in which the copies were made wiIhcuI using crypIcgraphy. Even though these results are better compared to relying on one-stream data, Ihey sIill dcn'I ccme clcse Ic Ihe line rate (approximately 10 Gbps) achieved in Ihe naIive case. Since Ihe IesIing used one VM with eight vCPUs, the test team determined that this might be a good case for using direct assignment of the 10G NIC Ic Ihe virIual machine. 7 8 CASE 2: One Virtual Machine with Eight vCPUs and VMDirectPath The IesI bed seIup and ccnguraIicn in this case was similar to that of the previous case except that the 10G was direcI assigned Ic Ihe VN. Figure 10 shcws Ihe IesI ccnguraIicn !cr Ihis case. The test team started with a synthetic benchmark netperf and then ran eight parallel sIreams c! ccpies !cr varicus le Irans!er Iccls. The resulIs shcwn in Figure 12 compare the performance numbers from the VM with DirectPath I/O to the performance numbers of the VM with no 0irecIPaIh l/0 and naIive. As Figure 11 illustrates, VMDirectPath (VT-d direct assignment) of the 10G NIC to the VM increases performance to a level that is close to the native per!crmance resulIs. NcneIheless, the trade-offs associated with using VN0irecIPaIh are subsIanIial. A number of features are not available !cr VN ccngured wiIh VN0irecIPaIh, including VNcIicn*, suspend/resume, record/replay, fault tolerance, high availabiliIy, 0RS, and sc cn. Because c! these limitations, the use of VMDirectPath will continue to be a niche solution awaiting future developments that minimize Ihem. VN0irecIPaIh may be practical to use today for virtual security-appliance VMs since these VMs Iypically dc ncI migraIe !rcm hcsI Ic hcsI. VMDirectPath may also be useful for other applications that have their own clustering Iechnclcgy and dcn'I rely cn Ihe VNware plaI!crm !eaIures !cr !aulI Iclerance. Figure 10. TesI ccnguraIicn !cr Case Z. Figure 11. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams) using VN0irecIPaIh. 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) ESX 4.0* GA (1VM with 8 vCPU & VMDirectPath I/O: Various FileVarious File Copy Tools (8 streams) Receive Throughput Native Receive Throughput VM w/o VMDirectPath I/O Receive Throughput VM w/VMDirectPath I/O Avg. CPU (%Util) Native Avg. CPU (%Util) VM w/o VMDirectPath I/O Avg. CPU (%Util) VM w/ VMDirectPath I/O SOURCE SERVER VM1 (8 vCPU) VM1 (8 vCPU) File Transfer Direction lnIel 0plin 10CbE CX4 Directly connected back-to-back File Transfer Applications File Transfer Applications RHEL 5.3 RHEL 5.3 VMware ESX 4.0 VMware ESX 4.0 lnIel' Xecn' Prccesscr X5500 Series lnIel Xecn Prccesscr X5500 Series DESTINATION SERVER 8 9 CASE 3: Eight Virtual Machines, Each with One vCPU and 2 GB RAM The two previous cases incorporated a single large VM with eight vCPUs, which is scmeIhing similar Ic a naIive server. In Case 3, the scenario includes eight virIual machines wiIh each VN ccngured wiIh cne vCPU and Z CB RAN. The guesI 0S is sIill RHEL 5.3 (64 biI). BcIh c! Ihe servers include eight virtual machines and the tests run one stream of copy per VM (so in effect eight parallel streams of ccpies are running). Figure 1Z shcws Ihe ccnguraIicn !cr Ihis case. Figure 13 indicates the performance results when running a synthetic micrcbenchmark and real le Irans!er applicaIicns cn eighI VNs in parallel. Comparisons with the results for the naIive server are shcwn in blue. As shown in Figure 13, the aggregate throughput with netperf across eight VMs reaches the same level of throughput as achieved in Ihe naIive case. wiIh Ihe le Irans!er Iccls, crypIcgraphy sIill imposes performance limitations, but the multi-threaded cryptography in HPN-SSH improves performance when compared Ic Ihe sIandard uIiliIies, such as ssh. Larger beneIs are gained when bulk cryptography is disabled, as indicated by the results shown by the last three red bars. Several c! Ihese resulIs shcw that the virtualized case can equal the per!crmance c! Ihe naIive case. Figure 12. TesI ccnguraIicn !cr Case 3. Figure 13. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls wiIh B VNs and 1 sIream per VN. 0 10 20 30 40 50 60 70 80 90 100 NETPERF SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Receive Throughput Native Receive Throughput Virtualized Avg. CPU (%Util) Native Avg. CPU (%Util) Virtualized ESX* 4.0 GA (8 VMs with 1 vCPU each): Various File Copy Tools SOURCE SERVER VM1 (1 vCPU) VM1 (1 vCPU) VM8 (1 vCPU) VM8 (1 vCPU) File Transfer Direction lnIel 0plin 10CbE CX4 Directly connected back-to-back v S w i t c h v S w i t c h VMware* ESX* 4.0 VMware ESX 4.0 lnIel' Xecn' Prccesscr X5500 Series lnIel' Xecn' Prccesscr X5500 Series DESTINATION SERVER 9 10 Sub Case: Eight Virtual Machines, with and without VMDq Because Ihe IesI ccnguraIicn includes multiple VMs, the VMDq feature offers scme advanIages. This !eaIure helps c!dcad neIwcrk l/0 daIa prccessing from the virtual machine monitor (VMM) sc!Iware Ic neIwcrk siliccn. The IesI Ieam ran a sub case, comparing the receive IhrcughpuI !rcm varicus le Irans!er Iccls by enabling and disabling the VMDq and asscciaIed NeI0ueue !eaIure. The resulIs c! Ihe rsI !cur ccpy cperaIicns in Figure 14 shcw IhaI there is no difference in throughput, regardless of whether VMDq is enabled or disabled. This is primarily because c! Ihe limitations imposed by bulk cryptography cperaIicns. when running Ihe le Irans!er applications tools by disabling the cryptographyas in last three cases the advantages of the VMDq feature beccme clear. The resulIs c! Ihe lasI three operations show the advantage of VMDq with the cryptography bottleneck remcved. These daIa resulIs indicaIe IhaI VMDq improves performance with multiple VMs if there are no system bottlenecks in place (such as crypIcgraphy cr slcI widIh). Based cn Ihe !ull range c! IesI resulIs, the test team developed a set of best practices to improve performance in virtualized environments, as detailed in Ihe nexI secIicn. Figure 14. ThrcughpuI ccmpariscn c! varicus le ccpy Iccls (eighI sIreams) wiIh and wiIhcuI VirIual Nachine 0evice 0ueues (VN0q). 0 10 20 30 40 50 60 70 80 90 100 SCP (SSH) RSYNC (SSH) SCP (HPN-SSH) RSYNC (HPN-SSH) SCP (HPN-SSH + No Crypto) RSYNC (HPN-SSH + No Crypto) BBCP 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 A v g .
C P U
( % U t i l ) R e c e i v e
T h r o u g h p u t
( M b p s ) Receive Throughput VMDq Receive Throughput No VMDq Avg. CPU (%Util) VMDq Avg. CPU (%Util) No VMDq ESX* 4.0 GA - 8 VMs: Various File Copy Tools (VMDq vs. No VMDq) Type of File copy Best Practices for Virtualized Environments (ESX* 4.0) Consider the following guidelines for achieving the best data transfer performance in a virtualized environment running VNware VSphere ESX 4. Ccn!irm Ihe acIual widIh c! PCle slcIs. - Physical slcI size dcesn'I always maIch the actual connection to the chipset; check with your server vendor to deter- mine which cnes are really !ull widIh. -Veri!y Ihe prcper ccnnecIicn using sys- Iem Iccls. -PCle Cen1 xB (cr PCle CenZ x4, i! sup- ported) is required to achieve 10 Gbps IhrcughpuI !cr cne 10C pcrI. PCle CenZ xB is necessary !cr Zx 10C pcrIs. -The Irans!er raIe limiIaIicn wiIh PCle Cen1 x4 is apprcximaIely 6 Cbps. By de!aulI, NeI0ueue/VN0q is enabled !cr bcIh Rx and Tx queues. - 0ne queue is allccaIed per ccre/Ihread up Ic Ihe hardware limiIs. - Per!crmance is imprcved wiIh mulIiple VMs, if there are no system bottlenecks, such as slcI widIh cr crypIcgraphy. Use vmxneI3 (new in ESX 4.0) cr vmxneIZ (ESX 3.5) insIead c! Ihe de!aulI e1000 driver. Use mulIiple VNs Ic imprcve IhrcughpuI raIher Ihan cne large VN. - Twc vCPU VNs shcw beIIer IhrcughpuI Ihan cne vCPU VN in cerIain cases. Tccl limiIaIicns, slcI limiIaIicns, and cryptography limitations still apply, as in Ihe naIive case. 10 11 The Results This case study and the results obtained through the collaborative efforts of FedEx and Intel engineers suggest a number of ways in which le Irans!er per!crmance in the data center can be maximized, depending on the tools and virtualization techniques used, as well as the hardware ccnguraIicn. The most important performance-related considerations, based on the observations and data results obtained during testing, are as !cllcws. Be aware c! pcIenIial PCle slcI hardware issues. Ensure IhaI Ihe PCle bus width and speed provide sufficient I/O bandwidIh !cr !ull per!crmance. Ccnsider Twinax cabling Ic greaIly reduce Ihe ccsI c! 10C EIherneI. The SFP+ form factor provides a single physical interface standard that can cover the application rangefrom the lowest cost, lowest power, and shortest reach networking available to the longest reach situations that may be enccunIered. Chccsing Ihe cable Iype on a port-by-port basis can provide a cost-effective and flexible approach !cr envircnmenIal needs. when using this new cabling approach, 10G can be cost effective today, as compared to commonly deployed usage models with six, eighI, cr mcre 1C pcrIs per server. SeI Ihe apprcpriaIe sysIem ccn!iguraIicn parameIers. The IesI Ieam evaluaIed each of the recommended settings detailed in the body of this case study and determined that these settings either improved performance or were neuIral !cr Ihe IesI wcrklcads. Nake sure Ihe Bl0S is seI ccrrecIly and IhaI the OS, VMM, and applications take full advanIage c! mcdern plaI!crm !eaIures. SynIheIic benchmarks are ncI Ihe same as real wcrklcads. AlIhcugh synIheIic benchmarks can be useful tools to evaluate subsystem performance, they can be misleading for estimating applicaIicn level IhrcughpuI. Chccse Ihe righI Iccls and usage mcdels. ldenIi!y IcclIhreading limiIs, and use mcre parallelism when pcssible. Cryptography can be a bottleneck, so disable iI i! iI is ncI required. l! iI is required, choose an implementation that has cpIimal per!crmance. SeI ycur expecIaIicns ccrrecIly. Configuration using multiple VMs will c!Ien cuIper!crm a single VN. Can ycur applicaIicn scale in Ihis !ashicn In some cases multi-vCPU VM is better i! Ihe applicaIicns are well Ihreaded. Moving directly from your physical server configurations and tunings to the nearest equivalent VM is an effective starting point, but ultimately may not be Ihe besI Iradec!!. VirIualizaIicn features, such as VMDq and NetQueue, will perform better when multiple VMs are being used and there are no system bottlenecks, such as PCIe bandwidth limiIaIicns cr crypIcgraphy prccessing. This case study demonstrates that it is possible to achieve close to 10G line raIe IhrcughpuI !rcm Icday's servers pcwered by lnIel Xecn prccesscrs running RHEL 5.3 and VNware ESX 4.0. Use Ihe testing methods described to validate the 10G network you are building and to help idenIi!y and resclve prcblems. The results detailed in this case study make it clear that tool choices, usage models, and ccnguraIicn seIIings are mcre impcrIanI than whether the application is running in a virIualized envircnmenI. 11 For More Information For answers to questions about server performance, data center architecture, or application optimization, check out Ihe rescurces cn The Server Rccm siIe. hIIp.//ccmmuniIies.inIel.ccm/ccmmuniIy/ openportit/server. lnIel experIs are available to help you improve performance within your network infrastructure and achieve ycur daIa cenIer gcals. AUTHORS Chris Greer, chief engineer, FedEx Services Chris Creer is a chie! engineer c! Iechnical archiIecIure aI FedEx Services. He has wcrked !cr 1Z years aI FedEx as a neIwcrk and sysIems engineer. CurrenIly, he wcrks Ic research and dene neIwcrk and server sIandards wiIh an emphasis cn server virIualizaIicn. Bob Albers, I/O usage architect, Intel Corporation Bcb Albers is an l/0 usage archiIecI in lnIel's 0igiIal EnIerprise Crcup's EndUser PlaI!crm lnIegraIicn Ieam. He has cver 3Z years c! experience in ccmpuIer and neIwcrk design and archiIecIure, including Z5 years aI lnIel. Fcr Ihe lasI several years Bcb has been focused on understanding the usage models and improving the performance of neIwcrk and securiIy wcrklcads running cn naIive and virIualized lnIel servers. Sreeram Sammeta, senior systems engineer, Intel Corporation Sreeram SammeIa is a senicr sysIems engineer in lnIel's 0igiIal EnIerprise Crcup's Enduser PlaI!crm lnIegraIicn grcup. He previcusly has wcrked in varicus engineering pcsiIicns in ln!crmaIicn Technclcgy during Ihe lasI 7 years. He hclds an N.S in elecIrical engineering !rcm SlU. His primary !ccus area is building endIcend prcc!c!ccncepIs, explcring, deplcying, and validaIing emerging enduser usage mcdels. His recenI interests include evaluating next generation data center virtualization technologies !ccusing cn per!crmance c! virIualized neIwcrk and securiIy in!rasIrucIure appliances. S0LUTl0N PR0Vl0E0 BY.
1 Based on list prices for current Intel Ethernet products. This document and the information given are for the convenience of Intels customer base and are provided AS IS WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED,INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does not grant any license to any of the intellectual property described, displayed, or contained herein. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control, or safety systems, or in nuclear facility applications. Perlorrarce lesls ard ral|rgs are reasured us|rg spec|lc corpuler syslers ard/or corporerls ard relecl lre approx|rale perlorrarce ol lrle| producls as reasured oy lrose lesls. Ary d|llererce |r sysler rardWare or sollWare des|gr or corlgural|or ray allecl aclua| perlorrarce. lrle| ray ra|e crarges lo spec|lcal|ors, producl descr|pl|ors ard p|ars al ary l|re, W|lroul rol|ce. Copyr|grl @ 2010 lrle| Corporal|or. A|| r|grls reserved. lrle|, lre lrle| |ogo, ard Xeor are lraderar|s ol lrle| Corporal|or |r lre u.3. ard olrer courlr|es. 0lrer rares ard orards ray oe c|a|red as lre properly ol olrers. Pr|rled |r u3A 1010/8Y/VE3l/P0F Please Recycle 323329-002US