A Study of Applications

A STUDY OF APPLICATIONS FOR OPTICAL CIRCUIT-SWITCHED NETWORKS
A Thesis Presented to the faculty of the School of Engineering and Applied Science University of Virginia
In Partial Fulllment of the requirements for the Degree Master of Science Computer Science
by
Xiuduan Fang
May 2006
APPROVAL SHEET
This thesis is submitted in partial fulllment of the requirements for the degree of Master of Science Computer Science
Xiuduan Fang
This thesis has been read and approved by the examining committee:
Malathi Veeraraghavan (Advisor)
Marty Humphrey (Chair)
Alfred Weaver Accepted for the School of Engineering and Applied Science:
Dean, School of Engineering and Applied Science
May 2006
Abstract
The networking community has made a signicant investment in GMPLS networks, which are connection-oriented networks that support dynamic call-by-call bandwidth sharing. Currently, GMPLS switches are call blocking and GMPLS control-plane protocols only support immediate requests for bandwidth. This thesis rst addresses the question of suitability for different types of applications for GMPLS networks. Using the Erlang-B formula, we reason that GMPLS networks are well suited for applications in which the required per-circuit bandwidth is on the order of one-hundredth the shared link capacity. Then, we propose two applications for the GMPLS network, CHEETAH, which we have deployed as part of an NSF-sponsored project. The rst is a web transfer application, for which we design and implement a software package called WebFT. We integrate the CHEETAH end-host software modules into WebFT to provide deterministic data-transfer services transparently to users. The CHEETAH network provides connection-oriented services in addition to the connectionless service offered by the Internet. This add-on design allows the WebFT package to provide normal web access to nonCHEETAH clients through the Internet while simultaneously serving CHEETAH clients on dedicated circuits. The experiments conducted on the CHEETAH testbed show that WebFT can achieve low-variance, end-to-end transfer delays at different circuit rates and low transfer delays when high-speed circuits are possible. The second application is parallel le transfers on CHEETAH. We identify that two factors limit le-transfer throughput on networks with a high bandwidth-delay product: TCPs congestioncontrol algorithm and end-host limitations. We propose a general cluster solution to overcome these two factors. The solution uses GridFTP striped transfer and Parallel Virtual File System, version
iii
iv 2 (PVFS2) to transfer data amongst multiple hosts in parallel over dedicated circuits. To minimize end-host networkanddisk contention, we modify GridFTP and PVFS2 code such that all pairs of sending and receiving hosts are only responsible for blocks located in their local disks, which results in improved throughput.
Acknowledgments
I am indebted to my advisor, Professor Malathi Veeraraghavan, for her consistent guidance and support. Professor Veeraraghavan has tirelessly guided me, teaching me how to do research in a systematic way. She has spent signicant time on improving my writing skills. She has been and will always be an excellent role model for me. I am also grateful to all the other members in our research group, Dr. Xuan Zheng, Xiangfei Zhu, Zhanxiang Huang, Tao Li, and Anant P. Mudambi, for all their help. I am especially grateful to my grandmother, my parents, my brother Kevin, and my husband Lin for their continuous love and support. Without them, I could not have achieved what I have achieved today. Finally, this work was carried out under the sponsorship of NSF ITR-0312376, NSF EIN0335190, and DOE DE-FG02-04ER25640 grants.
Contents
Acknowledgments 1 2 INTRODUCTION BACKGROUND 2.1 CO Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 CO Networks and GMPLS Control-Plane Protocols . . . . . . . . . . . . . Existing Switches, Gateways, and Networks . . . . . . . . . . . . . . . . .
v 1 3 3 3 8 11 11 13 15 15
CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 CHEETAH Concept and Network . . . . . . . . . . . . . . . . . . . . . . CHEETAH End-Host Software . . . . . . . . . . . . . . . . . . . . . . .
ANALYTICAL MODELS OF GMPLS NETWORKS 3.1 Bandwidth Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Model for Applications in which Call-Holding Time is Independent of PerCircuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Model for Applications in which Call-Holding Time is Dependent on PerCircuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Applications in which Call-Holding Time is Independent of Per-Circuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17 19
19
vi
Contents
3.2.2 Applications in which Call-Holding Time is Dependent on Per-Circuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
23 27 29 30 30 31 32 34 35 37 38 38 41 41 45 46 48 48 50 53 61 65 68 69
WEB TRANSFER APPLICATION ON CHEETAH 4.1 WebFT Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.3 WebFT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CGI Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The WebFT Sender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The WebFT Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experimental Testbed and Results . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PARALLEL FILE TRANSFERS ON CHEETAH 5.1 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 5.2.2 5.3 5.4 FTP and GridFTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PVFS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Single-Host Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The General-Case Cluster Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 The Splitting Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ImplementationModications to PVFS2 . . . . . . . . . . . . . . . . . ImplementationModications to GridFTP . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 5.6
The Specic Cluster Solution for TSI . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
6 CONCLUSIONS AND FUTURE WORK 6.1 6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii 70 70 71 73
Bibliography
List of Figures
2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5
Distributed call-setup process progressing hop-by-hop . . . . . . . . . . . . . . . CHEETAH concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . . . . . . . . . Call-based sharing model for any single link of a switch A bandwidth sharing model for le transfers . . . . . . . . . . . . . .
6 11 13 14 15 17 20 20
. . . . . . . . . . . . . . . . . . . .
Plots of Pb vs. m for U = 40%, 60%, 80%, and 90% . . . . . . . . . . . . . . . . . Plots of vs. m and /m vs. m . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plots of Pb vs. and U vs. for m = 10, 100, and 1000, N 0 = 50 and 100, = 1.1, and k = 1.25 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.6
Plot of N 0 vs. for m = 10, 100, and 1000, U = 60% and 80%, = 1.1, and k = 1.25 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 31 32 33 35 36 40 42
3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2
Plots of N vs. m for U = 40%, 60%, 80%, and 90% . . . . . . . . . . . . . . . . . WebFT architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The ow of events from running CGI scripts . . . . . . . . . . . . . . . . . . . .
The ow chart for the WebFT sender . . . . . . . . . . . . . . . . . . . . . . . . . CHEETAH testbed for WebFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . The web page to test WebFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The single-host solution vs. the general-case cluster solution . . . . . . . . . . . . The model and ow chart of third-party control . . . . . . . . . . . . . . . . . . . ix
List of Figures
5.3 5.4 5.5 5.6 5.7 5.8 5.9 The model and ow chart of GridFTP striped transfer . . . . . . . . . . . . . . . . PVFS system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x 43 46 52 53 55 55 56 57 58 59 59 60 61 62 63
A model of using GridFTP partial le transfer to implement the transferring step . A model of using GridFTP striped transfer to implement the transferring step . . . A snippet of pvfs2-fs2.conf, the PVFS2 conguration le on sunre6 . . . . . . . . A part of the output for pvfs2-fs-dump . . . . . . . . . . . . . . . . . . . . . . . . The content of an s KB le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10 A part of the output for the command more testle/pvfs2cp2 | grep connect . . . . . 5.11 A part of the output of the command more testle/pvfs2cp2 | grep writev | more . . 5.12 The pvfs2-fs-dump output for the test 1000M le . . . . . . . . . . . . . . . . . . 5.13 A snippet from the le pvfs2cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14 A part of the output for the strace command . . . . . . . . . . . . . . . . . . . . . 5.15 A snippet of the source code for PINT cached cong get next io() . . . . . . . . . 5.16 The commands to start GridFTP servers on sunre . . . . . . . . . . . . . . . . . 5.17 A part of the debug output for the GridFTP striped transfer . . . . . . . . . . . . .
5.18 The tcptrace outputs for GridFTP striped transfer before we modied GridFTP code 64 5.19 The tcptrace outputs for GridFTP striped transfer after we modied GridFTP code . 5.20 The specic cluster solution for TSI . . . . . . . . . . . . . . . . . . . . . . . . . 67 68
List of Tables
2.1 4.1 5.1 5.2 5.3 5.4
A classication of networks that reects sharing modes . . . . . . . . . . . . . . . Average throughputs and delays at a variety of circuit rates . . . . . . . . . . . . . A summary of possible approaches to implement the general-case cluster solution . The logical server numbers for the physical I/O servers . . . . . . . . . . . . . . . The le descriptors and IP addresses for sunre6 through sunre10 . . . . . . . . . The data-distribution pattern for /pvfs2/test 1000M . . . . . . . . . . . . . . . . .
4 37 54 56 57 58
xi
List of Abbreviations
API AS CHEETAH CGI CL CN CO C-TCP DNS DRAGON FTP GbE Gb/s GB GFP GMPLS GPFS GSR GT I/O ION
application programming interface autonomous system Circuit-switched High-speed End-to-End Transport ArcHitecture Common Gateway Interface connectionless compute node connection-oriented Circuit-TCP Domain Name Server Dynamic Resource Allocation via GMPLS Optical Networks File Transfer Protocol Gigabit Ethernet gigabit per second gigabyte Generic Framing Procedure Generalized Multiprotocol Label Switching General Parallel File System Gigabit Switch Router Globus Toolkit Input/Output I/O node xii
IP KB LAN LMP MAN Mb/s MB MPLS MSPP MTU NCSU NFS NIC OC OCS ORNL PCIX PVFS2 QoS RAID RD RSVPTE RTP RTT SDM SLR SNMP SONET Internet Protocol kilobyte Local Area Network Link Management Protocol Metropolitan Area Network megabit per second megabyte Multiprotocol Label Switching Multi-Service Provisioning Platform Maximum Transmission Unit North Carolina State University Network File System network interface card Optical Carrier Optical Connectivity Service Oak Ridge National Laboratory (ORNL) Peripheral Component Interconnect Extended Parallel Virtual File System, version 2 Quality of Service redundant array of inexpensive disks routing decision Resource ReSerVation ProtocolTrafc Engineering Research Triangle Park round-trip delay time Space Division Multiplexing Southern Light Rail Simple Network Management Protocol Synchronous Optical Network
xiii
SOX TB TCP TDM TE TSI VC VLSR WAN WDM Southern Crossroads terabyte Transmission Control Protocol Time Division Multiplexing trafc engineering Terascale Supernova Initiative virtual circuit Virtual Label Switch Router Wide Area Network Wavelength Division Multiplexing
xiv
Chapter 1
INTRODUCTION
The networking community has made a signicant investment in connection-oriented (CO) networking. Allowing the reservation of bandwidth in the form of a dedicated circuit, or virtual circuit (VC), through a CO network prior to data transfers, this networking mode is recognized for its ability to offer service guarantees at some cost of utilization and fairness. A number of optical CO testbeds, some of which use Generalized Multiprotocol Label Switching (GMPLS), have been deployed for research and educational purposes. These include CANARIEs CA*net 4 [11], OMNInet [34], SURFnet [49], UKLight [55], DOEs UltraScience net [41], Dynamic Resource Allocation via GMPLS Optical Networks (DRAGON) [46], and Circuitswitched High-speed End-to-End Transport ArcHitecture (CHEETAH) [13]. Further software projects to enable the use of MPLS tunnels across Internet2 [26] and across the Department of Energys ESnet [15] are also underway. Most of these networks are primarily designed for large-scale scientic applications. Some of these applications require high-bandwidth circuits and long call-holding times. To create largescale circuit or VC networks, we need to extend the usage of these networks beyond scientic applications to millions of users. Thus, we need to identify and design more applications to use these networks efciently. The rst goal of this thesis is to determine what applications are well served by GMPLS networks, which currently only support immediate-request calls. We use the Erlang-B formula to analyze the suitability of different types of applications. The study of application suitability for 1
Chapter 1. INTRODUCTION
GMPLS networks identies applications suited to these networks in general, and specically the CHEETAH testbed. Then, we study two applications for CHEETAH. The rst is a web transfer application, where we present a solution to improve web performance by leveraging CHEETAH without requiring modications to existing web server and client software. We implement a CGI-based software package called WebFT. WebFT is integrated with the CHEETAH end-host software modules to provide deterministic data-transfer services transparently to users. With dedicated circuits on CHEETAH, WebFT can achieve low-variance, end-to-end transfer delays at different circuit rates and low transfer delays when high-speed circuits are possible. The second application is parallel le transfers on CHEETAH, where we study how to achieve multi-Gb/s throughput for bulk data transfers over WANs. We identify two factors that limit throughput to hundreds of Mb/s: TCPs congestion-control algorithm and end-host limitations. Then, we present a cluster solution over dedicated circuits, using GridFTP striped transfer and Parallel Virtual File System, version 2 (PVFS2) to achieve multiple-host parallelism, and thus, improve overall throughput. The rest of this thesis is organized as follows. In Chapter 2, we provide background information on a class of call-blocking CO networks and the CHEETAH experimental testbed. In Chapter 3, we explore the suitability of different types of applications for call-blocking CO networks. In Chapter 4, we design and implement a software package, called WebFT, to improve web performance through CHEETAH. In Chapter 5, we propose a cluster solution using GridFTP striped transfer and PVFS2 for parallel le transfers. Finally, we present our conclusions and list future-work items in Chapter 6.
Chapter 2
BACKGROUND
In this chapter, we rst review different types of GMPLS networks and control-plane protocols. We point out that current GMPLS implementations use a call-blocking approach. Then, we briey describe existing equipment and networks in which CO services can be enabled. Finally, we overview the CHEETAH network and CHEETAH end-host software because all the work in this thesis has been conducted as a part of the CHEETAH project.
2.1
CO Networking
Networks are commonly classied by scale into Local Area Networks (LANs), Metropolitan Area Networks (MANs), Wide Area Networks (WANs), wireless networks, home networks, and internetworks [50]. This classication, however, misses the critical aspect of networkingresource sharing. To reect how resources are shared in networks , Veeraraghavan and Karol gave a classication of networks based on both switching type and networking type, as shown in Table 2.1 [56]. In this section, we focus on the CO networking mode and, more specically, on a class of call-blocking GMPLS networks.
2.1.1 CO Networks and GMPLS Control-Plane Protocols

There are two types of CO networks: packet-switched and circuit-switched (see Table 2.1). Packetswitched CO networks include 3
Chapter 2. BACKGROUND
Table 2.1: A classication of networks that reects sharing modes

PP PP PP
Circuit-switched Networking PPPP P PP type P Connectionless Not an option Connection-oriented e.g., Telephone network, SONET/SDH, WDM
Multiplexing/ Switching type P
Packet-switched e.g., IP networks; Ethernet networks e.g., X.25, ATM, MPLS
Intserv IP networks [8] Multiprotocol Label Switched (MPLS) [42] and Asynchronous Transfer Mode (ATM) networks IEEE 802.1p and 802.1q Virtual LAN (VLAN) Ethernet switch based networks [25] Circuit-switched networks include Time-Division Multiplexed (TDM) SONET/SDH networks All-optical Wavelength Division Multiplexed (WDM) networks Space-Division Multiplexed (SDM) Ethernet switch based networks (an SDM connection is created by mapping two ports into an untagged VLAN) The GMPLS control-plane protocols are dened as a common control plane for these different types of CO networks even though their data-plane protocols differ signicantly. This common control plane consists of: 1. Link Management Protocol (LMP) [29] 2. Open Shortest Path FirstTrafc Engineering (OSPFTE) routing protocol [27] 3. Resource Reservation ProtocolTrafc Engineering (RSVPTE) signaling protocol [3]
These three protocols are designed to be implemented in a control processor at each network switch. Each of these protocols provides an increasing degree of automation, and a corresponding decreasing dependence upon manual network administration. This triple combination serves as an excellent basis on which to create large-scale CO networks, in which switches can cooperate in a completely automated fashion to respond to requests for end-to-end bandwidth. We consider each protocol in a little more detail below, starting with LMP. Primarily, the LMP module automatically establishes and manages the control channels between adjacent nodes, to discover and verify data-plane connectivity, and to correlate data-plane link properties. In GMPLS networks, there could be multiple data-plane links between two adjacent nodes and the control channel could be established on a separate physical link from any of the data-plane links. A mechanism is required to automatically discover these data-plane links, verify their properties, combine them into a single trafc-engineering (TE) link, and correlate data-plane links to the control channel. Thus, LMP contributes to our plug-and-play goal for CO networks by minimizing manual administration. The OSPFTE routing protocol software module, located at a switch, enables the switch to send topology, reachability, and the loading conditions of its interfaces to other switches, and receive corresponding information from them. This data-dissemination process allows the route computation module at the switch to determine the next-hop switch toward which to direct a connection setup (this module could be part of the signaling-protocol module or could be used to pre-compute routing data ahead of when call-setup requests arrive). As a routing protocol, its value in creating large-scale connectionless networks has already been observed with the success of the Internet. Admittedly, being a link-state protocol, it is only used intra-domainthat is, within the network of an organization, referred to as an autonomous system (AS). Even within this intra-domain context, it organizes the AS as a two-layer hierarchy, meaning that the AS is partitioned into self-contained areas interconnected by a backbone area. In conjunction with the distance-vector based inter-domain routing protocol, Border Gateway Protocol (BGP), we have a highly decentralized automated mechanism to spread routing information, which was critical to the scaling of the Internet.
Finally, an RSVPTE signaling engine at a switch manages the bandwidth of all the interfaces on the switch, and programs the data-plane switch hardware to enable it to forward demultiplexed incoming user bits or packets as and when they arrive. Given that dynamic bandwidth sharing in CO networks is controlled by the signaling engine, the call-handling performance of this engine is critical to the scaling of CO networks. The faster the response times of signaling engines, the lower the cost to an application to release and reacquire bandwidth as and when needed. This allows applications to hold circuits only for the duration of their communication bursts, which, in turn, improves link utilization. The need for high call-handling performance from signaling engines can be met with a completely automated and distributed bandwidth-management implementation. This will allow for both temporal and spatial scalability (i.e., shorter call-holding times and networks with large numbers of switches and hosts). An RSVPTE engine implemented in a control card at a switch executes three steps when it receives a connection setup Path message (i.e., a request for bandwidth), as show in Fig. 2.1.
BW: Bandwidth; D: Destination address Path message (BW, D) (from previous switch on path) Path message (BW, D) Path message (BW, D) (to next switch on path)
Route lookup Bandwidth and label management
Route lookup Bandwidth and label management Switch fabric configuration
Control plane Data plane
Switch fabric configuration
GMPLS switch
GMPLS switch
Figure 2.1: Distributed call-setup process progressing hop-by-hop
1. Route computation: Based on the destination address to which the connection is requested (D, in the example shown in Fig. 2.1), the RSVPTE engine determines the next-hop switch
toward which to route the connection or a subset of switches on the end-to-end path within its area of its domain. Constrained Shortest Path First (CSPF) algorithms can only be executed intra-area because of the intra-area scope of bandwidth related parameters in OSPFTE messages. 2. Bandwidth and label management: If the switch is in a position to only compute the next-hop switch in the route computation phase, then it needs to check if there is sufcient bandwidth on a link connected to the next-hop switch. If it performs CSPF to determine a part of the end-to-end route (i.e., the subset of switches on the path within its area of its domain), then this step of bandwidth management is integrated with the partial route computation. But at subsequent switches within the area, this step is required to check if there is sufcient bandwidth available on the link to the next-hop indicated in the partial source route passed within the Path signaling message (see Fig. 2.1 for how Path messages travel hop-by-hop). This is because local conditions can change between the last routing protocol update, which provided the data used in the CSPF computation, and the arrival of the call being set up. Typical implementations use a call-blocking approach where calls are simply rejected if sufcient bandwidth is not available. Label management is the selection of labels to be used on incoming and outgoing switch interfaces. In the data plane, labels can be either explicit in the data plane (e.g., labels used within packet headers in VC networks), or implicit (e.g., time slots, wavelengths or interface identiers in TDM, WDM, and SDM networks). In the control plane, labels are explicit in both types of switches, with the labels identifying time slots, wavelengths and interface identiers to be used for the connection across a circuit switch. These labels are used in the next step. 3. Switch fabric conguration: This step is needed to congure the switch fabric to forward user data as and when they arrive. This function maps incoming labels associated with input interfaces to outgoing labels on appropriate outgoing interfaces. In packet switches, there is an additional step to program the scheduler to enable it to serve packets arriving on the VC being set up at the requested bandwidth level.
We do not show the rest of the call-setup procedure in Fig. 2.1, the continuation of the Path message propagation hop-by-hop, or the Resv message returning in the opposite direction, which implicitly conrms successful connection setup. Detailed procedures are also dened in RSVPTE for call-setup failure. As mentioned in step 2, the bandwidth-management procedure implemented in most GMPLS switches is based on call blocking. In other words, if the requested bandwidth is not available when a call arrives, the call request is rejected. There is support for preemption, but if no existing call is preemptable (because of priority levels), then the call is blocked. The counterpart call-queuing model, though analyzed in textbooks [44], is seldom implemented. This is because a call traversing multiple links requires a simultaneous allocation of bandwidth on all these links. A distributed call-queuing model requires a call (an RSVPTE Path message) to wait in a queue until resources become available at the rst switch, and then to join a queue at the next switch in a hop-by-hop manner as shown in Fig. 2.1. Resources allocated to a call at upstream switches will lie unused while the Path messages are queued at downstream switches. Parallelizing this wait time by simultaneously queuing the call at multiple switches will decrease wasted bandwidth, but not eliminate it. Therefore, call queuing is seldom implemented. The RSVPTE and OSPFTE control-plane protocols do not support advance reservations of bandwidth. For example, there are no objects dened in RSVPTE to specify a future start time in a Path message. Nor are there parameters dened in OSPFTE to report future loading conditions in the TE link state advertisements. Hence, these GMPLS control-plane protocols only support immediate-request or on-demand calls.
2.1.2 Existing Switches, Gateways, and Networks

The most common network switches today are Ethernet switches, IP routers and SONET/SDH switches. The rst two are primarily connectionless packet switches; however, Ethernet switches have VLAN capabilities with limited Quality of Service (QoS) support. A VLAN is constructed by programming the switch to include two or more ports. It can be tagged or untagged. In tagged mode, all Ethernet frames are tagged with a VLAN header that includes a VLAN ID. Frames
tagged with the same VLAN ID are treated in the same manner; that is, they are forwarded to all the ports belonging to that VLAN. An untagged VLAN with two ports is essentially a SDM circuit because all Ethernet frames arriving on either port are sent exclusively to the other port. No frames arriving on other ports are forwarded to ports in an untagged VLAN. Ethernet switches available from Extreme Networks, Dell, Cisco, Intel, Foundry, and Force 10, just to name a few vendors, have these capabilities. Thus, the data-plane capabilities required to create circuits or VCs through Ethernet switches are now available. However, control-plane software used to set up and release circuits dynamically is not implemented within these switches. The Dragon project has developed a software module called the Virtual Label Switch Router (VLSR), which implements the RSVPTE and OSPFTE protocols. It runs on an external Linux host connected to the Ethernet switch [46] and manages the bandwidth of the switch. It issues Simple Network Management Protocol (SNMP) [7] commands to create the VLANs for admitted connections. With this external software, the Ethernet switches become fully equipped CO switches. IP routers are equipped with MPLS engines and RSVPTE signaling software for dynamic control of MPLS VCs. Both Cisco and Juniper routers support MPLS. SONET/SDH and WDM switches are circuit switches in which time slots and wavelengths are respectively mapped from incoming to outgoing interfaces. Some of these switches now support RSVPTE and OSPFTE control-plane implementations. For example, Sycamore SONET switches implement these protocols. Examples of WDM switches that implement GMPLS controlplane protocols include Movaz and Calient WDM equipment. In addition to supporting pure CO-switching functionality, some of this equipment can be used as gateways to interconnect different types of networks. Before describing the gateway functionality of these pieces of equipment, we establish some terminology. We dene the term network to consist of switches and endpoints (data-sourcing and sinking entities) interconnected by shared communication links, on which the sharing (multiplexing) mechanism is the same on all links. Further, we dene the term switch as an entity in which all links (interfaces) support the same (single) form of multiplexing (referred to as switching capability [45]). For example, a SONET switch is one in which all interfaces carry TDM signals formatted
10
according to the SONET multiplexing standards, and a SONET network is one in which all the switches are SONET switches. Typical endpoints in a SONET network are IP routers with SONET line cards; these nodes are endpoints in the SONET network as they source and sink data carried on to the SONET network. We use the term internetwork to denote an interconnection of networks (referred to as multiregion networks) [45]. Entities (nodes) that interconnect networks necessarily need the ability to support interfaces with different types of multiplexing capabilities, minimally two. We use the term gateways to refer to such nodes. An IP router is a gateway in the connectionless Internet with different line cards implementing the protocols of the networks to which they are connected. The gateway functionality is achieved by the IP implementation within the router examining IP datagram headers to determine how to route a packet from an incoming network to an appropriate outgoing network. In contrast, gateways in a CO internetwork move data from one network to another using circuit or VC techniques. For example, Ethernet cards in a Sycamore SN16000 implement the Generic Framing Procedure (GFP) Ethernet-to-SONET encapsulation to map all frames received on any of its Ethernet ports into a port on a SONET line card, which connects this gateway node to a SONET network. In this scenario, the circuit is a simple SDM circuit. We thus refer to these gateways as circuit or VC gateways to contrast them with packet-based IP routers. An example of a VC gateway is a Cisco GSR 12008, which supports line cards that can be programmed to map all frames arriving on a specic VLAN into an MPLS tunnel set up on one of its other ports. It thus interconnects a VLAN based CO network to an MPLS based CO network. While the data-plane capabilities for extracting data from one type of multiplexed connection and sending it on to a different type of multiplexed connection are available, the control-plane capabilities for controlling such circuits or VCs are not yet standardized, and hence, not implemented. Finally, as for current CO network deployments, SONET/SDH and WDM networks are already in widespread deployment. However, the dynamic bandwidth provisioning capability supported by the GMPLS control-plane protocols, while available on some switches in deployment, is not yet made available to users. Similarly, the Abilene backbone of Internet2 and DOEs ESnet has routers with built-in MPLS and RSVPTE capabilities. There are ongoing research projects [22,24]
11
to enable the use of dynamically requested VCs through these networks, including CHEETAH [13], a SONET based network, and DRAGON [46], a WDM based network. Both CHEETAH and DRAGON are call-blocking and immediate-request GMPLS networks.
2.2
CHEETAH Network
Our research group has deployed the CHEETAH network as part of an NSF-sponsored project proposed to provide high-speed, end-to-end connectivity on a call-by-call basis. In this section, we review the CHEETAH concept and the current experimental testbed. We also describe the end-host software needed in CHEETAH-connected computers.
2.2.1 CHEETAH Concept and Network

CHEETAH is a networking solution to provide end-host applications access to end-to-end CO services, while preserving the connectionless services already available to them via the Internet. In other words, CHEETAH is designed as an add-on service to existing Internet connectivity, and further, it leverages the services of the latter. As shown in Fig. 2.2, end hosts are equipped with two Ethernet Network Interface Cards (NICs). The primary NICs (NIC I) in the end hosts are connected to the public Internet through the usual
IP routers
Packet-switched Packet-switched Internet Internet
IP routers
End host
NIC I NIC II
NIC I NIC II
End host
Ethernet-SONET gateway
Optical OpticalCircuitCircuitswitched switched CHEETAH CHEETAHNetwork Network
Ethernet-SONET gateway
Figure 2.2: CHEETAH concept
12
LAN Ethernet switches or IP routers, while the secondary NICs (NIC II) are connected to Ethernet ports on Ethernet-to-SONET circuit gateways. Ethernet-to-SONET circuit gateways, in turn, are connected to wide-area SONET circuitswitched networks, in which both circuit gateways and pure SONET switches are equipped with GMPLS protocols to support call-by-call dynamic bandwidth sharing. End-to-end CHEETAH circuits (as shown in the dashed line in Fig. 2.2) are set up dynamically between end hosts with RSVPTE signaling messages being processed at each intermediate gateway or switch in a hop-byhop manner. The add-on design of CHEETAH network brings two benets: 1. Connectivity to the Internet allows a CHEETAH end host to communicate with other non CHEETAH hosts on the Internet while it communicates with another CHEETAH end host through a dedicated CHEETAH circuit. 2. Applications can selectively choose to request CHEETAH circuits only when the Internet path is estimated to provide a lower service quality than the CHEETAH circuit, and further fall back to the Internet path if the CHEETAH circuit-setup attempt fails due to an unavailability of circuit resources on the CHEETAH network. Currently, the CHEETAH network consists of three Ethernet-to-SONET circuit gateways, which are Sycamore SN16000 switches, deployed at MCNC in Research Triangle Park (RTP), NC, Southern Crossroads (SOX) and Southern Light Rail (SLR) in Atlanta, GA, and Oak Ridge National Laboratory (ORNL) in Oak Ridge, TN. The testbed layout is shown in Fig. 2.3. Hosts, running Linux, are connected via Gigabit Ethernet (GbE) NICs to the SN16000 switches. The circuits, set up and released dynamically, consist of Ethernet segments from the hosts to the switches mapped to Ethernet-over-SONET segments between the switches. The GbE signal is mapped to a 21-OC1 virtually concatenated SONET signal to create an end-to-end 1 Gb/s dedicated circuit.
Sycamore SN16000 Crossconnect card
13
Control card
wukong
OC192 card
MCNC/NCSU, NC
zelda1 zelda2 zelda3 Internet Juniper router
Sycamore SN16000 Crossconnect card
Control card OC192 card
SOX/SLR, GA
zelda4 zelda5 Juniper router
ORNL, TN
Figure 2.3: CHEETAH experimental testbed
2.2.2 CHEETAH End-Host Software

We have developed a software package for Linux hosts, called CHEETAH end-host software, to enable the automatic use of CHEETAH circuits. Wherever possible, our goal is to integrate libraries of this CHEETAH end-host software into application software modules to make CHEETAH services transparent to human users. The CHEETAH end-host software architecture is shown in Fig. 2.4. The Optical Connectivity Service (OCS) client module is used to determine whether the correspondent end host (called party) is on the CHEETAH network. It does this by sending a TXT query to a Domain Name Server (DNS). The TXT resource record is a generic type supported by DNS to allow users to store any data about hosts. The TXT data we store for a CHEETAH end host consist of an indication that it is a CHEETAH end host, along with the IP and MAC addresses of the hosts secondary NIC. The routing decision (RD) module answers queries from applications as to whether to attempt a circuit setup. It makes these decisions by using collected measurements about the two paths, the
14
End host
CHEETAH software OCS client Routing decision RSVP-TE client CHEETAH network
CHEETAH software
End host
Internet
OCS client Routing decision RSVP-TE client
Application
Application
TCP/IP C-TCP
NIC 1 NIC 2
NIC 1 NIC 2
TCP/IP C-TCP
Figure 2.4: CHEETAH end-host software Internet path and the CHEETAH path, along with the size of the le to be transferred. The RSVPTE client module is used to initiate the setup and release of CHEETAH circuits [59]. Parameters provided to this module include the secondary NIC IP address of the destination to which a circuit is being requested and the desired bandwidth. The Sycamore switches in the CHEETAH network receive these RSVPTE messages, process them and set up circuits if the requested bandwidth is available to the specied destination. It is a distributed switch-by-switch signaling procedure. The Circuit-TCP (C-TCP) module is the transport protocol that we have developed for CHEETAH circuits [33]. Given that the bandwidth of a dedicated circuit is known before a le transfer starts, any changes in the sending rate will either cause the circuit to remain idle or cause the receiver buffer to ll up. Since neither option is desirable, we essentially removed the congestion-control algorithms of TCP that were designed to keep adjusting the sending rate based on IP network conditions in order to create our C-TCP module. This disabling of the congestion control is selectively done only by TCP connections traversing the secondary NIC, which is used for CHEETAH circuits. TCP connections traversing the primary NIC connected to the Internet continue using the standard TCP code. Corresponding to each CHEETAH software module is a library providing application programming interfaces (APIs) to invoke the services of each module. These libraries are expected to be linked into applications using the CHEETAH software and network.
Chapter 3
ANALYTICAL MODELS OF GMPLS NETWORKS
In Chapter 2, we reasoned that GMPLS networks are call-blocking networks that only support immediate-request calls. One important question is, what applications, if any, are suitable for GMPLS networks. This chapter addresses this problem. First, we present bandwidth sharing models for two types of applications, ones in which the per-circuit bandwidth and mean call-holding time are independent and ones in which they are dependent (le transfers). Then, we provide numerical results for both models. Finally, we conclude that, GMPLS networks are well suited for applications in which the required per-circuit bandwidth on the order of one-hundredth the shared link capacity for both types of applications.
3.1
Bandwidth Sharing Model
The switch model used in our analysis is illustrated in Fig. 3.1, in which calls originating from hosts on the N links (e.g., the N Ethernet links connecting hosts to Ethernet interfaces on a gateway) share the link capacity C on link L (e.g., the SONET/SDH/WDM/MPLS link out of a gateway). We assume that call-setup requests arrive according to a Poisson process with rate , since many
1 2 N-1 N
Link L, capacity C
Figure 3.1: Call-based sharing model for any single link of a switch 15
Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS
16
call-arrival processes observable in practice can be modeled as Poisson processes [44]. Further, we assume that call-holding times follow arbitrary distributions with a mean call-holding time denoted as 1/. To understand the types of applications that can be supported on GMPLS circuit-switched networks, we make a simplifying assumption that all calls are of the same typethat is, they need the same amount of bandwidth. This allows us to treat link L as a link of m circuits, where each circuit is of capacity C/m. We ask two questions about the suitability of applications for GMPLS networks: 1. Are applications that require high-bandwidth circuits more or less desirable than applications that require low-bandwidth circuits?1 2. Are applications that generate calls with long mean holding times more or less desirable than calls with short mean holding times? The rst question is related to m, the number of circuits. The larger the per-circuit bandwidth, the smaller the m for a given link capacity C. The second question is related to the mean call-holding time, 1/. For applications such as remote visualization and video conferencing, the mean holding time is independent of the per-circuit bandwidth. On the other hand, for le transfers, commonly identied as an application suitable for high-speed circuits [57], m and 1/ are related. The larger the percircuit bandwidth (the smaller the m), the lower the mean call-holding time, 1/. We describe models for these two cases in the following subsections, respectively.
3.1.1 Model for Applications in which Call-Holding Time is Independent of PerCircuit Bandwidth
Given our assumptions, we can model link L as an M /G/m/m system [44]. The call-blocking probability in this model is given by the well-known Erlang-B formula: m /m!
i=0
1 In
Pb =
(i /i!)
(3.1)
this chapter, we only use the word circuits, but the same model and analysis hold for virtual circuits as well.
17
where , the offered trafc load, is given by = /. Although this is a time-tested model for telephony trafc, we found it useful to our current problem of identifying applications suited to GMPLS networks. Assume that the number of calls per second arriving on each of the N ports that are destined for link L is . Thus, from Fig. 3.1, the aggregate , call-arrival rate for link L, is given by:
= N
(3.2)
The utilization of link L, U , is given by: (1 Pb ) m
U=
(3.3)
3.1.2 Model for Applications in which Call-Holding Time is Dependent on PerCircuit Bandwidth
File-transfer applications belong in this category. Given that the GMPLS switch operates in a callblocking mode even when used for this category of applications, equations (3.1)(3.3) apply here as well. If le sizes are too small, the overhead incurred in call-setup delay will signicantly reduce link utilization (since call-setup delays could exceed le-transfer delays). Therefore, Veeraraghavans team [57] proposed using an RD module at end hosts to decide, based on the le size and other metrics, whether to request a circuit for a particular le transfer, or whether to simply use the Internet connectivity. Fig. 3.2 illustrates a model for the le transfer application. We use a settable parameter crossover le size, , to model the behavior of the RD module, wherein les larger than are
end host
Figure 3.2: A bandwidth sharing model for le transfers
...
routing decision (RD) module
1 2 N-1 N
Link L, capacity C

routed to the CO network.
18
We assume that le sizes are distributed according to the Pareto distribution with the probability density function: f (x) = k , x+1 xk (3.4)
where is the shape parameter (the larger the , the higher the probability of small le sizes), and k is the scale parameter, denoting the minimum le size. Crovella [14] characterized web le sizes as following this distribution and suggested in the range from 1.0 to 1.3 and a value for k of 1000 bytes. Given that only les larger than are routed to the CO network, using (3.4), we derive the mean le size, E [X |(X )], as E [X |(X )] = We then estimate the mean call-holding time, 1/, as 1 = Tprop + E [Temission ] where Tprop is the one-way propagation delay, and E [Temission ] = By neglecting Tprop , we can approximate: 1 m = 1 C (3.8) E [X |(X )] m = C/m 1 C (3.7) 1 (3.5)
(3.6)
capturing the inter-dependence of m and 1/. We justify neglecting Tprop as follows. E [Temission ] should be larger than Tprop because the latter is incurred as part of call-setup delay, and to maintain a high link utilization, mean call-setup delay should be much smaller than E [Temission ], which means that Tprop is much smaller than E [Temission ].

From Fig. 3.2, we can derive the call-arrival rate at link L as: k
19
= N = N 0 P(X ) = N 0 Combining (3.9) with the mean holding time from (3.8), we get k m = N 0 1 1 C
(3.9)
(3.10)
3.2
Numerical Results
3.2.1 Applications in which Call-Holding Time is Independent of Per-Circuit Bandwidth

Assume that the link capacity C = 10 Gb/s. This is a reasonable value if the switch is a SONET or MPLS switch. For WDM switches, if the number of wavelengths on link L is 100, then a more reasonable value for C would be 1 Tb/s because each wavelength is typically engineered to support 10 Gb/s. We will consider this number later in this chapter. For now, we consider C = 10 Gb/s. We study the effect of changing m from 1 to 1000; in other words, the per-circuit bandwidth varies inversely from 10 Mb/s to 10 Gb/s. We obtain numerical results corresponding to four different xed values of U , 40%, 60%, 80%, and 90%. Since we have two equations (3.1) and (3.3), if we x two parameters, U and m, then the other two variables, and Pb , become xed as well. We use an iterative algorithm as follows to obtain these values. First, we observe that for a given m, U increases as increases. We also conduct experiments to conrm the observation. Then, we start to assign = m temporarily, and compute the corresponding Pb and U . If the current U is larger than the given U , meaning that is too large, we decrease by = 0.001 until the corresponding U in the current iteration is smaller than the given U ; otherwise, we increase by until the corresponding U in the current iteration is larger than the given U . Next, we compare the current U and its neighbor in the previous iteration to get the closest one to meet the given U and m. Finally, we compute the corresponding Pb . Fig. 3.3 plots Pb vs. m.

1 0.8 U=90% 0.6 Pb 0.4 0.2 0 U=80%
Pb 0.03 0.02 0.01 0 101 U=80% 400 m 700 U=90%
20
0.05 0.04
U=60% U=40% 0 20 40 m 60 80 100
1000
(a) m [1, 100]
(b) m [101, 1000]
Figure 3.3: Plots of Pb vs. m for U = 40%, 60%, 80%, and 90% From Fig. 3.3a, we see that at small values of m, it is hard to achieve high utilization combined with low call-blocking probability. Consider m = 10, which corresponds to a per-circuit allocation of 1 Gb/s per call (e.g., for HDTV applications). To run the link at an 80% utilization level, the corresponding call-blocking probability will be a high 23.62%. In Fig.3.3b, we show the effect of large m at which values both high utilization and low call-blocking probability are achievable. The effect of trafc load is not obvious from Fig. 3.3. Therefore, we plot the trafc load vs. m and /m vs. m in Fig. 3.4. From Fig. 3.4a, we see that should be engineered to be high
10 8 6 /m 4
U=60% U=40%
100 80 60 40 20 0 U=90% U=80%
U=90% 2 U=80% U=60% 0 U=40% 0 20
20
40 m (a) vs. m
60
80
100
60 m (b) /m vs. m
40
80
100
Figure 3.4: Plots of vs. m and /m vs. m
21
when m is high. We also see that, as m increases, Pb decreases and /m approaches U according to (3.3). For example, when U = 60%, /m approaches 0.6, reaching this value when m = 80. Thus, is typically close to and less than m when Pb is low (close to 0) and U is high (close to 1). For example, at a xed value of U = 80%, when m = 100, = 80.35, Pb = 0.4%, and when m = 1000, = 800, Pb 0. Thus, is close to m when Pb is low (close to 0) and U is high (close to 1). From the two graphs (Figs. 3.3 and 3.4) we see that if we want to operate the link at a given value of call-blocking probability, and a given value of utilization, the number of circuits, m, and trafc load, , become xed. An alternative starting point is that a given application has a xed capacity requirement, which means that m is xed. If we further assume that , the call-arrival rate per port, and mean call-holding time, 1/, are intrinsic to the application, then we can only adjust the aggregate trafc load by engineering N to achieve a given call-blocking probability or utilization. But these graphs show us that once m is set, if m is small, we are highly limited in our ability to achieve both high utilization and low call-blocking probability. Having understood the inuences of all the important variables in this model, , m, Pb and U , let us now consider three applications. The rst application is a high-bandwidth application (m = 10), the second, a low-bandwidth application (m = 1000) and nally, an intermediate-level bandwidth application (m = 100). High-bandwidth applications: When m = 10that is, when the application requires a percircuit bandwidth of 1 Gb/swe can achieve a target 80% utilization, only by operating the link at a high call-blocking probability of 23.62%. Such a high call-blocking probability could be unacceptable to users. We conclude that applications requiring a high per-circuit capacity relative to the shared link capacity are unsuitable for the immediate-request call-blocking mode of bandwidth sharing offered by GMPLS networks in situations where high utilization and low call-blocking probability are important. Since, as discussed in Chapter 2.1.1, call queuing is not an option, it appears that we need a book-ahead mechanism for such applications. We then ask whether the above answer is dependent on the mean call-holding time. In other words, when m is small, do we require a book-ahead mechanism only if the mean call-holding time is large or do we need such a mechanism even if the mean call-holding time is small? For example,
22
in a doctors ofce, where there are three to four doctors per ofce (m is 3 or 4), since our mean holding times (appointment lengths) are fairly high, on the order of 20-30 minutes, we use a bookahead mechanism. If the mean holding time is on the order of 1-2 minutes (e.g., at a bank teller), could an immediate-request approach work? The answer is that it would if there was space to wait. In other words, if the queuing system has a buffer to wait, high-bandwidth calls that have short mean holding times could be handled without a reservation system. Unfortunately, as explained in Chapter 2.1.1, queuing models are not suitable for calls. Therefore, for applications that require high bandwidth (i.e., m is small, irrespective of the mean call-holding time), our conclusion of needing a book-ahead mechanism holds. Low-bandwidth applications: At the other extreme, consider large values of m, say m = 500 to m = 1000. For example, in a video-telephony application with motion JPEG cameras operating at 25 frames/sec (motion-JPEG used instead of MPEG to meet the stringent delay requirements of telephony), we could allocate 10 Mb/s on an MPLS-shared 10 Gb/s link, in which case m = 1000. At these high values of m, call-blocking probability of almost 0 and utilization levels close to 1 are achievable as seen in Fig. 3.3b; however, the required trafc load is high (close to m) as noted in our analysis of Fig. 3.4. Whether and how such trafc loads can be engineered depends upon the second important factor, mean call-holding time. At a trafc load = 500, if the mean call-holding time is small (say 3 minutes for a video-telephony call, which is the number typically quoted as the mean duration of telephony calls), the aggregate call-arrival rate, , needs to be about 2.8 calls/sec. Say on average each end host makes 1 call every two hours, which means in (3.2) is about 0.5 calls/hour. This means that we need N to be 20160 to obtain an aggregate of 500 Erlangs. In other words, we need calls from 20106 end hosts to be multiplexed (perhaps through a multi-level hierarchy of switches) into the switch shown in Fig. 3.1, destined to share link Ls capacity. This is a high level of aggregation requiring switches with large numbers of ports. Since line cards (the more the ports, the more the line cards) drive up the cost of switches, our conclusion is that to achieve a high utilization with low-bandwidth applications that have short durations and low call-arrival rates, we need to equip the switch with a large number of line cards to generate sufcient trafc, which

could be expensive.
23
Consider what happens if the mean call-holding time, 1/, is larger, say 2 hours, and mean call-arrival rate is still low at 1 per 2 hours. This means the number of ports, N feeding trafc into the shared link can be 540. Building switches with this order of line cards is more feasible. We thus conclude that the immediate-request, call-blocking mode of bandwidth sharing in GMPLS networks can be used for low-bandwidth applications that have relatively long durations and low call-arrival rates. There is an upper limit on mean call-holding time, because if it is very large, unless the callarrival rate is very low, , will become very large causing a high call-blocking probability. Intermediate-bandwidth applications: Finally, consider an intermediate level, where m is in the range of 100. As seen from Fig. 3.3, call-blocking probabilities are very small when m = 100 even at utilizations of 90%. Now consider the question of mean call-holding times. If we again use the video-conferencing application or eScience remote-visualization applications where the percircuit bandwidth is 100 Mb/s on a 10 Gb/s link (which means m = 100), and mean call-holding times are in the 2-hour range, the required aggregate call-arrival rate is 40 per hour. If each port of the switch offers a load of 1 call per 5 hours, we need N to be 200, which is an acceptable number from a switch-cost perspective. Clearly, the higher the mean holding time, the smaller the N , and hence, the more preferable the application. This result again is surprising: calls with long holding times are preferable to calls with short holding times in a call-blocking mode of operation. In summary, applications suitable for present-day GMPLS networks are those in which the per-circuit capacity is 1/100th shared link capacity and have holding times on the order of tens of minutes or higher.
3.2.2 Applications in which Call-Holding Time is Dependent on Per-Circuit Bandwidth

As described in the model in Section 3.1.2, 1/(m) is constant if we neglect Tprop , and hence the two questions raised at the start of Section 3.1 seem to reduce to one question. But if we study the system at certain xed values of m, say m = 10, 100, 1000 (as in Section 3.2.1), we have a new parameter , the crossover le size, with which to manipulate the mean call-holding time 1/.
24
Therefore, in this section, we study the effect of on various metrics, such as , Pb , U , and N 0 , which represents the total call-arrival rate for all les whose sizes are greater than k. Fig. 3.5 plots the two metrics, Pb , and U , against for xed values of m and N 0 . The inuence of on is interesting because two factors operate in opposing directions. As increases, at a given m, the mean call-holding time, 1/, increases. But from (3.9), we see that is proportional to and hence decreases as increases. Since is larger than 1, decreases at a rate faster than 1/ increases. As a result, decreases with increasing . Decreasing is the reason why Pb and U drop with increasing .
0.35 0.3 0.25 0.2
Pb
1 0.9 0.8
m=10, N0=100
m=1000, N =100 0 m=100, N0=100 m=10, N =100

0
0.7
U
0.15 0.1 0.05 0 m=1000, N0=100 0 5
(bytes)
0.6
m=100, N0=100
0.5 0.4
10 15 x 10
7
m=100, N0=50
(bytes)
10
15 x 10
7
(a) Pb vs.
(b) U vs.
Figure 3.5: Plots of Pb vs. and U vs. for m = 10, 100, and 1000, N 0 = 50 and 100, = 1.1, and k = 1.25 MB In Fig. 3.5, we hold N 0 constant. But to see the effect of on the required call-arrival rate, we plot N 0 against for a set of given U in Fig. 3.6. From (3.10), we see that N 0 is proportional to 1 . Therefore, N 0 increases as increases. From this set of graphs, we see that we should select a smaller so that the required N 0 is not too large. If N 0 is large, and the per-host callarrival rate, 0 , is low, it means that we need to engineer our switches with a large number of ports. Another interesting result seen in this set of plots is that, unlike the results in Section 3.2.1, where as m is increased, the required trafc load increases, here we see in Fig. 3.6 that, as m increases, the required load N 0 decreases.

160 140 120 N0 100 80 60 40 U=80%, m=10 U=80%, m=100 U=80%, m=1000 U=60%, m=100
25
(bytes)
10
15 x 10
7
Figure 3.6: Plot of N 0 vs. for m = 10, 100, and 1000, U = 60% and 80%, = 1.1, and k = 1.25 MB We further plot Fig. 3.7 to contrast the effects of m on N for non-le-transfer applications and le-transfer applications by xing U and . As shown in Fig. 3.3, increases as m increases. For non-le-transfer applications, since m and 1/ are independent and 1/ is constant, and N increase with increasing . We can also derive that the trend of N vs. m is the same as that of vs. m (see Fig. 3.4a and Fig. 3.7a). In other words, for m at a small value, the curve has a higher slope
200
250
180 160
200
140
150
N
120
U=90%
N
100 80
U=80% 100 U=60% 50 U=40%
U=90% 60 40 20 U=80% U=60% U=40% 0 20 40 m 60 80 100
20
40 m
60
80
100
(a) N vs. m for non-le-transfer applications with = 0.5 call/s and 1/ = 0.8 s
(b) N vs. m for le-transfer applications with 0 = 0.5 call/s, = 1.1, k = 1.25 MB, and = 8 MB
Figure 3.7: Plots of N vs. m for U = 40%, 60%, 80%, and 90%
26
than that for m at a large value. In particular, for m at a high value, the curve has an approximately constant slope of (U )/0 (see Fig. 3.7a). But for le-transfer applications, 1/(m) is a constant for a xed , C, and . From (3.10), we can see that the trend of N vs. m is the same as that of /m vs. m as shown in Fig. 3.4b. In particular, for large m, the curve for N vs. m is at for a given U (see Fig. 3.7b). Thus, for le transfers, we can allocate smaller amounts of bandwidth per call, which means that m can be larger to achieve lower Pb and higher U without increasing N if the user can tolerate the longer holding time. Repeating the questions asked in Section 3.2.1, we consider whether high-bandwidth circuits can be used for le transfers. We reach the same answer as in Section 3.2.1 if m = 10. Fig. 3.5 shows that the call-blocking probability is quite high (at 10% even at large ) when m = 10. Furthermore, Fig. 3.6 shows that a higher N 0 load is required to achieve a certain U when m = 10 than when m is larger. Therefore, we conclude that high-bandwidth circuits, such as m = 10, are not suitable even for the le-transfer application, unless latency requirements dictate its use. We see from Fig. 3.5 that using low-bandwidth circuits (m = 1000) does not reduce Pb or increase U signicantly if appropriate values of are selected, although it does not increase N either (see Fig. 3.7b). Given the natural advantage of lower delay to using lower m for le transfers, we focus the rest of our analysis on the intermediate-bandwidth m = 100 case. Now we consider the question of what crossover le size, , to select when m = 100. From Fig. 3.5, we see that should be in the range from 6 MB to 29 MB to meet a utilization higher than 80% and a call-blocking probability lower than 5%. We observe that cannot be too large, because if it is, then U decreases and the required call-arrival rate, N 0 , becomes large as seen in Fig. 3.6. On the other hand, if it is too small, then Pb becomes too high. To achieve a low call-blocking probability and high utilization, just as we need to choose a fairly large m (e.g., m = 100) in Section 3.2.1, here we see the need for a fairly high call-arrival rate, N 0 (e.g., N 0 = 100). At an aggregate value N 0 of 100 calls/sec, we also see that should be in the range from 6 MB to 29 MB. This means that the mean holding time is in the range of 0.5 s to 2.3 s since the per-circuit rate is 100 Mb/s when m = 100. These mean call-holding times are signicantly smaller than the numbers we consider in Section 3.2.1, where even a mean call-
27
holding time of 3 minutes, results in a need for a large number of ports. We see from Fig. 3.5 that lowering N 0 can lower utilization signicantly. To engineer an N 0 rate of 100 calls/sec, if 0 is 1 call every 10 s, it means that we require N to be 1000. This is not a small number and requires a cascade of switches to build up this load. For example, if the bottleneck link is an enterprise access link, it requires multiple aggregations from switches internal to the enterprise, whose links can be run at lower utilization levels, so that the aggregate trafc load for the enterprise access link is high enough to achieve a high utilization at an acceptable Pb . Next, we note that the very low mean call-holding times require high-speed signaling engines to reduce call-setup delays so that they approach round-trip propagation delays, and thus, the circuit utilization is high. Our work on hardware-accelerated signaling [58] shows the feasibility of implementing an RSVP-TE subset in hardware, which reduces per-switch call processing delays from the 100 ms range we measured on Sycamore switches to the order of microseconds. Finally, we note that, although a link capacity of 10 Gb/s is appropriate for SONET/SDH and MPLS shared links, it is low for a WDM link. If we assume that the shared link supports 100 wavelengths, using a typical data rate of 10 Gb/s, link capacity is 1 Tb/s and the per-circuit bandwidth is 10 Gb/s. Media-immersive applications could consume such high-levels of end-to-end capacity (category of applications where the mean call-holding time is independent of m), but for the letransfer application, le sizes should increase signicantly to make the use of WDM networks with GMPLS control-plane protocols usable for le transfers.
3.3
Conclusions
In this chapter, we analyzed the call-blocking mode of operation to determine the types of applications suitable for GMPLS networks by dividing them into two categories: those for which the per-circuit capacity is independent of the holding time, and those for which these two variables are directly related, such as le transfers. We concluded the following for the rst category. First, applications that require high-bandwidth circuits relative to the link capacity (e.g., where the ratio is one-tenth, say 1 Gb/s circuits on a 10 Gb/s link) are not suitable. Second, applications that re-
28
quire low-bandwidth circuits but have short holding times (on the order of a few minutes) require a high degree of aggregation leading to expenses from large numbers of line cards. Ideal applications require on the order of one-hundredth the link capacity as per-circuit rates, and have long holding times. In the second category of applications, we found that the rst conclusion to the rst category still holds; however, the second does not because the number of line cards keeps almost constant for m at a high value. In this category of applications, we also found that calls need to have very short call-holding times (on the order of seconds).
Chapter 4
WEB TRANSFER APPLICATION ON CHEETAH
In this chapter, we describe our implementation of a software package, called WebFT, as an application for CHEETAH [16]. WebFT accomplishes web transfers across CHEETAH without changing existing web client and web server software by integrating the CHEETAH end-host software modules into Common Gateway Interface (CGI) and other external modules. The main reasons why we chose web transfers as a showcase for CHEETAH are three-fold. First, web-based applications have become ubiquitous [19] and there is signicant interest in improving web performance. Although solutions such as web caching focus on the problems of overloaded web servers [9, 17], we focus on improving network performance. Second, according to the analysis of Chapter 3, CHEETAH network can be operated at a low call-blocking probability and a high utilization if circuits are on the order of one-hundredth the shared link capacity, for example, 100 Mb/s on a 10 Gb/s link, and a circuit of 100 Mb/s is suitable for either many small web le transfers or a single bulk web transfer. Third, many new types of web-based applications, such as large-le downloads, high-quality video streaming, and remote visualization, require highthroughput, low-jitter, and deterministic data transfers. These applications need QoS guaranteed network connectivity. The connectionless sharing mode of the current Internet is inadequate to provide such connectivity. We contend that the lack of rate-guaranteed network connectivity is hindering these web-based applications from being developed and deployed. An answer to this need lies in some of the newer networking technologiesfor example, CO networking technologies, currently under development and deployment. CO networks, such as CHEETAH and DRAGON, 29
Chapter 4. WEB TRANSFER APPLICATION ON CHEETAH
30
allow for the reservation of bandwidth in the form of a dedicated circuit or VC through the networks prior to data transfer. This chapter determines how we can leverage these new CO technologies to improve the performance of web applications. We rst describe the WebFT software design and implementation. Then, we show our experimental results and reason that WebFT can achieve low-variance, end-toend transfer delays at different circuit rates and low transfer delays when high-speed circuits are possible.
4.1
WebFT Design
A primary goal of the WebFT software design is to provide deterministic data-transfer services to clients connected to a web server via the CHEETAH network. WebFT leverages the coexistence of two paths between a web client and a web serverthat is, through the Internet and through the CHEETAH network. It allows clients that have network connectivity to the circuit-switched CHEETAH network to connect the WebFT server and download web content (e.g., large les or streamed video) through dedicated end-to-end circuits, while simultaneously providing normal web access to other nonCHEETAH clients through the Internet. The dedicated nature of the circuits allows for user data to be streamed unhindered from a web server to a web client via the CHEETAH network. This results in low-variance transfer delays. Another goal of the WebFT software design is not to impose any special requirements with regards to the operating system or the web server or client software packages executed on the client and server hosts. We leverage the CGI technology to achieve this goal [32].
4.1.1 WebFT Architecture

The WebFT architecture is shown in Fig. 4.1. On the web server side, WebFT includes two CGI scripts, download.cgi and redirection.cgi, and a process called WebFT sender. Download.cgi is embedded into web pages as a hyperlink, with the name of the le to be served as a parameter. When the user clicks the download.cgi hyperlink on the web page through any typical web client, the web

Web client Web Browser (e.g. Mozilla) RSVP-TE daemon WebFT receiver RSVP-TE API C-TCP API Control messages via Internet Data transfers via a circuit URL Response Web server Web Server (e.g. Apache) WebFT sender OCS API RD API RSVP-TE API C-TCP API OCS daemon RD daemon RSVP-TE daemon CGI scripts (download.cgi & redirection.cgi
31
Figure 4.1: WebFT architecture server receives an HTTP message causing download.cgi to be initiated. Download.cgi, in turn, initiates the WebFT sender process, which communicates with the WebFT receiver process on the client host to transfer the data from the server side to the client side. By leveraging the CGI technology, we avoid requiring any software upgrades to both web servers and web browsers. Integrated into the WebFT sender and receiver are libraries provided with the CHEETAH endhost software module described in Section 2.2. Through interaction with the CHEETAH end-host software modules, the WebFT sender determines whether to use the Internet path or attempt to set up a CHEETAH circuit, and if deemed appropriate, initiates the setup of a circuit. It then transfers the user data, and initiates the release of the circuit. If, for some reason, the user data cannot be transferred via the CHEETAH network (e.g., the client host is not connected to CHEETAH, the le size is too small, which makes it inefcient to use a circuit, or bandwidth is not available on the CHEETAH network), the WebFT sender process exits and redirection.cgi is invoked to transfer the le via the Internet.
4.1.2
CGI Scripts
CGI denes an approach for a web server to interact with external programs, which are often referred to as CGI programs or CGI scripts. Fig. 4.2 shows the ow of events while running CGI scripts.1
1 This
gure is adapted from Writing CGI Applications with Perl by Meltzer and Michalski [32].
32
Gateway programs
HTTP request HTTP response
CGI
`
WWW Client HTTP Web Server
Run CGI Scripts
Figure 4.2: The ow of events from running CGI scripts The WebFT package contains two CGI scripts developed in Perl5 on the server side: download.cgi and redirection.cgi. On receiving a request from a client, the web server invokes the download.cgi script with one input parameter, the requested le name. Download.cgi obtains the
clients primary IP address by querying the environment variable of REMOTE ADDR. It then calls the WebFT sender process and passes the clients primary IP address and the requested le name to the WebFT sender process. If the WebFT sender returns indicating a failure to transfer the le over the CHEETAH network, download.cgi calls redirection.cgi to initiate a normal download of the le via the Internet.
4.1.3 The WebFT Sender

The WebFT sender is integrated with APIs for the four basic CHEETAH end-host software modules. Thus, it interacts with the CHEETAH software daemons, including the OCS daemon, the RD daemon, and the RSVPTE daemon, as shown in Fig. 4.1. The owchart for the WebFT sender is shown in Fig. 4.3. Once the sender is initiated by the download.cgi script, it calls the OCS client module to determine whether the client host is reachable via the CHEETAH network. If the answer is yes, the OCS client module returns with the IP address and the MAC address of clients secondary NIC (the one connected to the CHEETAH network). The WebFT sender then establishes a TCP connection through the host primary NIC via the Internet to the WebFT receiver, which is running as a daemon on a well-known port in the client host. Once the TCP connection is successfully established, the receiver sends back a desired CHEETAH circuit rate (based on its receiving capability) and a C-TCP listening port number for the data
33
The client can be reached via the CHEETAH network (OCS)
No
Yes No
Request a CHEETAH circuit (RD)
Yes
Set up a circuit (RSVP_TE client)
Fail
Succeed
Send the file via C-TCP
Release the circuit (RSVP_TE client)
Return Success
Return Failure
Figure 4.3: The ow chart for the WebFT sender transfer on the CHEETAH circuit. Then, the WebFT sender process calls the RD module (passing the client hosts primary IP address, secondary IP address, clients desired circuit rate, and le size as arguments) to determine whether to attempt a CHEETAH circuit setup. The RD module chooses between the two options based on the loading conditions of the two networks (the Internet and the CHEETAH circuit-switched network), the round-trip delay time (RTT), and the le size. If it returns a decision to attempt a CHEETAH circuit setup, the WebFT sender process calls the RSVPTE client module (passing the clients primary and secondary IP addresses and the circuit rate), asking it to initiate circuit setup.
34
If the circuit setup is successful, the WebFT sender process calls the C-TCP send() subroutine, passing the following arguments: the circuit rate, the clients secondary IP address, the C-TCP port number on which the client is ready to accept an incoming C-TCP connection on the circuit, and the le name. The C-TCP send() subroutine opens a socket and connects the client through the secondary NIC and the CHEETAH circuit. The le is transferred on the dedicated CHEETAH circuit at a rate equal to the circuit rate. Once the data transfer is completed, the WebFT sender process invokes the RSVPTE client APIs to initiate release of the CHEETAH circuit. Finally, it returns a Success indication to the
download.cgi script.
If, during the above-mentioned procedure, the OCS client module determines that the client host does not have CHEETAH connectivity, or the RD module decides that it is better to use the Internet path, or the circuit setup initiated by the RSVPTE client module fails, the WebFT sender process immediately returns a Failure indication to the download.cgi script. The download.cgi process then calls redirection.cgi to download the le via the Internet as mentioned in Section 4.1.2.
4.1.4 The WebFT Receiver

To avoid manual intervention, the WebFT receiver is designed to run as a daemon on a well-known port in the background on the client host and to process incoming connection requests from the WebFT sender automatically. The WebFT receiver is completely independent of web browser software, and therefore does not require any modication to the latter. All clients connected to the CHEETAH network are congured to run this daemon. The WebFT receiver forks a child process to handle each request for a TCP connection from the WebFT sender through the primary NIC. The forked WebFT receiver process then creates a TCP connection with the WebFT sender to accept the request and sends to the latter the information of a pre-computed desired circuit rate. The circuit rate is typically computed based on the disk access rate of the client host because with todays technology, disk access rate is usually the bottleneck for le transfers. The forked WebFT receiver process also sends the listening C-TCP port number for the data transfer through the secondary NIC on the CHEETAH circuit.
35
The WebFT receiver includes the API libraries associated with the RSVPTE client and C-TCP modules of the CHEETAH end-host software. The RSVPTE client module API library accepts circuit setup requests from the CHEETAH network and the C-TCP module API library accepts incoming C-TCP connection requests from the WebFT sender to transfer user data. After a data transfer is completed, the forked child process terminates and returns to the parent WebFT receiver process.
4.2
Experimental Testbed and Results
The Linux implementation of WebFT described in the previous section has been tested on the CHEETAH experimental testbed. This section presents and discusses these results. The CHEETAH portion relevant for our experiments is shown in Fig. 4.4. We chose two PCs,
zelda3 and wukong, which are located in Atlanta, GA and RTP, NC, respectively. Zelda3 is a
Dell PowerEdge 2850 with dual 2.8 GHz Xeon processors and 2 GB memory. Wukong is a Dell PowerEdge 1850 with a 2.8 GHz Xeon processor and 1 GB memory. Both of them have an 800 MHz front side bus and a PERC4 RAID-0 controller with two 146 GB SCSI disks. The RTT between
zelda3 and wukong is 24.7 ms for the Internet path and 8.6 ms for the CHEETAH circuit. We loaded
the Apache HTTP server 2.0 on zelda3 and ran a web client on wukong.
IP routers NIC I zelda3 NIC II
Internet Internet
IP routers NIC I wukong NIC II
CHEETAH CHEETAH Network Network

Sycamore SN16000 Atlanta, GA Sycamore SN16000 MCNC, NC
Figure 4.4: CHEETAH testbed for WebFT We opened the mozilla web browser on
wukong,
entered
the
URL,
36
http://130.207.252.133/Webapplication.htm,2 and the web page that downloaded from the server is as shown in Fig. 4.5. After we clicked the hyperlink Download test.rm in Fig. 4.5, which was
Figure 4.5: The web page to test WebFT linked to http://130.207.252.133/cgi-bin/download.cgi?le=test.rm, a circuit was established at a rate of 1 Gb/s from zelda3 to wukong illustrated by the dashed line in Fig. 4.4. The le, test.rm of a size of 1.6 GB, was downloaded from zelda3 to wukong with a delay of about 19 s (excluding the time for circuit setup and release) at a throughput of about 680 Mb/s. The throughput was lower than the circuit rate because of the slow disk writing rate of wukong, which was approximately 700 Mb/s. Circuit setup across the two SONET switches took approximately 170 ms and circuit release took 9 ms. Table 4.1 gives the average throughput and delay (excluding the time for circuit setup and release) to download test.rm via WebFT for lower-rate circuits. We show the results of using lowerrate circuits to make the point that, if the web server (e.g., zelda3 in our experiment) has a GbE secondary NIC and it needs to simultaneously support multiple web downloads, it needs to allocate smaller bandwidth levels per download. It is also worth mentioning that the delay variance is negligible because circuits provide dedicated end-to-end bandwidth and the C-TCP transport protocol maintains a xed sending rate closely matched to the circuit rate. In contrast, the delay varies signicantly on the Internet because concurrent trafc has a signicant effect on any single download [57].
2 130.207.252.133
is the primary NIC IP address of zelda3
37
Table 4.1: Average throughputs and delays at a variety of circuit rates Circuit rate (Mb/s) 700 600 500 400 Average throughput (Mb/s) 602.5 515.4 412.7 337.3 Average delay (s) 21.2 25.0 31.0 37.9
From this experiment, we conclude that, for web downloads that require deterministic characteristics (e.g., streamed data or web-based gaming applications), guaranteed services provided by CO networks are indeed useful. Further, for large web downloads, the variability introduced by the connectionless nature of the Internet could cause signicantly large delays, especially on long propagation-delay paths. Circuits are a better option for such downloads as well.
4.3
Conclusions
In this chapter, we described a new web-based le transfer software package, called WebFT, to leverage new CO networking technologies that are increasingly available today. Specically, we used a wide-area experimental CO network testbed called CHEETAH, which we deployed as part of an NSF-sponsored project. We integrated CHEETAH end-host software APIs into the WebFT package to provide CHEETAH related services transparently to users. By leveraging the CGI technology, the WebFT package is completely independent of the web server and browser software, and therefore, does not require any modications to the latter. We tested WebFT on the experimental CHEETAH testbed using Apache HTTP web server and Mozilla web browser (note: WebFT is also usable with other web servers and web browsers as long as CGI is supported). Our experimental results showed that WebFT can provide deterministic data services to CHEETAH clients on dedicated end-to-end circuits, because it uses a new C-TCP transport protocol that is capable of providing reliable end-to-end data transfers at the circuit rate.
Chapter 5
PARALLEL FILE TRANSFERS ON CHEETAH
5.1
Introduction
Today, scientists carry out experiments collaboratively on a global scale. These large-scale scientic efforts are popularly termed as e-Science. E-Science projects share geographically distributed and heterogeneous resources, such as computational systems, scientic instruments, databases, networks, and software. In particular, they need to share large volumes of data (terabytes or petabytes or even larger) amongst geographically distributed applications. For example, scientists at NCSU, who are the primary users of CHEETAH and the primary team members of the Terascale Supernova Initiative (TSI) [54], run their simulations on a Cray X1E, located at ORNL. Each simulation creates a multi-TB dataset. These datasets are then downloaded from the Cray X1E to a local cluster, called orbitty, for analysis. The scientists need access to the latest dataset as soon as it is created. Currently, they use either the Logistical Runtime System (LoRS) tool [31] or bbcp [6] for these bulk le transfers and achieve throughput in the range of 200 Mb/s to 400 Mb/s. Given that no link has bandwidth lower than 1 Gb/s on the network path from the Cray X1E to orbitty (e.g., the backbone bandwidth of Internet2 is OC192), we should be able to achieve at least 1 Gb/s throughput. In this chapter, we study the use of parallel le transfers on CHEETAH to support a broad class of e-Science projects, including TSI. To achieve multi-Gb/s throughput, we need to analyze why current solutions are limited to hundreds of Mb/s. We have identied two factors for this poor performance. First, TCPs con38
Chapter 5. PARALLEL FILE TRANSFERS ON CHEETAH
39
gestion control algorithm does not work well in networks with a high bandwidth-delay product. On detecting congestion (through a packet loss or by receiving triple duplicate acknowledgments), the TCP sender will drop its sending rate immediately and slowly increase its rate as packets get through the network successfully. This process takes time to regain the full transfer speed. Second, end hosts are themselves bottlenecks. Readwrite speeds of hard disks are commonly hundreds of Mb/s, which are lower than network bandwidth (several Gb/s). Therefore, hard disks create a severe bottleneck. In addition, Baker and Feng [4] pointed out another possible limiting factor, the PC I/O bus. Even without any other bottleneck, such as hard disks, a host that connects a 10 Gb/s NIC through a 133 MHz, 64-bit Peripheral Component Interconnect Extended (PCI-X) bus can only achieve a peak bandwidth of 133 MHz64 b=8.512 Gb/s. To overcome the effects of these two factors, several solutions have been proposed. Most letransfer programs, such as GridFTP and bbcp, allow a user to employ multiple TCP streams to mitigate the rst factor. We propose the use of CO networks, such as CHEETAH, to overcome this rst limitation. Specically, we reserve bandwidth (e.g., multiple Gb/s) from end host to end host and thus avoid packet loss. To reduce the second limitation, one possible solution is to equip each end host with highspeed hardware, including high-speed CPUs, I/O buses, hard disks, and NICs. In this solution, we concentrate on making each end host faster. Thus, we refer to this approach as a single-host solution. Alternatively, we can relieve the end-host bottleneck by leveraging parallelism amongst multiple end hosts, which we term a cluster solution. There are two variations of the cluster solution based on whether the source le is located on a single-host le system, or distributed in blocks across a multi-host le system, such as PVFS: 1. Non-split source le: The le is not split and is located on a le system in a single host. 2. Split source le: The le is split into multiple parts and these parts are distributed across disks of multiple hosts. The case of non-split source le is more general than the case of split source le. Thus, we term the former general case, and the latter special case. For the general case, we need to carry out

the following steps:
40
1. Splitting: partition a large le located at a single host (on one or more disks) into multiple parts, and load each part onto a separate host. We refer to the number of parts as the splitting degree. 2. Transferring: transfer the parts to receiving hosts in parallel 3. Assembling: assemble the parts into a large le For the special case, where the le is already partitioned into blocks and distributed across multiple hosts, we do not need the steps of partitioning and assembling. All that is required is a le-transfer tool, such as GridFTP, which supports striped le transfers for les that are striped across disks on different hosts in a parallel le system. Fig. 5.1 illustrates the framework of the single-host and the general-case cluster solutions.
source file transfer sink
(a) The single-host solution
host 1
host 1'
...
original source splitting
host i
...
host n
transferring
...
host i
...
...
assembling original sink
Figure 5.1: The single-host solution vs. the general-case cluster solution In this chapter, we describe our design and implementation of these single-host and cluster solutions. First, we briey review the software tools of GridFTP and PVFS2 because we use these tools in our general-case cluster solution. Next, we discuss the usage of the single-host and the
...
...
(b) The general-case cluster solution
...
host n
...
...
41
general-case cluster solutions. Finally, we describe a specic-case solution for moving datasets in the TSI project.
5.2
Background
In this section, we briey review File Transfer Protocol (FTP) and then describe how GridFTP extends FTP to include the new features of multi-streaming, partial le transfer, and striping. We also provide a brief overview of PVFS.
5.2.1 FTP and GridFTP

GridFTP is a data-transfer protocol proposed for fast data transfers on the Grid [1, 2]. It extends FTP [36] by adding features for partial le transfer, multi-streaming, striping, and Globus-based security. It has been implemented by the Globus Alliance as a component of the Globus Toolkit (GT) [18, 20]. In the cluster solution, we mainly use the GridFTP functionalities of third-party control, partial data transfer, multi-streaming, and especially striped data transfer. Before we describe GridFTPs extensions to FTP, we overview FTP and focus on its feature of third-party control.1 There are two kinds of TCP connections in FTP: control connections and data connections. All FTP commands are transferred over the control connection, while user data are transferred over the data connection. The default port number of the control connection on the FTP server is 21 and that of the data connection is 20. Third-party control provided in FTP allows a user to transfer les between two other hosts. To implement this feature, FTP provides two commands, PASV and PORT. PASV has no argument and is an abbreviation for passive. Just as the term passive implies, PASV requests an FTP server to wait for a data connection rather than to initiate one on receiving a data transfer command.
PORT has an argument of hostport pair, with which it species the data port to be used in a data
connection.
1 Although RFC 959 [36] species this feature, it does not refer to the feature as third-party control. Instead, the GridFTP specication [1] introduces the term, third-party control.

1. control connection 2. PASV 3. host-port pair 1. control connection 4. PORT <host-port pair> 5. response to PORT
42
FTP client C
FTP server A
6. B initiates a data connection to A
FTP server B
Figure 5.2: The model and ow chart of third-party control Fig. 5.2 shows the model and ow chart of third-party control. First, an FTP client on a third party, denoted as C, establishes control connections to two FTP servers, denoted as A and B. C forwards all FTP commands, such as user and password, between A and B via the control connections. Then, C sends a PASV command to A. On receiving PASV, A listens on a data port, which it selects to be a number distinct from the well known port number, 20, returns to C a hostport pair (host provides As IP and port is the one on which A listens for a connection), and waits for a data connection. Then, C sends a PORT command to B with the hostport pair as the argument. After B receives the PORT command, it initiates a data connection to A at the port on which A waits for a connection. FTP has three transfer modes: 1. Stream mode: transmit data as a stream of bytes 2. Block mode: transmit data as a series of data blocks. Each block is identied by a 3-byte header, which contains two elds: 1-byte descriptor and 2-byte length. The descriptor eld indicates whether the block is a special block, for example, the last block that ends a le. The length eld species the length of the block. 3. Compressed mode: transmit compressed data All these modes transfer data in sequence and do not support partial le transfer. GridFTP extends the block mode by adding an offset eld in the block header to support out-ofsequence data delivery. With this extended block mode, GridFTP can do partial le transfer, which transfers portions of les rather than complete les. This extended block mode is also fundamental
43
to the GridFTP features of multi-streaming and striping. These two features leverage parallelism to speed up le transfers. Specically, the feature of multi-streaming supports multiple TCP streams in parallel between each pair of sending and receiving hosts. In contrast, the feature of GridFTP striped transfer stripes data across multiple sending hosts and transfers these stripes in parallel to multiple receiving hosts. Thus, GridFTP striped transfer leverages multiple-host parallelism and relieves the bottleneck caused by end-host limitations. We describe below how GridFTP implements striped transfer in detail.
globus-url-copy a third party C 1 n 5. re 4. SP . contro irs ctio a e p O n s l pon R rt on se to <hos connec ol c SPAS host-po r t t n o SPO -port tion 2. ist of 1. c pair R l a s> 3. receiving sending GridFTP server GridFTP server front end front end A B internal IPC internal IPC Block 1 Block n+1 data node 1 Block 2 parallel file system Block n+2 data node 2 ... Block n Block 2n data node n ... ... 6. initiate data connections from sending data nodes to receiving ones Block 1 Block n+1 data node 1' Block 2 parallel file system Block n+2 data node 2' ... ...
Figure 5.3: The model and ow chart of GridFTP striped transfer
...
...
Block n Block 2n data node n ...
...
44
Fig. 5.3 shows the model of GridFTP striped transfer.2 Multiple pairs of end hosts, termed as data nodes and typically located in two clusters, participate in a single data transfer that is controlled by two GridFTP servers, termed as front ends, and a third party, which runs globusurl-copy (a GridFTP client tool provided by GT). Each front end acts as the single GridFTP control
server on each cluster to coordinate le transfers between data nodes. Each data node moves the parts of the le assigned to it to its peer. To support GridFTP striped transfer, GridFTP denes two commands, SPAS and SPOR, which
extend PASV and PORT, respectively. If a front end receives a SPAS command, it requests all its
data nodes to wait for data connections and returns a list of hostport pairs for these data nodes. In contrast, if a front end receives a SPOR command with a list of hostport pairs, it noties its data nodes to initiate data connections to the hosts specied in the SPOR commands argument list. Comparing Fig. 5.2 with Fig. 5.3, we see that the ow chart for GridFTP striped transfer is similar to that for third-party control provided in FTP. The additional features in GridFTP striped transfer are as follows. First, it involves many data nodes. Second, it uses SPAS and SPOR instead of PASV and PORT. Third, it is required be unidirectional, which means that SPAS is paired with a receiving front end and SPOR, with a sending one. In contrast, FTP does not have any such restriction. Fourth, a front end communicates with its data nodes through an internal Interprocess Communication (IPC) protocol, which is unspecied in the GridFTP specication. Finally, although there are multiple data connections between sending and receiving data nodes, there are only two control connections between two front ends and a third party. In addition, as shown in Fig. 5.3, GridFTP striped transfer requires that end hosts on each cluster have access to the le, which means that the le needs to be managed by a parallel le system. Furthermore, the underlying parallel le system must deliver a high readwrite throughput to avoid becoming a bottleneck itself. Currently, General Parallel File System (GPFS) [21] and PVFS2 are two popular parallel le systems. We use PVFS2 in our experiments because PVFS2 is open-source software allowing us to make any required modications whereas GPFS is a commercial product.
2 Unless otherwise mentioned, the number of sending hosts is equal to that of receiving hosts. Although the two numbers are not required to be equal, we make them equal to simplify our explanation.
45
5.2.2 PVFS2
Clemson University and Argonne National Laboratory jointly developed PVFS (or PVFS1) [12,37], which has been released and supported under a GNU General Public License since 1998. The PVFS team aimed to design and implement a parallel I/O system that handles the performance disparity between I/O devices and processors, and addresses the scalability problem of Network File System (NFS). NFS is a distributed le system developed by Sun Microsystems, Inc. It is a clientserver application and allows a user to conveniently access les on a remote computer [48]. An NFS server stores all les in a central location, which causes a scalability problem when the number of clients exceeds the performance capacity of the machine exporting the le system. We can equip an NFS server with more memory, a faster CPU, and higher-speed NICs, but being a central node, it can still run out of resources. As the number of client nodes increases, each client receives a smaller portion of the overall bandwidth for le I/O. Another problem is availability. If an NFS server goes down, all its client nodes have to wait until the server recovers. Unlike NFS, which is a central data storage system, PVFS uses storage on multiple computers to create a large high-performance parallel le system. PVFS physically distributes a single le across multiple disks in multiple nodes. For example, it stripes a le over the local disks in multiple I/O servers using a simple round-robin style as in RAID0. Fig. 5.4 shows the system architecture for PVFS1.3 It is still a clientserver le system. Each host may play one or more of the following three roles: 1. compute nodes (CN or clients), where applications run 2. I/O nodes (ION or I/O servers), where les are stored 3. metadata sever or management node (MGR), where metadata operations are handled PVFS1 can have one and only one management node.
3 This
gure is adapted from PVFS1 user guide [37].
46
Figure 5.4: PVFS system architecture A second version of PVFS, PVFS2, has several new features [38, 39]. For example, it allows for several management nodes, which eliminates the possible bottleneck caused by a single management node in PVFS1. But it uses the same principles as PVFS1 to create a parallel le system.
5.3
The Single-Host Solution
The single-host solution leverages high-speed hardware to avoid the end-host bottleneck. Specifically, we concentrate on the bottleneck created by hard-disk I/O. The other PC hardware components, such as NICs, PCI-X buses, memory buses, and CPUs, are also possible bottlenecks, but as Hurwitz and Feng [23] pointed out, these components are not the primary bottlenecks and they are kept updated by new technologies. For example, new PCI Express16 implementation will achieve a peak bandwidth of 64 Gb/s [10] and thus will remove the possible bottleneck caused by the I/O bus. To relieve the disk bottleneck, we can equip sending and receiving hosts with redundant arrays of inexpensive disks (RAIDs). However, what is the peak write speed for a RAID?4 Is the hardware solution feasible, scalable, and cost-effective? In this section, we address these questions after
4 In
this section, we only use write speed for our comparison because write speed is lower than read speed.

providing a brief overview of RAID.
47
Patterson, Gibson and Katz [35] formally dened RAID levels one through ve and showed that RAID outperformed single large expensive disks by an order of magnitude in speed, reliability, scalability, and other metrics. Currently, the most commonly used RAID levels are RAID0 and RAID5. A RAID0 stripes data evenly across all member disks without any parity or redundancy. A RAID5 stripes data, including parity information, across all member disks. Assume that the number of disks is M and that each disk has an equal write speed of x. If I/O operations are ideally split into equal-sized blocks and these blocks are distributed evenly across the M disks, then these I/O operations can be carried out concurrently on all member disks. Since all M disks for RAID0 contain data, the maximum write speed for RAID0 is M x. In contrast, for RAID5, one disk contains parity information for the I/O operations, and thus, the maximum speed is (M 1) x. In practice, as the number of hard disks connected to a RAID controller increases, the write speed may not increase proportionally because the RAID controller itself becomes the bottleneck. Currently, over 1 Gb/s readwrite speeds are achievable for RAIDs. Barclay, Chong, and Gray [5] reported that an 8-disk 3ware Escalade 8508 controller saturated at 1.8 Gb/s read and 1.6 Gb/s write. An 8-disk Areca ARC-1120 controller, congured as RAID5, was reported to saturate at 6.0 Gb/s read and 3.6 Gb/s write [53]. Therefore, the hardware solution is feasible. In light of the RAID0 and RAID5s designs, a theoretical disk utilization for RAID0 is 100% and for RAID5, disk utilization is (M 1)/M . Assume that each hard disk is 146 GB SCSI disk. To accommodate 2 TB data, we need at least (2 TB)/(146 GB) = 15 hard disks for RAID0 and even more for RAID5. To manage an array of more than 15 hard disks, we need a high-end RAID host adapter with an I/O processor and memory to off-load the intensive RAID5 XOR parity computation. Given the trends in communication bandwidth growth from 1 Gb/s to tens of Gb/s, I/O performance is likely to lag behind network performance for the near-term future. Hence, we conclude that although the single-host solution is feasible for fast le transfers, it is neither scalable nor cost-effective.
48
5.4
The General-Case Cluster Solution
In this section, we describe the cluster solution for the general case of non-split source les at the sending end. First, we address the problem of determining an appropriate value for the splitting degree. Second, we discuss possible approaches to implement the general-case cluster solution and explain why we use GridFTP and PVFS2 to implement it. We also present our specic requirements for GridFTP and PVFS2 to minimize network-and-disk contention. Then, we describe our modications to GridFTP and PVFS2 to meet these requirements. Finally, we provide experimental results after we modied GridFTP and PVFS2.
5.4.1 The Splitting Degree

As mentioned in Section 5.1, the general-case cluster solution needs to rst partition the source le. One important question is to determine an appropriate value for the splitting degree. First, we should select the splitting degree such that the cluster solution transfers a source le faster than an approach without splitting. Let the size of the source le be x, the splitting degree be d (d 1, where d = 1 means that the le is not split), and the number of pairs of sending and receiving hosts be n (see Fig. 5.1b). Assume that the 2 n hosts have the same hardware and software congurations and thus have the same processing power. Let the disk I/O for each host be r for reading and w for writing. Let the time to split and load the le, and the time to assemble the le be Tsplit and Tassemble , respectively. Tsplit and Tassemble are serial in nature because the splitting and assembling steps involve a single source or sink. We assume that Tsplit and Tassemble are independent of the splitting degree d . Since hosts at the sending cluster are typically co-located in one geographic location, we ignore the RTT delay for inter-host communication. Similarly, we ignore the RTT delay amongst receiving hosts. Thus, we estimate Tsplit and Tassemble as follows: Tsplit = Tassemble = x x + r w (5.1)
Let the time to transfer the whole le from a single host at the sending site to a single host at the T f er receiving site be Ttrans f er . Assume that we evenly split the le into d parts. If d < n, it takes trans d

to transfer these parts in parallel. Otherwise, the time is
49
Ttrans f er because we do not benet by n
increasing d to be larger than n. Hence, we have the following equation to guide us in our selection of the splitting degree: Tsplit + Ttrans f er + Tassemble < Ttrans f er min(d , n) (5.2)
The speedup for the general-case cluster solution is Ttrans f er T Tsplit + trans f er + Tassemble min(d , n)
speedup =
(5.3)
Combining (5.1), (5.2), and (5.3), we reason that to get the largest speedup, we should select the splitting degree such that d = n if n > Ttrans f er x x Ttrans f er 2( + ) r w d = 1 otherwise
(5.4)
x In addition, the Ttrans f er > 2( x r + w ) requirement should be met; otherwise, the splitting and assembling operations take longer time than the transferring operation. The two condition of Ttrans f er x x n> x x and Ttrans f er > 2( r + w ) determine whether we should split the source Ttrans f er 2( + ) r w le, that is, whether we should use the general-case cluster solution. If the le transfer is carried out over the Internet, Ttrans f er increases signicantly as RTT increases and/or network congestion increases. Consequently, the probability of meeting these two conditions increases. In contrast, if the le is transferred over a CO network, such as CHEETAH, bandwidth is reserved for the le transfer and thus, there is no congestion during data ow. Assume that a circuit of rate b is reserved between each pair of the sending and receiving hosts. Since we do not benet by reserving a circuit faster than w, b should be no larger than w even if maximum bandwidth rate is larger than w. If b < w, Ttrans f er depends on b. Hence, we estimate Ttrans f er as follows: Ttrans f er = x min(b, w) (5.5)

Thus, to use the cluster solution, we should at least satisfy x x x rw > 2( + ) = b < min(b, w) r w 2(r + w)
50
(5.6)
However, if the circuit bandwidth is high, then the probability of meeting the condition (5.6) is low or even zero. This argues against the cluster solution on CHEETAH. But note that during the previous analysis, we assume that the three steps of splitting, transferring, and assembling are carried out separately. If we pipeline them, then we can decrease the total delay. For example, while we split some parts and load them to sending hosts, we can transfer these available parts to receiving hosts without waiting for the splitting step to be nished. Additionally, if we use PVFS2 to manage les and the starting point is already split le, the cluster solution has value even on CHEETAH.
5.4.2 Design
In this section, we propose possible approaches to implement the three steps of the general-case cluster solution. We discuss their advantages and disadvantages and decide to use GridFTP striped transfer and PVFS2. There are several possible approaches to splitting and assembling a le. The rst approach is to use the functionalities of partial transfer and third-party control provided by some le transfer tools. For example, we use GridFTP. However, there are two problems with this approach. Firstly, disk space of the whole le size should be allocated on each host. Thus, this implementation is not suitable for a large le which cannot even reside on a single host. Secondly, this approach is serial in nature and consumes much time as we mentioned in Section 5.4.1. Thus, the overall speedup is signicantly affected even though the transferring step has a theoretical speedup of min(d , n). Alternatively, we can write a socket program to implement splitting and assembling and thus overcome the rst space problem of using GridFTP partial transfer. However, this approach still has signicant overhead for splitting and assembling. The best approach is to use PVFS2 to manage les. PVFS2 provides a tool, pvfs2-cp, to transfer
51
les between PVFS2 and other le systems, such as NFS, Linux ext2, and Linux ext3. Thus, we can use it to assemble a PVFS2 le, which is distributed across multiple I/O servers, into a non-split one stored in the other le systems, and vice versa. PVFS2 automatically manages partitioning. From a users point of view, a le can be accessed as though it was stored in a single central location. Hence, we can avoid assembling if a user chooses to access a le in PVFS2. We can even avoid splitting if les are initially created in PVFS2. Thus, we choose to use PVFS2 to manage les and we use pvfs2-cp to split or assemble a le if necessary (i.e., a le is not originally managed by PVFS2, if users need to access the le via a non-PVFS2 le system). After deciding to use PVFS2 for splitting and assembling, we study the approaches to transmitting parts of a le. The rst approach is to use GridFTP partial transfer (or any le transfer tools that provide the functionality of partial transfer) to transfer partitions from one PVFS2 to another PVFS2 in parallel but independently. To achieve highest throughput, we should avoid unnecessary networkanddisk contention in each PVFS2 system by making all GridFTP servers responsible for moving only the data blocks located in their local disks. For example, we should avoid the following scenario: a GridFTP server reads a non-local data block and sends the block to its peer receiver, which then has to move the block using PVFS2 to a disk of another host. To avoid such networkanddisk contention, we should meet the following two conditions: 1. The software should know a priori how data are striped in PVFS2. 2. PVFS2 I/O servers and GridFTP servers run on the same hosts and GridFTP servers are responsible only for their local data blocks. Provided that the rst condition holds, the second condition becomes trivial. However, PVFS2 does not provide any explicit utility to examine data distribution. Therefore, to meet the rst condition, we investigated how PVFS2 works and modied PVFS2 code. We will describe our modications to PVFS2 in Section 5.4.3. Fig. 5.5 shows a model of using GridFTP partial le transfer to implement the transferring step, where for each data block, there is a GridFTP control connection and a GridFTP data connection responsible for transmitting the block between the two PVFS2 systems.

Block 1 Block 6 ... PVFS2 I/O server 1 GridFTP server 1 PVFS2 GridFTP partial file transfer Block 1 Block 6 ... PVFS2 I/O server 1' GridFTP server 1' PVFS2
52
Block n Block 2n ... PVFS2 I/O server n GridFTP server n
Figure 5.5: A model of using GridFTP partial le transfer to implement the transferring step The second approach is to use GridFTP striped transfer. Similar to the rst approach, to achieve highest throughput, we should also minimize networkanddisk contention in each PVFS2 system. For this target, we should meet the following two conditions besides the two conditions for the rst approach: 1. GridFTP stripes data across data nodes in the same sequence as PVFS2 does across PVFS2 I/O servers. 2. GridFTP and PVFS2 have the same stripe size. We can easily meet the second condition by setting the stripe-size parameters for GridFTP and PVFS2 to have the same value. We will address how we modied GridFTP code to meet the rst condition in Section 5.4.4. Fig. 5.6 shows the model of using GridFTP striped transfer to implement the transferring step. Unlike the rst transferring approach, which is composed of many independent parallel partial transfers, this approach has only a single le transfer involving many hosts (see Section 5.2.1). As shown in Fig. 5.6, there are only two control connections between a third party and two front ends. In addition, for each pair of sending and receiving data nodes, there is only a single data connection.
...
...
Block n Block 2n ... PVFS2 I/O server n GridFTP server n
...

a third party C n ctio nne o c ol ontr receiving front end A globus-url-copy
con tr ol c onn ecti o
53
GridFTP server internal IPC Block 1 Block n+1 ... I/O server 1 data node 1 PVFS2
sending front end B
GridFTP server internal IPC Block 1
data connection
Block n+1 I/O server 1' data node 1' PVFS2 ...
...
...
Block n Block 2n ... I/O server n data node n data connection
Figure 5.6: A model of using GridFTP striped transfer to implement the transferring step Comparing Fig. 5.5 with Fig. 5.6, we see that the approach using GridFTP striped transfer is more natural and has less overhead to establish and release connections. For these reasons, we choose to use GridFTP striped transfer to implement the transferring step. In conclusion, we use GridFTP striped transfer and PVFS2 to implement the general-case cluster solution. For convenience, we summarize the above-described approaches in Table 5.1.
5.4.3 ImplementationModications to PVFS2

As mentioned in Section 5.4.2, to minimize networkanddisk contention in the general-case cluster solution, we need to know how a le is striped in PVFS2. In this subsection, we describe our modications to PVFS2 to obtain data distribution information.
...
Block n Block 2n ... I/O server n data node n
54
Table 5.1: A summary of possible approaches to implement the general-case cluster solution Steps Approach Pros. Cons. GridFTP wastes disk space, consumes partial le signicant overhead to split transfer and assemble splitting & socket avoids wasting disk space consumes signicant overhead assembling program to split and assemble pvfs2-cp avoids wasting disk space, avoids assembling or even splitting overhead transferring GridFTP many independent transfers partial le which incurs much overhead transfer to set up and release connections GridFTP a single le transfer striped transfer
We installed two PVFS2 1.0.1 systems on a 22-node cluster, called sunre. Sunre1 through
sunre22 are all equipped with two Intel(R)-Xeon 2.80 GHz CPUs, and 1 GB RAM, and are con-
nected to a 24-port GbE switch. They run Redhat Linux 9 and are the clients of an NFS server, called centurion. We loaded each PVFS2 system on ve sunre hosts. For the rst PVFS2 system, we congured sunre1 through sunre5 as the I/O servers and compute nodes, and sunre1 as the only metadata server. For the second PVFS2 system, we congure sunre6 through sunre10 as the I/O servers and compute nodes, and sunre6 as the only metadata server. The conguration le for the second PVFS2 is shown in Fig. 5.7. In this subsection, we carried out the experiments in the second PVFS2 system unless otherwise mentioned. Unlike PVFS1, which provides the utility of pvstat to examine physical le-distribution parameters (e.g., the index of the starting I/O node, the number of I/O servers, and the stripe size) [43], PVFS2 1.0.1 does not provide any direct utility to inspect data distribution. We reported this problem to the pvfs2-user mailing list and were advised to use the tool pvfs2-fs-dump, which displays information about the contents of the le system.5 However, the output by pvfs2-fs-dump does not explicitly illustrate how les are striped. The output is not only hard to comprehend, but also is
5 See
http://www.beowulf-underground.org/pipermail/pvfs2-users/2005-April/000622.html.

... <MetaHandleRanges> Range sunfire6 4-715827885 </MetaHandleRanges> <DataHandleRanges> Range sunfire10 715827886-1431655767 Range sunfire6 1431655768-2147483649 Range sunfire7 2147483650-2863311531 Range sunfire8 2863311532-3579139413 Range sunfire9 3579139414-4294967295 </DataHandleRanges> ... Figure 5.7: A snippet of pvfs2-fs2.conf, the PVFS2 conguration le on sunre6
55
verbose when the PVFS2 le system contains myriad les. Fig. 5.8 shows a part of the output of the pvfs2-fs-dump command. For each le in PVFS2, pvfs2-fs-dump provides the handle number, ... File: test_500M handle = 715827830, type = Metafile, server = 0 handle = 3579139362, type = Datafile, server = 3 handle = 4294967244, type = Datafile, server = 4 handle = 1431655716, type = Datafile, server = 0 handle = 2147483598, type = Datafile, server = 1 handle = 2863311480, type = Datafile, server = 2 File: test_2000M handle = 715827861, type = Metafile, server = 0 handle = 2863311500, type = Datafile, server = 2 handle = 3579139382, type = Datafile, server = 3 handle = 4294967264, type = Datafile, server = 4 handle = 1431655736, type = Datafile, server = 0 handle = 2147483608, type = Datafile, server = 1 ... Figure 5.8: A part of the output for pvfs2-fs-dump the type (Metale or Datale), and the I/O or metadata server number.We wanted answers to the following questions. First, the I/O server numbers and metadata server numbers are logical numbers. It is unclear how PVFS2 match the logical server numbers with the physical servers. Second, the order of the server numbers is not deterministic; for example, the le test 500M is striped in the
56
order 3, 4, 0, 1, and 2 whereas the le test 2000M is striped in the order 2, 3, 4, 0, and 1. How is this order determined? Does it indicate the round-robin sequence of the I/O servers where the les are distributed? Finally, the output of pvfs2-fs-dump does not provide any information about the data stripe size. The default stripe size is 64 KB, but can a user set the stripe size? The rst question was easy to answer. Sunre6 is the only metadata server (see Fig. 5.7). Therefore, as a metadata server, sunre6 has the logical number 0 (see Fig. 5.8). By combining the handle numbers in Fig. 5.8 and the handle ranges for each data server in Fig. 5.7, we determined physical servers corresponding to logical numbers (see Table 5.2). In other words, by combining the output of pvfs2-fs-dump command and the contents of the pvfs2-fs2.conf le, we determined the identication of the physical servers corresponding to logical numbers of I/O nodes. Table 5.2: The logical server numbers for the physical I/O servers Physical I/O server Logical number sunre10 0 sunre6 1 sunre7 2 sunre8 3 sunre9 4
To answer the other two questions, we wrote a program, called legenerator, to create a le such that the le stores the striping information. Consider an s-KB le with the format shown in Fig. 5.9. We used the strace command to trace the system calls called by the utility pvfs2-cp. We describe our trace results below. 1a...a 2a...a ... sa...a
1024 B 1024 B ... 1024 B Figure 5.9: The content of an s KB le First, we used legenerator to create a 1000 MB le, called test 1000M, in the directory of /tmp/ on sunre10. Then, we issued the command strace pvfs2-cp -t /tmp/test 1000M /pvfs2/test 1000M
-o testle/pvfs2cp2 to copy the le into PVFS2 and to save the strace output into the le, called

[xf4c@sunfire10 xf4c]$ more testfile/pvfs2cp2 | grep ... connect( 4,sa_family=AF_INET, sin_port=htons(3334), sin_addr=inet_addr( "128.143.63.248"), 16) = (Operation now in progress) connect( 6,sa_family=AF_INET,sin_port=htons(3334), sin_addr=inet_addr( "128.143.63.216"), 16) = (Operation now in progress) connect( 7,sa_family=AF_INET,sin_port=htons(3334), sin_addr=inet_addr( "128.143.63.226"), 16) = (Operation now in progress) connect( 8,sa_family=AF_INET,sin_port=htons(3334), sin_addr=inet_addr( "128.143.63.224"), 16) = (Operation now in progress) connect( 9,sa_family=AF_INET,sin_port=htons(3334), sin_addr=inet_addr( "128.143.63.225"), 16) = (Operation now in progress) ... connect
57
-1 EINPROGRESS
-1 EINPROGRESS
-1 EINPROGRESS
-1 EINPROGRESS
-1 EINPROGRESS
Figure 5.10: A part of the output for the command more testle/pvfs2cp2 | grep connect
testle/pvfs2cp2. Next, we identied the le descriptors used in the I/O servers on sunre by typ-
ing the command more testle/pvfs2cp2 | grep connect. From Fig. 5.106 , we determined the le descriptors used in sunre6 through sunre10 by matching IP addresses from Fig. 5.10 with the names of these machines. The results are shown in Table 5.3. Further, we used the command,
more testle/pvfs2cp2 | grep writev | more, to determine how the le was distributed across the I/O
servers. Fig. 5.11 shows a small part of the output for this command, where we saw that the distance between neighboring blocks on the same host was 320 KB (e.g., 385-65, 321-1, etc.). Since each Table 5.3: The le descriptors and IP addresses for sunre6 through sunre10 File descriptor IP address Host name 4 128.143.63.248 sunre10 6 128.143.63.216 sunre6 7 128.143.63.226 sunre9 8 128.143.63.224 sunre7 9 128.143.63.225 sunre8
6 We
congured the I/O servers and the metedata server to listen on the default TCP port number 3334.

writev( 4, ..., " 65aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, " 385aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, ... writev( 7,..., " 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, " 321aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, ... writev( 6,..., " 129aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, " 449aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, ... writev( 8,..., " 193aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, " 513a aaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, ... writev( 9,..., " 257aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, " 577aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536, ... Figure 5.11: A part of the output of the command more testle/pvfs2cp2 | grep writev | more
58
stripe was 64 KB, and there were ve I/O servers, neighboring blocks were 65 5 = 320 KB apart. Combining Fig. 5.11 and Table 5.3, we summarized the data-distribution pattern for test 1000M in Table 5.4. Thus, test 1000M was distributed cyclicly across sunre9, sunre10, sunre6, sunre7, and sunre8. Finally, we examined the output of pvfs2-fs-dump for test 1000M, as shown in
Fig. 5.12. Combining Fig. 5.12 and Table 5.1, we found that the I/O-server sequence given by pvfs2fs-dump was also sunre9, sunre10, sunre6, sunre7, and sunre8. Therefore, we concluded that
Table 5.4: The data-distribution pattern for /pvfs2/test 1000M File descriptor Host name Starting offset for each block 4 sunre10 65, 385, 705 ... 1023745 6 sunre6 129, 449, 769, ... 1023809 7 sunre9 1,321,641,961, ... 1023681 8 sunre7 193, 513, 833, ... 1023873 9 sunre8 257, 577, 897, ... 1023937

... File: test_1000M handle = 715827870, type = Metafile, server = 0 handle = 4294967284, type = Datafile, server = handle = 1431655756, type = Datafile, server = handle = 2147483638, type = Datafile, server = handle = 2863311520, type = Datafile, server = handle = 3579139402, type = Datafile, server = ... Figure 5.12: The pvfs2-fs-dump output for the test 1000M le
pvfs2-fs-dump shows the round-robin sequence of the I/O servers for le distribution.7
59
4 0 1
2 3
For the third question on the stripe size, we rst used legenerator to create a 128 KB le, called test 128K. Then, we typed the command strace pvfs2-cp -s 131072 -t /tmp/test 128K
/pvfs2/test 128K2 -o pvfs2cp, which specied the stripe size as 128 KB in the -s option. Fig. 5.13
shows a part of the strace output, where the stripe size was 64 KB instead. Thus, we concluded that writev( 4,...," 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...) ... writev( 6,...," 65aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...) ... Figure 5.13: A snippet from the le pvfs2cp in PVFS2 1.0.1, pvfs2-cp has a bug of ignoring the -s option.8 To change the default stripe size, we investigated the PVFS2 1.0.1 source code. We found that the statement that species the default stripe size (64 KB) is located in the program $PVFS2dir9 /src/io/description/Dist-simple-stripe.c as shown below: static PVFS_simple_stripe_params simple_stripe_params = 65536 /* strip size */ };
repeated the procedure for many les and found that the result always holds. The PVFS2 team also conrmed this result. 8 we reported this problem to the pvfs2-developer mailing list and were notied that this problem would be xed in the future. 9 $PVFS2dir denotes where PVFS2 is installed
7 We
60
By setting the parameter simple stripe params, we can change the default stripe size and thus overcome the problem of pvfs2-cp ignoring the -s option. For example, we set sim-
ple stripe params=1048576 and recompiled the code. Then, we used pvfs2-cp to copy test 1000M
into PVFS2 and used strace to observe the system calls called by pvfs2-cp. Fig. 5.14 shows a part of the strace output, where test 1000M was distributed across the I/O servers with the 1 MB stripe size. writev( ... writev( ... writev( ... writev( ... writev( ... 4,...," 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,) 6,...," 1025aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,) 7,...," 2049aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,) 8,...," 3073aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,) 9,...," 4097aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)
Figure 5.14: A part of the output for the strace command Finally, we addressed the problem that PVFS2 stripes les across the I/O servers in a nondeterministic sequence. We found that inside the program $PVFS2dirsrc/common/misc/pint-cachedcong.c, there is a function, PINT cached cong get next io(), which chooses a random I/O server and then uses the order specied in pvfs2-fs2.conf to distribute a le, as shown in Fig. 5.15. The reason that PVFS2 was designed to stripe data with a random starting I/O server is load balancing. But in our general-case cluster solution, we need to predict how a le is striped to minimize network-and-disk contention. Hence, we modied the boldfaced statement in Fig. 5.15 into jitter = -1 and obtained a predictable (xed) order of data distribution. In other words, a le is distributed across all the I/O servers according to the logical order specied in pvfs2-fs2.conf. Thus, for the second PVFS2, the sequence is sunre10, sunre6, sunre7, sunre8, and sunre9; and for the rst PVFS2, the sequence is sunre1, sunre2, sunre3, sunre4, and sunre5. Consequently, given the information of stripe size, we can exactly gure out how a le is striped across the I/O servers.

/* PINT_cached_config_get_next_io() * returns the address of a set of servers that should be used to * store new pieces of file data. This function is responsible for * evenly distributing the file data storage load to all servers. */ int PINT_cached_config_get_next_io(...) { ... num_io_servers = PINT_llist_count( cur_config_cache->fs->data_handle_ranges); /* pick random starting point */ jitter = (rand() % num_io_servers); while(jitter-- > -1) { cur_mapping = PINT_llist_head(cur_config_cache->data_server_cursor); ... cur_config_cache->data_server_cursor = PINT_llist_next( cur_config_cache->data_server_cursor); } while(num_servers) { ... cur_config_cache->data_server_cursor = PINT_llist_next( cur_config_cache->data_server_cursor); data_server_bmi_str = PINT_config_get_host_addr_ptr( config,cur_mapping->alias_mapping->host_alias); ... } } Figure 5.15: A snippet of the source code for PINT cached cong get next io()
61
5.4.4 ImplementationModications to GridFTP

GridFTP stripes data across data nodes according to a data-connection sequence, termed stripe index, in the range of 0 to n 1. To meet the condition that GridFTP stripes data across data nodes in the same sequence as PVFS2 does across PVFS2 I/O servers, we rst need to answer the question: what is the stripe index for each pair of sending and receiving data nodes? In other
62
words, in GridFTP striped transfer, how and in what order are sending data nodes matched with receiving ones? The GridFTP specication [1] does not address this question. In this section, we rst investigate how sending and receiving data nodes are matched. Our experimental results show that the matching is nondeterministic and thus, we cannot avoid the network-and-disk contention unless we modify GridFTP code. Then, we describe how to modify the GridFTP code to get a deterministic matching sequence between sending and receiving data nodes. We installed the GridFTP package provided by GT3.9.5 on sunre. This GridFTP package contains the functionality of GridFTP striped transfer. We started GridFTP servers on sunre1 through sunre10 such that sunre1 and sunre6 are front ends and the other eight hosts are data nodes. Fig. 5.16 shows the commands. With the -r option, we specied that the data nodes for
sunre1 were ordered as sunre2 through sunre5 and those for sunre6 were sunre7 through sunre10. The -dn option means that the GridFTP server is a data node. We expected sunre2
through sunre5 and sunre7 through sunre10 were ideally matched according to the sequences specied in the -r option, which means that sunre2 would communicate with sunre7, sunre3 with sunre8, and so on. However, the following results show that GridFTP striped transfer does not work in this ideal way. [xf4c@sunfire1 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50001 -r sunfire2:5001, sunfire3:5001, sunfire4:5001, sunfire5:5001 [xf4c@sunfire6 etc]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50002 -r sunfire7:5001, sunfire8:5001, sunfire9:5001, sunfire10:5001 [xf4c@sunfire2 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50001 -dn ... [xf4c@sunfire5 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50001 -dn [xf4c@sunfire7 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50001 -dn ... [xf4c@sunfire10 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa -p 50001 -dn Figure 5.16: The commands to start GridFTP servers on sunre We started globus-url-copy on a third party, sunre11, to use the functionality of GridFTP striped

transfer (by turning on the -stripe option). The command is as follows: [xf4c@sunfire11 xf4c]$ $GLOBUS_LOCATION/bin/globus-url-copy -vb -stripe ftp://sunfire1:50001/home/xf4c/testfile/test_1G ftp://sunfire6:50002/home/xf4c/testfile/test_1G1 2>dbg1.txt -dbg
63
We turned on the debug mode with the -dbg option so that we could obtain the details. Fig. 5.17 shows a part of the debug output. By examining the information in Fig. 5.17 below and Table 5.3 on page 57, we saw that the sequence for hostport pairs returned by the SPAS command were sunre10, sunre9, sunre8, sunre7 rather than the sequence of sunre7 through sunre10 specied
by the -r option for sunre6. The result of SPAS: debug: sending command: SPAS debug: response from ftp://sunfire6:50002/home/xf4c/testfile/test_1G1: 229-Entering Striped Passive Mode. 128,143,63,248,185,185 128,143,63,226,186,31 128,143,63,225,185,170 128,143,63,224,186,15 229 End Figure 5.17: A part of the debug output for the GridFTP striped transfer Before the GridFTP striped transfer, we also started tcpdump [51] to capture the GridFTP trafc amongst sunre1 through sunre10. After the transfer was nished, we used tcptrace [52] to analyze the captured trafc. Fig. 5.18 shows the tcptrace outputs for sunre710. The GridFTP data connections were between sunre4 and sunre10, sunre3 and sunre9, sunre2 and sunre8, and
sunre5 and sunre7. Thus, when the sending front end, sunre1, executed the SPOR command, it
did not require its data nodes (sunre2 through sunre5) to establish connections sequentially with the hosts returned by the SPAS command (sunre10, sunre9, sunre8, sunre7). We repeated the experiment several times, and found that neither SPAS nor SPOR follows the sequence specied by the -r option. Hence, we could not predict how data connections were established between multiple data nodes.

[xf4c@sunfire10 tcptrace-6.6.7]$ tcptrace /tmp/sunfire10.log 280048 packets seen, 280020 TCP packets traced elapsed wallclock time:0:00:00.652783, 429006 pkts/sec analyzed trace file elapsed time:0:08:30.409906 TCP connection info: 1: sunfire6.cs.Virginia.EDU:47763 - sunfire10.cs.Virginia.EDU:5001 (a2b) 221> 187< (complete) 2: sunfire4.cs.Virginia.EDU:4878 sunfire10.cs.Virginia.EDU:47545 (c2d) 186099> 93513< (complete) [xf4c@sunfire9 tcptrace-6.6.7]$ tcptrace /tmp/sunfire9.log 278903 packets seen, 278885 TCP packets traced elapsed wallclock time:0:00:00.891238, 312938 pkts/sec analyzed trace file elapsed time:0:07:27.005080 TCP connection info: 1: sunfire6.cs.Virginia.EDU:47764 - sunfire9.cs.Virginia.EDU:5001 (a2b) 212> 174< (complete) 2: sunfire3.cs.Virginia.EDU:47586 sunfire9.cs.Virginia.EDU:47647 (c2d) 185247> 93252< (complete) [xf4c@sunfire8 tcptrace-6.6.7]$ tcptrace /tmp/sunfire8.log 279503 packets seen, 279482 TCP packets traced elapsed wallclock time: 0:00:00.745197, 375072 pkts/sec analyzed trace file elapsed time: 0:07:50.749054 TCP connection info: 1: sunfire6.cs.Virginia.EDU:47765 - sunfire8.cs.Virginia.EDU:5001 (a2b) 215> 180< (complete) 2: sunfire2.cs.Virginia.DU:48556 sunfire8.cs.Virginia.EDU:47530 (c2d) 185827> 93260< (complete) [xf4c@sunfire7 tcptrace-6.6.7]$ tcptrace /tmp/sunfire7.log 275137 packets seen, 275109 TCP packets traced elapsed wallclock time:0:00:01.237319, 222365 pkts/sec analyzed trace file elapsed time:0:08:30.410378 TCP connection info: 1: sunfire6.cs.Virgiia.EDU:47766 - sunfire7.cs.Virginia.EDU:5001 (a2b) 209> 167< (complete) 2: sunfire5.cs.Virginia.EDU:47577 sunfire7.cs.Virginia.EDU:47631(c2d) 182995> 91738< (complete)
64
Figure 5.18: The tcptrace outputs for GridFTP striped transfer before we modied GridFTP code
65
These nondeterministic data connections between sending and receiving data nodes are unsuitable for us to deploy the general-case cluster solution on CHEETAH. We need to reserve bandwidth before a data transfer. Given the nondeterminism, we need to reserve bandwidth between any pairs of sending and receiving hoststhere are totally n (n 1) pairs. We would waste and even run out of bandwidth if we reserved bandwidth for all possible pairs. We can solve this problem by reserving bandwidth between two cluster switches and allows any hosts connected to a switch to communicate with any hosts connected to the other switch. However, to minimize networkand disk contention, we have to make data connections deterministic. We studied the GridFTP source code in GT3.9.5 and modied the implementation of the SPAS and SPOR commands. For the SPAS command, we rst obtained the IP addresses of data nodes specied in the -r option for a receiving front end. Then, we sorted the list of hostport pairs generated by the old SPAS command according to the IP-address order for receiving data nodes. Then, we let SPAS return the sorted list to the third party negotiating the GridFTP striped transfer. Thus, the argument for the SPOR command sent to the sending front end was also sorted by the order of the IP addresses of the receiving data nodes. For the SPOR command, we requested sending data nodes specied in the -r option for a sending front end to initiate data connections sequentially to receiving data nodes specied in the argument of the SPOR command. In this way, sending and receiving data nodes are matched according to their sequences in the -r option for sending and receiving front ends. Additionally, their data connections have ascending stripe indexes from 0 to n 1. Hence, it is easy to let GridFTP stripe data across data nodes in the same sequence as PVFS2 does across PVFS2 I/O servers. We only need to set the -r option such that GridFTP data nodes have the same sequence as PVFS2 I/O servers.
5.4.5 Experimental Results

We tested the general-case cluster solution on sunre. In this section, we present the experimental results to show that networkanddisk contention is minimized after we modied GridFTP and PVFS2. There are two PVFS2s on sunre (see Section 5.4.3 on page 53). The I/O servers for the rst
66
PVFS2 are ordered as sunre1 through sunre5. The I/O servers for the second PVFS2 are ordered as sunre10, and sunre6 through sunre9. We started GridFTP front ends on sunre1 and sunre10 and GridFTP data nodes on sunre1 through sunre10. The data nodes for sunre1 were ordered as sunre1 through sunre5 and those for sunre10 were sunre10, sunre6 through sunre9. Then, we started globus-url-copy on sunre11 to conduct a le transfer between two PVFS2 systems on the sunre cluster. The command is as follows: [xf4c@sunfire11 xf4c]$ $GLOBUS_LOCATION/bin/globus-url-copy -vb -dbg -stripe ftp://sunfire1:50001/pvfs2/test_1G ftp://sunfire10:50002/pvfs2/test_1G1 2>dbg1.txt Fig. 5.19 shows the tcptrace outputs for sunre6 through sunre10, where we saw that connections were established between sunre1 and sunre10, sunre2 and sunre6, sunre3 and sunre7,
sunre4 and sunre8, and sunre5 and with sunre9. Hence, the data connections were estab-
lished according to the sequences specied in the -r options for sunre1 and sunre10. Note that in Fig. 5.19, we omited some TCP-connection information for sunre6 through sunre10 to save space. These connections were not essential for the purpose of our experiment. They were either for the communication between the PVFS2 metadata server and the PVFS2 I/O servers or for the communication between the GridFTP front end and the GridFTP data nodes. Moreover, they only contained a comparatively small number of packets. There were no other connections amongst the PVFS2 I/O servers. In other words, each data node transfers only the data located in its local disk. Thus, we minimized network-and-disk contention. We repeated the test many times and did not nd any exceptions to our original results. Since we avoided unnecessary network-and-disk contention, we expected that the general-case cluster solution would have a speedup of n (n=5 in our experiment) over normal GridFTP transfer involving only a single sourcesink pair. Surprisingly, we found that the cluster solution gained only a small speedup. The reason for the poor performance is that PVFS2 had a much lower read write speed than NFS and Linux ext2 on sunre. Thus, we need to continue working on PVFS2 or try other parallel le systems (e.g., GPFS) to get a high readwrite throughput.

[xf4c@sunfire6 xf4c]$ tcptracescript sunfire6.log 181171 packets seen, 181163 TCP packets traced ... TCP connection info: 1: sunfire8.cs.Virginia.EDU:44786 - sunfire6.cs.Virginia.EDU:3334 (a2b) 1565> 796< ... 7: sunfire6.cs.Virginia.EDU:44721 - sunfire9.cs.Virginia.EDU:3334 (m2n) 2> 1< 8: sunfire2.cs.Virginia.EDU:58306 sunfire6.cs.Virginia.EDU:56735 (o2p) 121641> 50070< (complete) 9: sunfire7.cs.Virginia.EDU:44734 - sunfire6.cs.Virginia.EDU:3334 (q2r) 1571> 791< 10: sunfire9.cs.Virginia.EDU:45156 - sunfire6.cs.Virginia.EDU:3334 (s2t) 1549> 789< [xf4c@sunfire7 xf4c]$ tcptracescript sunfire7.log 176887 packets seen, 176879 TCP packets traced ... 9: sunfire3.cs.Virginia.EDU:57513 sunfire7.cs.Virginia.EDU:56871 (q2r) 121617> 52921< (complete) ... [xf4c@sunfire8 xf4c]$ tcptracescript sunfire8.log 155197 packets seen, 155189 TCP packets traced ... 17: sunfire4.cs.Virginia.EDU:57002 sunfire8.cs.Virginia.EDU:56999 (ag2ah) 105821> 46770< (complete) ... [xf4c@sunfire9 xf4c]$ tcptracescript sunfire9.log 181769 packets seen, 181760 TCP packets traced ... 10: sunfire5.cs.Virginia.EDU:56857 sunfire9.cs.Virginia.EDU:56905 (s2t) 123475> 55980< (complete) [xf4c@sunfire10 xf4c]$ tcptracescript sunfire10.log 177961 packets seen, 177954 TCP packets traced ... 7: sunfire1.cs.Virginia.EDU:44346 sunfire10.cs.Virginia.EDU:58105 (m2n) 122541> 53132< (complete) ...
67
Figure 5.19: The tcptrace outputs for GridFTP striped transfer after we modied GridFTP code
68
5.5 The Specic Cluster Solution for TSI

As mentioned in Section 5.1, in the TSI project, scientists at NCSU, need to download multi-TB datasets from the Cray X1E at ORNL to orbitty at the local site. These datasets are stored as separate 10 GB les on the Cray disks. We are not granted the permission to access the Cray directly. The current le-transfer solutions, bbcp or LORS, use one intermediate hop to transfer the les to a storage depot, TSILN, before moving them to orbitty. These solutions use only a single source and sink to transfer data, and achieve a throughput of 200 Mb/s to 400 Mb/s. We can improve the throughput by using a specic cluster solution as follows. Given that the dataset is composed of many (e.g., about 200) separate les, we move these les from the Cray X1E to ve machines connected to CHEETAH, called zelda1 through zelda5. Then, we transfer the les on CHEETAH circuits established between the ve machines zelda1 through zelda5 and ve computing nodes of orbitty. Any le transfer tool can be used to carry out the transfers in parallel. Fig. 5.20 shows the network conguration for this approach. This solution employs pipelining of le movement between the Cray and the zelda hosts, and le movement between the zelda and
orbitty clusters. Since we have to move 200 les, but only have ve hosts at each end, parallelism is
achieved at a le level rather than at a block level as described in the general cluster solution with
orbitty at NCSU
controller-0 (rudi) controller-1 (orbitty) disk-0-0 Dell 5424 Dell 5224 compute0-0 compute0-1 compute0-2 compute0-3 compute0-4 . . . compute0-19
zelda at ORNL
zelda1
X1E at ORNL
zelda2
CHEETAH
zelda3
LAN
X1E
disk-1-0
zelda4
disk-2-0
zelda5
disk-3-0
disk-4-0 monitoring host
Figure 5.20: The specic cluster solution for TSI

PVFS2 and GridFTP.
69
On a 1-Gb/s circuit between zelda5 at ORNL and compute-0-2 at NCSU, we achieved a disk todisk throughput of 720 Mb/s using ftp. Thus, with ve pairs of parallel independent transfers, we expect an aggregate throughput of 3.6 Gb/s.
5.6
Conclusions
In this chapter, we described the single-host and cluster-based solutions to achieve throughput above 1 Gb/s over WANs. We reasoned that the hardware solution created by equipping end hosts with high-speed hardware is feasible but neither scalable nor cost-effective. Then, we proposed a general-case cluster solution, which uses PVFS2 and GridFTP to transfer data between multiple end hosts in parallel. By requiring GridFTP servers to transfer data blocks only located on their local disks, we minimize end-host networkanddisk contention. To achieve this, we modied source code of PVFS2 to force a xed data-block distribution, and changed the implementation of GridFTP SPAS and SPOR commands. Finally, we presented a solution for fast le transfers in the TSI project. By reserving bandwidth and conducting transfers in parallel between ve pairs of senders and receivers, we achieved a disktodisk throughput of 3.6 Gb/s.
Chapter 6
CONCLUSIONS AND FUTURE WORK
We summarize the thesis in this chapter. We also discuss the future work needed to advance our present research.
6.1
Conclusions
In this thesis, we studied applications for optical circuit-switched networks. In Chapter 2, we reviewed different types of GMPLS networks and reasoned that they are call-blocking networks that only support immediate-request calls. We also described CHEETAH as an example of GMPLS networks. Then, in Chapters 3 through 5, we concentrated on three topics on applications for GMPLS networks. First, in Chapter 3, we addressed an important question: what applications are suitable to run on GMPLS networks to achieve both high utilization and low call-blocking probability? We presented single-link bandwidth sharing models for two categories of applications: those for which the percircuit capacity and the holding time are independent, and those for which they are directly related (e.g., le transfers). For the two categories of applications, we concluded that ideal applications on GMPLS networks require bandwidth on the order of one-hundredth the link capacity as per-circuit rates. The rst category of applications should have long call-holding times to keep the number of line cards small. In contrast, the second category of applications need to have short call-holding times (on the order of seconds). 70
Chapter 6. CONCLUSIONS AND FUTURE WORK
71
Second, according to the conclusions in Chapter 3, we believe that web le transfers can use CHEETAH efciently. Thus, in Chapter 4, we designed and implemented a new web-based letransfer software package, called WebFT. We integrated CHEETAH end-host software APIs into the WebFT package to provide CHEETAH related services transparently to users. By leveraging CGI, the WebFT package is completely independent of the web server and browser software, and therefore, does not require any modications to the latter. We also tested WebFT on CHEETAH and our experimental results showed that WebFT can provide deterministic data services to CHEETAH clients on dedicated end-to-end circuits. Finally, in Chapter 5, we explained that TCPs congestion-control algorithm and end-host limitations made it hard to achieve a throughput above 1 Gb/s across long-RTT WANs. Then, we described another parallel le-transfer application to overcome the two factors that limit throughput. We used PVFS2 and GridFTP to implement a general-case cluster solution, where a source le is not split. We also modied PVFS2 and GridFTP code to avoid unnecessary end-host network anddisk contentions, and thus maximized throughput. Furthermore, for the TSI project, where a source le is already split into many parts, we presented a specic cluster solution, which used several pairs of parallel independent transfers to get multi-Gb/s throughput.
6.2
Future Work
We list several signicant directions in which we would like to advance this study: Analytical models of GMPLS networks: We used single-link bandwidth sharing models to analyze the suitability of applications in GMPLS networks. We assumed that there was only a single class of applications sharing networks. We plan to extend the analytical models to multiple classes based on the multi-class call-blocking model presented by Kaufman [28]. We also plan to extend our models to multiple links and then to network models by referring to the work done by Ramesh et al. [40] and Li et al. [30]. Web transfer application on CHEETAH: Currently, only hosts directly connected to CHEETAH can use WebFT to improve web performance. We plan to design and imple-
Chapter 6. CONCLUSIONS AND FUTURE WORK
72
ment a web application using partial-path circuits such that nonCHEETAH hosts can also use CHEETAH. We will use the proxy software, Squid [47], to break up a long-distance connectionless path into a partial circuit through CHEETAH, and two low-RTT connectionless sub-paths. Using this approach, we can avoid congested connectionless links and reduce RTT. Thus, nonCHEETAH hosts can use CHEETAH to improve web performance. In addition, we can leverage web caching protocols provided by Squid to further improve web performance. We will also extend our partial-path circuit models to include other CO networks and reduce RTT on a national or even global scale. Parallel le transfers on CHEETAH: We will test the general-case cluster solution on CHEETAH. We will work on PVFS2 or try GPFS to overcome the barrier of low I/O throughput caused by end-hosts. For the TSI project, if we can directly access the Cray, we will remove the intermediate step which moves data from the Cray to zelda. We will apply the general-cluster case solution directly to a single-step le transfer between the Cray and orbitty.
Bibliography
[1] A LLCOCK , W. GridFTP: Protocol extensions to FTP for the Grid. Global Grid Forum Recommendation GFD.20, Mar. 2003. [2] A LLCOCK , W., B RESNAHAN , J., K ETTIMUTHU , R., L INK , M., D UMITRESCU , C., R AICU , I.,
AND
F OSTER , I. The Globus striped GridFTP framework and server. In Proceedings of
Super Computing 2005 (Nov. 2005). [3] AWDUCHE , D., B ERGER , L., G AN , D., L I , T., S RINIVASAN , V.,
AND
S WALLOW, G.
RSVP-TE: Extensions to RSVP for LSP tunnels. RFC 3209, Dec. 2001. [4] BAKER , M.,
AND
F ENG , W. 10-Gigabit Ethernet helps relieve network bottlenecks for
bandwidth-intensive applications. Dell Power Solutions (mar 2004), 113116. [5] BARCLAY, T., C HONG , W.,
AND
G RAY, J. A quick look at Serial ATA (SATA) disk perfor-
mance. Technical Report MSR-TR-2003-70, Oct. 2003. [6] bbcp. http://www.slac.stanford.edu/ abh/bbcp/ .
AND
[7] B ELL , E., S MITH , A., L ANGILLE , P., R IJHSINGHANI , A.,
M C C LOGHRIE , K. Deni-
tions of managed objects for bridges with trafc classes, multicast ltering and virtual LAN extensions. RFC 2674, Aug. 1999. [8] B RADEN , R., Z HANG , L., B ERSON , S., H ERZONG , S., AND JAMIN , S. Resource ReSerVation Protocol (RSVP)-version 1 fuctional specications. IETF RFC 2205, Sept. 1997.
73
Bibliography
[9] B RESLAU , L., C AO , P., FAN , L., P HILLIPS , G.,
AND
74 S HENKER , S. Web caching and zipf-
like distributions: Evidence and implications. In Proceedings of IEEE INFOCOM99 (Mar. 1999). [10] B REWER , J., AND S EKEL , J. PCI Express technology. Dell white paper, Feb 2004. [11] CANARIEs CA*net 4. http://www.canarie.ca/canet4/index.html [12] C ARNS , P. H., III, W. B. L., ROSS , R. B.,
AND
T HAKUR , R. PVFS: A parallel le system
for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference (Atlanta, GA, Oct. 2000), pp. 317327. [13] CHEETAH. http://cheetah.cs.virginia.edu [14] C ROVELLA , M.,
AND
A.B ESTAVROS. Self-similarity in World Wide Web trafc evidence
and possible causes. IEEE/ACM Transactions on Networking 5, 6 (Dec. 1997). [15] The Energy Sciences Network (ESnet). http://www.es.net/ [16] FANG , X., Z HENG , X.,
AND
V EERARAGHAVAN , M. Improving Web performance through
new networking technologies. In IEEE ICIW06 (Feb. 2006). [17] F LORESCU , D., VALDURIEZ , P., YAGOUB , K.,
AND I SSARNY,
V. Caching strategies for
data-intensive Web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Sept. 2000). [18] F OSTER , I.,
AND
K ESSELMAN , C. A metacomputing infrastructure toolkit. IEEE Commun.
Mag. 11(2) (1997), 115128. [19] G ARZOTTO , F. Ubiquitous Web applications. In Proceedings of the 5th East European Conference on Advances in Databases and information Systems (Springer-Verlag, London, Sept. 2001). [20] The Globus Alliance. http://www.globus.org/ .
Bibliography
[21] General Parallel File System (GPFS). clusters/software/gpfs.html . http://www-1.ibm.com/servers/eserver/
75
[22] G UOK , C. ESnet On-demand Secure Circuits and Advance Reservation System (OSCARS). http://www.es.net/oscars/index.html .
[23] H URWITZ , J., AND F ENG , W. End-to-end performance of 10-Gigabit Ethernet on commodity systems. IEEE Micro 24, 1 (2004). [24] H WANG , S.-Y., 2003. [25] Virtual bridged Local Area Networks, May 2003. [26] Internet2. http://www.internet2.net [27] K ATZ , D., KOMPELLA , K.,
AND AND
R IDDLE , R. Bandwidth Reservation for User Work (BRUW), May
Y EUNG , D. Trafc engineering (TE) extensions to OSPF
version 2. RFC 3630, Sept. 2003. [28] K AUFMAN , J. S. Blocking in a shared resource environment. IEEE Transactions on Communications 29 (Oct. 1981), 14741481. [29] L ANG , J. Link Management Protocol (LMP). IETF RFC 4204, Oct. 2005. [30] L I , C. Y., WAI , P. K. A.,
AND
L I , V. O. K. The decomposition of a blocking model for
connection-oriented networks. IEEE/ACM Trans. Netw. 12, 3 (2004), 549558. [31] Logistical Runtime System (LoRS). http://loci.cs.utk.edu/lors/ [32] M ELTZER , K.,
AND
M ICHALSKI , B. Writing CGI Applications with Perl. Addison-Wesley,
Reading, MA, 2001. [33] M UDAMBI , P., Z HENG , X.,

AND
V EERARAGHAVAN , M. A transport protocol for dedicated
end-to-end circuit. In IEEE ICC2006 (June 2006). [34] OMNInet. http://www.icair.org/omninet/ .
Bibliography
[35] PATTERSON , D. A., G IBSON , G. A.,
AND
76 K ATZ , R. H. A case for redundant arrays of
inexpensive disks (RAID). In Proceedings of the International Conference on Management of Data (SIGMOD) (June 1988). [36] P OSTEL , J., AND R EYNOLDS , J. File Transfer Protocol (FTP). IETF RFC 959, Oct. 1985. [37] The parallel Virtual File System project. http://www.parl.clemson.edu/pvfs/ [38] PVFS2
DEVELOPMENT TEAM .
Parallel Virtual File System, version 2 (PVFS2). http: , Sept. 2003. .
//www.pvfs.org/pvfs2/pvfs2-guide.html
[39] Parallel Virtual File System, version 2 (PVFS2). http://www.pvfs.org/pvfs2/ [40] R AMESH , S., ROUSKAS , G. N.,
AND
P ERROS , H. G. Computing blocking probabilities in
multi-class wavelength routing networks with multicast calls. IEEE Journal on Selected Areas in Communications 20 (Jan. 2002), 8996. [41] R AO , N. S. V., W ING , W. R., C ARTER , S. M.,
AND
W U , Q. Ultrascience net: Network
testbed for large-scale science applications. IEEE Commun. Mag. 43, 11 (Nov. 2005), 1217. [42] ROSEN , E., V ISWANATHAN , A., ture. RFC 3031, Jan. 2001. [43] ROSS , R. B., C ARNS , P. H., III, W. B. L., AND L ATHAM , R. Using the Parallel Virtual File System. http://www.parl.clemson.edu/pvfs/user-guide.html , July 2002.
AND
C ALLON , R. Multiprotocol label switching architec-
[44] S CHWARTZ , M. Telecommunication networks: protocols, modeling and analysis. AddisonWesley, Boston, MA, 1986. [45] S HIOMOTO , K., PAPADIMITRIOU , D., ROUX , J.-L. L., V IGOUREUX , M.,
GARD , AND
B RUN -
D.
Requirements for GMPLS-based multi-region and multi-layer networks
(MRN/MLN). IETF Internet Draft, Oct. 2005. [46] S OBIESKI , J., L EHMAN , T.,
AND JABBARI ,
B. Dynamic Resource Allocation via GMPLS .
Optical Networks (DRAGON). http://dragon.east.isi.edu/
Bibliography
[47] Squid. http://www.squid-cache.org/ .
77
[48] S UN M ICROSYSTEMS I NC . NFS: Network File System protocol specication. IETF RFC 1094, Mar. 1989. [49] SURFnet. http://www.surfnet.nl/info/en/home.jsp .
[50] TANENBAUM , A. S. Computer Networks, fourth ed. Prentice Hall PTR, Upper Saddle River, New Jersey, 2002. [51] Tcpdump public repository. http://www.tcpdump.org . .
[52] Tcptrace Ofcial Homepage. http://jarok.cs.ohiou.edu/software/tcptrace/ [53] Tekram Systems Co., Ltd. http://www.tekram.com/ [54] TSI. http://www.phy.ornl.gov/tsi/ [55] UKLight. http://www.uklight.ac.uk/ [56] V EERARAGHAVAN , M.,
AND
. .
K AROL , M. Internetworking connectionless and connection-
oriented networks. IEEE Commun. Mag. (Dec. 1999), 130138. [57] V EERARAGHAVAN , M., Z HENG , X., L EE , H., G ARDNER , M., AND F ENG , W. CHEETAH: Circuit-switched High-speed End-to-End Transport Architecture. In Proc. of Opticomm 2003 (Dallas, TX, Oct. 2003). [58] WANG , H., V EERARAGHAVAN , M., K ARRI , R.,
AND
L I , T. Design of a high-performance
RSVP-TE signaling hardware accelerator. IEEE JSAC 23, 8 (Aug. 2005), 15881595. [59] Z HU , X., Z HENG , X., V EERARAGHAVAN , M., L I , Z., S ONG , Q., H ABIB , I., AND R AO , N. S. V. Implementation of a GMPLS-based network with end host initiated signaling. In IEEE ICC2006 (June 2006).

A Study of Applications

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Study of Applications

Diunggah oleh

Hak Cipta:

Format Tersedia

A STUDY OF APPLICATIONS FOR OPTICAL CIRCUIT-SWITCHED NETWORKS

Malathi Veeraraghavan (Advisor)

Marty Humphrey (Chair)

Dean, School of Engineering and Applied Science

Experimental Testbed and Results . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Specic Cluster Solution for TSI . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5

3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2

2.1 4.1 5.1 5.2 5.3 5.4

2.1.1 CO Networks and GMPLS Control-Plane Protocols

Table 2.1: A classication of networks that reects sharing modes

Multiplexing/ Switching type P

Packet-switched e.g., IP networks; Ethernet networks e.g., X.25, ATM, MPLS

Route lookup Bandwidth and label management

Route lookup Bandwidth and label management Switch fabric configuration

Control plane Data plane

Switch fabric configuration

Figure 2.1: Distributed call-setup process progressing hop-by-hop

2.1.2 Existing Switches, Gateways, and Networks

2.2.1 CHEETAH Concept and Network

Packet-switched Packet-switched Internet Internet

Optical OpticalCircuitCircuitswitched switched CHEETAH CHEETAHNetwork Network

Figure 2.2: CHEETAH concept

Sycamore SN16000 Crossconnect card

Control card OC192 card

Figure 2.3: CHEETAH experimental testbed

2.2.2 CHEETAH End-Host Software

OCS client Routing decision RSVP-TE client

Bandwidth Sharing Model

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

The utilization of link L, U , is given by: (1 Pb ) m

Figure 3.2: A bandwidth sharing model for le transfers

routing decision (RD) module

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

3.2.1 Applications in which Call-Holding Time is Independent of Per-Circuit Bandwidth

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

U=60% U=40% 0 20 40 m 60 80 100

(a) m [1, 100]

(b) m [101, 1000]

100 80 60 40 20 0 U=90% U=80%

U=90% 2 U=80% U=60% 0 U=40% 0 20

Figure 3.4: Plots of vs. m and /m vs. m

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

3.2.2 Applications in which Call-Holding Time is Dependent on Per-Circuit Bandwidth

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

m=1000, N =100 0 m=100, N0=100 m=10, N =100

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

U=80% 100 U=60% 50 U=40%

U=90% 60 40 20 U=80% U=60% U=40% 0 20 40 m 60 80 100

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS

Chapter 4. WEB TRANSFER APPLICATION ON CHEETAH

4.1.1 WebFT Architecture

Chapter 4. WEB TRANSFER APPLICATION ON CHEETAH

Chapter 4. WEB TRANSFER APPLICATION ON CHEETAH

HTTP request HTTP response

Run CGI Scripts