Anda di halaman 1dari 8

MYRINET

monitoring on every link. The newest,


"fourth-generation" Myrinet, called Myri-
Abstract: Myrinet, ANSI/VITA 26-1998, 10G, supports a 10 Gbit/s data rate and is
is a high-speed local area networking inter-operable with 10 Gigabit Ethernet on
system designed by Myricom to be used as PHY, the physical layer (cables,
an inter connect between multiple connectors, distances, signaling).
machines to form computer clusters.
Myrinet has much less protocol overhead Myrinet Overview:
than standards such as Ethernet, and
therefore provides better throughput, less Myricom Myrinet is the market leader in
interference, and lower latency while high-performance, high-availability,
using the host CPU. Although it can be cluster interconnects. Myricom shipped its
used as a traditional networking system, first Myrinet products in August 1994 and
Myrinet is often used directly by programs there are now thousands of Myrinet
that "know" about it, thereby bypassing a installations, ranging in size up to more
call into the operating system. than 1,000 hosts. These sites include many
of the world's premier cluster-computing
Keywords: Myrinet, myricom, Ethrenet systems.
Latency, Throughput, switching, Duplex,
Gigabit Ethernet. Characteristics that distinguish Myrinet
from other networks include:
Introduction: Myrinet is a cost-effective,
high-performance, packet-communication  Full-duplex 2+2 Gigabits/s or
and switching technology that is widely 10+10 Gigabits/s data rate links,
used to interconnect clusters of switch ports, and interface ports
workstations, PCs, servers, blade servers,  Flow control, error control, and
or single-board computers. Clusters "heartbeat" continuity monitoring
provide an economical way of achieving on every link
high performance, high availability.
 Low-latency, cut-through, crossbar
Myrinet physically consists of two fibre switches, with monitoring for high-
optic cables, upstream and downstream, availability applications
connected to the host computers with a  Switch networks that can
single connector. Machines are connected interconnect hundreds of hosts
with a single, full-bisection
network component, can scale to
tens of thousands of hosts, & that
via low-overhead routers and switches, as can also provide alternative
opposed to connecting one machine communication paths between
directly to another. Myrinet includes a hosts
number of fault-tolerance features, mostly  Host interfaces that execute a
backed by the switches. These include control program to interact directly
flow control, error control, and "heartbeat" with host processes ("OS bypass")
for low-latency communication, --Packet Format
and directly with the network to
send, receive, and buffer packets

Myrinet is an American National Standard


-- ANSI/VITA 26-1998. The link and
routing specifications are public,
published, and open.

Myrinet components :

–--Header ( up
Arbitrary to 24payload
length bytes)
– CRC, error-control

Myrinet Switch Counters

The Myrinet switches maintain counters


 Myrinet Switches for the following types of received packets
 Myrinet software support for most between reboots or clears:
common hosts and operating
systems. • good_high, good_low - A 64-bit
 Myrinet Interfaces counter of the number of good
 Myrinet link fiber or cables packets received in on this port.
• bad_high, bad_low - A 64-bit
counter of the number of non-zero
CRC8 packets received on this
Myrinet Switches: Myrinet switches port.
provide the high speed cluster message • timeouts - The number of times
passing network for passing messages that the switch didn't see a tail for
between computer nodes and for I/O. The too long coming in on a port.
Myrinet switches have a few counters that Timeouts are on the receive side,
can be accessed from an Ethernet so they indicate that a packet went
connection to the switch. These counters too long without a tail, or that the
can be accessed to monitor the health of packet was blocked for too long.
the connections, cables, etc. The following • bad_routes - The number of
information refers to the 16-port, the clos- packets that entered this port
64 switches, and the Myrinet2000 (receive), but had a routing byte
switches. that was off either end (out-
ofrange) of the switch.
Myrinet Packets: • dead_routes - The number of
packets that entered this port
(receive), but had a routing byte
destined for a valid port that was
down, disconnected, or off.
• bad_packets- Received packets o The 0xff00 bits are a bit mask of
with non-zero CRC bytes at the the following bits:
tail.
• missed_beats - A port is not #define HW_CTRL_XMIT_OFF 0x80
connected. Connected ports receive
heart beat signals from their #define HW_CTRL_RECV_OFF 0x40
opposites, and when these stop
arriving, the counter goes up. If #define HW_CTRL_MISS_BEAT_OFF
you have HA features enabled, the 0x20
port will also shut itself down
(portDownOnMissedBeat). #define HW_CTRL_ILL_OFF 0x10
• A missed beat is no beat in 10
usecs, but this is polled for by the System Requirements for Switch
xbar microcontrollers around every Monitoring :
10 msecs and updated in the
monitoring card probably about Clos-64 or 16 port switch requirements
every 100 msecs.
• port_state - A mask of the • The Linux kernel must support the
following bits: RARP protocol. Kernel
configuration defaults to "no" so a
 #define SW_PORT_ON new kernel may have to be
0x01 configured. This can be a a module
 #define SW_PORT_BEAT or as part of the kernel but as a
0x02 module has to be loaded. It's not
 #define very big.
SW_PORT_TRANSMIT
0x04  RARP support can be
 #define confirmed by looking for a
SW_PORT_RECEIVE 0x08 /proc/net/rarp. If the file exists,
then RARP is supported. If not,
• control is in 2 parts. a module may have to loaded
o The 0xff bits are an internal or it is not supported.
state machine state of the
following values: RARP support might be configured as a
module.
case 0: /* port was in reset. initialize
control register */ The command > lsmod can be used to list
modules such as rarp.o that have been
case 1: /* port is initialized */ installed.

case 2: /* port is off, but automatic If there is a file in /lib/modules/[release


shutdown is enabled */ number]/rarp.o then that release is
configured as a module.
case 3: /* port is on, automatic shutdown
is on */

case 10: /*port was disabled by user. */


RARP support can be added with the Myrinet 2000 Switches are considerably
command: different from the Clos-64(m2) and 16-
> insmod /lib/modules/[release port switches. First, these switches get
number]/rarp.o. their IP addresses from the dhcpd demon,
not from the RARP command. So, the
IP addresses must be assigned for each switches' IP and MAC addresses must be
switch. This implies that at the switches in the dhcpd's configuration file,
must be connected to an Ethernet hub. /etc/dhcpd.conf, which is generated from
Each switch has its hardware MAC the Cluster Tools on Cplant.. Dhcpd must
address printed on the switch. The IP be restarted when the file is modified, it
address must be broadcast from a Linux does not respond to a SIGHUP or notice
host using the rarp command. This that the file is changed. Once the m3
command informs the Myrinet switch its switch has its address it is accessed by
IP address. HTML browsing, SNMP commands, or
one of several small programs supplied by
• Setting the IP address in the Myrinet. Note that these programs do not
switch: /sbin/rarp -s [IP address] all run correctly so use with caution.
[MAC Address]
• The "ping" command can be used Since then hardware upgrades and
to verify that the Myrinet switch software improvements have made
has assigned its IP address. Myrinet the network of choice for many
• The Myrinet switch will store that cluster builders and until recently there
IP address until it is powered off or was hardly an alternative when a fast, low-
rebooted with the bob_test -reboot latency network was required.
command. If it is powered off or
rebooted, the rarp command will Like Infiniband, Myrinet uses cut-through
have to be repeated. The switch routing for an efficient utilisation of the
will not recognize a changed IP network. Also RDMA is used to write
address with a second rarp to/read from the remote memory of other
command. The switch will have to host adaptor cards, called Lanai cards.
be powered off or rebooted to These cards interface with the PCI-X bus
change the previous address. of the host they are attached to. The latest
• Download the corresponding 16- Myrinet implementation only uses fibers
port or 64-port version of the as signal carriers. This gives a high
Myrinet Switch Query Program, flexibility in the connection and much
bob_test, from Myrinet at headroom in the speed of signals but the
• There are several precompiled fiber cables and connectors are rather
versions of the 64-port version. delicate which can lead to damage when
There is no Alpha version. The cluster nodes have to be serviced.
Makefile requires some simple Myrinet offers ready-made 8--256 port
modifications to make it work. It switches. The 8 and 16 port switches are
does build correctly on the Alpha full crossbars. In principle all larger
platform using gcc. networks are build form these using a Clos
network topology. An example for a 64-
Myrinet2000, or M3, Switches: port systems is shown in Figure.
Myrinet clusters today, and a new system
called MX (Myrinet Express). Other
software interfaces such as MPI, Sockets,
DAPL, VI, and PVM are layered
efficiently over GM or MX, and are
available from Myricom and from third
parties. Both GM and MX provide
"Ethernet emulation," in which any
protocols carried over Ethernet, typically
TCP/IP and UDP/IP, can be carried over
Myrinet. The GM and MX systems
Figure. An 8×16 Clos network using 8 an provide protected user-level access to the
16 port crossbar switches to connect 64 Myrinet (secure in multi-user,
processors. multiprogramming environments);
reliable, ordered delivery of messages;
A Clos network is another example of a network mapping and route computation;
logarithmic network with the maximum and other features that support robust and
bisectional bandwidth of the endpoints. error-free communication. The Software
Note that 4 ports of the 16×16 crossbar and Customer Support page provides a
switches are unused but other complete outline of Myrinet software, with
configurations need either more switches links to documentation, and the
or connections or both. Performance Measurements page shows
Myricom provides benchmark the performance achieved with the various
measurements of its networks at which for software interfaces.
the newest Myinet2000 switches give
quite good results: about 230 MB/s for a Myrinet Express (MX) Software:
Ping-pong experiment with latencies
around 7 µs for small messages with the Myrinet Express (MX) is a new message-
MPI version that Myricom distributes with passing system from Myricom that
its networks. provides lower latency with middleware
layers such as MPI. MX includes Ethernet
emulation and support for all protocols
that run over Ethernet, including TCP/IP
and UDP/IP. MX was developed for the
Myrinet Interfaces: Myri-10G series of NICs, but MX-2G is
already available on several operating
Myrinet packets may be of any length, and systems for the PCI-X series of Myrinet
thus can encapsulate other types of NICs. MX was developed for the Myri-
packets, including IP packets, without an 10G series of NICs and is fully compatible
adaptation layer. Each packet is identified with MX-2G at the API and application
by type, so that a Myrinet, like an levels. Myrinet-2000 and Myri-10G
Ethernet, may carry packets of many types employ the same network architecture and
or protocols concurrently. Thus, Myrinet protocols, and, when used with MX
supports several software interfaces. software support, are fully compatible
with respect to applications and
Myricom supplies two alternative low- application-programming interfaces.
level message-passing systems for
Myrinet: GM, which is used in most
Myrinet Components and Software:

Myricom supplies Myrinet components


and software in two series: Myrinet-2000
and Myri-10G. Myrinet-2000 is a superior
alternative to Gigabit Ethernet for clusters, Myrinet control program
whereas Myri-10G offers performance and
cost advantages over 10-Gigabit Ethernet. NIC
Myri-10G uses the same physical layers
(PHYs: cables, connectors, signaling) as
10-Gigabit Ethernet, and is highly
interoperable with 10-Gigabit Ethernet. In
Fabric Management System:
fact, Myri-10G NICs are both 10-Gigabit
Myrinet NICs and 10-Gigabit Ethernet
Myricom's Fabric Management System
NICs.
(FMS) provides centralized diagnostic
monitoring of the Myrinet fabric from a
Myrinet-2000 and Myri-10G employ the
command-line or web interface. FMS
same network architecture and protocols,
consists of one FMS server process to
and, when used with MX software
manage the fabric, and many Fabric
support, are fully compatible with respect
Management Agent (FMA) processes, one
to applications and application-
running on each Myrinet node in the
programming interfaces.
fabric. It is an important diagnostic tool
for verifying the health of the Myrinet
Myricom supplies Myrinet software
hardware.
support for most common hosts and
operating systems. The software is
The Fabric Management System (FMS)
supplied "open source," and other Myrinet
for Myrinet networks may be used with
software is available from third parties.
either the MX or GM low-level firmware.
FMS can be used with GM-1 and
You or an integrator install the NICs and
supersedes Mute, the previous
software in the hosts, and connect the
management software provided by
network with cables and switches. The
Myricom.
software maps the network, and uses
whatever communication paths are
Performance:
available from host to host. No switch
programming or routing-table
Myrinet is a lightweight protocol with
configuration is necessary.
little overhead that allows it to operate
with throughput close to the basic
Software Stack:
signaling speed of the physical layer. On
the latest 2.0 Gbit/s links, Myrinet often
runs at 1.98 Gbit/s of sustained
Host throughput, considerably better than what
Application
Ethernet offers, which varies from 0.6 to
1.9 Gbit/s, depending on load. However,
MPI, Sockets ,etc for supercomputing, the low latency of
Myrinet is even more important than its
throughput performance, since, according
Kernal agent user level API GM
to Amdahl's law, a high-performance provide parity checking both in their
parallel system tends to be bottlenecked by memory and across host IO buses.
its slowest sequential process, which is
often the latency of transmission of
messages across the network in all but the
most embarrassingly parallel Reality:
supercomputer workloads.
Myrinet is the clear market leader in high-
Deployment: performance, high-availability, cluster
interconnect. Myricom shipped its first
According to Myricom, 141 (28.2%) of Myrinet products in August 1994.
the June 2005 TOP500 supercomputers Including the installations supplied by
used Myrinet technology. In the Myricom's OEM customers and by
November 2005 TOP500, the number of Myrinet resellers and integrators, there are
supercomputers using Myrinet was down now many thousands of Myrinet
to 101 computers, or 20.2%, in November installations, ranging in size to more than
2006, 79 (15.8%), and by November 2007, 2,000 hosts and more than 4,000
18 (3.6%), a long way behind Gigabit processors. These sites include many of
Ethernet at 54% and Infiniband at 24.2%. the world's premier cluster-computing
systems. A total of 141 (28.2%) of the
Technology and Reliability: June-2005 TOP500 supercomputers use
Myrinet technology.
Myrinet components are implemented
with the same advanced technology -- full-
custom-VLSI CMOS chips -- as today's
workstations, servers, PCs, and single- Conclusion: Myrinet is a cost-effective,
board computers. This use of CMOS high-performance, packet communication
technology is one reason why Myrinet and switching technology that is widely
performance has advanced and will used to interconnect clusters of
continue to advance in step with advances workstations, PCs, or single-board
in the hosts, without changes to the computers. Conventional networks such as
network architecture and software Ethernet can be used to build clusters, but
interfaces. do not provide the performance or features
required for high-performance or high-
These CMOS-based Myrinet components availability clustering.
are also extremely reliable. The MTBF of
current-production Myrinet switches and
NICs exceeds 5 million hours per port.
Myrinet exhibits a very low bit-error rate, References:
and is highly robust with respect to host, – A. Gulati, D. K. Panda, P. Sadayappan,
switch, and cable faults. Myrinet can map and P. Wyckoff, NIC-based Rate
itself continuously, and use alternate Control for Proportional Bandwidth
routes to circumvent faults (e.g., Allocation in Myrinet Clusters, ICPP ‘01
disconnections and powering-down). The – S. Senapathi, B. Chandrasekharan, D.
hardware computes and checks a CRC for Stredney, H.-W. Shen, and D. K.
each packet on each link. The NICs Panda, QoS-aware Middleware for
Cluster-based Servers to Support
Interactive and Resource-Adaptive – D. Buntinas, D.K. Panda, and W. Gropp,
Applications, HPDC ’03 NIC-Based Atomic Operations
– D. Buntinas, D. K. Panda, J. Duato, and on Myrinet/GM, SAN-1
P. Sadayappan, – D. Buntinas and D. K. Panda, NIC-
Broadcast/Multicast over Myrinet using Based Reduction in Myrinet Clusters:
NIC-Assisted Multidestination Is It Beneficial? SAN-2
– W. Yu, D. Buntinas, and D. K. Panda,
High Performance and Reliable
Messages, CANPC ‘03
– D. Buntinas, D. K. Panda and P. NIC-Based Multicast over Myrinet/GM-2,
Sadayappan, Fast NIC-Based Barrier ICPP ’03
over Myrinet/GM, IPDPS ‘01.

Anda mungkin juga menyukai