The newest addition to the family of high-availability features is nonstop forwarding (NSF) with stateful switchover (SSO). NSF with SSO is
®
a supervisor redundancy mechanism introduced on the Supervisor Engine 2 and the Supervisor Engine 720 in Cisco IOS Software Release
12.2(18)SXD to provide intrachassis SSO at Layer 2–4. NSF with SSO reduces the mean time to repair (MTTR) by allowing extremely fast
supervisor switchover in the order of 0 to 3 seconds of packet loss. NSF with SSO can be deployed in the most critical parts of an enterprise or
service provider network. It is an essential feature for single points of termination in the network, and it minimizes downtime when voice over
IP (VoIP), video, and other packet loss-sensitive applications are involved.
This paper discusses the NSF and SSO supervisor redundancy operations for the Cisco Catalyst 6500 in Cisco IOS Software. It covers the NSF
with SSO platform-specific details; the NSF with SSO supported features, including Multicast Multilayer Switching (MMLS) NSF with SSO, and
the NSF with SSO performance results. Although it is not the goal of this paper, it is very important for readers to understand how to design a highly
available network with NSF and SSO. For high-availability campus network design information, in-depth information about generic NSF with SSO
operations, an NSF with SSO configuration guide, an exhaustive list of all Cisco Catalyst 6500 high-availability mechanisms, and supervisor
redundancy information about the Cisco Catalyst Operating System for the Cisco Catalyst 6500, see the “References” section of this paperSupervisor
Redundancy
• Supervisor engine—Every Cisco Catalyst 6500 chassis can support redundant supervisors to provide for system high availability. Supervisors
operate in active and standby modes and support a variety of redundancy mechanisms for failover.
• Switch fabric—The switch fabric provides a data path for fabric-enabled line cards and increases the available system bandwidth from the shared
bus capacity of 32 Gbps to 256 Gbps for the Supervisor Engine 2 with switch fabric module 2 (SFM2) or 720 Gbps for the Supervisor Engine
720. If a switch fabric fails, the redundant switch fabric (if present) takes over.
• Power supplies—Every Cisco Catalyst 6500 chassis supports redundant power supplies so that a power supply failure does not affect operations.
• Fan trays—Each fan tray has multiple fans. The Cisco Catalyst WS-C6509-NEB-A chassis also provides optional fan-tray redundancy.
• Line-card online insertion and removal (OIR)—New modules can be added without affecting the system, and line cards can be exchanged
without losing the configuration. When a module with a local forwarding engine (also referred to as distributed forwarding card) is inserted,
the local forwarding-engine hardware tables are repopulated with the most current forwarding information.
All contents are Copyright © 1992–2005 Cisco Systems, Inc. All rights reserved. Important Notices and Privacy Statement.
Page 1 of 16
Supervisor Redundancy Definitions
Supervisor redundancy on the Cisco Catalyst 6500 requires the following:
The supervisor engine that boots first becomes the active supervisor engine. The active supervisor is responsible for control-plane and forwarding
decisions. The second supervisor is the standby supervisor, which does not participate in the control or data-plane decisions. The active supervisor
synchronizes configuration and protocol state information to the standby supervisor. As a result, the standby supervisor is ready to take over the
active supervisor responsibilities if the active supervisor fails. This “take-over” process from the active supervisor to the standby supervisor is
referred to as switchover.
Only one supervisor is active at a time, and supervisor-engine redundancy does not provide supervisor-engine load balancing. However, the
interfaces on a standby supervisor engine are active when the supervisor is up and thus can be used to forward traffic in a redundant configuration.
• RPR—RPR is the first redundancy mode of operation introduced in Cisco IOS Software. In RPR mode, the startup configuration and boot
registers are synchronized between the active and standby supervisors, the standby is not fully initialized, and images between the active and
standby supervisors do not need to be the same. Upon switchover, the standby supervisor becomes active automatically, but it must complete
the boot process. In addition, all line cards are reloaded and the hardware is reprogrammed. The RPR switchover time is 2 or more minutes.
• RPR+—RPR+ is an enhancement to RPR in which the standby supervisor is completely booted and line cards do not reload upon switchover.
The running configuration is synchronized between the active and the standby supervisors. All synchronization activities inherited from RPR
are also performed. The synchronization is done before the switchover, and the information synchronized to the standby is used when the
standby becomes active to minimize the downtime. No link layer or control-plane information is synchronized between the active and the
standby supervisors. Interfaces may bounce after switchover, and the hardware contents need to be reprogrammed. The RPR+ switchover time
is 30 or more seconds.
• SRM with SSO—SSO expands the RPR+ capabilities to provide transparent failover of Layer 2 protocols when a supervisor failure occurs.
SSO is stateful for Layer 2 protocols. Policy-feature-card (PFC) and distributed-forwarding-card (DFC) hardware tables are maintained across a
switchover. This allows for transparent failover at Layer 2 and Layer 4. SSO is a requirement for SRM with SSO and NSF with SSO. SSO can
be used independently of SRM and NSF, which provide extra Layer 3 routing functions. When using SRM with SSO, the routing protocols restart
upon switchover. However, SRM with SSO uses the existing PFC and DFC Layer 3 switching information to forward traffic for a configurable
route-convergence interval while the newly active Multilayer Switch Feature Card (MSFC) builds its routing table. This minimizes downtime,
but peers still need to reconverge around the supervisor failure. The SRM-with-SSO switchover time is 0 to 3 seconds for Layer 2 unicast traffic.
• NSF with SSO—NSF works in conjunction with SSO to ensure Layer 3 integrity following a switchover. It allows a router experiencing the
failure of an active supervisor to continue forwarding data packets along known routes while the routing protocol information is recovered and
validated. This forwarding can continue to occur even though peering arrangements with neighbor routers have been lost on the restarting router.
NSF relies on the separation of the control plane and the data plane during supervisor switchover. The data plane continues to forward packets
based on pre-switchover Cisco Express Forwarding information. The control plane implements graceful restart routing protocol extensions to
signal a supervisor restart to NSF-aware neighbor routers, reform its neighbor adjacencies, and rebuild its routing protocol database following a
switchover. An NSF-capable router implements the NSF functionality and continues to forward data packets after a supervisor failure. An NSF-
aware router understands the NSF graceful restart mechanisms: it does not tear down its neighbor relationships with the NSF-capable restarting
MMLS NSF with SSO enables the system to maintain multicast forwarding state in the PFC3 and DFC3 hardware during a supervisor-engine
switchover, minimizing multicast service interruption. Prior to MMLS NSF with SSO, the multicast forwarding entries were not synchronized to
the standby supervisor engine. The NSF with SSO switchover time is 0 to 3 seconds for Layer 2–4 unicast or multicast traffic.
Table 1 gives the minimum software version for each redundancy mode supported on the Cisco Catalyst 6500.
Table 1. Supervisor Redundancy Mode Support
Supervisor Engine RPR and RPR+ SRM with SSO NSF with SSO
Supervisor Engine 1A 12.1(13)E – –
Supervisor Engine 2 12.1(13)E or 12.1(17d)SXB – 12.2(18)SXD
Supervisor Engine 720 12.2(14)SX 12.2(17b)SXA and 12.2(17d)SXB 12.2(18)SXD
only
The default redundancy mode of operation with two Supervisor Engine 720s is SSO in Cisco IOS Software Release 12.2(17b)SXA and later
releases. The default redundancy mode of operation with two Supervisor Engine 2s is SSO in Cisco IOS Software Release 12.2(18)SXD and
later releases. In earlier Cisco IOS Software 12.2SX releases, the default redundancy mode of operation is RPR+.
SRM with SSO is enabled by default in Cisco IOS Software releases 12.2(17b)SXA and 12.2(17d)SXB. SRM with SSO was replaced with NSF
with SSO starting with Cisco IOS Software Release 12.2(18)SXD. Enabling NSF awareness and capability is routing protocol-specific. For NSF-
with-SSO configuration details, go to
http://www.cisco.com/en/US/products/hw/switches/ps708/products_configuration_guide_chapter09186a008027e4cd.html.
Software Upgrades
At the time of writing for this paper, SSO requires each supervisor engine to be running the same Cisco IOS Software release. Fast Software
Upgrade (FSU) can be used to minimize downtime associated with a planned software upgrade. With this process, the redundancy mode reverts
to RPR during the upgrade. The detailed procedure can be found at
http://www.cisco.com/en/US/products/hw/routers/ps368/products_configuration_guide_chapter09186a0080160f3b.html#wp1089399
In order to run in RPR+ or SSO redundancy mode, image versions must be the same on the redundant and active supervisors. In these redundancy
modes, the active supervisor engine checks the image version of the redundant supervisor engine when the redundant supervisor engine comes
online. If the image on the redundant supervisor engine does not match the image on the active supervisor engine, the software sets the redundancy
mode to RPR while doing a software upgrade and sets it back to SSO when the software upgrade is complete.
Note that future Cisco In Service Software Upgrades (ISSUs) will allow software upgrades in SSO redundancy mode. NSF with SSO is the building
block for Cisco ISSUs.
SSO
SSO Operation
SSO synchronizes runtime data for Layer 2 dynamic protocols. As Layer 2 control-plane, configuration, or other network-related changes occur, the
Cisco IOS Checkpoint Facility running between the peer processes on the active and standby supervisors communicates the changes. Table 2 gives
the list of Layer 2 protocols supported with SSO. For example, the Spanning Tree Protocol database on the standby supervisor is kept up-to-date by
check pointing both protocol information and port states from the active supervisor.
SSO also synchronizes the hardware forwarding tables between the active and standby supervisors. The PFC is a supervisor daughter card that
contains the application-specific integrated circuit (ASIC) responsible for hardware switching. When new hardware table entries need to be
downloaded to the PFC, entries also are downloaded to all other forwarding engines in the system. This allows the standby supervisor PFC to
contain the same forwarding information as the active PFC and the DFCs. The MAC address table, the Forwarding Information Base (FIB), the
adjacency table, the access control lists (ACLs), and the quality-of-service (QoS) hardware table contents can be used for switching decisions after
switchover.
Figure 2 depicts the supervisor switchover operation. Upon switchover, traffic can be forwarded without disruption. The numbers 1, 2, 3, and 4
represent switchover steps. These steps are described as follows.
Figure 2. Supervisor Switchover Operation
1. The system detects a software or hardware fault on the active supervisor and triggers a switchover. This fault could be detected by software
exception handlers, GOLD background checks, keepalive failures between the route processor (RP) and the switch processor (SP), fabric-
switching-module state changes on a Supervisor Engine 720, or it could be the result of a user-initiated switchover.
2. Line-card synchronization helps ensure that all modules in the system understand that a switchover has occurred. The standby supervisor
assumes the role of active supervisor and data is forwarded by the PFC on the newly active supervisor.
3. The switch processor and route processor on the newly active supervisor start processing protocol and data packets. SSO-aware protocols are
not affected by the switchover, and these protocols start processing updates from the network.
4. Non-SSO-aware protocols and routing protocols are initialized. SRM with SSO purges the preswitchover FIB information after a configurable
route-convergence interval, which allows for Layer 3 forwarding to continue in hardware while the routing protocols converge. Peers need to
reconverge around the failure. Static routes are maintained across a switchover because they are based on static configuration and are not
dynamic. Supported Layer 2 control protocols and Layer 4 policies derived from QoS or ACL policies are not affected by a switchover.
Packet forwarding in a Cisco router is provided by Cisco Express Forwarding, which maintains two tables: a FIB and an adjacency table. The FIB
table is a distilled version of the routing table, containing only information relevant to the forwarding process and not to particular routing protocols.
For example, the routing protocols administrative distance is not relevant to the forwarding process. The adjacency table is a collection of next-hop
rewrite information for adjacent nodes.
During normal operation, the system collects the routes calculated by each routing protocol into a common database called the Routing Information
Base (RIB). When information for all routing protocols is present in the RIB, the RIB is scanned to determine the lowest-cost next-hop destination
for each network and subnet. At that point, routing prefix and adjacency information for lowest-cost paths are populated to the Cisco Express
Forwarding tables. As routing-protocol changes occur, the software Cisco Express Forwarding databases are check pointed from the active
supervisor to the standby supervisor, and the Cisco Express Forwarding tables are downloaded to the hardware on all PFCs and DFCs present in
the system, including the standby PFC. This ensures forwarding-table synchronization at the software and hardware level and ensures that
postswitchover data forwarding relies on the most accurate and up-to-date forwarding-table information.
An epoch number per Cisco Express Forwarding entry is introduced in order to allow differentiation between old and new Cisco Express
Forwarding entries. This is known as FIB and adjacency database versioning. Only software Cisco Express Forwarding tables keep track of the
epoch number, and this version number does not impact the forwarding path. A “global epoch number” is incremented when a switchover occurs.
The same switchover operations as described in Figure 2 occur. However, reinitialization of the NSF-capable routing protocol does not cause route
flaps. Figure 4 describes the generic routing protocol NSF with SSO operations that take place. Figure 4 depicts an NSF-aware neighbor router and
an NSF-capable Cisco Catalyst 6500. The Cisco Catalyst 6500 newly active supervisor is represented along with NSF with SSO operation steps.
This figure does not represent the failing former active supervisor. Note that the steps applying to the supervisor switch processor (SP) and policy
feature card (PFC) apply also to the line-card (LC) processor and DFCs. Orange steps are control-plane driven, whereas blue steps are data-driven.
Figure 4 steps 1 through 12 are described as follows. All these steps occur on the “newly active” supervisor.
1. Switchover is triggered.
2. Routing-protocol processes are informed of the supervisor failover. In order to provide control- and data-plane separation, the FIB is detached
from the RIB until the routing protocol reconverges.
3. Packet forwarding continues based on last-known FIB and adjacency entries while the standby takes over.
4. The global epoch number is incremented: if the preswitchover global epoch was 0, it is incremented to 1.
6. The software adjacency table is populated with the preswitchover Address Resolution Protocol (ARP) table contents. Updated Cisco Express
Forwarding entries receive the new global epoch number. The epoch number is available only in the route processor software Cisco Express
Forwarding entries. It is not present in the hardware table. New adjacency entries are downloaded to the hardware.
8. The routing protocol-specific database synchronization occurs: routing protocol processes rebuild their database using database information
from NSF-aware neighbors.
9. When the routing databases are synchronized, distance-vector, path-vector, or shortest-path-first (SPF) algorithm computations determine the
best route for specific prefix destinations. The RIB is repopulated with new routing entries. The corresponding Cisco Express Forwarding
entries are updated.
10. As the software Cisco Express Forwarding databases are populated with updated information, updated entries receive the global epoch number
to indicate that they have been refreshed. Corresponding FIB entries and hardware entries are updated.
11. Each routing protocol notifies Cisco Express Forwarding that it has converged. After all of them have converged, the last routing protocol
flushes the stale route and adjacency information: software Cisco Express Forwarding entries with an epoch number not corresponding to the
current global epoch number are flushed. Corresponding FIB and adjacency hardware entries are also flushed.
12. The Cisco IOS Software Cisco Express Forwarding tables on the route processor and the forwarding tables on the switch processor and PFC
and DFCs are now synchronized.
NSF graceful restart routing protocol extensions follow IETF drafts and RFCs. For additional NSF protocol-specific information, see the
“References” section.
When a switchover occurs, the multicast forwarding entries on the PFC3 and DFC3s (if present) are purged, causing service interruption for
multicast traffic. After the new active route processor comes online, it must establish Protocol Independent Multicast (PIM) neighbor relationships,
process Internet Group Management Protocol (IGMP) packets, and otherwise reconverge multicast state before it can repopulate the hardware
forwarding entries in the PFC3 and DFC3 forwarding engines.
MMLS NSF with SSO enables the system to maintain multicast forwarding state in the PFC3 and DFC3 hardware during a supervisor-engine
switchover, minimizing multicast service interruption.
In a steady state, the active supervisor engine synchronizes the standby supervisor engine with the hardware multicast forwarding entries. If a
supervisor-engine switchover occurs, the entries in the PFC3 and DFC3 hardware forwarding tables are preserved and the system continues to
forward multicast traffic using the last-known good copy of the multicast forwarding table.
When the new active route processor comes online, converges with the network, and relearns the multicast forwarding state, it repopulates the
hardware forwarding tables on the PFC3 and DFC3 using the new information.
SWITCHOVER PERFORMANCE
The NSF with SSO performance on the Cisco Catalyst 6500 was measured using the setup shown in Figure 5. This setup integrates simulation
devices to record the failover time corresponding to different modes of operation. It also includes real-life applications to make sure that these
applications are not affected by a NSF with SSO switchover. The tested applications include video and VoIP applications.
The setup consists of three Cisco Catalyst 6500 switches. All supervisors in the setup are loaded with Cisco IOS 12.2(18)SXD. The Device Under
Test (DUT) is a Cisco Catalyst 6500 with redundant Supervisor 720s switching bidirectional Layer 2 and Layer 3 traffic from the neighbor routers.
This bidirectional traffic includes simple Layer 2 and Layer 3 traffic generated from a traffic simulator at 100,000 pps, as well as voice and video
traffic from VoIP phones and a video client and server. The testing procedure consists of Layer 2 and Layer 3 tests run for each of the following
failover mechanisms: RPR+, SSO with NSF capability disabled, and NSF with SSO. Layer 3 test runs were performed with 1000 routes injected
for OSPF, EIGRP, IS-IS, and BGP. All neighbors are NSF-aware.
The failover time can be derived by comparing the number of packets sent with the number of packets received across a switchover: at a given
packet rate (100,000 pps in the test run), the failover time corresponds to (Packets Transmitted – Packets received)/Packet rate.
Table 6 lists failover times for different scenarios when traffic flows between two ports on a WS-X6748-GE-TX module. Overall, failover times
range from 0 to 3 seconds with NSF with SSO, depending on test conditions.
Table 6. NSF with SSO Failover Times
Failover Time Layer 2 Traffic Layer 3 EIGRP Layer 3 OSPF Layer 3 IS-IS Layer 3 IS-IS Layer 3 BGP
Routed Traffic Routed Traffic Routed Traffic Routed Traffic Routed Traffic
(Cisco method) (IETF method)
RPR+ 62.00s 70.00s 140.00s 82.00s 82.00s 130.00s
SSO (NSF 0.50s 6.00s 11.00s 0.55s 0.55s 54.00s
capability
disabled)
NSF with SSO 0.50s 0.55s 0.55s 0.55s 0.55s 0.55s
Comparison of Cisco Catalyst Operating System and Cisco IOS Software High Availability
Table 7 compares the Cisco Catalyst Operating System, hybrid, and Cisco IOS Software switchover performance numbers for equivalent features.
More information about the Cisco Catalyst Operating System supervisor redundancy mechanisms can be found at
http://www.cisco.com/warp/public/cc/pd/si/casi/ca6000/tech/hafc6_wp.pdf.
The hybrid high availability with SRM redundancy method is the hybrid equivalent to SSO. The hybrid model does not support the NSF
functionality.
Table 7 compares performance numbers for the Cisco Catalyst Operating System, hybrid, and Cisco IOS Software redundancy features.
Table 7. Cisco Catalyst Operating System and Cisco IOS Software Supervisor Redundancy Features
Statistics
The various statistics maintained by an active supervisor are not synchronized to the redundant supervisor because they may change often and the
degree of synchronization they require is substantial. A network-management system should be used to poll affected statistics regularly to maintain
accurate statistics.
SNMP
Simple Network Management Protocol (SNMP) data is synchronized between redundant supervisors when the supervisor is operating in SSO mode.
This is done to ensure that the standby and the active supervisor are indistinguishable from a network-management perspective. Some of the SNMP
objects that are synchronized include interface-related features such as ifindex and SNMP configuration.
The Cisco High-Availability MIB, CISCO-RF_MIB, reports redundancy information to an administrator. This information includes identification of
the primary and secondary supervisors, current redundancy state, the reason for the last switchover that occurred, and when the last switchover
occurred. When a switchover occurs, the ciscoRFSwactNotif notification is used to signal a switchover.
In addition to using the Cisco High-Availability MIB, syslog messages and SNMP traps are sent to notify the administrator of any component
failure.
SNMP data synchronization is not available in the RPR and RPR+ modes of operations.
For services modules compatibility with SSO mode, check release notes at
http://www.cisco.com/en/US/products/hw/switches/ps708/prod_release_note09186a00801c8339.html.
REFERENCES
High Availability:
High-availability technical documentation: http://www.cisco.com/en/US/products/ps6550/products_ios_technology_home.html
Stateful Switchover:
SNMP SSO: http://www.cisco.com/en/US/products/sw/iosswrel/ps1829/products_feature_guide09186a00801b3a8f.html
[OSPF Restart Signaling, OSPF Link-Local Signaling, OSPF Out-of-Band LSDB Resynchronization / IS-IS Restart / Graceful Restart Mechanism
for BGP]
Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed on
the Cisco Website at www.cisco.com/go/offices.
Argentina • Australia • Austria • Belgium • Brazil • Bulgaria • Canada • Chile • China PRC • Colombia • Costa Rica • Croatia • Cyprus
Czech Republic • Denmark • Dubai, UAE • Finland • France • Germany • Greece • Hong Kong SAR • Hungary • India • Indonesia • Ireland
Israel • Italy • Japan • Korea • Luxembourg • Malaysia • Mexico • The Netherlands • New Zealand • Norway • Peru • Philippines • Poland
Portugal • Puerto Rico • Romania • Russia • Saudi Arabia • Scotland • Singapore • Slovakia • Slovenia • South Africa • Spain • Sweden
Switzerland • Taiwan • Thailand • Turkey • Ukraine • United Kingdom • United States • Venezuela • Vietnam • Zimbabwe
Copyright 2005 Cisco Systems, Inc. All rights reserved. CCIP, CCSP, the Cisco Powered Network mark, Cisco Unity, Follow Me Browsing, FormShare, and StackWise are
trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, and iQuick Study are service marks of Cisco Systems, Inc.; and Aironet, ASIST, BPX,
Catalyst, CCDA, CCDP, CCIE, CCNA, CCNP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, the Cisco IOS logo, Cisco Press, Cisco Systems, Cisco Systems
Capital, the Cisco Systems logo, Empowering the Internet Generation, Enterprise/Solver, EtherChannel, EtherSwitch, Fast Step, GigaStack, Internet Quotient, IOS, IP/TV, iQ
Expertise, the iQ logo, iQ Net Readiness Scorecard, LightStream, Linksys, MeetingPlace, MGX, MICA, the Networkers logo, Networking Academy, Network Registrar, Packet,
PIX, Post-Routing, Pre-Routing, RateMUX, Registrar, ScriptShare, SlideCast, SMARTnet, StrataView Plus, Stratm, SwitchProbe, TeleRouter, The Fastest Way to Increase
Your Internet Quotient, TransPath, and VCO are registered trademarks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.
© 2005
All other trademarks mentioned in this document or Website Cisco
are the Systems,
property Inc. All rights
of their respective owners.reserved.
The use of the word partner does not imply a partnership relationship
Important
between Cisco and any notices,
other company. privacy statements, and trademarks of Cisco Systems, Inc. can be found on cisco.com.
(0501R) 204189.b_ETMG_CC_1.05
Page 16 of 16
Printed in the USA
XXXXXX.xx_ETMG_XX_12.04