Anda di halaman 1dari 26

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Redpaper
Fabric Resiliency Best Practices

Ian MacQuarrie John Juenemann Jon Tate

In this IBM Redpaper publication, we discuss best practices for deploying and utilizing advanced Brocade Fabric OS (FOS) features to identify, monitor, and protect Fibre Channel (FC) SANs from problematic device and media behavior. Fabric Operating System: This paper covers Brocade Fabric OS command options up to and including version 6.4.2b. If you are using a code level higher than this the guidance and strategy provided by the paper is still applicable, however you may find some additional options available that allow even greater granularity in establishing specific alerting features.

Introduction
Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard FC media can significantly impact the performance of FC fabrics and the applications they support. In most real-world scenarios, these issues cannot be corrected or completely mitigated within the fabric itself. Instead, the behavior must be addressed directly. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency. This document provides a high-level description of the most commonly experienced, detrimental device and link behaviors, and explains how to use features in recent levels of Fabric OS (FOS) to protect your data center. In FOS 6.1, Brocade introduced Port Fencing as part of the optional Fabric Watch offering. In FOS 6.3, Brocade added a new set of base features referred to as Bottleneck Detection. This was extended in FOS 6.4 with broader monitoring, improved configuration, and detection capabilities for additional types of bottlenecks. For further details about the features described in this publication, see the following product documents, available with registration at http://my.brocade.com/wps/portal: Fabric OS Administrators Guide (version 6.4, 53-1001763-02) Fabric OS Command Reference Manual (version 6.4, 53_1001764_02) Fabric Watch Administrators Guide (version 6.4.1, 53-1001770-01)

Copyright IBM Corp. 2011. All rights reserved.

ibm.com/redbooks

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

Configuring Performance Monitoring and Thresholds in Brocade Data Center Fabric Manager (DCFM), GA-BP-247-00 Bottleneck Detection Best Practices Guide, GA-BP-383-00 It is assumed that readers of this document are familiar with the basic functionality of features, such as bottleneck detection, fabric watch, and port fencing.

Fabric resiliency
Two primary aspects of fabric resiliency are captured in this document: Detecting abnormal behavior in external components (typically servers and hosts, storage devices, or faulty links) so that the negative impact on the fabric can be addressed. Providing mechanisms that protect the fabric from adverse effects caused by a faulty component. This includes one or more actions that can be invoked automatically by a switch, when faulty behavior is detected, to contain and isolate the impact of the misbehaving component in the fabric. This should be considered a temporary measure. Ultimately, the faulty or improperly configured component must be addressed to resolve the problem completely and permanently. There are two common classes of abnormal behavior originating from fabric components: Misbehaving high-latency end devices (hosts or storage) End devices that do not respond as quickly as expected, and therefore cause the fabric to hold frames for excessive periods of time (referred to as slow draining devices), can result in application performance degradation or, in extreme cases, I/O failure. Common examples of moderate device latency include disk arrays that are overloaded and hosts that cannot process data as fast as they request it. Severe latencies are caused by badly misbehaving devices that stop receiving, accepting, or acknowledging frames for excessive periods of time. The Bottleneck Detection feature of the Brocade FOS discussed in this document is specifically designed to identify and raise alerts for high-latency or slow draining devices within the SAN. Faulty media (fiber optic cables and small form-factor pluggable (SFP) transceivers and optics) Faulty media can cause frame loss due to excessive cyclic redundancy check (CRC) errors, invalid transmission words, and other conditions. This may result in I/O failure and application performance degradation. The Fabric Watch monitoring feature of the Brocade FOS discussed in this document is designed to identify and raise alerts for these types of conditions. Be aware that FC switches cannot correct bad node behavior or faulty media. Instead, they can only attempt to alert and compensate for it. Ultimately, the problems must be addressed at the host, target device, or media where the source of the error actually resides.

Maintaining an optimal FC SAN environment


Although there are many features available in FOS to assist you with monitoring, protecting, and troubleshooting fabrics, several recent enhancements have been implemented that deal exclusively with this area. This document focuses specifically on those newer features and related capabilities that help provide optimum fabric resiliency.

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Most of those features and capabilities are available and supported on the majority of 4 Gbps and 8 Gbps platforms, provided that the most recent FOS releases are used. Some features may require optional licensing. This section discusses these features, minimum release levels, licensing requirements, and platform limitations. It is strongly suggested that you review the additional documentation listed the Introduction on page 1 to understand all of the tools available for maintaining a FC SAN environment. Also, be sure to read the FOS Release Notes. Note: To use all of the capabilities described in this document, switches need to be running FOS 6.4.0 or later.

Bottleneck Detection
Bottleneck Detection was introduced in FOS 6.3.0 with monitoring for device latency conditions, and then enhanced in FOS 6.4.0 with added support for congestion detection on both E_Ports and F_Ports. FOS 6.4 also added improved reporting options and simplified configuration capabilities. The FOS 6.3.1b release (and later) included enhancement in the algorithm for detecting device latency, making it more accurate. Bottleneck Detection does not require a license and is supported on both 4 and 8 Gbps platforms.

Fabric Watch and Port Fencing


Fabric Watch is an optional (licensed) feature that was enhanced in FOS 6.1.0 with the addition of Port Fencing. This capability allows a switch to monitor specific behaviors and protect a switch by blocking a port when specified thresholds have been reached.

Edge Hold Time configuration


Edge Hold Time Configuration is a new capability added in the FOS 6.3.1b release. However, it is not documented in the FOS 6.3 or FOS 6.4 Command Reference. See Configuring Edge Hold Time on page 15 for details about its use. No license is required to configure the Edge Hold Time setting.

Device latencies
A device experiencing latencies responds more slowly than expected. The device does not return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to support the offered load, even though the offered load is less than the maximum physical capacity of the link connected to the device, as shown in Figure 1.

Figure 1 Buffer backup on ingress port 6 on B1 causes latency bottleneck upstream on S1, port 3

After it exhausts all available credits, the switch port connected to the device needs to hold additional outbound frames until a buffer credit is returned by the device. When a device is not
Fabric Resiliency Best Practices

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

responding in a timely fashion, the transmitting switch is forced to hold frames for longer periods of time, thus resulting in high buffer occupancy. This, in turn, results in the switch lowering the rate at which it returns buffer credits to other transmitting switches.

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

This effect propagates through switches (and potentially, multiple switches with devices attempting to send frames to devices attached to the switch with the high-latency device) and ultimately impacts the fabric. See Figure 2.

Figure 2 Latency on a switch can propagate through the fabric

Note: The impact to the fabric (and other traffic flows) varies based on the severity of the latency exhibited by the device. The longer the delay caused by the device in returning credits to the switch, the more severe the problem.

Assessing device latency severity


There are several features and capabilities you can use to assess the severity of device latency.

Moderate device latencies


Moderate device latencies are defined as those not severe enough to cause frame loss. If the time between successive credit returns by the device is between tens of milliseconds (ms) up to 100 ms, then the device exhibits moderate latencies because this delay is typically not enough to cause frame loss (frame loss typically occurs above 100 ms). This causes a drop in performance of traffic flows using the fabric, but typically does not cause frame drops or I/O failures. When a device exhibits mild to moderate latency behavior, applications may see a drop in performance but typically not I/O failure. However, the higher the latency, the greater the chance that a user will experience degraded performance.

Fabric Resiliency Best Practices

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

Severe device latencies


Devices taking longer than 100 ms between successive credit returns are exhibiting severe latency. Severe device latencies result in frame loss, which triggers the host small computer system interface (SCSI) stack to detect failures and to retry I/Os. This process can take tens of seconds (possibly as long as 30 to 60 seconds), which can cause a very noticeable application delay and potentially result in application errors. If the time between successive credit returns by the device is in excess of 100 milliseconds, then the device is exhibiting severe latency. When a device exhibits severe latency, the switch is forced to hold frames for excessively long periods of time (in the order of hundreds of milliseconds). When this time becomes greater than the established timeout threshold (500 ms by default), the switch drops the frame (per FC standards). Frame loss in switches is also known as Class 3 (C3) discards or timeouts.

Dropped frames cause applications to retry I/Os


Because the effect of device latencies often spreads through the fabric, frames can be dropped due to timeouts, not just on the F_Port to which the misbehaving device is connected, but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause I/O errors that result in a host retry and can result in significant decreases in application performance. The implications of this behavior are compounded and exacerbated because frame drops on the affected F_Port (device) result not only in I/O failures to the misbehaving device (which would be expected), but also because frame drops on E_Ports may cause I/O failures for unrelated traffic flows involving other hosts using the same inter-switch links (ISLs) as the device experiencing severe latency (see item 5 in Figure 2 on page 5) (which would not typically be expected).

Latency detection
Using Bottleneck Detection to detect devices that exhibit latency, and Fabric Watch to detect frame timeouts, is considered best practice and therefore strongly recommended.

Using Bottleneck Detection on F_Ports


Bottleneck Detection is a comprehensive feature that can be used to detect a wide range of device latencies from mild to severe. See Configuring Bottleneck Detection on page 10 for details about how to enable Bottleneck Detection. After Bottleneck Detection is enabled, the switch monitors F_Ports for latency symptoms. Specifically, it looks for conditions in which the time delay between successive buffer credit returns from a device is higher than expected. When the condition is detected, Bottleneck Detection reports latency bottlenecks at F_Ports based on configurable thresholds. These reports can then be leveraged to: Determine the specific device port on which device latency is occurring and raise an alert Determine the severity and duration of the latency behavior Determine the actual device latency in the range of 100 microseconds to hundreds of milliseconds

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Enter bottleneckmon --show to display the percent congestion on specified ports, as shown in Example 1.
Example 1 bottleneckmon --show

switch:admin> bottleneckmon --show -interval 5 -span 30 1/16 =========================================================== Thu Jun 16 13:24:11 UTC 2011 =========================================================== Percentage of From To affected secs =========================================================== Jun 16 13:24:06 Jun 16 13:24:11 0.00% (no data for 3 seconds) Jun 16 13:24:01 Jun 16 13:24:06 66.67% (no data for 2 seconds) Jun 16 13:23:56 Jun 16 13:24:01 25.00% (no data for 1 seconds) Jun 16 13:23:51 Jun 16 13:23:56 100.00% (no data for 1 seconds) Jun 16 13:23:46 Jun 16 13:23:51 66.67% (no data for 2 seconds) Jun 16 13:23:41 Jun 16 13:23:46 100.00% (no data for 2 seconds)

Timeout notification on F_Ports


Use Fabric Watch to detect frame timeouts, that is, frames that have been dropped because of severe latency conditions (the Fabric Watch C3TX_TO area is available in version 6.3 for 8 Gbps ports and in FOS 6.3.1b/6.4.0 and later for 4 Gbps ports). If the number of timed-out frames on an F_Port exceeds the currently effective threshold settings, Fabric Watch can notify the user by: Sending an SNMP trap Logging a RASlog message Sending an email alert Logging a SYSlog message

Latency mitigation action


The following section describes actions you can take to mitigate latency.

Reducing timeouts on unrelated flows


FC standards dictate that frames are dropped in switches if they have been held in the switch buffers for longer than the established Hold Time, which is a value calculated from several configurable fabric parameters. Unless any of these fabric parameters (R_A_TOV, E_D_TOV, WAN_TOV, or MAX_HOPs) have been changed from their defaults, the Hold Time is calculated to be 500 ms. In most environments, fabric parameters on all switches in a fabric should match, and thus the Hold Time should be consistent throughout a fabric. When congestion conditions cause frames to drop in the core of the fabric, then wherever there tends to be more flows or traffic, there will be more disruption. To reduce frame drops on E_Ports on core switches, the edge switches that host the end server or storage devices can be configured to have a shorter Hold Time compared to the core switches by using the Edge Hold Time feature (available in FOS 6.3.1b and later). This setting lowers the Hold Time on the edge of the network, which reduces the likelihood of frame loss on the core of the network, effectively mitigating the impact of the misbehaving device. For this reason, it is useful to enable the Edge Hold Time feature.

Fabric Resiliency Best Practices

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

See Configuring Edge Hold Time on page 15 for details about how to enable the Edge Hold Time feature. Enabling and configuring the Edge Hold Time is a nondisruptive operation.

Fabric configuration
Fabrics can be architected to mitigate some impacts of device latency. Isolating the device flows (host or storage pair) that exhibit high latencies, either by putting them in their own fabric or on their own blade or switch, contains the impact of the latencies to the fabric, blade or switch that contains the high-latency device flows. Features such as Integrated Routing (FC Routing) and local switching provide architectural-level solutions that limit the need for more complex monitoring and mitigation capabilities. However, using fabric design as a protection mechanism requires some knowledge of which devices are likely to exhibit latency.

Faulty media
In addition to high-latency devices causing disruptions to data centers, fabric problems are often the result of faulty media. Faulty media can include bad cables, SFPs, extension equipment, receptacles, patch panels, improper connections, and so on. Media can fault on any port type (E_Port or F_Port) and fail, often unpredictably and intermittently, making it even harder to diagnose. Faulty media involving F_Ports results in an impact to the end device attached to the F_Port and to devices communicating with this device. Failures on E_Ports can have an even greater impact. Many flows (host and target pairs) can simultaneously traverse a single E_Port. In large fabrics, this can be hundreds or thousands of flows. In the event of a media failure involving one of these links, it is possible to disrupt some or all of the flows utilizing the path. Severe cases of faulty media, such as a disconnected cable, can result in a complete failure of the media, which effectively brings a port offline. This is typically easy to detect and identify. When this occurs on an F_Port, the impact is specific to flows involving the F_Port. E_Ports are typically redundant, so severe failures on E_Ports typically only result in a minor drop in bandwidth because the fabric automatically utilizes redundant paths. Also, error reporting built into FOS readily identifies the failed link and port, allowing for simple corrective action and repair. With moderate cases of faulty media, failures occur but the port can remain online or transition between online and offline. This can cause repeated errors, which can occur indefinitely or until the media fails completely. When these types of failures occur on E_Ports, the result can be devastating because there can be repeated errors that impact many flows. This can result in significant impacts to applications that last for prolonged durations. Signatures of these types of failures include the following: CRC errors on frames Invalid Transfer Words (includes encoder out errors) State Changes (ports going offline or online repeatedly) Credit loss (complete loss of credit on a virtual channel (VC) on an E_Port prevents traffic from flowing on that VC, resulting in frame loss and I/O failures for devices utilizing the VC)

Automatically detecting and mitigating faulty media


You can automatically detect and mitigate the impact of faulty media by using Fabric Watch monitoring and quarantine, as well as the Bottleneck Detection feature, as explained here.

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Fabric Watch
Enable Fabric Watch to monitor for CRC errors, Invalid Transfer Words, and State Changes. Configure for alerts on reaching low thresholds and fence (disable) a port when reaching high thresholds. See Configuring Port Fencing on page 13 for details about how to enable and configure Fabric Watch Port Fencing.

Fabric Watch monitoring


Fabric Watch monitors can be enabled to automatically detect most of the faulty media conditions previously noted. For example, Fabric Watch can monitor CRC errors (available in FOS 6.1), Invalid Transfer Words (available in FOS 6.1), and State Changes (ports transitioning between offline and online, available in FOS 6.3). Fabric Watch generates alerts based on user-defined thresholds for these conditions. The most common cause of credit loss is corruption to credit return messages (VC_RDY or R_RDY) due to faulty media. Credit corruption is tracked by an encoding out error, which is an Invalid Transfer Word error. Monitoring and mitigating Invalid Transfer Word issues protects against credit loss.

Fabric Watch quarantine


Fabric Watch also provides a mechanism that quarantines the badly behaving component with the optional action of Port Fencing. Port Fencing is available for each of the previously noted conditions. Use it to automatically protect the fabric from these error conditions. The thresholds specified in Configuring Port Fencing on page 13 have been tested and tuned to quarantine components that are misbehaving to the point at which they are likely to cause a fabric-wide impact. These thresholds do not falsely trigger on normally behaving components.

Bottleneck Detection
The Bottleneck Detection feature can detect different types of bottlenecks in a fabric. Lost buffer credits can result in extreme congestion by slowing the aggregate throughput of a connection. Bottleneck Detection can detect ports that are blocked due to lost credits, and generate special lost credit alerts for the E_Port in this condition (available in FOS 6.3.1b and later). Bottleneck Detection can also generate alerts on upstream E_Ports blocked due to a downstream latency condition such as an E_Port with lost credits or a high-latency device. Bottleneck Detection can optionally create alerts when latencies are detected. Prior to FOS 6.4, alerts are sent to RASlog. Simple Network Management Protocol (SNMP) alerts were introduced in FOS 6.4. See Configuring Bottleneck Detection on page 10 for useful suggestions about configuring and using this feature. FOS version 6.3.2d and above allows for Bottleneck Detection SNMP alerts to be sent via RASLOG AN-1003. AN-1003 itself is not an actual error condition, but rather an indication of a potential bottleneck device. The change in this version 6.3 code level and above provides an easy interface for customers to trap Bottleneck Detection alerts via SNMP.

Summary of best practices


The following list summarizes how you can use features and capabilities to improve the overall resiliency of Brocade FOS-based FC fabric environments: Enable Fabric Watch for monitoring and alerting Enable Fabric Watch Port Fencing to fence ports on extreme behavior Enable the Edge Hold Time feature Enable Bottleneck Detection for congestion conditions

Fabric Resiliency Best Practices

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

See Suggested implementation on page 15 for specific commands required to implement these features using FOS 6.3.1b and FOS 6.4.

Configuring Bottleneck Detection


This section provides examples showing you how to: Enable and disable Bottleneck Detection Display a list of ports with Bottleneck Detection enabled Change the Bottleneck Detection setting on a port Display a history of bottlenecks that occurred on a port This section also provides an example of a bottleneck alert. Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual commands and output will vary slightly, depending on the version of the FOS deployed. Refer to the FOS command reference manual for the FOS version in your environment.

Enabling and disabling Bottleneck Detection


When Bottleneck Detection is enabled, RASlog alerts can be enabled to be sent when the bottleneck conditions at a port exceed a specified threshold. On the switch with target port connections, log in with administrator privileges. Enter the bottleneckmon --enable command to enable Bottleneck Detection on an F_Port or FL_Port. bottleneckmon --enable [ -alert ] [ -thresh threshold ] [ -time window ] [ -qtime quiet_time] [slot/]portlist [[slot/]portlist]... If the alert parameter is not specified, then alerts are not sent, but a history of bottleneck conditions for the port can be viewed. The thresh, time, and qtime parameters are also ignored if the alert parameter is not specified. Use the default values for the thresh (0.1), time (300), and qtime (300) parameters.

Enabling Bottleneck Detection example (preferred use case)


The following example enables bottleneck detection on all F_ and FL_Ports in the switch with RASlog alerts using default values for threshold and time. Alerts are logged when a port is experiencing a bottleneck condition for 10% of the time (default value for thresh/lthresh) over any period of 300 seconds (default value for time) with a minimum of 300 seconds (default value for qtime) between alerts. switch:admin> bottleneckmon --enable -alert *

Enabling Bottleneck Detection on ports 3 - 7 with default values example


The following example enables bottleneck detection on ports 3 through 7 using default values for threshold and time. No alerts will be delivered to report bottleneck conditions, but the bottleneck history can be viewed using the CLI. switch:admin> bottleneckmon --enable 3-7

10

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Example: Disabling Bottleneck Detection


You can disable Bottleneck Detection by following these steps: 1. Connect to the switch to which the target port belongs, and log in as administrator. 2. Enter bottleneckmon --disable to disable Bottleneck Detection on a port.

Example: Disabling Bottleneck Detection on port 3


You can disable Bottleneck Detection on port 3 by using this command: switch:admin> bottleneckmon --disable 3

Displaying a list of ports with Bottleneck Detection enabled


Follow these steps to display a list of ports that have Bottleneck Detection enabled: 1. Connect to the switch to which the target ports belong and log in as administrator. 2. Enter bottleneckmon --status to display a list of ports on which Bottleneck Detection is enabled, as shown in Example 2. Note: When using Virtual Fabrics, the output displays ports that do not belong to the logical switch if the ports were moved out of the logical switch after Bottleneck Detection was enabled on them.
Example 2 Results of the bottleneckmon --status command

switch:admin> bottleneckmon --status Port Alerts? Threshold Time(s) Quiet Time(s) ======================================================================= 3 N ---4 Y 0.100 300 300 5 Y 0.100 300 300 6 N ----

Changing Bottleneck Detection settings on a port


The default settings for Bottleneck Detection are the preferred settings. These settings are configurable in the event that a user has specific reasons for modifying them, but in most cases, the default settings should not be changed. Examples of reasons to change the defaults can include transient events that cause moderate congestion that are considered normal. Increasing the time or threshold may accommodate such events. Using the following procedure, RASlog alerts can be enabled or disabled, along with configuration of the following settings: Threshold: the percentage of one-second intervals required to generate an alert Time: the time window in seconds in which bottleneck conditions are monitored and compared against the threshold Quiet Time (qtime) options Note: Bottleneck Detection must be disabled on a port before any of the settings can be modified.

Fabric Resiliency Best Practices

11

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

To change settings on a port: 1. Connect to the switch to which the target port belongs and log in as administrator. 2. Enter bottleneckmon --disable to disable Bottleneck Detection on the port. 3. Enter bottleneckmon --enable to enable Bottleneck Detection, specify the new threshold values, and set the alert option. Example 3 changes the Bottleneck Detection settings on port 4. In this example, the bottleneck --status commands show the before and after settings.
Example 3 Before and after running the bottleneck --status command

switch:admin> bottleneckmon --status Port Alerts? Threshold Time(s) Quiet Time(s) ======================================================================= 4 Y 0.800 300 300 switch:admin> bottleneckmon -disable 4 switch:admin> bottleneckmon -enable thresh 0.6 time 420 4 switch:admin> bottleneckmon -status Port Alerts? Threshold Time(s) Quiet Time(s) ======================================================================= 4 Y 0.600 420 300

Displaying the history of bottlenecks on a port


Use the bottleneckmon show command to display a history of bottleneck conditions for an individual port: 1. Connect to the switch to which the target port belongs and log in as administrator. 2. Enter the bottleneckmon --show command to display a history of the bottleneck severity for a specific port. Example 4 shows the bottleneck history for port 3 in five-second windows over a period of 30 seconds.
Example 4 Results of the bottleneckmon --show command

fcr_saturn1:root> bottleneckmon --show -interval 5 -span 30 3 ============================================================= Mon Jun 15 18:54:35 UTC 2010 ============================================================= From To affected secs ============================================================= Jun 15 18:54:30 Jun 15 18:54:35 80.00% Jun 15 18:54:25 Jun 15 18:54:30 40.00% Jun 15 18:54:20 Jun 15 18:54:25 0.00% Jun 15 18:54:15 Jun 15 18:54:20 0.00% Jun 15 18:54:10 Jun 15 18:54:15 20.00% Jun 15 18:54:05 Jun 15 18:54:10 80.00%

12

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Bottleneck alert example


The following is an example of a Bottleneck Detection alert on an F_Port.
Example 5 Example Bottleneck Detection alert on an F_Port

2010/03/16-03:40:47, [AN-1003], 21760, FID 128, WARNING, sw0, Latency bottleneck at slot 0, port 38. 100.00 percent of last 300 seconds were affected. Avg. time b/w transmits 80407.3975 us.

Configuring Port Fencing


Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual command usage and output will vary slightly, depending on the version of the FOS deployed. Refer to the FOS command reference manual for the FOS version in your environment. Use the portFencing CLI command to enable error reporting for the Fabric Watch Port Fencing feature on all ports of a specified type and to configure the ports to report errors for a specific area. Supported port types include E_Ports, F_Ports, and physical ports. A specified port type can be configured to report errors for one or more areas. Note: Avoid Port Fencing using time outs on E_Ports, because it can affect traffic for a large number of devices throughout the fabric. Port Fencing monitors ports for erratic behavior and disables a port if specified error conditions are met. The portFencing CLI command enables or disables the Port Fencing feature for an area of a class. You can customize or tune the threshold of an area using the portthConfig CLI command. Use portFencing to configure Port Fencing to detect excessive Invalid Words on all F_Ports. See the following example: portfencing --enable fop-port area IW The same command can be used to configure Port Fencing on link reset: portfencing --enable fop-port area LR Use portFencing to configure Port Fencing to detect excessive State Changes on E_Ports. See the following example: portfencing --enable e-port area ST Note: Using the Lossless DPS feature can minimize the number of dropped frames due to E_Ports being fenced because of State Changes.

Fabric Resiliency Best Practices

13

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

Use portThconfig to customize Port Fencing thresholds: switch:admin> portthconfig --set -trigger above -action email switch:admin> portthconfig --set -action email switch:admin> portthconfig --set above -action email switch:admin> portthconfig --set -action email port -area crc -highthreshold -value 2 port -area crc -highthreshold -trigger below port -ar crc -lowthreshold -value 1 -trigger port -ar crc -lowthreshold -trigger below

To apply the new custom settings so that they become effective: switch:admin> portthconfig --apply port -area crc -action cust -thresh_level custom To display the port threshold configuration for all port types and areas: switch:admin> portthconfig --show

Port Fencing suggested thresholds


Table 1 lists the suggested thresholds for Port Fencing for three areas.
Table 1 Suggested Port Fencing thresholds Area Link Reset State Change C3TX_TO Suggested threshold 5 7 5

Suggested thresholds for CRC errors and Invalid Transfer Words


CRC errors and Invalid Transfer Words can occur on normal links. They have also been known to occur during certain transitions, such as server reboots. When these errors occur more frequently, they can cause a severe impact. Although most systems can tolerate infrequent CRC errors or Invalid Transfer Words, other environments can be sensitive to even infrequent instances. The overall quality of the fabric interconnects is also a factor. When you establish thresholds for CRC errors and Invalid Transfer Words, consider the following. In general, cleaner interconnects can have lower thresholds because they should be less likely to introduce errors on the links. Table 2 on page 14 lists suggestions for establishing moderate (the preferred choice), conservative, and aggressive thresholds.
Table 2 Aggressive and conservative thresholds Area CRC Invalid Transfer Words Moderate threshold Low 5 High 20 Low 25 High 40 Aggressive threshold Low 1 High 2 Low 12 High 25 Conservative threshold Low 5 High 40 Low 25 High 80

14

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

After you select the type of thresholds for an environment, set the low threshold with an action of ALERT (RASlog, email, SNMP trap). The alert will be triggered whenever the low threshold is exceeded. Set the high threshold with an action of Fence. The port will be fenced (disabled) whenever the high threshold is detected.

Configuring Edge Hold Time


Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual command usage and output will vary slightly, depending on the version of FOS deployed. For the actual command details and output produced, see the FOS command reference manual for the version of FOS in your environment. Users can configure Edge Hold Time by using the following commands. The switch does not need to be disabled to modify the hold time. Use the Configure Edge Hold Time option to turn this feature on or off. Output of the configure command is shown in Example 6.
Example 6 Configure command

IBM_SAN384B_27_VF:admin> configure Not all options will be available on an enabled switch. To disable the switch, use the "switchDisable" command. Configure... Fabric parameters (yes, y, no, n): [no] yes Configure edge hold time (yes, y, no, n): [yes] Edge hold time: (100..500) [100] The edge_hold_time value is persistently stored in the configuration file. All configuration file operations such as upload and download are supported for this feature. Note: This setting is only available in FOS 6.3.1b and later.

Suggested implementation
Implement the resiliency features provided by the Brocade FOS by using a two-phase approach. In Phase 1, enable only the alerting and monitoring features using the suggested threshold values. In Phase 2, enable the mitigation (fencing) features. This approach allows you to gain experience in the environment (over a 30-day period, for example) using these features, and also provides an opportunity for you to identify and address pre-existing conditions that may exist in the environment prior to enabling mitigation actions. Note: The suggested approach and associated thresholds presented have been identified as appropriate for most environments. It is possible that specific environments may require alternate settings to meet specific requirements.

Fabric Resiliency Best Practices

15

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

The resiliency features discussed in this paper were introduced and enhanced over several levels of FOS code. This section specifically refers to the capabilities provided by FOS codes 6.3 and later. FOS code 6.3 contains the majority of these new features; however, FOS 6.4 introduced additional enhancements, most notably in regard to bottleneck detection. Note: Port Fencing for Class 3 (C3) discards or timeouts is supported on 8 Gbps in FOS 6.3 and later. However, it was not supported on 4 Gbps until FOS 6.3.1b.

Phase 1
Step 1: Enable Fabric Watch alerting
Enable Fabric Watch Alerting on both F_Ports and E_Ports. Table 3 lists suggested values.
Table 3 Suggested Fabric Watch Alerting thresholds for F_Ports and E_Ports Condition Link Resets State Change CRC Invalid Transfer Word C3TX_TO (C3 Discard) F_Ports Low 3 High 5 Low 3 High 7 Low 5 High 20 Low 25 High 40 Low 3 High 5 E_Ports Low 2 Low 2 Low 2 Low 10 NA

Note: The C3 Discard Frames Threshold cannot be applied to an E_Port. Thresholds can be defined using either the Fabric Watch GUI or the command line. Figure 3 on page 20 shows the Fabric Watch Threshold Configuration tab. Example 7 on page 17 shows the CLI commands used to set the low and high threshold values for F_Ports and E_Ports. See the following product documents, available with registration at http://my.brocade.com/wps/portal/registration: Note: Thresholds can also be defined using DCFM; however, detailing the use of DCFM is beyond the scope of this paper. For information about using DCFM to set monitoring thresholds, see the following product documents, available with registration at http://my.brocade.com/wps/portal: The IBM System Storage publication, Data Center Fabric Manager User Manual Supporting DCFM version 10.4.x (GC52-1304-03) at http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003231 Configuring Performance Monitoring and Thresholds in Brocade Data Center Fabric Manager (DCFM), GA-BP-247-00 at http://www.brocade.com/downloads/documents/white_papers/DCFM_PerfMon_Thresho lds_GA-BP-247-00.pdf

16

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Example 7 CLI commands used to set strongly suggested threshold values portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog portthconfig --set -action raslog fop-port -area LR -lowthreshold -value 3 -trigger above fop-port -area LR -highthreshold -value 5 -trigger above fop-port -area ST -lowthreshold -value 3 -trigger above fop-port -area ST -highthreshold -value 7 -trigger above fop-port -area CRC -lowthreshold -value 5 -trigger above fop-port -area CRC -highthreshold -value 20 -trigger above fop-port -area ITW -lowthreshold -value 25 -trigger above fop-port -area ITW -highthreshold -value 40 -trigger above fop-port -area C3TX_TO -lowthreshold -value 3 -trigger above fop-port -area C3TX_TO -highthreshold -value 5 -trigger above e-port -area LR -lowthreshold -value 2 -trigger above e-port -area ST -lowthreshold -value 2 -trigger above e-port -area CRC -lowthreshold -value 2 -trigger above e-port -area ITW -lowthreshold -value 10 -trigger above e-port -area C3TX_TO -lowthreshold -value 2 -trigger above

After the custom thresholds and actions have been defined as shown in Example 7, they must then be applied. Example 8 shows the CLI commands used to apply the custom defined threshold and action for each area.
Example 8 portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig portthconfig --apply --apply --apply --apply --apply --apply --apply --apply --apply --apply --apply --apply fop-port -area LR -action_level cust -threhsold_level cust fop-port -area LR -action_level cust -threhsold_level cust fop-port -area ST -action_level cust -threhsold_level cust fop-port -area ST -action_level cust -threhsold_level cust fop-port -area CRC -action_level cust -threhsold_level cust fop-port -area CRC -action_level cust -threhsold_level cust fop-port -area ITW -action_level cust -threhsold_level cust fop-port -area ITW -action_level cust -threhsold_level cust fop-port -area C3TX_TO -action_level cust -threhsold_level cust fop-port -area C3TX_TO -action_level cust -threhsold_level cust e-port -area LR -action_level cust -threhsold_level cust e-port -area ST -action_level cust -threhsold_level cust

Fabric Resiliency Best Practices

17

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

portthconfig --apply e-port -area CRC -action_level cust -threhsold_level cust portthconfig --apply e-port -area ITW -action_level cust -threhsold_level cust portthconfig --apply e-port -area C3TX_TO -action_level cust -threhsold_level cust

Note: For threshold alerts to be acted upon by either writing them to the RASlog, sending an email, or sending an SNMP trap the -trigger and -action parameters must be specified. Multiple actions can be specified when separated by commas. Example: portthconfig --set fop-port -area LR -lowthreshold -value 3 -trigger above -action raslog,email,snmp

Step 2: Enable Bottleneck Detection alerting


Use the default values that follow: FOS 6.3: thresh=(0.1), time=(300), qtime=(300) FOS 6.4: lthresh=(0.1), cthresh=(0.8), time=(300), qtime=(300) Note: FOS 6.3 supports Bottleneck Detection on F_Ports only. FOS 6.4 supports Bottleneck Detection on F_Ports and E_Ports. Example 9 shows the commands used to enable Bottleneck Detection alerting on FOS 6.3.1b. Be aware that on FOS 6.3.1b, ports must be enabled individually. The output from bottleneckmon --status shows ports that have been enabled.
Example 9 Enabling Bottleneck Detection alerting (for FOS 6.3.1b)

butane:admin> bottleneckmon --enable -alert 1 butane:admin> bottleneckmon --enable -alert 2 butane:admin> bottleneckmon --enable -alert 3 butane:admin> bottleneckmon --status Port Alerts? Threshold Time (s) Quiet Time (s) ====================================================================== 1 Y 0.100 300 300 2 Y 0.100 300 300 3 Y 0.100 300 300 butane:admin>

18

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Example 10 shows the commands used to enable Bottleneck Detection alerting on FOS 6.4. Be aware that on FOS 6.4, all ports are enabled together on a logical switch basis.
Example 10 Enabling Bottleneck Detection alerting (for FOS 6.4 and later)

max2:admin> bottleneckmon --enable -alert * max2:admin> bottleneckmon --status Bottleneck detection - Enabled ============================== Switch-wide alerting parameters: ============================ Alerts - Yes Action requested - No Latency threshold for alert - 0.100 Congestion threshold for alert - 0.800 Averaging time for alert - 300 seconds Quiet time for alert - 300 seconds

Step 3: Enable the Edge Hold Timer feature


Select a hold time of 100 ms. Example 11 shows the command used to enable the Edge Hold Timer with a hold time of 100 ms.
Example 11 Setting Edge Hold Timer

max2:admin> configure Not all options will be available on an enabled switch. To disable the switch, use the "switchDisable" command. Configure... Fabric parameters (yes, y, no, n): [no] yes Configure edge hold time (yes, y, no, n): [no] yes Edge hold time: (100..500) [100] max2:admin>

Phase 2
Step 4: Enable Fabric Watch Port Fencing
Enable Fabric Watch Port Fencing on F_Ports only. Fencing will occur on the high threshold values defined in Phase 1. See Table 3 on page 16 for information about those values.

Fabric Resiliency Best Practices

19

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

Example 12 shows the commands used to enable port fencing on F_Ports.


Example 12 Commands used to enable port fencing on F_Ports

portfencing portfencing portfencing portfencing portfencing

--enable --enable --enable --enable --enable

fop-port fop-port fop-port fop-port fop-port

-area -area -area -area -area

LR ST CRC ITW C3TX_TO

Note: Although the high threshold values used by port fencing can be defined using the Fabric Watch GUI, port fencing can only be enabled from the command line or through DCFM. For information about using DCFM, see the following document, available with registration at http://my.brocade.com/wps/portal: The IBM System Storage Data Center Fabric Manager User Manual, Supporting DCFM version 10.4.x, GC52-1304-03 at http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003231. Figure 3 shows the Fabric Watch Threshold Configuration menu.

Figure 3 Fabric Watch Threshold Configuration menu

Step 5: Evaluate and tune Bottleneck Detection alerting


After gaining experience with Bottleneck Detection enabling using the default values, it may be appropriate to change the threshold values to more aggressive settings. If the default values applied in Step 5 are not generating an excessive volume of alerts, then a more aggressive setting should be applied to improve the effectiveness of the monitoring.

20

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Note: The recommended threshold settings defined here were found to be appropriate for most environments. In some cases, these settings will be adjusted for the specific environment. If alerting is triggering on events that do not represent a problem, then the values can either be returned to the default values or tuned through trial and error.

Suggested initial aggressive values are: FOS 6.3: thresh=(0.1), time=(120), qtime=(120) FOS 6.4: lthresh=(0.1), cthresh=(0.75), time=(120), qtime=(120) Example 13 shows the commands for changing Bottleneck Detection alerting using FOS 6.3.1b.
Example 13 Changing Bottleneck Detection alerting (for FOS 6.3.1b)

butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 1 butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 2 butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 3 butane:admin> bottleneckmon --status Port Alerts? Threshold Time (s) Quiet Time (s) ====================================================================== 1 Y 0.100 120 120 2 Y 0.100 120 120 3 Y 0.100 120 120 butane:admin>

Example 14 shows the commands for changing Bottleneck Detection alerting using FOS 6.4.
Example 14 Changing Bottleneck Detection alerting (for FOS 6.4)

max2:admin> bottleneckmon --config -lthresh .1 -cthresh .75 -time 120 -qtime 120 -alert max2:admin> bottleneckmon --status Bottleneck detection - Enabled ============================== Switch-wide alerting parameters: ============================ Alerts Latency threshold for alert Congestion threshold for alert Averaging time for alert Quiet time for alert max2:admin>

Yes 0.100 0.750 120 seconds 120 seconds

Fabric Resiliency Best Practices

21

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

The team who wrote this paper


This paper was produced by a team of specialists from around the world, working at the International Technical Support Organization, San Jose Center. Ian MacQuarrie is a Senior Technical Staff Member with the IBM Systems and Technology Group located in San Jose, California. He has 26 years of experience in enterprise storage systems in a variety of test and support roles. He is currently a member of the Systems and Technology Group (STG) Field Assist Team (FAST) supporting clients through critical account engagements, availability assessments, and technical advocacy. His areas of expertise include storage area networks (SANs), open systems storage solutions, and performance analysis. Ian co-authored a previous IBM Redbooks publication, Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933. John Juenemann is a Senior Technical Staff Member with the IBM Global Technology Services Delivery Storage Service Line, located in Boulder, Colorado. He has 19 years of experience in strategic outsourcing. During the last nine years, he has been focused on delivering storage solutions to some of the largest, strategically outsourced IBM accounts. His areas of expertise include open systems servers, databases, SANs, and enterprise storage solutions. John has previously been a contributor to the IBM Redbooks publication, Exploiting RS/6000 Security: Keeping it Safe at http://www.redbooks.ibm.com/abstracts/sg245521.html?Open. Jon Tate is a Project Manager for IBM System Storage SAN Solutions at the International Technical Support Organization, San Jose Center. Before joining the ITSO in 1999, he worked in the IBM Technical Support Center, providing Levels 2 and 3 support for IBM storage products. Jon has 26 years of experience in storage software and management, services, and support, and is both an IBM Certified IT Specialist and an IBM SAN Certified Specialist. He is also the UK Chairman of the Storage Networking Industry Association. Special thanks to Brocade for its unparalleled support of this paper in terms of equipment and support in many areas, and to the following people at Brocade: Jim Baldyga Mansi Botadra Silviano Gaona Brian Steffler Marcus Thordal Steven Tong

Now you can become a published author, too!


Here's an opportunity to spotlight your skills, grow your career, and become a published authorall at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html

22

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper.fm

Stay connected to IBM Redbooks


Find us on Facebook: http://www.facebook.com/IBMRedbooks Follow us on Twitter: http://twitter.com/ibmredbooks Look for us on LinkedIn: http://www.linkedin.com/groups?home=&gid=2130806 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks weekly newsletter: https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm Stay current on recent Redbooks publications with RSS Feeds: http://www.redbooks.ibm.com/rss.html

Fabric Resiliency Best Practices

23

4722paper.fm

Draft Document for Review May 9, 2012 11:15 am

24

Fabric Resiliency Best Practices

Draft Document for Review May 9, 2012 11:15 am

4722paper-spec.fm

Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Copyright International Business Machines Corporation 2011. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

25

4722paper-spec.fm

Draft Document for Review May 9, 2012 11:15 am

This document REDP-4722-01 was created or updated on May 9, 2012. Send us your comments in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks Send your comments in an email to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 U.S.A.

Redpaper

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
IBM Redbooks Redpaper Redbooks (logo) RS/6000 System Storage

The following terms are trademarks of other companies: Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

26

Fabric Resiliency Best Practices

Anda mungkin juga menyukai