Purposes of Performance Chapter 1 Understanding The Need For

From a device perspective, we are mainly interested in device "health" data, such as overall throughput, per-(sub)interface utilization,
response time, CPU load, memory consumption, errors, and so forth. Details about network element performance, such as interface
utilization and errors, are provided by the various MIBs, such as MIB-II (RFC 1213), Interfaces-Group-MIB (RFC 2863), and TCP-MIB
(RFC 2012).
Calculation of transmission efficiency is related to the number of invalid packets; it measures the error-free traffic on the network and
compares the rate of erroneous packets to accurate packets. We measure only ingress transmission efficiency, because a router or
switch does not send defect packets intentionally. The required parameters are provided by the Interface MIB (IF-MIB RFC 2863):
ifInErrors "For packet-oriented interfaces, the number of inbound packets that contained errors preventing them from being
deliverable to a higher-layer protocol. For character-oriented or fixed-length interfaces, the number of inbound transmission units
that contained errors preventing them from being deliverable to a higher-layer protocol."
ifInUcastPkts "The number of packets, delivered by this sub-layer to a higher (sub-) layer, which were not addressed to a
multicast or broadcast address at this sublayer." These are the unicast packages.
ifInNUcastPkts "The number of packets, delivered by this sub-layer to a higher (sub-)layer, which were addressed to a multicast
or broadcast address at this sub-layer." These are the nonunicast packages (the sum of multicast and broadcast traffic).
transmission efficiency [%] = AifInErrors * 100 / (AifInErrors + AifInUcastPkts + AifInNUcastPkt)
The CISCO-IF-EXTENSION-MIB adds details such as cieIfInFramingErrs (misaligned or framing errors), cieIfInOverrunErrs (the receiver
ran out of buffers), cieIfInRuntsErrs (too-small packets), and cieIfInGiantsErrs (too-large packets). These counters should be used for
in-depth error analysis; if the ifInErrors counter is high, the root cause needs to be identified.
Note
A single SNMP polling cycle of the MIB counters is useless; the delta between two polling cycles provides relevant data!
More details related to device performance can be found in the Cisco Press book Performance and Fault Management.
System and Server Performance Monitoring
Most of the recommendations described for networking devices also apply to server monitoring. The low-level and operation systems
functions need to be checked constantly to identify performance issues immediately. In addition to checking these details, you should
also monitor the specific services running on the server. Consider a DNS service. It would not satisfy the users to know that the server
response time to a ping request is okay if the logical service (DNS in this case) is very slow due to issues that might be caused by other
applications running on the same physical hardware.
In the case of system and server monitoring, we make a distinction between low-level service monitoring and high-level service
monitoring:
Low-level service monitoring components:
- System: hardware and operating system (OS)
- Network card(s)
- CPU: overall and per system process
- Hard drive disks, disk clusters
- Fan(s)
- Power supply
- Temperature
- OS processes: check if running; restart if necessary
- System uptime
High-level service monitoring components:
- Application processes: check if running; restart if necessary
- Server response time per application
- Optional: Quality of service per application: monitor resources (memory, CPU, network bandwidth) per CoS definition
- Uptime per application
A practical approach is to measure the server performance with the Cisco IP SLA or Cisco NAM card for the Catalyst switch. The NAM
leverages the ART MIB and provides a useful set of performance statistics if located in the switch that connects to the server farm.
Purposes of Performance :: Chapter 1. Understanding th... http://etutorials.org/Networking/network+management/Par...
2 of 9 11/23/2013 11:56 AM
Figure 1-23 shows an ART MIB report. Chapter 5 includes details about the ART MIB.
Figure 1-23. Catalyst 6500 NAM ART Measurement
[View full size image]
Network Performance Monitoring
Network connectivity and response time can be monitored with basic tools such as ping and traceroute or with more advanced tools such
as Ping-MIB, Cisco IP SLA, external probes, or a monitoring application running at the PC or server. When measuring the network
connectivity and response time, we recommended that the administrator monitor connectivity between the network devices and also to
the servers. This can avoid finger-pointing between the networking and server departments.
In the context of network performance, we distinguish between the downtime measured by an application and the downtime
experienced by a user. If you just monitor network and server availability and do not monitor the actual service (which runs on the
physical server), we could measure 100 percent availability, even if this service has an availability of no more than 90 percent. Another
example is related to the measurement interval. If the performance monitoring application pings the devices only every 5 minutes, the
result could be 100 percent availability, even if there are short outages during the measurement interval. A user would experience these
outages and argue that the measurement is inaccurate, even though in this case both parties are right. A similar situation could occur
during the night. The user probably sleeps and does not care about the network, while the server is monitoring outages. You can already
imagine the challenges of service level definitions and monitoring.
Availability is only one parameter in the area of network monitoring; others are also relevant:
Network response time
Utilization (device, network)
Packet loss
Network throughput/capacity
Network delay
Jitter (delay variation)
Transmission efficiency
Service Monitoring
We started the monitoring approach at the device and network level to ensure basic connectivity. Assuming that the network
connectivity and response time are well monitored now, the next step is to monitor the services in the network. This is the right time to
discuss the notion of service level management and service level agreements.
From a service perspective, here are significant parameters to monitor:
3 of 9 11/23/2013 11:56 AM
Service availability
Service delay
Packet loss
Delay variation (jitter)
Mean Opinion Score (MOS) in the case of voice
Key Performance Indicators (KPI)
Key Quality Indicators (KQI)
A good reference for service parameters such as KPI and KQI is the "SLA Management Handbook GB917" from the TeleManagement
Forum (TMF).
Service availability measurements require explicit measurement devices or applications, because a clear distinction between server and
service is necessary. Imagine a (physical) server operating without any problems, while the (logical) service running on the server has
terminated without a notification. Client-based monitoring applications can generate application-specific requests (for example, SAP
transactions) to identify if the service is operational and what the response time is.
We briefly discussed this issue in the "System and Server Performance Monitoring" section. We proposed using the Cisco NAM card in
connection with the Response Time (ART) MIB, or Cisco IP SLA. IP SLA supports application-specific probe operations, like DNS/DHCP
request or web server response time. In a voice over IP (VoIP) environment, IP SLA measures the delay variation (also known as jitter),
which is a very important parameter to identify voice quality. Furthermore, it measures the MOS that is essential in Internet telephony,
because it provides a numeric measure of the quality of human speech at the destination end of the circuit.
Because Chapter 3 describes the notion of a service, at this point we address this topic only briefly:
Service A generic definition by Merriam-Webster declares: "A facility supplying some public demand...." More specifically, related
to IT, we define a service as a function providing network connectivity or network functionality, such as the Network File System,
Network Information Service (NIS), Domain Name Server (DNS), DHCP, FTP, news, finger, NTP, and so on.
Service level The definition of a certain level of quality (related to specific metrics) in the network with the objective of making the
network more predictable and reliable.
Service level agreement (SLA) A contract between the service provider and the customer that describes the guaranteed
performance level of the network or service. Another way of expressing it is "An SLA is the formalization of the quality of the service
in a contract between the Customer and the Service Provider."
Service level management The continuously running cycle of measuring traffic metrics, comparing those metrics to stated goals
(such as for performance), and ensuring that the service level meets or exceeds the agreed-upon service levels.
Table 1-8 provides some generic SLA examples.
Table 1-8. Generic SLAs
Class SLAs Application
Premium Availability: 99.98/99.998 percent
Latency: 50 ms maximum
Packet delivery: 100 percent
Jitter: 2 ms maximum
Broadcast videoTraditional
voice
Optimized Availability: 99.98/99.998 percent
Packet delivery: 100 percent
Jitter: 10 ms maximum
Compressed video
Voice over IP
Mixed application
Virtual private network
Best effort Availability: 99.98 percent
Packet delivery: 99.95 percent
Internet data
BaseIining
Baselining is the process of studying the network, collecting relevant information, storing it, and making the results available for later
analysis. A general baseline includes all areas of the network, such as a connectivity diagram, inventory details, device configurations,
4 of 9 11/23/2013 11:56 AM
software versions, device utilization, link bandwidth, and so on. The baselining task should be done on a regular basis, because it can be
of great assistance in troubleshooting situations as well as providing supporting analysis for network planning and enhancements. It is
also used as the starting point for threshold definitions, which can help identify current network problems and predict future bottlenecks.
As a summary, the objective of baselining is to create a knowledge base of the network-and keep it up to date!
Baselining tasks include the following:
Gather device inventory information (physical as well as logical). This can be collected via SNMP or directly from the command-line
interface (CLI)-for example, show version, show module, show run, show config all, and others.
Gather statistics (device-, network-, and service-related) at regular intervals.
Document the physical and logical network, and create network maps.
Identify the protocols on your network, including
- Ethernet, Token Ring, ATM
- Routing (RIP, OSPF, EIGRP, BGP, and so on)
- Legacy voice encapsulated in IP (VoIP)
- IP telephony
- QoS (RSVP)
- Multicast
- MPLS/VPN
- Frame Relay
- DLSW
Identify the applications on your network, including
- Web servers
- Mainframe-based applications (IBM SNA)
- Peer-to-peer applications (Kazaa, Morpheus, Grokster, Gnutella, Skype and so on)
- Backup programs
- Instant messaging
Monitor statistics over time, and study traffic flows.
From a performance baselining perspective, we are primarily interested in performance-related subtasks:
Collect network device-specific details:
- CPU utilization
- Memory details (free system memory, amount of flash memory, RAM, etc.)
- Link utilization (ingress and egress traffic)
- Traffic per class of service
- Dropped packets
- Erroneous packets
Gather server- and (optionally) client-related details:
- CPU utilization
- Memory (main memory, virtual memory)
- Disk space
- Operation system process status
- Service and application process status
Gather service-related information:
- Round-trip time
- Packet loss (delay variation-jitter)
- MOS (if applicable)
The collected baseline details are usually stored in a database so that relevant reports can be generated later. The next step is to define
reporting requirements. Which level of detail do you need? Which level of granularity is required? These questions can be answered by
5 of 9 11/23/2013 11:56 AM
looking at the specific types of applications that generated the traffic for the baseline. For example, you need a finer level of granularity
for troubleshooting than for trending. If capacity planning includes QoS, the relevant QoS parameters need to be collected, which might
not be required if the data is collected for computing the amount to charge back per department. Based on the demands of the
particular use case, you can define polling intervals and the granularity of the data collection. Five-minute intervals are in most cases
sufficient for baselining, so to start polling devices every 5 minutes. In a large network, this can create a nontrivial amount of overhead
traffic. You can avoid this by creating different polling groups (that is, poll core devices every 5 minutes, distribution level devices every
10 minutes, and access devices every 15 minutes, for example).
Over time, you realize that the amount of collected data becomes huge, so you want to aggregate older data. This depicts the
compromise between data granularity and storage capacity. For example, you could combine the original 5-minute interval collection
into a 30- or 60-minute interval. The informational RFC 1857 proposes guidelines for the aggregation interval:
Over a 24-hour period, aggregate data to 15-minute intervals. Aggregate three 5-minute raw data samples into one 15-minute
interval, which results in a reduction of 33 percent.
Over a 1-month period, aggregate data to 1-hour intervals. Aggregate four 15-minute data sets into a 1-hour period, thereby
reducing the data by 25 percent.
Over a 1-year period, aggregate data to 1-day intervals. Aggregate 24 1-hour data sets into one, resulting in a 4.2 percent reduction.
Comparing the 5-minute raw data collections with the 1-year aggregation, a reduction by the factor 3 * 4 * 24 = 288, or 0.35
percent takes place.
So far, you have collected performance statistics from the elements in the network and stored them in an archive or database. The next
chapter shows that baselining is a foundation for effective fault management.
FauIt Management
In addition to the close linkage between performance and accounting, we also recognize a strong association between performance and
fault management. Figure 1-5 illustrated this, and we want to elaborate on its concepts. Remember that the objective of performance
monitoring is collecting statistics from devices, networks, and services, and displaying them in various forms to the user. Performance
management extends this approach and includes active modifications of the network to reconstitute the expected performance level.
Note that an additional step should occur between the recognition of a deviation and the remedy-notifying the fault application of the
abnormality. One could argue that a deviation is not a fault at all, but it is certainly an indicator of some form of abnormal behavior that
should be examined further. This is the reason for sending a notification toward the fault application. At a high level, we can distinguish
between two different fault events:
State change (device, network, or service failure; outage; or restart)
Performance deviation
Notifications about state changes are sent by devices proactively. As a result, a state change occurs at the application (for example,
network map, event list, and so on). A state change from operational to nonoperational usually indicates an outage, while the opposite
indicates either a recovery from a failure or the activation of a backup procedure. Therefore, events categorized under "state change
(a)" require as much attention as events according to "performance deviation (b)." An example is the activation of an ISDN backup link
when the primary DSL connection fails. Assuming the DSL connection has a flat rate, the ISDN link is probably charged per time interval
and can result in a drastically increased monthly invoice. This would indicate a poorly designed fault management system, if only the
invoice at the end of the month can indicate this situation.
Performance deviation events are much more closely linked to performance management than fault management. The challenge is to
identify a deviation from "normal," because you need some history indicators to define what "normal" is for a specific network at a
specific time. To achieve this, baselining is required, as explained in the preceding section. If the current measurement values exceed a
predefined threshold above or below the expected value, a performance event is generated.
We will now analyze the performance baseline to understand the traffic flows in the network and to define appropriate thresholds for
traffic or application guarantees. Thresholding is the process of specifying triggers on traffic patterns or situations and generating events
when such situations occur.
We define two classes of thresholds:
Discrete thresholds Boolean objects with two values (yes/no, true/false, 0/1) that define the transition from one state to another.
Examples: Link up/down, interface operational/failed, or service available/unavailable. Boolean operators are either true or false and
can easily be correlated; for instance:
Symptoms: A specific service is unavailable and the physical link to the server that provides this service is down.
Action: Check the link before checking the service.
Continuous thresholds Apply to continuous data sets and can optionally include time. In this case we need to define an absolute
or relative threshold and generate an event whenever this value is exceeded.
6 of 9 11/23/2013 11:56 AM
Example: The number of erroneous packets as a percentage of total traffic.
The thresholding technique can be enhanced by adding a hysteresis function to reduce the volume of generated events. In this case,
only one positive event is sent when the value exceeds the upper threshold, but no further events are sent until the lower threshold has
been reached and one negative event is sent. This helps reduce the volume of events drastically without reducing the level of relevant
information.
Figure 1-24 shows a response-time hysteresis function with a rising threshold of 100 ms and a falling threshold of 50 ms. In this
example, a response time between 50 and 100 ms is considered normal, and a response time above 100 ms is critical and generates an
alert. After the alert has occurred, the state remains critical until the response time drops to 50 ms. Alternatively, you could set both the
upper and lower threshold to 100 ms to get immediate notification if the response time drops below 100 ms; however, this would
remove the hysteresis function.
Figure 1-24. Defining Reaction Conditions
[View full size image]
Statistical analysis can be visualized in a plot chart to identify normal and abnormal behavior. It is a good practice to start with a "soft"
or lower threshold instead of too tight a value to avoid alert storms. Thresholds can be adjusted by taking a successive approach, which
adjusts the values over time to closely match the normal behavior and identify abnormalities.
You can define thresholds at the NMS system by polling the device performance data and checking it against the thresholds.
Alternatively, you can set thresholds directly at the device level and notify the NMS application proactively if a threshold has been
exceeded. The first approach is the classic solution in which NMS frameworks such as HP OpenView, IBM Tivoli, CA Unicenter, and so on
discover and poll all devices in the network. The second approach requires more intelligence at the device level and has additional
resource requirements (memory and CPU). But it can help reduce network management traffic on the network, because only status
polling and event notification between the NMS server and the device are required. The RMON-MIB, Event-MIB, and Cisco IP SLA
(CISCO-RTTMON-MIB) can provide this functionality.
Which thresholds are relevant for your network? Almost every network administrator is searching for generic answers, only to discover
that he or she has to find an individual answer to this question. Very few generic thresholds can be applied across all networks; thus, in
most cases, defining thresholds is a task for the operator. For example, Cisco suggests that the average CPU load should not exceed 60
percent so that enough performance is available for sudden events such as routing protocol recalculations. However, you could argue
that by purchasing a "full" CPU, you could expect a much higher utilization, and therefore define a threshold of 95 percent. Best practice
suggests taking a more conservative approach to increase network availability.
Generic threshold recommendations for Cisco routers and switches are as follows:
Total CPU utilization over 5 min 60 percent (CISCO-PROCESS-MIB)
Free memory pool 25 percent (CISCO-MEMORY-POOL-MIB)
Round-trip time between two devices 150 ms (CISCO-RTTMON-MIB)
Jitter between two devices 10 ms (CISCO-RTTMON-MIB)
DNS response time 2 sec (CISCO-RTTMON-MIB)
DHCP request time 10 sec (CISCO-RTTMON-MIB)
With a performance baseline in place and threshold definitions applied, we will identify a more sophisticated feature called "deviation
from normal." This function adds "intelligence" to the performance baseline collection by defining and analyzing network performance
metrics over a timeline. For instance, if you identify link utilization above 90 percent as a critical situation, you can also ask at what time
this is a particular problem. On a Monday morning, when every user downloads e-mails, updates virus definitions, or performs a backup,
it can be acceptable to utilize the network almost completely, especially from an economical perspective. If the same situation occurs on
7 of 9 11/23/2013 11:56 AM

Purposes of Performance Chapter 1 Understanding The Need For

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Purposes of Performance Chapter 1 Understanding The Need For

Diunggah oleh

Hak Cipta:

Format Tersedia

From a device perspective, we are mainly interested in device "health" data, such as overall throughput, per-(sub)interface utilization,

Anda mungkin juga menyukai