Anda di halaman 1dari 97

IBM WebSphere Portal software family Your world. Your way.

IBM WebSphere Portal Performance Troubleshooting Guide


Martin Presler-Marshall IBM WPLC Performance Team
Document version 2.0 December 2008

Contents
About This Document ......................................................................................................... 7 1 Web Performance Basics ................................................................................................. 8 1.1 What is a performance problem? ....................................................................... 8 1.2 Performance limits and bottlenecks .................................................................. 8 1.3 The cyclical nature of performance analysis ..................................................... 9 1.4 The importance of tools ...................................................................................... 9 1.5 The importance of planning ............................................................................... 9 2 Classifying Performance Problems ............................................................................... 11 3 Collecting Information About the Problem................................................................... 12 3.1 What is the environment with the performance problem?............................. 12 3.2 Has the environment been configured for high performance? ....................... 12 3.3 Is the problem affecting a specific set of users? .............................................. 12 3.4 How reproducible is the problem? ................................................................... 13 3.5 What actions are affected? ............................................................................... 13 3.6 How is the problem observed? ......................................................................... 14 3.7 Has the problem always been present? ........................................................... 14 3.8 What other symptoms are being seen? ........................................................... 15 4 Resolving Common Performance Problems .................................................................. 16 4.1 Poor response times under light load .............................................................. 17 4.1.1 Poor response times at light load for most or all users ............................... 17 4.1.1.1 Slow page retrieval at low load.................................................................. 17 4.1.1.2 Slow page rendering................................................................................... 18 4.1.2 Poor response times at light load for certain users ..................................... 19 4.2 Inability to utilize full system capacity ........................................................... 20 4.2.1 Poor response times for specific operations ................................................. 20 4.2.1.1 Poor login response time ............................................................................ 21 4.2.1.2 Poor response time for certain pages ........................................................ 21

4.2.2 Poor response times for most or all operations ............................................ 22 4.3 Unable to achieve required capacity ............................................................... 25 4.3.1 Capacity limits: impact of portlets ............................................................... 25 4.3.2 Capacity limits: impact of themes and skins ............................................... 26 4.3.3 Capacity limits: number of portlets per page .............................................. 26 4.3.4 Capacity limits: WebSphere Portal tuning .................................................. 27 4.4 Performance dropping under constant load .................................................... 29 4.5 Problems specific to load tests ......................................................................... 31 4.5.1 Problems related to automated load generation ..................................... 31 4.5.2 Identifying common errors in load measurements.................................. 33 5 Detailed Instructions ..................................................................................................... 36 5.1 Network issues ................................................................................................. 36 5.1.1 Verifying network connections ................................................................. 36 5.1.2 Checking network tuning ......................................................................... 38 5.2 HTTP server performance ............................................................................... 38 5.2.1 HTTP server connection limits................................................................. 38 5.2.2 Monitoring HTTP Servers ........................................................................ 40 5.2.3 HTTP server processor consumption ....................................................... 42 5.3 Database performance ..................................................................................... 42 5.3.1 Database connection pool monitoring and tuning ................................... 42 5.3.2 Monitoring database performance ........................................................... 43 5.3.3 Common database performance issues .................................................... 45 5.3.4 Database maintenance ............................................................................. 46 5.4 Memory analysis .............................................................................................. 48 5.4.1 Understanding verbose garbage collection output .................................. 48 5.4.2 Detecting and resolving memory leaks .................................................... 58 5.5 Page loading and composition ......................................................................... 61 5.5.1 Breaking down page load times ............................................................... 61 5.5.2 Reducing the number of elements in a page............................................ 65

5.6 Application server performance analysis ........................................................ 67 5.6.1 Performance monitoring and profiling tools ............................................ 67 5.6.2 Using Javacores ........................................................................................ 68 5.6.3 System monitoring .................................................................................... 73 5.6.4 Using Tivoli Performance Viewer ............................................................ 73 5.6.5 Monitoring caches in WebSphere Portal ................................................. 80 5.7 Load generation ............................................................................................... 81 5.7.1 Excessive load on load generators ............................................................ 81 5.7.2 Assigning user IDs to virtual users ......................................................... 82 5.8 The server environment ................................................................................... 83 5.8.1 Locating Logs ............................................................................................ 83 5.8.2 IBM Support Assistant ............................................................................. 83 6 Benchmarking Information ........................................................................................... 85 6.1 Methodology ..................................................................................................... 85 6.1.1 Overview ................................................................................................... 85 6.1.2 Measurement scenarios and practices ..................................................... 86 6.1.3 Collecting data from measurements ........................................................ 87 6.2 Drawing conclusions from performance measurements ................................ 89 6.3 Benchmarking tools ......................................................................................... 90 6.3.1 Automated load generation tools ............................................................. 90 6.3.2 Manual load generation ............................................................................ 91 7 Conclusion ...................................................................................................................... 92 Feedback? ............................................................................................................... 92 Appendix A: Credits.......................................................................................................... 93 Appendix B: References and Further Reading ................................................................ 94 Appendix C: HTTP Status Codes ..................................................................................... 96

Figures
Figure 1: Page complexity and capacity ........................................................................ 27 Figure 2: Throughput graph showing declining throughput ........................................... 29 Figure 3: Example transaction mix ................................................................................ 34 Figure 4: Sample Deployment ....................................................................................... 36 Figure 5: Apache server status display ......................................................................... 40 Figure 6: Sample verbose garbage collection log entry (Java 1.4.2) ............................. 49 Figure 7: Sample verbose garbage collection log entry (Java 5.0, nursery) .................. 51 Figure 8: Sample verbose garbage collection log entry (Java 5.0, tenured region) ....... 53 Figure 9: GC and Memory Visualizer: healthy heap ...................................................... 56 Figure 10: GC and Memory Visualizer: overly full heap ................................................ 57 Figure 11: GC and Memory Visualizer: table view ........................................................ 58 Figure 12: Sample verbose garbage collection log entry with compaction (Java 1.4.2) 59 Figure 13: IBM Page Detailer: Chart view ..................................................................... 61 Figure 14: IBM Page Detailer: Details view ................................................................... 62 Figure 15: IBM Page Detailer: adding a column ............................................................ 64 Figure 16: IBM Page Detailer: content encoding ........................................................... 64 Figure 17: IBM Page Detailer: content encoding ........................................................... 65 Figure 18: Javacore: lock detail ..................................................................................... 69 Figure 19: Thread and Monitor Dump Analyzer ............................................................ 70 Figure 20: ThreadAnalyzer: select servlet thread pool .................................................. 71 Figure 21: ThreadAnalyzer: select hotspot .................................................................... 72 Figure 22: ThreadAnalyzer: view thread call stack ........................................................ 72 Figure 23: Integrated Solutions Console: login page ..................................................... 74 Figure 24: Integrated Solutions Console: enabling PMI ................................................ 75 Figure 25: Tivoli Performance Viewer: average response times ................................... 76 Figure 26: Tivoli Performance Viewer: set custom monitoring level .............................. 78 Figure 27: Tivoli Performance Viewer: select session metrics ...................................... 79

Figure 28: Tivoli Performance Viewer: select session metrics ...................................... 79 Figure 29: Tivoli Performance Viewer: session metrics by webapp .............................. 80 Figure 30: Response times with an overloaded generator ............................................ 82 Figure 31: Good and bad workload plans...................................................................... 87

Performance Troubleshooting Guide

About This Document


This document is intended as a guide for resolving performance problems you might be encountering with WebSphere Portal version 5.1 or later. In this guide, you will find: Information on how to isolate the nature of the problem you are encountering Useful information to help you diagnose the problem Descriptions of common performance problems, so you can identify the specific problem you are facing, and Descriptions of how to resolve those problems. This guide is intended for a technical audience individuals who are comfortable working with WebSphere Portal and its associated software. It will not be generally useful to end-users who are accessing Web sites built using WebSphere Portal. This guide is not intended as the only definitive resource regarding WebSphere Portal or Java tuning, performance or troubleshooting, but provides a guideline and suggestions that will aid the majority of WebSphere Portal administrators when encountering the performance problems described in this guide. It should be used in conjunction with the product support documentation in addition to other references. This document contains screenshots and sample log output from WebSphere Portal 6.1, running on WebSphere Application Server 6.1.0.15. Using a different version of WebSphere Portal? Thats not a problem. While the exact steps for some of the actions vary from version to version, the basic methodology described here applies to all versions of WebSphere Portal, and to products built on top of WebSphere Portal, such as Lotus Quickr.

Page 7 of 97

Performance Troubleshooting Guide

1 Web Performance Basics


1.1 What is a performance problem?
The term performance problem is a very broad term, which can span many observed symptoms. Before discussing the types of performance problems, which are frequently observed, let's first discuss the data points, which are typically monitored for performance. The two main data points for performance are throughput and response time: Throughput is defined as 'units of work per time'. In a Web environment the units of work are typically page views, so a common way to measure throughput is in page views per second. Response time is the time required to complete a unit of work, so in a Web environment, response time indicates the time required to fetch a page. Typically response times will be monitored separately for different types of page views, such as the site's homepage, login, page navigation, and so on. It is important to understand the meaning of these two measurements: Response time determines how the user will perceive the site: a low response time gives the feeling of a fast site. If the response time is too high, the user will consider the site slow and may become frustrated.

For response time, lower is generally better.

Throughput influences the capacity of the site. Each user of the site will request pages at a certain rate, so the sites throughput will determine the maximum number of active users the site can support.

For throughput, higher is generally better.

1.2 Performance limits and bottlenecks


All systems are built from multiple components, and typically one component will limit the overall performance of the system; this is known as the limiting factor. For example, a car has a maximum safe speed which is dictated by its engine power, transmission, aerodynamics, tires, suspension, driver skill, and so on. One of these components will become the limiting factor for the entire system. So, if the tires are limiting the safe speed of the car, increasing the engine power will not let the car achieve a higher safe speed. Likewise, a WebSphere Portal system will have multiple factors which can limit the throughput and response times it can provide. It is important to know what you expect the limiting factor to be. If an unexpected factor is limiting the performance, this unexpected factor can be considered a bottleneck. Removing the bottleneck will improve the overall performance of the system.

Page 8 of 97

Performance Troubleshooting Guide

In most cases, the processor load at the node(s) running WebSphere Portal will be the limiting factor in the system. In some cases, the limiting factor could be expected to be elsewhere. For example, for a portal site with many database-intensive portlets, you might expect the database server accessed by those portlets to be the limiting factor. In either case, however, if the limiting factor is found to be elsewhere, such as the HTTP server, then that becomes a bottleneck. Eliminating the bottleneck will allow the system to perform better.

1.3 The cyclical nature of performance analysis


Consider the car analogy mentioned above: each component may be the limiting factor for the maximum safe speed of the car. Imagine a car whose tires are the limiting factor, as they are badly worn. Replacing them with new, high-performance tires will mean the tires are no longer the limiting factor, and the maximum safe speed will probably be higher. But now some other component in the system will become the limiting factor. This is also true when trying to improve the performance of a WebSphere Portal site. Eliminating one bottleneck may improve the performance of the site, but another bottleneck is often waiting behind the first. Therefore plan on multiple analysis iterations: 1. Determine the current system capacity 2. Find the factor which currently limits system capacity 3. Resolve that capacity limit 4. Repeat these steps until the goals are reached

1.4 The importance of tools


One of us took a woodworking class where the students started with a rough block of wood and produced a wooden peppermill. One could summarize the class as take a block of wood and cut away parts until it looks like a peppermill. Of course, this misses the main point of the class: learning which tools to use, and how to use them, to achieve that goal. Likewise, one could describe resolving performance problems as find out where the problem is and remove it. Just as with crafting the peppermill, it is important to have tools to help with this task. It is extremely difficult to resolve performance problems on a Web site with no tools beyond a Web browser. This document will describe some tools which we have found useful although many other helpful tools exist.

1.5 The importance of planning


It is much easier to diagnose and resolve performance problems in a test environment than in a production environment. A test environment can be much more controlled
Page 9 of 97

Performance Troubleshooting Guide

than a production environment, which simplifies the diagnosis process. Users of the site will also be much happier if any performance problems are resolved before the site goes into production. Therefore it is important to include time for load testing, and problem resolution, in project schedules or you might find that you have no choice except to try to resolve performance problems in your production environment.

Page 10 of 97

Performance Troubleshooting Guide

2 Classifying Performance Problems


Since different types of problems generally have different causes, it is important to identify what kind of performance problem you are dealing with. Most performance problems fall into one of these categories: Poor response time under light load. Generally, the response time of a portal site should be very good when a single user or a small number of users is accessing it. This category of performance issue is seen when response time is poor under light load. Good response time under light load, but unable to use full system capacity. In a well-balanced system, it should be possible to drive the CPU load to 80% or higher on the system(s) running WebSphere Portal before response times become unacceptable. Environments which cannot reach this level of CPU utilization on the WebSphere Portal nodes fall into this category of performance problems. Using full capacity, but still unable to achieve needed throughput. As the load increases on a system, eventually a performance bottleneck will arise somewhere. This category of performance problem occurs when the CPU load on the system running WebSphere Portal is high, but you are unable to reach your needed throughput goals. Unstable throughput. Under constant load, the performance of the system should remain roughly constant. Some variation over time is normal, but it should not decline steadily. A steady decline in performance could manifest itself as declining throughput or increasing response times, but in either case they would fall into this category of problems. Measurement-introduced problems. Performance benchmarking or load testing can introduce some problems which are grouped into this category. These types of problems do not often occur when the product is in use by actual users, and they will give a poor impression of the systems performance. To help you classify the type of performance problem you are seeing, it is useful to gather some information about the problem that is being experienced. Section 3 will help you gather this information. Section 4 will then lead you through a detailed classification of the observed problem to specific solutions, and Section 5 will explain how to implement those solutions.

Page 11 of 97

Performance Troubleshooting Guide

3 Collecting Information About the Problem


In one sense, performance problems on Web sites are simple: if a user accesses the site, and the page takes a long time to be displayed on their screen, then it is a performance problem. While this broad description may suffice for determining user satisfaction with the site, it is not enough information to resolve a performance problem. Resolving a performance problem involves isolating the cause of the problem. An important step in isolating a performance problem is to describe the problem in some more detail. With this detail, it is then possible to classify the type of problem, which is the next step toward finding solutions. The following are questions to consider about the type of performance problem.

3.1 What is the environment with the performance problem?


Identify the environment where the performance problem is observed. Common examples include the following: Test lab: this is typically a dedicated system used for testing the site. Load to the site is generated by testers and/or test software. Production site, local area network: the problem is being seen in a production site, but the users who are accessing the site are connected through a local area network; no wide-area network links are involved in accessing the site. Production site, wide-area network: the users who are accessing the site are remote from the server(s) running the site.

3.2 Has the environment been configured for high performance?


The default settings provided when WebSphere Portal is installed are appropriate for evaluating the product and portlet development. In order to get good performance from the system, it is necessary to tune various parts of the system. A discussion of tuning parameters which might be helpful can be found in the WebSphere Portal Tuning Guide. A link to this guide is in Appendix B, References. This guide covers the tuning used by the performance team for several measurement scenarios. Different scenarios will stress different parts of the application server, so different tuning was used in different scenarios. It's also important that the entire environment be set up for the load it's expected to handle. Tuning includes not just the application server, but other servers in the environment, as well as the underlying operating systems and the network.

3.3 Is the problem affecting a specific set of users?


Some performance problems affect all or most users of the site; other performance problems are more selective about who is affected. In cases where only a subset of users is affected, attempt to identify a common factor among those users:

Page 12 of 97

Performance Troubleshooting Guide

Users with limited bandwidth: some problems are seen only by users with limited network bandwidth, such as dialup Internet access, or users accessing the site through a virtual private network (VPN) connection Users with specific types of devices: the type of system the user has might affect their performance. For example, users running an older Web browser or users with slower computers might suffer performance issues that other users do not see. Users on mobile devices: mobile computing devices have specialized requirements, due to their limited memory, processing power, and network bandwidth. Thus they are likely to see performance problems, which might not be seen by other users. Users at a specific site or location: in cases where users accessing the site over a wide-area network are seeing the performance problem, sometimes only users at certain locations will see a problem. Users of a specific application or applications: as WebSphere Portal aggregates multiple applications, a specific application may be performing slowly; this would cause users of those application(s) to see poor performance.

3.4 How reproducible is the problem?


It is easiest to resolve a problem which is easily reproducible. Try to identify when the problem occurs. The problem might: Always be occurring, or always at a certain load level. Usually occur at specific times of the day or days of the week. Have begun at a specific time. In this case, see section 3.7 for topics to consider. Be showing up on only one member of a cluster. Not exhibit any clear pattern.

3.5 What actions are affected?


Knowing which actions are affected by a performance problem can help identify the source of the problem. For example, if the problem is only seen on accesses to a page with a certain portlet, it is likely that portlet is contributing to the performance problem. Attempt to classify the affected actions as one of the following: Most or all page views Login only Navigating to a new page, while interacting with the same page shows better performance Specific page views other than login

Page 13 of 97

Performance Troubleshooting Guide

Happening on pages showing WCM components only Is uniquely related to a Virtual Portal Occurs on pages with remote portlets

3.6 How is the problem observed?


Knowing how the problem is observed is important for validating if the problem has been resolved. Therefore track how the problem has been observed, such as: Through a test tool User browsers Through server monitoring All of the above

3.7 Has the problem always been present?


A change in the environment can sometimes precipitate a performance problem. If the environment had previously been functioning within expectations, check for the following types of recent changes: Hardware changes: a hardware change could introduce a performance problem. Changes could include introducing a new network device, such as a router, IP sprayer or switch, or replacing an existing device, such as, replacing a failed hard drive. Software changes: a typical WebSphere Portal site contains a large number of software components; changes in any of these could lead to a performance problem. This can include updates, fixes, or patches recently installed to the operating system, portal or any supporting components, or an update to a custom application or part of the application. Also remember client software: were the client devices browsers updated? Configuration changes: has a configuration change been made on WebSphere Portal or one of the applications running on the portal? Was a new feature activated in an existing application? Was a change made to one of the systems used by the portal, such as a database or an LDAP server? User changes: in particular if the performance problem is affecting only a specific set of users, look to see if changes were made to those users or their groups. Were permissions changed? Were changes made to their personalized settings?

Page 14 of 97

Performance Troubleshooting Guide

3.8 What other symptoms are being seen?


Performance problems may be caused by a failure in one of the components being used by WebSphere Portal, or by an application being used. These are areas to check: Page output: are pages displaying properly? Are any portlets displaying errors? If so, these could be causing performance problems, and should be investigated. Log messages: when problems are encountered, these may be logged. Check the logs used by WebSphere Portal (if you need help finding the logs, this is discussed in chapter 5). Also check the logs for other systems being used by WebSphere Portal, such as database servers, directory servers, HTTP servers, Web proxy servers, or other systems. If these logs indicate problems, they may need to be resolved as part of resolving a performance problem. When checking the logs, also check if there are debugging messages in the logs. For example, a portlet may have been deployed with debug logging accidentally enabled. This can impose a significant performance penalty.

Page 15 of 97

Performance Troubleshooting Guide

4 Resolving Common Performance Problems


This section will ask you to look at the symptoms being experienced and guide you to one or more areas to investigate for possible solutions to the performance problem. Keep in mind that what appears to be a single performance problem may have multiple root causes, so resolving performance problems is an iterative process: measure the current performance, take action to resolve a problem, measure the result, and repeat as needed.

Classify the type of problem being experienced according to the description in section 2, Classifying performance problems, and then refer to the corresponding section: Poor response times under light load: refer to section 4.1. Good response time under light load, but unable to use full system capacity: refer to section 4.2. Using full capacity, but still unable to achieve needed throughput: refer to section 4.3. Performance drops or is unstable under constant load: refer to section 4.4. Exceptions, errors, or other problems seen in a performance measurement: refer to section 4.5.

Page 16 of 97

Performance Troubleshooting Guide

4.1 Poor response times under light load


Does the problem affect most (or all) users of the site or only a portion of them? If the problem affects most or all users, refer to section 4.1.1. If the problem affects only a portion of the users, refer to section 4.1.2.

4.1.1 Poor response times at light load for most or all users
When users are seeing poor response times even when the site is lightly loaded, the explanation is either that the content is taking too long to get to their system, or it is taking too long to be displayed once it reaches them. Use the following method to determine the cause: Visit a page, which shows the poor response time. Use the browser to save the page to the local hard disk. Open the page that's just been saved on the hard disk. If it loads quickly from the local disk, then the problem is in retrieving the content. If this is occurring, refer to section 4.1.1.1. If the page loads slowly even from the local disk, then the page is taking the browser too long to render. In this case, refer to section 4.1.1.2.

4.1.1.1 Slow page retrieval at low load


Slow page retrieval is usually caused by one of the following: Pages which contain too many elements (graphics, stylesheets, etc) Elements which load slowly

Page 17 of 97

Performance Troubleshooting Guide

Unfortunately, it is not easy to see what is happening just by watching the page load in a browser. This is discussed in more detail in section 5, Breaking down page load times. In some cases, the element which is loading slowly is the HTML page, not any of the other content going into the page. This is usually caused by a slow portlet, or a complex theme. This is discussed in more detail in section 4.2.1.2, Poor response time for certain pages.

4.1.1.2 Slow page rendering


Modern personal computers have fast processors, which can render even complex HTML pages quickly. However, there are still cases of pages which cannot be rendered quickly: Pages using slow client-side code: many pages contain active content which executes at the client, such as Javascript or Java applets. These impose a much larger burden on the client than pages with no active content. Java applets may load slowly on systems with slow disk drives and/or insufficient free memory. If users with such systems will access your site, consider structuring the site to make the use of Java applets optional, so that they can make a trade-off of performance versus functionality. While Javascript tends to have a lower overhead than Java applets, dynamically inserting content into HTML pages through document.write() calls can cause the page to render very slowly. Whenever possible, dynamic content creation should be handled server-side, through the use of JavaServer Pages (JSPs). Javascript must be parsed and interpreted by the client, so very large Javascript files (thousands of lines) can take a long time to be processed even if they do not contain complex logic. AJAX and DoJo are two client side frameworks which require specific tools and techniques for troubleshooting. There could be errors in the javascript which impact page rendering on the client. Poor markup: browsers may render invalid or excessively complex markup slowly. It can be difficult to manually validate a markup page, so it is helpful to have tools to assist with this task. One good tool for validating HTML or XHTML pages is the W3Cs Markup Validation Service, which can be found on the Web at http://validator.w3.org/. There is no simple way to determine if a page is too complex; a manual inspection of the page is required. When doing so, inspect the markup as seen at the browser, not the JSP which generates the page.
Page 18 of 97

Performance Troubleshooting Guide

Occasionally, bugs in JSPs or portlets will result in very large or very complex pages being sent to the browser, to the surprise of the people who wrote the application.

4.1.2 Poor response times at light load for certain users


When only certain users see a performance problem, it is important to identify what those users have in common. This can lead to isolating the reason behind the performance problem. Examples include: Users with limited bandwidth connections: limited bandwidth can cause pages to load slowly, particularly if the page is large or references large components (such as images). For these users, the key is to reduce the size of the page. This can include: Reducing the complexity of the theme and skins used on the pages. Reducing the number of portlets on the page. Reducing the size of images or other referenced content on the page. Users with high-latency connections: high-latency connections, such as VPN connections or trans-oceanic connections can cause users to see poor response times for pages, which contain a large number of components. For high-latency connections, the key question is the number of pieces of content, which must be loaded to view the page. See Breaking down page load times in section 5, for guidance on how to identify how many components are required to view the page. Then see Reducing the number of elements in a page in section 5 for ideas on how to reduce this. Users at a specific site or sites: in this case, the explanation is likely to be the network link used by that site/sites, or a problem caused by some networking equipment at that site. To help identify where the problem might reside, consider the following: If the sites use an HTTP proxy, a improperly configured or overloaded HTTP proxy could be the cause. Try accessing other Web sites while the problem is being observed; if they also show a problem, then the problem lies in getting data back from the site, not at the site itself. Try transferring files (for example, using FTP or a shared filesystem). If file transfers are slow, then the network is contributing to the performance problem. Users of mobile devices: mobile computing devices, such as smartphones and handheld computers, present a special performance challenge. They often have low-bandwidth, high-latency network connections, so they can be very sensitive to page size and the number of elements on the page. Take care to minimize the total size of the page as well as the number of elements in the page. Their limited processor power also means that even moderate-complexity pages can
Page 19 of 97

Performance Troubleshooting Guide

take time to render. In addition, if the devices access the network through a gateway or protocol converter, verify that it is not a performance bottleneck.

4.2 Inability to utilize full system capacity


Many performance problems fall into this category. These problems can have many different causes, but the observed symptom is the same: response times are high with moderate CPU load on the portal server(s). This means that some resource other than the portal CPU is the limiting factor in the overall system performance; resolving the performance problem means isolating what the limiting factor is, and then removing that limitation. In trying to isolate the limiting factor, it is often useful to identify the actions with poor response times: Specific operations have poor response times, while others still show good response times. For example, logging in to the site may be very slow, while interacting with the site after login performs well. In this case, refer to section 4.2.1. Most or all operations see poor performance, even if login shows the worst response time. Refer to section 4.2.2.

4.2.1 Poor response times for specific operations


Once you have identified the operation or operations, which see poor response time, you can begin to identify the bottleneck. Based on where the poor response time occurs, choose from the following options: Login only, or login and page navigation: users have poor response times when logging in (viewing their first authenticated page), but good response times for other activities. Read section 4.2.1.1 for possible causes and solutions.

Page 20 of 97

Performance Troubleshooting Guide

Specific pages, other than login, are seeing poor response times: in this case it is likely that something about those pages is responsible for the poor response time. Read section 4.2.1.2 for areas to investigate.

4.2.1.1 Poor login response time


Logging on to WebSphere Portal is a complex activity, because the portal must interact with multiple systems for the login to complete. Use the following explanations to figure out the cause of the poor response time: Slow responses from the directory server: if using a directory server to authenticate users and provide group memberships, slow responses from the directory server can cause poor login response time. Actions to take include: Monitor CPU and disk activity on the LDAP server, plus response times from the LDAP server. If CPU or disk activity on that server is high, investigate tuning options provided by the LDAP server. Verify that the network connection between the portal server(s) and the LDAP server is not overloaded. For assistance on doing this, see Verifying network connections in section 5. Slow responses from the portal database(s): WebSphere Portal will typically make several requests from its database when a user logs in, so poor performance from this database can seriously impact login performance. To determine if this is affecting your site, see Monitoring database performance in section 5. Slow authentication proxy: sites which use an authentication proxy, may find that the authentication proxy becomes a performance bottleneck. To determine if this is the case, try to access another Web site through the authentication proxy. If the other site also shows poor response time when logging in, then the authentication proxy is likely causing a performance problem. This guide does not offer any suggestions on how to remedy performance problems caused by an authentication proxy. Consult the documentation for that product for additional information. There is a set of TechNotes available around login and authentication issues. These can be found at: http://www1.ibm.com/support/docview.wss?rs=688&uid=swg21243740.

4.2.1.2 Poor response time for certain pages


Generally, a slow page can be traced to a slow portlet. Unfortunately, determining which portlet is showing a poor response time is often not obvious just from watching the page render. The best way to track this down is with a tool which can monitor the portlet response times.

Page 21 of 97

Performance Troubleshooting Guide

One option for monitoring portlet response times is to use Tivoli Performance Viewer, which is discussed in Using Tivoli Performance Viewer in section 5. Other monitoring tools are also available; see Performance monitoring and profiling tools for details. Once a poor-performing portlet has been located, the next question is, why is it performing poorly. These are the most common reasons for a portlet to be slow: 1. The portlet is using a remote system, such as a database server, which is performing slowly. This may be a single request which takes too long to process (such as a complex database query), or it may be a large volume of requests (such as hundreds of simpler database queries). Monitoring database performance in section 5 discusses ways to analyze performance of database operations. 2. The portlet is doing a lot of work on the application server. Complex work within the portlet is most easy to debug using a code profiler; see Performance monitoring and profiling tools in section 5 for details. One way to determine what the underlying cause for the portlet(s) to be slow is to capture Javacores. Look for multiple threads executing code within that portlet. Look to see if most of them are executing code which runs on the WebSphere Portal application server, or it they are waiting for a remote system. Javacores are discussed in detail in section 5, in the section titled Using Javacores.

4.2.2 Poor response times for most or all operations


There are several factors which can lead to poor response times for most or all operations while the CPU load on the portal server(s) is still moderate: Network bottleneck: if requests cannot get quickly to the HTTP server(s), or from the HTTP server to the portal server(s), then users will experience slow response times. See Verifying network connections in section 5 to determine if your site is experiencing a network bottleneck. Validate the network performance from a client system to the HTTP server, and from the HTTP server to the portal server. If multiple HTTP server or portal server systems are in use, make sure that you check each link in the system. No external Web server: the built-in HTTP server in WebSphere Application Server is adequate for very light load levels and experimenting with the application server. But it's inadequate for moderate or higher load levels. Even if the environment has only a single application server, an external Web server is needed for the site to be able to handle significant load. For more information on the way you can tune your IBM HTTP Server, have a look at this document: http://publib.boulder.ibm.com/httpserv/ihsdiag/ihs_performance.html. Also see "HTTP server connection limits" in section 5 for information on HTTP server connection limits.

Page 22 of 97

Performance Troubleshooting Guide

HTTP server overloaded: when using an external HTTP server or servers, these can become a bottleneck. These are areas to check: Look at the CPU load on the HTTP server(s). A CPU load over 80% on the HTTP server indicates that it is likely to be a bottleneck. In this case, read HTTP Server CPU Consumption in section 5 for ideas on how to alleviate this bottleneck. Check network tuning on the HTTP server. Improper network tuning on the HTTP server can cause a significant bottleneck. This is discussed in section 5 under Checking network tuning. Check that the HTTP server can accept sufficient connections. Detailed discussion of this is in the HTTP Server Connection Limits in section 5. View the HTTP servers access log and check for errors serving pages. The exact location of these log files can be found in the server's configuration documentation. SSL Overhead: compare response times with HTTPS (SSL encryption) versus HTTP (no encryption): there is a performance overhead to using SSL, so it is not unusual to see higher response times with SSL. Exactly how much of an overhead will depend on many factors, including network latency and CPU load on the HTTP server. But as a simple rule, on a lightly-loaded HTTP server, with low network latency, the SSL connect overhead should not be more than 0.30 seconds. In addition, if the browser makes multiple requests to the same server over SSL, it will try to reuse the SSL connection, in which case the SSL connect time will not be included in every request. The time for making SSL connections can be seen in several ways. For load tests using simulated users, the requests can be changed from https: URLs to http: URLs, and then the change in response time can be seen in the load test tool. Alternatively, a client-side tool can be used to gather this information. For example, IBM Page Detailer (discussed in section 5) can show this information. Excessive JVM garbage collection: if the JVM is spending a significant amount of time performing garbage collection, this will impact the response times seen by the users. Understanding verbose garbage collection output in section 5 contains a more in-depth discussion of garbage collection issues. Demand paging: this can cause a very severe performance degradation for Java applications such as WebSphere Portal. Demand paging is implemented in all modern operating systems, allowing them to work when the amount of memory in use is larger than the physical memory of the system. Unfortunately, the memory access patterns of Java applications do not mesh well with demand paging schemes. Therefore it is important to be certain that the process(es) executing WebSphere Portal JVM(s) fit entirely in the physical memory of the system. Either decrease
Page 23 of 97

Performance Troubleshooting Guide

the maximum heap size for the WebSphere Portal JVM, or add additional memory to the system. A system which is having a performance problem due to excessive paging, will typically show very high disk activity, but relatively low CPU utilization. Database connection pools: WebSphere Application server uses pools of connections to communicate with database servers. These are used by WebSphere Portal for communicating with its databases, and portlets may make use of them as well. If those connection pools are frequently full, this can cause threads to wait, slowing down response times. Database connection pools are discussed in more detail in section 5, Database connection pool monitoring and tuning. Theme complexity: The site theme(s) are rendered on every page view, so a complex theme can slow down every access to the pages it is used with. To determine the complexity of the theme, build a page which contains only a single instance of the About WebSphere Portal portlet, using one of the IBM-supplied skins for that portlet. If this page still shows poor performance, the theme may be causing performance issues. An additional way to investigate the impact of the sites theme is to replace the theme with the default theme provided with WebSphere Portal. If the page performs significantly better with the default theme than the custom theme, then you should investigate the custom theme to see why its slowing down the pages.

Page 24 of 97

Performance Troubleshooting Guide

4.3 Unable to achieve required capacity


If the systems running WebSphere Portal are seeing high CPU utilization (80% or more), but without the giving the necessary throughput, then it is time to look at ways to improve the efficiency of generating pages. There are many components, which make up a page as delivered by WebSphere Portal. The following components typically have the largest impact on the time required to generate a page: Portlets in use: see section 4.3.1, Capacity limits: impact of portlets. Theme and skins being used: see section 4.3.2, Capacity limits: impact of themes and skins. Number of portlets on the page: see section 4.3.3, Capacity limits: number of portlets per page. Proper tuning of WebSphere Portal: see section 4.3.4, Capacity limits: WebSphere Portal tuning. Improving the efficiency of one or more of these can increase the overall capacity of the site. While thinking about these, make sure to consider the most commonly used pages on the site, as these will have the biggest impact on the site's capacity.

4.3.1 Capacity limits: impact of portlets


WebSphere Portal is used to run applications portlets. A portlet which has an excessively high CPU cost can impact the capacity of the entire site. See Performance monitoring and profiling tools in section 5 for tools to use to identify applications which could be optimized to improve the site's capacity.

Page 25 of 97

Performance Troubleshooting Guide

4.3.2 Capacity limits: impact of themes and skins


The theme provides the overall visual appearance of the page. A complex theme can account for 30% of the CPU cost of generating a page. Some suggestions related to themes include: Consider the elements in the theme. A theme must balance visual attractiveness with functionality. But keep in mind that components added solely for appearance add to the cost of generating each page. Consider removing elements which are used only for appearance, and do not add to the usability of the site. For example: some themes include a display of the date and time. But most operating systems provide a clock application to display the current date and time, so is this really needed in the portal site as well? Are there expensive operations in the theme? The theme will be executed on each page view, so any expensive operations within the theme code will ripple throughout the entire site. A skin surrounds each portlet, so the skin may be rendered multiple times per page. This means that the skin also needs to be as efficient as possible. WebSphere Portal 6.1 provides a skin called ThinSkin, which has been designed to be inexpensive to render. Consider using this skin as-is, or adapting it to your needs. More information on creating themes and skins for WebSphere Portal can be found in the developerWorks article Developing themes and skins in IBM WebSphere Portal 6.0.X, at http://www.ibm.com/developerworks/websphere/library/techarticles/0804_barnes/0804_ barnes.html. Another option to consider for WebSphere Portal 6.1 is the PortalWeb2 theme. This theme uses Javascript-based Client-Side Aggregation (CSA). It uses AJAX techniques with Portal Model REST services, to give a more responsive user experience and reduce load on the server. More information about this theme is available in the WebSphere Portal 6.1 information center.

4.3.3 Capacity limits: number of portlets per page


The CPU time required to generate a page is greatly influenced by the number of portlets on that page as all the portlets on the page must be re-rendered on each page view (unless their output is cached). Therefore one straightforward way to reduce the CPU cost to generate a page is to remove one or more portlets from that page. The following graph shows a theoretical example of the impact of additional portlets, and lighter-weight vs. heavier-weight portlets, on a page:

Page 26 of 97

Performance Troubleshooting Guide

Page Cost Illustration


Calculated from Example Portlet Estimated Costs Higher is better
250

Max Theoretical Capacity

200

150

100

50

0
Empty page Max Capacity 220

4 Light + 1 Simple 4 Simple 4 Light + 4 Light + 4 Light + 4 Light + 4 Heavy 2 Med + Portlet Portlets 1 Med 2 Med 1 Hvy 2 Hvy + 2 Med 2 Hvy 190 136 106 86 68 46 38 26

Figure 1: Page complexity and capacity

Particularly for pages with many portlets on them, think about whether the page could be split up into several pages, distributing its portlets across those pages. One way to make this decision is to think about whether users will need to access all the applications on the page at the same time. If not, then it may be possible to break the page up into several pages. If those additional pages will only be used by certain sets of users, consider using either access control settings or Personalization rules to make the pages visible only to the users who need them. Generally speaking, limiting access through access control settings will give better performance, while limiting access through Personalization rules is more flexible. For more information on Personalization rules, take a look at the information center for Portal 6.1: http://www.ibm.com/developerworks/websphere/zones/portal/proddoc.html

4.3.4 Capacity limits: WebSphere Portal tuning


WebSphere Portal needs to be properly tuned to achieve high throughput. Compare the tuning done on the site with the tuning in the WebSphere Portal tuning guide. The settings in the tuning guide are not intended as the best settings for all customer sites,

Page 27 of 97

Performance Troubleshooting Guide

so our recommendation is not to just blindly apply all of the settings in that guide. Rather, for settings, which are different, make sure you understand the differences and why they are different on your site. The WebSphere Portal 6.1 tuning guide has the general tuning used for the base portal benchmark scenario. It has a separate section with additional tuning settings made for the Web Content Management (WCM) benchmark scenario. If your site uses WCM, check that section of the tuning guide. The performance of the database server is very important for sites using WCM; a poorperforming database server can easily make a WCM-based site sluggish. If you are having performance problems with such a site, look into performance on the database server to see if improvements can be made in this area. Links to the WebSphere Portal 6.0 and 6.1 tuning guides are in Appendix B.

Page 28 of 97

Performance Troubleshooting Guide

4.4 Performance dropping under constant load


When a constant load is applied to a system, performance should remain roughly constant over time. That is, the observed throughput and response times should be fairly flat as long as the load remains flat. These metrics won't remain exactly constant: in addition to the inevitable random variation in performance measurements, WebSphere Portal performance will show some variation due to the behavior of the JVM garbage collector. However, response times should not steadily increase, nor should throughput steadily drop while the load remains constant. As it is difficult to maintain a constant load in a production environment, we recommend you check for these problems in a controlled load test. The following graph shows throughput in such a test:

Figure 2: Throughput graph showing declining throughput

In this measurement, the load was increased to a selected level and then kept constant over time. In such an environment, we would expect the throughput to remain roughly constant. But in this measurement, the throughput drops slowly over time, even though the load is kept constant. It is important to note the time scale as well: this measurement covers only 1 hour, so the trend might continue, or it might be due to variation over this 1-hour period. To detect a clear trend indicating a drop in performance, we recommend measurements over an 8-hour or 24-hour period. Having described this symptom, here are some common causes for this kind of performance problem: Memory leak: a memory leak in a Java application is where more and more of the heap memory is consumed over time. This causes a performance problem as the JVM is forced to spend more time in garbage collection, leaving less time for productive work. The easiest way to identify that this is occurring is to enable
Page 29 of 97

Performance Troubleshooting Guide

verbose garbage collection. This is explained in section 5 under Detecting memory leaks. Database tables which increase in size: as database tables grow, some database servers require maintenance tasks in order to operate efficiently. If this is not done, queries against the database can get slower and slower as the tables grow. Any application which adds records to a database could cause this problem. In addition, WebSphere Portal has several tables which contain one row per user who has ever logged in to the site. These tables will require maintenance as they grow; this is discussed in section 5 under Database maintenance. Coldstart vs. steady-state behavior: collecting and analyzing detailed data is a common part of trying to determine the cause of a performance problem. This can include snapshots of database activity, traces using an application monitoring tool, or other such data. When doing so, make sure that the data you are capturing is steady-state data, and not the result of the application's warmup behavior. These are some examples of ways in which WebSphere Portal should be warmed up before gathering data: Web applications must be initialized before their first access. Therefore you should access all portlets that will be used for data collection at least once before you begin collecting data. On the first access to a JSP (or the first access after an update to the JSP), the JSP must be compiled; this can add a delay before the page is available. This delay can be eliminated by pre-compiling JSPs; WebSphere Application Server provides the JSPprecompiling command for this purpose. Turn to http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topic=/ com.ibm.websphere.base.doc/info/aes/ae/rweb_jspant.html for more information on how to run this command. WebSphere Portal caches information such as access control permissions, resource hierarchies, and page layouts. A good way to get an accurate picture of the behavior of the product is to access the site as two different users who are in the same group. The data collected for the second user's visit is likely to be more representative of steadystate behavior. But keep in mind that some of the internal caches in WebSphere Portal discard entries after a period of time, so make sure those two accesses are close enough together that the cache expiry has not occurred. An easy way to deal with this is to make sure all of the cache lifetimes specified in CacheManagerService.properties are at least 1 hour (3600 seconds). For more information about monitoring WebSphere Portals internal caches see "Monitoring caches in WebSphere Portal" in section 5.

Page 30 of 97

Performance Troubleshooting Guide

4.5 Problems specific to load tests


Performance measurements or load tests are sometimes done to determine the capacity of a system, or to verify that the system can sustain a certain throughput rate over time. These tests are typically run with load generation software simulating the workload of human users. The load generation software must always implement a simplified model of how the user will interact with the system; sometimes the simplifications chosen cause a performance problem to appear that will not appear in real-world use.

4.5.1 Problems related to automated load generation


The following are problems that might occur in such a test: Excessive load on one or more load generators: generating requests places a load on the systems acting as load generators. While these are usually not a bottleneck, they can become overloaded if too much work is assigned to one load generator. In general, if the CPU load on a load generator is over 60%, that system has too much work. For more information on identifying these problems see Excessive load on load generators in section 5. Excessive network load: most systems will report CPU load, but network load often isnt as easy to measure. As a rough estimate, you can look at the total amount of data being transferred, and compare this against the capacity of the network adapters of the systems under test. If the network utilization reaches 80% or more of the stated capacity of the adapter, it is probably becoming a bottleneck. Messages about user data not being updated: when a user logs in, WebSphere Portal updates some information in its database for that user. If the same user ID is used by multiple virtual users simultaneously, WebSphere Portal will detect this and not attempt to update the database; instead, it will log the following message: com.ibm.wps.engine.commands.LoginUser execute EJPSD0027E: User data for user Yin_006_992 was not updated due to multiple logins. This can cause an artificial performance bottleneck, so it should be resolved if

Page 31 of 97

Performance Troubleshooting Guide

possible. For more information, read Assigning user IDs to virtual users in section 5. Warnings about not being able to set a session cookie: by default, WebSphere Portal will not create sessions for unauthenticated users. If a portlet which requires a session is placed on an unauthenticated page, the following message will be logged to SystemOut.log: [10/17/05 9:39:11:641 EDT] 66ed246e SessionContex W SESN0066E: Response is already commited to client. Session cookie cannot be set. The portlet will still get a session, but the session is 'orphaned' because the session cookie cannot be set. An orphaned session cannot be accessed by the application on subsequent requests, and will remain in memory until it expires. This causes a performance problem in two ways: logging the message introduces some overhead and the orphan session objects consume memory. To correct this, either modify the portlet code so that it does not require a session, or remove the portlet from unauthenticated pages. Network variations: it is important to run load tests on an isolated network one which is not also serving other traffic. Otherwise, variations in network utilization from the other traffic can introduce performance variations into the load tests. Often, its not practical to run load tests on an isolated network. In that case, make sure you are monitoring the network utilization to be certain that its not becoming a bottleneck even load tests run at off-peak hours can be impacted by network performance if other tasks (such as backups) are occurring at that time. Information on monitoring the network is given in section 5, Network issues. Sudden increase in response time: as the load on the WebSphere Portal system increases, the response times should increase gradually. A sudden increase in response times usually indicates that one component of the system is no longer handling requests efficiently. Two areas where this may be occurring are in the database server and the JVM garbage collector. Detailed information is available in section 5: Monitoring database performance, for ways to determine if the database has become a bottleneck, and Understanding verbose garbage collection output for information on the JVM garbage collector. Stale references to static content: many load testing scripts are recorded containing a list of static content (such as images or stylesheets) to download in addition to the main HTML page. If the site changes and some of that static content no longer exists, this will produce errors at the Web server when trying to serve this content. These errors will add overhead as well as produce large log files. This overhead is most pronounced in cases where the site is being accessed over SSL. On many HTTP servers, when an error page is returned, the HTTP connection is closed. This will require more new connections than may be
Page 32 of 97

Performance Troubleshooting Guide

typically needed, and these additional connections will require extra SSL handshakes. SSL handshakes are very CPU-intensive, driving up the CPU consumption at the HTTP server. Load variation: benchmark scenarios might not include default scheduled Portal activities such as users searching, the Portal (re-) building its search indexes through the crawlers, specific applications which are accessed using WSRP, workflow driven applications relying on the Process Server engine, rendering of WCM content locally or remotely. All of these elements ought to be taken into consideration when developing a load testing scenario. For general information on performance benchmarking, see section 6.

4.5.2 Identifying common errors in load measurements


The results from a performance measurement contain more than just response times and processor loads. The results also contain data which can help identify problems in the measurement. These are some items which we commonly check to validate a performance measurement: Error rate: the first metric to check on any performance measurement should be the error rate. Look at the overall error rate (total failed transactions divided by total transactions), and also the error rate for each individual transaction. Compare this to the maximum acceptable error rate for the scenario. Transaction mix: the transaction mix is the relative frequency of each transaction in the scenario, as shown in this example:

Page 33 of 97

Performance Troubleshooting Guide

Example Transaction Mix

9% 9% 1% 5% 48%

11%

6% 6% 5%

Application Interaction Enter Application List My Places

Homepage Filler Page Page Navigation

Log In Get Login Form Read News

Figure 3: Example transaction mix

This should be calculated for the period of constant load in the measurement, excluding the ramp-up and ramp-down periods. Capture the transaction mix from a measurement which is known to be good. Then you can compare this against the transaction mix in subsequent measurements. Mix of HTTP return codes: as with the transaction mix, this is the relative frequency of the various HTTP return codes seen in the measurement. This mix should change only if the behavior of the script or the site being measured changes; any other significant change may indicate a problem during the test. Typically, most of the HTTP return codes will be 200 (OK), with far fewer others, such as 300-class return codes. Therefore, a straightforward way to look at the frequency of other return codes is as a ratio of 200s to the other return codes. For example, consider a measurement with 108,651 200s, and 4,502 302s. In this case, the ratio of 200s to 302s is 24.1 to 1. This ratio should not change more than +/- 5% between measurements. A quick reference for common HTTP return codes is given in Appendix C. HTTP Requests per transaction: from the number of HTTP requests and the number of transactions completed in the period of constant load, you can calculate the average number of HTTP requests issued per transaction. This number is based on the site makeup (for example, the number of images on the page). This number should change only if the content of the site changes; if it
Page 34 of 97

Performance Troubleshooting Guide

changes without the site content having changed, then there may be a problem in the measurement. Bytes per transaction: from the measured traffic rate (bytes/second) and the observed transaction rate (transactions/second), you can calculate the average total content size retrieved per transaction. This is similar to the number of HTTP requests per transaction: it should change only if the site content changes.

Page 35 of 97

Performance Troubleshooting Guide

5 Detailed Instructions
The sections below include more detail on specific tasks mentioned above. To help navigate this section, it is divided into subsections by general topic area.

5.1 Network issues


5.1.1 Verifying network connections
A WebSphere Portal installation can consist of many systems, which cooperate to provide the site functionality. It is important that the network links between these systems do not become a bottleneck. The following diagram shows the types of systems, which go into a common WebSphere Portal installation:

Sample Deployment
Internet clients Database server

HTTP server Edge server

WebSphere Portal Network Deployment Manager

HTTP server

WebSphere Portal Directory server (LDAP)

Local clients or load generator

Figure 4: Sample Deployment

A network bottleneck on any of these links can cause a performance problem. There are many tools, which can be used to monitor network performance. If you do not have a performance monitoring solution for the network links between the systems in your WebSphere Portal configuration, it is still possible to extract this information with some simple tools: Ping: the 'ping' command will send network packets to a peer node, and report the time taken to receive a reply. It will also show the percent of packets lost. On a local area network, response times for ping requests should be less than 5 milliseconds. A higher response time can indicate a network bottleneck. Wide-area networks will also show higher ping times; for example, ping
Page 36 of 97

Performance Troubleshooting Guide

requests across the continental United States are often in the 75-100 millisecond range. Knowing the ping time for your end-users is important, because it can be used to give a general guide for page response time. In general, each element on the page will require a separate request, so the page response time will be at least (number of elements) * (average ping time). More information on determining the number of elements per page can be found in the section on Breaking down page load times. In addition, dropped packets can cause a significant delay. Even a low rate of packet loss such as 1% can cause a significant performance penalty. File transfer: transferring a large file between two systems can provide a simple measure of the available network bandwidth between the two systems. This can be done with a file-transfer program such as ftp or by copying a file over a network file system. In either case, a healthy network should allow transferring files at a rate of at least 1 megabyte per second. A rate lower than this probably indicates that the bandwidth on the network is exhausted. Tracert: The tracert (Trace Route) command is a route-tracing utility used to determine the path that an IP packet has taken to reach a destination. Tracert works by increasing the "time to live" (TTL) value of each successive packet sent. When a packet passes through a host, the host decrements the TTL value by one then forwards the packet to the next host. When a packet with a TTL of one reaches a host, the host discards the packet and sends an ICMP time exceeded. tracert, if used properly, can help you find points in your network that are either routed incorrectly or are not existent at all. You can find more information on this tool at http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.pr ftungd/doc/prftungd/traceroute.htm Network analysis: for a more in-depth investigation of network traffic, use a network analysis tool. One such tool is Wireshark, which is a free tool available from http://www.wireshark.org; more information on how to use it is available at http://wiki.wireshark.org. If you have identified a network bottleneck between two nodes, what is the next step? Unfortunately, there are many possible reasons for a network bottleneck, such as the following: Network adapter type: the systems involved should be using high-performance network adapters. At a minimum, 100-megabit adapters should be used, but high-load sites may need more bandwidth. Network adapter settings: make sure that the network adapters are configured to use their highest supported transfer rate. For example, a gigabit Ethernet adapter which is configured to run at 10Mbps will not provide good performance. If the network adapter supports full-duplex operation, make sure this is being
Page 37 of 97

Performance Troubleshooting Guide

used. Also, check if the MTU (maximum transmission unit) size has been overridden from the default value. A poorly chosen MTU value can cause inefficient use of the network, causing a bottleneck. Network reliability: if the network is showing packet loss, then there may be problems with network adapters, cabling, switches, or routers. All of these should be checked. Firewalls: firewalls are a crucial part of a networks security, but you must also make certain that they do not become a performance bottleneck. If there are any firewalls in the path which is exhibiting a network bottleneck, monitor them to make certain that they are not causing the bottleneck. Network Equipment: faulty equipment at the network layer can have an impact on how well your network performs. Make sure that the correct type of cables are being used for the expected throughput, and monitor network devices such as routers and switches to make sure theyre not contributing to performance problems. For example, Cisco has information on monitoring their switches available on the Web at http://www.cisco.com/en/US/docs/net_mgmt/traffic_director/5.8/user/guide/switch es.html

5.1.2 Checking network tuning


Even if the network infrastructure is performing well, poor network tuning can cripple application performance. The WebSphere Portal Tuning Guide lists network tuning changes, which we made to achieve good performance in our lab environment. If you are experiencing network bottlenecks, you should look at whether the changes made in that guide are appropriate for your environment. See the References section for links to tuning guides for WebSphere Portal 6.0 and 6.1.

5.2 HTTP server performance


5.2.1 HTTP server connection limits
Most HTTP servers have configuration settings which control how many simultaneous connections they can handle. Once this limit has been reached, additional connection requests will be queued or even rejected. When connections are queued, users see this as poor response time. When connections are rejected, clients will see Connection refused errors when trying to access the site. On Apache-based servers, such as IBM HTTP Server, this will cause the following message to be logged to the error log: [Thu Oct 27 10:01:04 2005] [error] server reached MaxClients setting, consider raising the MaxClients setting For most Web servers, there are two factors which impact whether clients will experience this problem. The first is the maximum number of connections the server
Page 38 of 97

Performance Troubleshooting Guide

can accept: if the server can only accept 100 connections, then once that limit has been reached, the next connection will be queued for handling later. Modern server hardware is able to handle a large number of concurrent connections, so we recommend setting the maximum connection limit to 1,000 or more. The second factor impacting the connection limits is the handling of keep-alive connections. To understand the impact of this, consider how HTTP uses keep-alive connections: 1. A client opens a connection and makes a request on that connection. 2. The server sends a reply and indicates that it is willing to keep the connection open for a subsequent request. 3. If the client has another request to make immediately (for example, for an image on that page), it will send the request. Otherwise, the client will simply hold the connection open for a period of time. While the client is holding the connection open, that connection counts against the limit of open connections the server can handle. So while HTTP keep-alive connections offer significant performance benefits, it is important to make sure that clients are not holding idle connections for a long period of time. The time period that the server will hold the connection open waiting for the next request from the client is called the keepalive timeout, and most HTTP servers allow configuring this. In our experience, this should be set to a relatively low value, such as 5-10 seconds. Disabling HTTP keep-alive connections is generally not recommended. This would free up the connection immediately to process a request from a new client, but it forces clients to open a new connection for each element on the page. This increases CPU utilization at the HTTP server. It also increases the total page response time for clients. The configuration related to the values mentioned, and the way to monitor whether the connection limit has been reached, is server-specific. This section will give instructions for IBM HTTP Server 6.1. For other HTTP servers, see the server documentation for further information.

Configuring the HTTP Server


As mentioned above, it is important to configure both the maximum number of connections the server can accept, and the keep-alive timeout. Instructions for tuning these settings can be found in the IBM WebSphere Portal Tuning Guide, under Performance Tuning the Environment.

Page 39 of 97

Performance Troubleshooting Guide

5.2.2 Monitoring HTTP Servers


Using mod_status with IBM HTTP Server
IBM HTTP Server includes a status monitor known as mod_status. This provides a status monitor, which can be accessed through a Web browser. The status display above is not enabled by default in IBM HTTP Server. The documentation for mod_status contains a topic, Enabling status support, which explains how to do this. The documentation can be found on the Web at http://httpd.apache.org/docs/2.2/mod/mod_status.html. Once the server status monitor has been configured, it can be accessed through a Web browser. The default path for accessing the server status monitor is http://<HTTP server hostname>/server-status. The status report looks like the following:

Figure 5: Apache server status display

Page 40 of 97

Performance Troubleshooting Guide

To determine if requests are being queued or handled immediately at the HTTP server, look at the line circled in the example above. When the number of idle workers drops to 0, this server is unable to accept any additional connections. Because this report shows the status at the moment the request was handled, it is important to check this at various times while the system is under load. If HTTP server is frequently showing a very small number of idle workers (10 or less), then it is likely that the HTTP server has become a bottleneck by not accepting connections quickly enough.

Other HTTP Server Monitoring Tools


HTTP servers can produce activity logs which can be consumed by a variety of monitoring and reporting tools. Some examples include: AWStats: a free powerful tool that generates advanced web, streaming, ftp or mail server statistics, graphically. Its available at http://awstats.sourceforge.net/. CA Wily Introscope: a component of the CA Wily Application Performance Management solution, it enables you to monitor complex web applications in production environments. Information is available at http://www.ca.com/us/application-management.aspx. Tivoli Monitoring for Web Infrastructure: a management tool which helps ensure the optimal performance and availability of your application servers and the associated Web servers which power them; details can be found at http://www.ibm.com/software/tivoli/products/monitor-web/.

Checking HTTP Server Response Times


The tools mentioned above give ways to monitor the server and its resource consumption. But sometimes you need a simple check of whether the HTTP server is responding properly. Often, the simplest way to do this is request a single file, such as an image or a small HTML file, from the HTTP server. This will show the best-case response time for the HTTP server, as its about the simplest request that can be made of the HTTP server. On a low-latency network (such as a LAN), a good-performing HTTP server should be able to serve a small file in less than 10 milliseconds (0.01 seconds). Obviously, this will increase as network latency increases. This response time can be considered against the observed response times for more complex transactions. For example, consider a page view which consists of 10 HTTP requests an HTML page, a stylesheet, and 8 images. If the HTTP server is giving a response time of 150 milliseconds for a small file, then those 10 requests would take a minimum of 1.5 seconds (or perhaps somewhat less if requests can be made in parallel). If this transaction shows a response time of 2.0 seconds on average, then its clear that most of the response time comes from the HTTP server itself and that therefore the

Page 41 of 97

Performance Troubleshooting Guide

performance of that transaction cannot be significantly improved unless the HTTP server performance is improved.

5.2.3 HTTP server processor consumption


In an environment where the HTTP server is on a separate system, it is unusual for the CPU on that system to become overloaded. However, if the CPU load on that system is 80% or higher, then it can be a performance bottleneck. In some cases, the only solution to this is to replace the system with a more powerful system. Before doing that, though, these are some items to consider which could ease the CPU load on that server: What is the error rate? Error processing at the HTTP server tends to have a higher CPU cost than do requests which are handled successfully. This is especially true on sites, which are being accessed over SSL. A common cause for a high error rate is bad links or missing images. If these are occurring on your site, correcting this problem would reduce the error rate, and reduce the CPU load on that server. Look for these errors in the HTTP server's error log. Alternatively, consider one of the tools discussed in the section above, Monitoring HTTP Servers. Can content be consolidated? Each component of the page, which must be separately retrieved, adds to the workload of the HTTP server. See Reducing the number of elements on a page for ideas. Is request logging at the HTTP server causing a large overhead? The standard access logging provided by HTTP servers does not typically impose a very high overhead, but it can become important for high-volume sites. Look at the logs, which are being generated. Is the information useful? Or would it be possible to reduce the amount of logging being done, without losing important information?

5.3 Database performance


5.3.1 Database connection pool monitoring and tuning
Each database connection pool has two parts: a pool of connections, and a queue of threads waiting for a connection. When a thread needs a connection, if one is available from the pool, it gets that connection and uses it. If no connection is available, and the pool hasn't reached its maximum size, a new connection will be made and assigned to that thread. When the pool has reached its maximum size, and none are available, the thread is put in a queue to wait for a connection. But it will only wait for the maximum amount of time configured for that connection pool; then the request will time-out and the message J2CA0045E will be logged, as in this example: J2CA0045E: Connection not available while invoking method createOrWaitForConnection for resource jdbc/wpdbDS_release

Page 42 of 97

Performance Troubleshooting Guide

Occasionally reaching the case where all the connections in the pool are in use is usually not a problem, but if it occurs frequently, or if threads are forced to wait so long that they time out, this can be a performance problem. When the connection pools are properly sized and the database server performing well, this problem should not occur. Unfortunately, theres on one best size for a database connection pool if there was, it wouldnt need to be configured. The best practice is to monitor the size of the connection pool and increase its size if the number of connections in use is frequently reaching the maximum size of the pool. Information about the size of the pool and the number of connections in use can be monitored with a tool which reads WebSphere Performance Monitoring Infrastructure (PMI) data, such as Tivoli Performance Viewer. Likewise, if the J2CA0045E message is being logged, this usually indicates that the size of the connection pool needs to be increased. Remember that when you increase the size of the connection pool, this allows more connections to be made to the database server; ensure that the database server can handle the total number of connections it will receive.

5.3.2 Monitoring database performance


It is important to get good performance from the database being used by WebSphere Portal to achieve high system performance. There are two approaches to monitoring the performance of the database: it can be monitored from the portal side, or from the database side. Monitoring from the portal side is done the same way regardless of which database server is being used, while the specifics of monitoring from the database side vary depending on the database server being used. Another big difference is in the level of detail available: monitoring at the database side can provide much more detail.

Monitoring database performance from the portal side


This task is done using Tivoli Performance Viewer, which is included with WebSphere Application Server. More information on using this tool is available in the WebSphere Application Server information center, in the section Monitoring performance with Tivoli Performance Viewer (TPV); this can be found at: http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/topic/ com.ibm.websphere.nd.multiplatform.doc/info/ae/ae/tprf_tpvmonitor.html Tivoli Performance Viewer will display aggregate statistics for all defined database connections, or the information can be displayed for individual datasources. We recommend viewing statistics on individual datasources, as this can show specific databases which may be causing performance problems. The following information about database performance can be found using Tivoli Performance Viewer, including guidelines for interpreting the results. Time spent executing database calls: the counter called JDBC Time monitors the time required for issuing database queries. When the database is performing well, this time should be under 35 milliseconds. Times in excess of this indicate
Page 43 of 97

Performance Troubleshooting Guide

that the database is processing queries slowly; in this case, you should look into monitoring at the database side to identify why queries are being handled slowly. Connection pool statistics: WebSphere Application server provides pools of database connections. Several counters are useful for monitoring these connection pools: o Avg Wait Time: this counter shows how long requesters must wait to get a connection from the connection pool. This should be 0 or very close to it; if the average is more than 5 milliseconds, this indicates a problem. In that case, look at the average time spent executing database calls. If that is also high, then investigate it further. If, however, the average time spent executing calls is still acceptable, then the database connection pool size probably needs to be increased. Information on how to do this is in the WebSphere Application Server information center. o PrepStmt Cache Discards: WebSphere Application server maintains a cache of prepared JDBC statements, to reduce CPU load on the database server. This counter shows how many times a prepared statement was discarded from the cache because the cache was full. Compare this to the counter Num Allocates to find the percentage of requests which result in a cache discard. The normal values expected here vary depending on the database being examined, as only certain databases have an access pattern which allows for good utilization on this cache. For example, the database for the jcr domain frequently shows a very high discard rate. On the other hand, the databases for the release, community, and customization domains typically show a low discard rate less than 25%. If the rate is higher, then you may be able to a performance increase by increasing the size of the prepared statement cache associated with that datasource. More information on doing this is in the WebSphere Application Server information center. Note: the prepared statement cache is kept in memory, so if your application server is seeing high memory utilization, weigh carefully the tradeoffs of a higher hitrate on this cache versus the additional memory usage.

Monitoring database performance from the database side


The database server is able to provide far more detail about the specific queries being executed, as well as potential performance problems. Unfortunately, the specific information available, and the way to gather it, varies depending on the database server. The following information will discuss how to monitor the performance of DB2; for other database servers, consult the documentation of that database server.
Page 44 of 97

Performance Troubleshooting Guide

The DB2 Database System Monitor provides the ability to capture snapshots of the database performance. Capturing and using this information is discussed in the article Performance Monitoring from DB2 Magazine. This three-part article is available on the Web as follows:

Part 1: http://www.db2mag.com/story/showArticle.jhtml?articleID=18901270 Part 2: http://www.db2mag.com/story/showArticle.jhtml?articleID=23903573 Part 3: http://www.db2mag.com/story/showArticle.jhtml?articleID=51200282

Database snapshots are also explained in some detail in the article Measure, Improve, Repeat from DB2 Magazine, at http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=161601940. Another resource for monitoring DB2 performance is IBM Tivoli OMEGAMON XE for Databases; more information on this product is available at http://www.ibm.com/software/tivoli/products/omegamon-xe-databases/.

Next steps
The goal of monitoring the database performance is to determine if the database server is causing a performance problem in your WebSphere Portal system. If a performance problem at the database is indicated, then continue to the next section (Common database performance issues) for ideas on how to resolve these problems.

5.3.3 Common database performance issues


A database server is a complex entity, and there are many reasons for a performance problem at a database server. This section will describe some common database performance problems and solutions to them, but there are many other reasons for database performance problems. Consider the following possible causes for performance problems: Choice of database: when WebSphere Portal is installed, it uses Cloudscape for its databases. While this database is suitable for portlet development or singleuser experiments with the portal, it is not appropriate for a portal which will have a high level of traffic. For portals which will have higher traffic loads, an alternate database is needed. The other databases supported by WebSphere Portal are listed in the product information center. Database tuning: most database servers have to be tuned to achieve proper performance. The WebSphere Portal performance team has used several database servers in our performance measurements. The tuning we have used on those database servers is documented in the WebSphere Portal Tuning Guide. If you are using DB2 as your database server, you can also use the DB2 Configuration Advisor for suggestions on how to tune the database.

Page 45 of 97

Performance Troubleshooting Guide

WebSphere Portal tuning: WebSphere Portal will cache some information in memory; this caching reduces the load on the database as well as improving overall performance. If the WebSphere Portal caches have not been tuned properly, this will increase the load on the database server. A discussion of these caches is in the WebSphere Portal Tuning Guide. That guide also contains the settings used by the WebSphere Portal performance team in our performance measurements, but these might not be appropriate for all WebSphere Portal sites. Database maintenance: some database servers require maintenance tasks to give best performance. See the following section, Database maintenance, for additional information. Network connections to the database server: if the network links from the WebSphere Portal server(s) to the database server are performing poorly, they could cause the database to perform poorly. Use the techniques described in section 5.1, Verifying network connections, to confirm that these connections are performing well. Indexing: the databases provided by WebSphere Portal, such as the wpsdb database, are created with the indexes needed for typical use of the product. It is unlikely that you will need to create additional indexes on these databases. However, if your applications are accessing other databases, then check whether the appropriate indexes have been created on those databases for efficient access. Its also important that the application server has adequate connections to communicate with the database server; see Database connection pool monitoring and tuning above for more information on this topic.

5.3.4 Database maintenance


Database servers have a component known as a query optimizer, which makes decisions about how to execute the queries they are asked to perform. One set of information that the query optimizer needs to make these decisions properly is statistics about the information in the database tables. Some database servers do not automatically update these statistics as table grow or shrink. Out-of-date statistics can cause the query optimizer to pick poor strategies to execute queries, resulting in slow response time from the database. Therefore it is important to update statistics on any tables which will grow or shrink significantly. The WebSphere Portal databases contain a number of tables which may experience significant changes in size. Rather than try to list all of the events which may cause this, and the tables to be affected, it is best to run maintenance on all of the tables in the WebSphere Portal databases. In addition to the tables in the WebSphere Portal databases, if the site uses portlets which are updating other databases, these might contain tables which would experience significant changes in size.

Page 46 of 97

Performance Troubleshooting Guide

The maintenance process is different on each database server. The following process is used on IBM DB2 Universal Database: Update table statistics: table statistics are updated with the runstats command. This should be done on a weekly basis, or whenever a large change is made in the content in the portal (such as creating pages through XmlAccess). For example, the following command will update the statistics on the USER_DESC table: runstats on table tableschema.USER_DESC with distribution and detailed indexes all Physically reorganize tables: this will defragment the tables on the disk, allowing for more efficient data access. We recommend performing this at least once a month, plus after any time where a large amount of data has been loaded into the portal databases. Reorganizing the tables is performed with the reorg command. For example, to optimize the table USER_DESC, use the following command: reorg table tableschema.USER_DESC The name tableschema represents the schema name used for the database table being reorganized. By default, the reorg command is designed to be run while the database is offline, but an option exists for running it while the database is active. In either case, the runstats command should be run on a table after it has been physically reorganized. For additional information on these commands, see the DB2 Command Reference manual or the DB2 information center. For information on how to do these operations with other database servers, see the documentation for those database servers. WebSphere Portal V6.0 and later use several logical database domains, which may be represented by different schema names in a single physical database, or multiple physical databases. In either case, make sure that the database maintenance covers all of these logical databases. The physical organization of the tables on disk can also make an impact to the performance of queries. If at all possible, it is highly recommended to use multiple physical drives in the database server. At a minimum, the database logs should be on a dedicated drive. The ideal situation is to have the database product code on one drive, and independent RAID arrays for data (tables and indexes) and database logs.

Page 47 of 97

Performance Troubleshooting Guide

5.4 Memory analysis


5.4.1 Understanding verbose garbage collection output
Garbage collection is the process used by Java Virtual Machines (JVMs) to find and free unused memory in their heap. Garbage collection occurs automatically and invisibly as part of the JVM's normal operation. This is a significant advantage to the Java programmer, as he or she need not worry about explicitly managing memory allocation and deallocation. However, time spent in garbage collection is time that the processor cant be doing other work. Therefore if a significant amount of time is spent in garbage collection, this can impact overall system performance. So, sometimes it is useful to have more information about the garbage collections being performed by the JVM.

Garbage collectors
The Java specification explains when an object is live or when it is eligible to be garbage collected, but it does not specify how or when garbage collection must be performed. This is left up to the implementer of the JVM, and therefore several different garbage collectors exist. For the IBM JVMs, there are three important garbage collectors: Mark-and-sweep collector: the mark-and-sweep collector is the default garbage collector in the IBM JVMs. While it runs, the JVM is stopped from doing other work. Concurrent mark-and-sweep collector: a variant of the mark-and-sweep collector, this garbage collector runs one phase of garbage collection (the mark phase) in parallel with the JVM doing other work. The JVM is only stopped from doing other work during the sweep phase. This results in the JVM being paused for shorter periods of time, at the expense of somewhat higher CPU overhead to handle the concurrent processing. Generational garbage collector: introduced in Java 5.0 (available with WebSphere Application Server 6.1), this garbage collector can offer dramatically better performance for most WebSphere Portal workloads. This garbage collector divides the heap into two regions. The first is the nursery, which is garbage collected very quickly, and tends to be garbage collected more frequently. While the nursery is being garbage collected, the JVM is stopped from doing other work. The second region of the heap is the tenured region, which requires longer to garbage collect, but typically is only garbage collected very infrequently. Garbage collections of the tenured region can occur concurrently with other work, and only require stopping the JVM for a short period of time at the end of the garbage collection cycle.

More details about the garbage collectors for IBM JVMs can be found in the Java diagnostics guide (see Garbage collection references, below).

Page 48 of 97

Performance Troubleshooting Guide

Observing garbage collection


A useful way to observe garbage collection is the verbose garbage collection log. This is enabled through a setting on the Java virtual machine settings page in the WebSphere Application Server administrative console. For specific instructions, see Java virtual machine settings in the WebSphere Application Server information center. For this setting to take effect, the WebSphere Portal application server will need to be restarted. After it has been restarted, garbage collection events will be logged to the file native_stderr.log, which is found in the log directory under the installation root directory for WebSphere Portal. For example, on AIX the default location of the file would be /usr/IBM/WebSphere/wp_profile/logs/WebSphere_Portal/ native_stderr.log. Each garbage collection will produce an entry in this log file. The format of this log is not standardized across all Java virtual machine implementations. The discussion below will focus on the format used by the IBM JVMs on AIX, Linux, and Windows. These JVMs have a format which is consistent across platforms, though it has changed with the JVM releases. Java 1.4.2 JVMs use one format, and Java 5.0 JVMs use a different format. These two formats will be discussed below. Remember that the application server has a lot of warming up work which occurs: portlets are initialized, caches fill, and so on. So it is usually a good idea to skip over the verbose garbage collection entries for the first 10 minutes or so of any load test.

Reading verbose garbage collection output: Java 1.4.2


The IBM Java 1.4.2 JVMs on AIX, Linux, and Windows use a text-based format, where each garbage collection event produces a block of lines similar to the following:
<AF[77]: Allocation Failure. need 65552 bytes, 24775 ms since last AF> <AF[77]: managing allocation failure, action=2 (20850408/608696832)> <GC(77): GC cycle started Thu Oct 09 12:03:29 2008 <GC(77): freed 175308440 bytes, 32% free (196158848/608696832), in 734 ms> <GC(77): mark: 694 ms, sweep: 40 ms, compact: 0 ms> <GC(77): refs: soft 0 (age >= 32), weak 1013, final 678, phantom 36> <AF[77]: completed in 736 ms>
Figure 6: Sample verbose garbage collection log entry (Java 1.4.2)

This log entry contains multiple pieces of information. Keeping in mind our goal of minimizing the amount of performance impact caused by garbage collection, here are several points of interest to look for in the log entries: Amount of memory being requested. This is shown as need 65552 bytes, above; it indicates the size of the block of memory being requested which caused the garbage collection. Requests for larger blocks of memory are harder to handle, as the heap tends to get fragmented over time. If you frequently see requests for more than 100K bytes of memory, the application is making work very hard for the heap manager. The source of these requests should be located and, if possible, removed.
Page 49 of 97

Performance Troubleshooting Guide

Time to complete the allocation failure. This shows how long the garbage collection cycle took to complete; in the example above, the line AF[77]: completed in 736 ms shows that this garbage collection cycle took 736 milliseconds . With the mark-and-sweep collector, the entire JVM is stopped during the garbage collection cycle. Typically, garbage collections will complete in 1-2 seconds. Occasional garbage collections lasting much longer than this (1015 seconds) do not indicate a problem, but if garbage collection usually takes 3 seconds or more, clients will experience irregular response times. The time spent in garbage collection is influenced primarily by the size of the heap. Reducing the heap size will cause garbage collections to complete more quickly, at the expense of them occurring more frequently. The other factor which can lead to very long garbage collection times is paging; this is discussed in section 4.2.2, Poor response times for most or all operations. Portion of heap free after garbage collection. The line 32% free shows the amount of the heap free after garbage collection completed. This number is very important to track when looking for memory leaks; this will be discussed under Detecting memory leaks, below. The garbage collectors in the IBM Java 1.4.2 JVM will perform very poorly if the amount of free memory after garbage collection drops below 10%. If this occurs, read section 5.9, Reducing time spent in garbage collection. Portion of time spent in garbage collection. This is not shown directly in the log entry; it must be computed. Here are the steps to compute this: The time since last AF value in the garbage collection entry shows the time since the completion of the last garbage collection cycle. This is 24,775 milliseconds in the example above. Add this to the time required to complete the last garbage collection cycle, and you've got the spacing between these two garbage collections. This comes from the completed in xxx ms line of the previous garbage collection entry. If we assume it completed in 882 milliseconds, then the spacing between these garbage collections was 24775 + 882 milliseconds, or 25,657 milliseconds. Using the spacing between garbage collections, you can compute the portion of time spent in garbage collection through dividing the time to complete the garbage collection by the spacing between them. We recommend computing this over several garbage collections to get a more accurate estimate. This gives the following formula for the portion of time spent in garbage collection: [gc completion time / (time since last AF + previous gc completion time)] Using the example above, and assuming that the previous garbage collection cycle took 882 milliseconds to complete, we can compute the portion of time spent in garbage collection as follows:
Page 50 of 97

Performance Troubleshooting Guide

[736 / (24775 + 882)] This is about 3%, which is a very healthy situation. If the portion of time spent in garbage collection is over 20%, then garbage collection is significantly reducing performance. If the portion is in the range 10-20%, then garbage collection is impacting performance, but not to such a large extent. When the portion is less than 10%, garbage collection is probably not contributing to performance problems. This calculation can be done automatically with the Garbage Collection and Memory Visualizer tool; see Tools for garbage collection logs below for more information.

Reading verbose garbage collection output: Java 5.0 (nursery)


In Java 5.0, the IBM JVMs on AIX, Linux, and Windows switched to an XML-based format in place of the format shown above. A single garbage collection event will produce one element in the output. When using the generational garbage collector, remember that there are two types of garbage collections: garbage collections of the nursery and of the tenured region. These two types will produce different log messages, and they have different impacts on performance. For the nursery, one key question is how far apart the nursery allocation failures occur, and how long they take to complete. Secondly, the JVM may issue some warning messages while garbage collecting the nursery; if these occur, they should be investigated and resolved. An allocation failure in the nursery will produce a log message similar to this:
<af type="nursery" id="632" timestamp="May 16 15:26:28 2008" intervalms="1551.227"> <minimum requested_bytes="56" /> ...portion deleted... <nursery freebytes="218110560" totalbytes="241591808" percent="90" /> <tenured freebytes="463223592" totalbytes="1610612736" percent="28" > <soa freebytes="382693160" totalbytes="1530082304" percent="25" /> <loa freebytes="80530432" totalbytes="80530432" percent="100" /> </tenured> <time totalms="64.747" /> </af>
Figure 7: Sample verbose garbage collection log entry (Java 5.0, nursery)

Theres a great deal of information in this log. Some of it is important for diagnosing performance problems, while other information is detail which is not needed in most circumstances. Here are the most important items: Amount of memory being requested. This is shown in the <minimum> element on the third line above. The value in requested_bytes is the size of the object being allocated; in this example, it is 56 bytes. Requests for larger blocks of memory are harder to handle, as the heap tends to get fragmented over time. If you frequently see requests for more than 100K
Page 51 of 97

Performance Troubleshooting Guide

bytes of memory, the application is probably making work very hard for the heap manager. The source of these requests should be located and, if possible, removed. Time to complete the allocation failure. This gives how long the garbage collection cycle took to complete. It is shown at the bottom of the output, in the <time> element; the value for totalms shows how long the garbage collection took, in milliseconds. In the example above, allocation failure took 64.747 milliseconds to complete. During nursery allocation failures, the entire JVM is stopped during the garbage collection cycle. Typically, these garbage collections will complete in less than 150 milliseconds. Portion of heap free after garbage collection. When an allocation failure occurs in the nursery, garbage will be cleaned up in the nursery, and some objects will be moved to the tenured region. Thus its typical to see the amount of free space in the tenured region drop after each allocation failure in the tenured region. This will only be recovered after a garbage collection in the tenured region discussed below. Portion of time spent in nursery allocation failures. This is not shown directly in the log entry; it must be computed. Here are the steps to compute this: Find the time since the last nursery allocation failure. This is the intervalms value in the <af> element. In the example above, its 1551.227 milliseconds. Add this to the time required to complete the previous allocation failure the totalms value from the previous log entry. If we assume it completed in 66.113 milliseconds, then the spacing between these garbage collections (rounding to whole numbers) was 1551 + 66 milliseconds, or 1617 milliseconds. Using the spacing between garbage collections, you can compute the portion of time spent in garbage collection through dividing the time to complete the garbage collection by the spacing between them. We recommend computing this over several garbage collections to get a more accurate estimate. This gives the following formula for the portion of time spent in garbage collection: [ gc completion time / (time since last AF + previous gc completion time)] Using the example above, and assuming that the previous garbage collection cycle took 65 milliseconds to complete, we can compute the portion of time spent in garbage collection as follows: [65 / (1551 + 66)]

Page 52 of 97

Performance Troubleshooting Guide

This is about 4%, which is a very healthy situation. If this value is over 10%, this indicates performance may be improved by tuning the size of the heap; see Reducing time spent in garbage collection, below.

Reading verbose garbage collection output: Java 5.0 (tenured region)


With the generational garbage collector, the second item to look at is the portion of time spent garbage collecting the tenured region. Garbage collection of the tenured region will show two log entries: one when the garbage collection starts, and a second when it completes. When it starts, a message similar to the following is logged:
<con event="kickoff" timestamp="Sep 25 15:08:15 2008"> <stats tenurefreebytes="6457736" nurseryfreebytes="114299992" tracetarget="911297536" kickoff="114309504" tracerate="8.00" /> </con>

When it completes, a message like the following will be logged:


<con event="collection" id="36" timestamp="Sep 25 15:08:17 2008" intervalms="598764.461"> ...portion deleted... <nursery freebytes="201674352" totalbytes="234677248" percent="85" /> <tenured freebytes="386608936" totalbytes="1207959552" percent="32" > <soa freebytes="386608936" totalbytes="1207959552" percent="32" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> <time totalms="112.901" /> </con>
Figure 8: Sample verbose garbage collection log entry (Java 5.0, tenured region)

In these log messages, there are a few key items: Amount free in the tenured region after GC completes: look for the last <tenured> element in the second log message. This shows the free space in bytes in the freebytes value, which is 386,608,936 bytes in this example, and the percent free, which is 32% in this example. Spacing between tenured region garbage collections: the first log message will show the time when garbage collection of the tenured region begins. By looking at the timestamp on successive entries, you can calculate the spacing between garbage collections of the tenured region. Duration of tenured region garbage collections: even with concurrent garbage collection, the JVM must be paused for a period of time. This is reported in the <time> element at the end of the second log message. If this is very high more than 10 seconds it may indicate that memory is overcommitted and the system is paging. A more important metric is the amount of time spent running the background collection. This can be calculated from the difference in the timestamps between the first and second log messages. In this case, it was approximately two seconds (from 3:08:15PM to 3:08:17PM).

Page 53 of 97

Performance Troubleshooting Guide

With the spacing between tenured region garbage collections and the duration of the tenured region garbage collections, its possible to calculate the portion of time spent garbage collecting the tenured region. In a well-behaving system, this ratio should be less than 5%; if its higher, see Reducing time spent in garbage collection, below.

Reducing time spent in garbage collection


Garbage collection will occur whenever the heap does not have a block of free memory to satisfy an allocation request. The following are ideas for reducing the frequency of garbage collections: Increase the size of the Java heap: increasing the size of the Java heap will make more memory available, which will make garbage collection occur less frequently. However, before doing this, read JVM max Heap Size Limits in the WebSphere Portal Tuning Guide for more guidance. Balancing nursery and tenured region sizes: with the generational garbage collector, you can tune the amount of space allocated to the nursery. In our performance benchmarks, we found that a nursery size of 256MB gives a good balance between length and frequency of nursery GCs. However, different scenarios may require that this change. If the nursery allocation failures are completing quickly (less than 250 milliseconds), but theyre coming too close together, experiment with increasing the size of the nursery. On the other hand, if the nursery allocation failures are often taking more than 250 milliseconds, try decreasing the size of the nursery. The size of the nursery is controlled with the JVM parameter Xmn. Too-frequent tenured region collections: if garbage collections of the tenured region are occurring too frequently, this will increase processor utilization on the system and impact user response times. If possible, increase the size of the Java heap (see above). Otherwise, the best alternative is to look into long-term memory usage, as the tenured region is where long-lived objects are stored. How to do this is discussed below. Transient memory allocation: garbage collection occurs because many objects are being allocated, so a way to reduce garbage collection is to reduce the amount of memory being allocated. This involves looking into the code of the applications (portlets) used on the site, trying to reduce the number and size of objects created on each page view. Long-term memory usage: long-term memory usage refers to objects which are live for extended periods of time minutes or more and cannot be garbage collected. Again, this involves looking into the code of the applications used on the site. In this case, the goal should be to reduce the amount of memory which will remain live for extended periods of time. A common example would be objects stored in the session: look carefully at such usage, and keep it to a minimum.

Page 54 of 97

Performance Troubleshooting Guide

Sessions: many portlets make use of sessions for storing information about their interaction with the user. Since sessions have a long lifetime, memory used by sessions can become a large contributor to heap usage. Ideally, portlet sessions should be less than 10K bytes in size on average; sessions larger than 100K bytes are large and can form a significant performance bottleneck. WebSphere Performance Monitoring Infrastructure (PMI) can be used to view session sizes. Multiple tools are available to read PMI data; one provided with WebSphere Application Server is Tivoli Performance Viewer. Caches: this is another example of long-term memory usage. Caching involves a tradeoff: memory is consumed by the cache, giving a savings because the cached data does not need to be retrieved or computed as frequently. If your application is constrained by memory consumption, then you can often get a performance improvement by reducing the sizes of caches, making more memory available for other work. Carefully measure the effects of these changes to make sure that you are getting a benefit, rather than just replacing one performance problem with another one. Frequent requests for large blocks of memory: Allocation requests for large blocks of memory are harder for the JVM to handle. When looking at the verbose garbage collection log, look at how frequently the amount of memory being requested is greater than 100K. If allocations of this size are occurring frequently, then these allocation requests may cause a significant performance impact. In this case, analyze the application to locate and remove these allocation requests. IBM Support offers a tool which can help locate the cause of large allocation requests; see http://www-01.ibm.com/support/docview.wss?uid=swg21236523 for more information. One other point of note about the verbose garbage collection output: there are some background actions in WebSphere Portal which need to allocate large blocks of memory (100K bytes or more). So if you see a pattern where garbage collections are spaced fairly far apart, but occasionally there are entries where several garbage collections come close together, this should not be a cause for concern.

Tools for Garbage Collection Logs


Several tools exist for reading and displaying verbose garbage collection logs.
Pattern Modeling and Analysis Tool

Pattern Modeling and Analysis Tool for IBM Java Garbage Collector (also known as PMAT), available as a free download from IBM alphaWorks at http://www.alphaworks.ibm.com/tech/pmat.

Page 55 of 97

Performance Troubleshooting Guide


Garbage Collection and Memory Visualizer

To use the Garbage Collection and Memory Visualizer (GCMV), youll need IBM Support Assistant (ISA). ISA is a tool with a large number of plugins for different tasks, one of which is GCMV. For more information about ISA, see IBM Support Assistant, below. Garbage Collection and Memory Visualizer is a powerful and flexible tool for examining verbose garbage collection logs. An overview of this tool is available in the article http://www.ibm.com/developerworks/java/library/j-ibmtools2/, and a tutorial is available at http://publib.boulder.ibm.com/infocenter/ieduasst/v1r1m0/index.jsp?topic=/com.ibm.iea. was_v7/was/7.0/ProblemDetermination/WASv7_GCMVOverview/player.html. When using GCMV, the usual starting point is to look at a graph of used heap vs. total heap size. The two examples below were taken from WebSphere Portal 6.1, using the generational garbage collector. The first example shows a healthy sawtooth pattern: memory usage increases as the tenured region slowly fills, and then a garbage collection of the tenured region recovers a significant amount of memory.

Figure 9: GC and Memory Visualizer: healthy heap

Page 56 of 97

Performance Troubleshooting Guide

In contrast, this example shows a case with too many live objects in the heap: garbage collections of the tenured region come very close together, and little memory is freed by each garbage collection of the tenured region.

Figure 10: GC and Memory Visualizer: overly full heap

This problem is further illustrated by switching to the table view in GCMV. Entries in the third column show when garbage collections occur in the tenured region. Not only are they close together, but looking in the fourth column, we can see that only a small amount of memory is being freed. Therefore the tenured region quickly fills up again, and the painful cycle of excessive garbage collection continues.

Page 57 of 97

Performance Troubleshooting Guide

Figure 11: GC and Memory Visualizer: table view


visualgc Tool

visualgc is a tool for use with Sun JVMs. It is part of the jvmstat performance tools, which are available from Sun. More information is available at http://java.sun.com/performance/jvmstat/.

Garbage Collection References


There is a great deal of information on the Web regarding Java garbage collection. The following are some sources we have found helpful: IBM developer kits: Diagnosis documentation, at http://www128.ibm.com/developerworks/java/jdk/diagnosis/ Java technology, IBM style: Garbage collection policies a two part article on IBM developerWorks. The first part is at http://www.ibm.com/developerworks/java/library/jibmjava2/index.html?S_TACT=105AGX02&S_CMP=EDU, and the second is at http://www.ibm.com/developerworks/java/library/jibmjava3/index.html?S_TACT=105AGX02&S_CMP=EDU. Tuning Garbage Collection with the 5.0 Java Virtual Machine (for the Sun JVM), at http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html.

5.4.2 Detecting and resolving memory leaks


If the amount of memory in use in the Java virtual machine steadily increases over time, this will reduce performance. The Java virtual machine will have to spend more

Page 58 of 97

Performance Troubleshooting Guide

of its time performing garbage collection, and will have less time to devote to running the application code. To determine if this is happening in your application, enable verbose garbage collection for the WebSphere Portal application server; enabling this and understanding the results is discussed in the previous section (Understanding verbose garbage collection output). Once verbose garbage collection is enabled, restart the WebSphere Portal application server. Then, re-run the test which may be showing the memory leak. If possible, apply a constant load to the system for at least a period of several hours. The application server will log garbage collection events After this time period, look first at the time since last AF (time since last allocation failure). Once the server has reached a steady-state load, this should remain roughly constant over time. If, on the other hand, the time between allocation failures is getting shorter and shorter, this may indicate a problem. Second, look at the amount of free memory in the heap after garbage collection completes. This is the first number after the percent free; in the example in the section above, 836,001,400 bytes, or around 800 megabytes, of memory is free in the Java heap. Again, this value should remain roughly constant over time, as long as the load remains constant. If these two values are not remaining constant, you may have a memory leak or you may not. If you see a very steep drop in these values, there is probably a memory leak. A slower drop, however, can be due to the behavior of the garbage collector. In the IBM 1.4.2 JVMs, a normal garbage collection is unable to free certain memory even though it is not in use. All of the available memory is not freed until the heap is compacted, which does not occur on every garbage collection. You can tell when the heap is compacted by looking at the time spent in the compact phase of garbage collection. Consider this example:
<AF[1617]: Allocation Failure. need 131088 bytes, 26409 ms since last AF> <AF[1617]: managing allocation failure, action=2 (1139981632/2147482112)> <GC(1617): GC cycle started Thu Nov 13 19:42:08 2008 <GC(1617): freed 229932680 bytes, 63% free (1369914312/2147482112), in 11218 ms> <GC(1617): mark: 401 ms, sweep: 54 ms, compact: 10763 ms> <GC(1617): refs: soft 0 (age >= 32), weak 597, final 71, phantom 23> <GC(1617): moved 10385672 objects, 614271536 bytes, reason=16, used 162272 more bytes> <AF[1617]: completed in 11220 ms>
Figure 12: Sample verbose garbage collection log entry with compaction (Java 1.4.2)

Line 5 shows the time spent in each phase of garbage collection. If the compact time is more than 0 milliseconds, then this garbage collection cycle included compaction of the heap. A heap compaction may include the entire heap, or only a portion of the heap; the only way to tell is to look at several sequential GCs; if each includes a compaction, then this usually indicates that a portion of the heap is being compacted in each GC. After the compaction is complete, the heap is at its cleanest state. If the amount of free memory at that point hasn't recovered to previous levels, then a memory leak looks more likely. If possible, let the test run long enough for several compaction cycles to
Page 59 of 97

Performance Troubleshooting Guide

complete, and check the free memory after each compaction cycle a continued drop indicates a likely memory leak. More discussion of partial compaction is given in the article Mash that Trash, referenced in the previous section. If you have identified a memory leak, then what is the next step? Locating the cause of a memory leak and fixing the problem can be difficult, but there are many articles on the Web relating to locating and solving memory leaks in Java applications. One article which discusses useful techniques and tools is: Java memory leaks Catch me if you can. It can be found on IBM developerWorks at http://www128.ibm.com/developerworks/rational/library/05/0816_GuptaPalanki/.

Page 60 of 97

Performance Troubleshooting Guide

5.5 Page loading and composition


5.5.1 Breaking down page load times
A single Web page generally consists of many elements: The HTML page Graphics Style sheets Client-side scripts (such as Javascript)

IBM Page Detailer


It can be difficult to determine how many different elements make up a single page, and even more difficult to see how long each takes to load. A tool which can help illustrate this is IBM Page Detailer (available from http://www.alphaworks.ibm.com/tech/pagedetailer). This tool monitors Web traffic on a client system. As the user interacts with a Web site, Page Detailer displays, for each page, the components fetched by the browser, and the response time for each. Page Detailer has two main views. The first is the chart view, which shows the processing of a request over time:

Figure 13: IBM Page Detailer: Chart view

Page 61 of 97

Performance Troubleshooting Guide

The second view is the Details view, which lists all of the elements involved in a page:

Figure 14: IBM Page Detailer: Details view

This lets you extract several important pieces of information: Are redirects involved in fetching the page? Page Detailer uses the following icon to indicate a redirection: How many elements are required to load the page? Each row in the display above corresponds to a single page element, such as an HTML page, a stylesheet, or an image. Counting the rows in the detail view will therefore indicate how many elements were fetched by the browser for that page view. Examine element response times. IBM Page Detailer breaks down the response time for each page element into several parts, such as connection time, server response time, and so on. The individual components of each element can be seen in the Chart view by holding the mouse pointer over the appropriate section of the colored bar for that element, as is shown in the first example. In the Details view, the column Item Duration shows the full time to download this element. Clicking the column header will allow sorting the elements by response time. This can be useful to find elements which have particularly long download times; you can then investigate the components of the download time by looking at the Graph Item Duration column, or on the Chart view. Example element sizes: the Item Size column on the Details view shows the size of the object returned by the HTTP server. Clicking the column header allows sorting elements by size. The Item Size will show 0 bytes when the response from the server contained no content. A common explanation for this is a 304 (Not Modified) response from

Page 62 of 97

Performance Troubleshooting Guide

the HTTP server. You can see the HTTP request and response by selecting the Chart view and double-clicking on an element. From this information, consider the following: Redirects: a redirect requires that the browser issue a second request to fetch an element. This adds an extra network round-trip to the response time. Redirects do not need to be avoided at all costs, but should only be used when needed. Number of elements: when considering how many elements are required for the page, there is no rule which says how many is too many. As a rough guide, our experience has shown that when more than 30 elements are required to display a page, the site may experience performance problems. However, this can vary greatly depending on network conditions. Establishing connections: if HTTP keep-alive connections are being used properly, a new connection should not be opened for each element on the page. The time to open a new connection can be seen as a yellow bar in the Graph Item Duration column. If all (or almost all) elements contain a yellow bar, then HTTP keep-alive connections are not being used properly. This is often due to a configuration problem at the HTTP server. This is discussed in more detail in the section on HTTP Server Connection Limits. Element response times: for elements with high total response times, look at the breakdown of the response time: A high connect time may indicates that the server which is serving that content is overloaded on has reached its connection limits. See HTTP Server Connection Limits for more information. A high server response time (the time to the first component of the response) is often caused when the hosting server is overloaded. Look at the CPU utilization on the server responsible for that element. A high download time may be caused by a particularly large download, or a network bottleneck. The item size is useful for understanding which is the most likely explanation. Item size: this information is useful for understanding the bandwidth requirements of the site. This is particularly useful for understanding the performance impacts for users with limited bandwidth. Content-encoding: Page Detailer allows adding columns for additional data it has captured; one useful column is the content encoding. Items with gzip or deflate in this column have been compressed, while a blank entry indicates that no compression is being used. This is a quick way to check if all of the compressible content in the page has been compressed. To add a column, right-click in the display area and select Add Column, as in this example:

Page 63 of 97

Performance Troubleshooting Guide

Figure 15: IBM Page Detailer: adding a column

This can then be used to locate items which are compressed at the server:

Figure 16: IBM Page Detailer: content encoding

Page Detailer and SSL


Page Detailer is able to capture interactions over SSL as well as unencrypted HTTP traffic, but only when using Internet Explorer; when using other browsers, Page Detailer can only capture unencrypted traffic.

Page 64 of 97

Performance Troubleshooting Guide

For SSL traffic, Page Detailer will display the time required to complete the secure connection separately from the base connection request. In the following example, the yellow region represents the time for the base connection to complete, and the purple region is the time for completing the secure connection:

Figure 17: IBM Page Detailer: content encoding

It is important to note that the browser will try to reuse the SSL connection to retrieve subsequent page elements (such as images). However, the HTTP server must accept the request to reuse the connection. So if you see the SSL connect time included on every request (or nearly every request), it is possible that the HTTP server is not configured to reuse connections or it has a very low limit on the number of requests which can be made on each connection.

Other client-side monitoring tools


Another way to monitor the client-side performance would be to use a browser add-on which gathers and displays this data. Some tools we have experience with: Firebug: be wary of this tool. When we attempted to validate the results shown by this tool, we have seen incorrect results both in listing the requests which were made, and the timings on those requests. YSlow: based on Firebug, this add-on has the same problems as Firebug. Load Time Analyzer: a Firefox add-on, available from https://addons.mozilla.org/en-US/firefox/addon/3371. This tool will gather data about the content being loaded in the page and give timings and sizes.

5.5.2 Reducing the number of elements in a page


Reducing the number of elements that must be retrieved by the client to render a page will provide several performance benefits, so it is worth considering how to achieve this. Some strategies which have worked successfully include: Consolidating files: often, a site will have several teams develop content such as client-side code (for example, Javascript) or style sheets, resulting in separate files from each team. In this case, a performance boost can be realized by merging these files. A single consolidated file will load more quickly than multiple smaller files. Eliminating nested inclusion: certain file types have structures that cause the client to include another URL within the file. These can cause a performance hit,

Page 65 of 97

Performance Troubleshooting Guide

as they cause an additional request for that nested content. For example, the following code in a style sheet instructs the client to include an additional style sheet: @import url("http://www.example.com/ui/v8/css/screen.css"); Client-side includes have all the disadvantages of forcing the client to load multiple files, with the added disadvantage that the client must process the base file before it even knows that the inclusion is required. This further extends the time required to load the page. Remove images from the page: examine at the images being used in the page. Are all of them really needed? For example, many sites use images to provide spacing or solid-color backgrounds. This same effect can usually be achieved by other means, such as table markup or styles provided by cascading style sheets. Table markup, for example, adds a few extra bytes to the response, which is more efficient than adding an extra image which must be downloaded. Client-side caching: browsers will cache content according to the HTTP caching rules. Sites can get a performance boost by specifying long cache lifetimes for static content which is not expected to change frequently. This will allow browsers to cache them for that period of time. For example, most sites have a set of graphic files representing icons on their pages. Usually these files change rarely, if at all. For files which will not change often, we recommend cache lifetimes of 1 week or more. The following HTTP header specifies that the file is cacheable for a period of 1 week (604,800 seconds): Cache-Control: public, max-age=604800 Apache-based HTTP servers, such as IBM HTTP Server, can use mod_expires or mod_headers to add HTTP headers. Other HTTP servers use different ways to specify cache lifetimes for the content they server; see your HTTP server documentation for details.

Caching and Javascript


A related topic is how to deliver Javascript. Javascript can be embedded in an HTML document, or placed in a file to be downloaded separately. Javascript which is placed in the HTML document (for example, in the theme JSPs) will be transferred on every page view, so its best to minimize this whenever possible. Our recommendation is to place Javascript which must be dynamically generated within the HTML document, and put all static Javascript in a separate Javascript file. This way the static Javascript file is cacheable, and the amount of content which must be transferred on every page view is minimized. For example, some portlets use Javascript, and use JSP tags to ensure that they have unique IDs associated with function or variable names. This Javascript needs to be dynamically generated so that the IDs can be created. However, if a library of Javascript functions is used, this library should be placed in a separate file, so that it can be cached.
Page 66 of 97

Performance Troubleshooting Guide

Caching and cookies


When using HTTP proxy servers to cache static content in your network, make sure that you understand the interaction between cookies and content caching. When a request contains a cookie using a Cookie: HTTP header, the response will not be cacheable in a shared cache unless the response explicitly states that it is publicly cacheable. The HTTP server does this by including the header Cache-Control: public in the response. If this is not present, most HTTP proxies will not cache the content. This does not affect the caching of content in private caches, such as in a Web browsers cache.

5.6 Application server performance analysis


5.6.1 Performance monitoring and profiling tools
As described in section 1.3, resolving performance problems is very difficult without proper tools. This section lists some tools which we have found useful for performance monitoring and profiling. This is not meant as an exhaustive list, as there are other tools not listed here which can be used for these tasks. Performance monitoring and profiling are two related, but separate tasks. Performance monitoring can be done on a system running under load, and it helps identify when a performance problem occurs, and can point to problematic areas. Profiling is more resource-intensive, so it often can not be done under load. The goal of profiling is to find ways to speed up an application, so it is typically performed on a portlet or other application which is known to have a performance problem. When monitoring a site under load, it can be useful to examine the response times of individual portlets. Often, a few portlets will display significantly higher response times than the others. If this is the case, these should be the first ones to be examined in more detail. The following tools are useful for monitoring the performance of a WebSphere Portal system: IBM Tivoli Composite Application Manager for WebSphere For more information, see http://www306.ibm.com/software/tivoli/products/composite-application-mgr-websphere/ Wily Portal Manager for IBM WebSphere Portal For more information, see http://www.wilytech.com/solutions/products/PortalManager_IBM.html Another discussion of some monitoring tools and techniques for WebSphere portal can be found in the developerWorks article, Monitoring performance in a WebSphere Portal environment, at

Page 67 of 97

Performance Troubleshooting Guide

http://www.ibm.com/developerworks/websphere/library/techarticles/0810_bickmo re/0810_bickmore.html. The following tools are useful for profiling portlets: Performance Inspectors JPROF, a profiling agent that captures code execution flow, CPU instructions and cycles, response times, and live/allocated memory. See http://perfinsp.sourceforge.net/jprof.html for additional information. Visual Performance Analyzer (VPA), an Eclipse-based performance visualization toolkit. Allows you to analyze the output from JPROF among other things. See http://www.alphaworks.ibm.com/tech/vpa for additional information. IBM Rational development tools, such as IBM Rational Web Developer for WebSphere Software or IBM Rational Application Developer for WebSphere Software More information is available at http://www-306.ibm.com/software/rational/swbycategory/subcategory/SW710.html JProbe Suite from Quest See http://www.quest.com/jprobe/ for additional information.

5.6.2 Using Javacores


What are Javacores?
Javacores commonly known as thread dumps can be a useful source of data for debugging performance problems. A javacore file is a snapshot of a running Java process. Javacore files are produced for a variety of error conditions in an IBM Java virtual machine, but you can also force the JVM to produce a javacore as a problem diagnostic tool. It is important to understand the limitations of using javacore files to try to debug problems. Producing a javacore is a disruptive process: the JVM is blocked while the javacore is being generated, which can be 30 seconds or more. If users are interacting with the application server at the time, their requests will not complete until the javacore is generated. In addition, a javacore is a snapshot of activity at the moment that the javacore is generated. It cannot capture historical trends. Therefore you will need multiple javacores, at different times, to establish whether the observed results are typical or not.

Collecting Javacores
An overview of Javacores, including instructions on collecting them, is available on the Web at http://www-01.ibm.com/support/docview.wss?uid=swg21181068. On UNIX operating systems, collecting a javacore is usually as simple as sending a SIGHUP signal to the java process; this can be done with the command kill -3 pid, where pid is the process ID of the java process. On Windows, the wsadmin command

Page 68 of 97

Performance Troubleshooting Guide

can be used to collect the javacore. This is discussed in steps 2-4 of http://www01.ibm.com/support/docview.wss?rs=0&uid=swg21111364.

Analyzing Javacores in a text editor


As Javacore files are text files, it is possible to analyze them by viewing them in a text editor. In our experience, the most useful portions of the javacore for resolving performance problems are the JVM monitor information section and the JVM thread dump. The monitor information section gives information about the locks within the JVM, and the thread dump shows the current call stack for all threads in the JVM. To locate the monitor information section in the javacore, look for the text LOCKS subcomponent dump routine, which is at the beginning of that section. For each monitor, this section will show the name of the monitor, the thread which owns the monitor (if any), and a list of threads waiting to enter or be notified by this monitor. From a performance point of view, its important to locate monitors which have a large number of threads waiting to enter or be notified by the monitor; this indicates a point where threads are blocked. In particular, look for WebContainer threads which are waiting for a monitor, as the WebContainer threads are the ones which process Web client requests. For example, consider the following monitor:

Figure 18: Javacore: lock detail

In this case, 23 threads are waiting to enter a single monitor, which indicates that access to that monitor is probably causing a performance problem.

Page 69 of 97

Performance Troubleshooting Guide

After looking at the monitor information, proceed to the thread dump section. Look for the text THREADS subcomponent dump routine at the beginning of this section. This section will contain the call stack for each thread in the JVM. Usually the most important threads to focus on are the WebContainer threads. On these threads, the key is to determine if multiple threads are at the same point in the execution flow. This might indicate a hot spot a bottleneck within the code. Before deciding that a specific location really is a hot spot, check in multiple javacore files to see if the same region of code is frequently active. If it is, then that is an area in the code which should be carefully examined to see if it is causing a performance problem.

Javacore tools: Thread and Monitor Dump Analyzer for Java


While its definitely possible to examine javacore files in a text editor, other tools exist to help process them. One useful tool is the IBM Thread and Monitor Dump Analyzer for Java, which is a free download from IBM alphaWorks, at http://www.alphaworks.ibm.com/tech/jca. This tool allows loading multiple javacore files, and can give a color-coded breakdown of the data in the javacore, as in this example:

Figure 19: Thread and Monitor Dump Analyzer


Page 70 of 97

Performance Troubleshooting Guide

For additional details on using this tool, see the help within the application.

Javacore tools: ThreadAnalyzer


ThreadAnalyzer is another tool for examining Javacores. Its a plugin for IBM Support Assistant (ISA). More information on ISA is given below. ThreadAnalyzer can be used to examine a single Javacore, or multiple ones. First, load the Javacore(s) by selecting Thread Dumps -> Open Existing File. Then youll want to select the threads to focus on. For WebSphere Portal performance problems, the servlet thread pool (WebContainer threads) are usually the ones to focus on. Do this by selecting that group in the left pane of the application:

Figure 20: ThreadAnalyzer: select servlet thread pool

Next, look in the right pane for hotspots. In ThreadAnalyzer, threads will be grouped by the top of the method in the call stack. The Method column gives the name of the method at the top of the call stack, and the #Same column shows the number of threads in that method. In the example below, we can see that 11 threads (22% of that thread pool) are in the method java.net.SocketInputStream.socketRead0. Thats a generic method showing that the portlet is reading information over a TCP/IP socket, so we need to drill deeper to find whats reading information.

Page 71 of 97

Performance Troubleshooting Guide

Figure 21: ThreadAnalyzer: select hotspot

To do this, highlight the method, and the lower-left pane will show the threads which are in that method. Selecting one of the threads will show a full call stack for that thread. In this example, we can look up the call stack and see that a database query is being prepared, because the method WSJdbcConnection.prepareStatement is being called. We could then investigate further up the call stack to find out why this thread is processing a database query. The next step would be to look at the other threads in this method to see if they are doing similar actions.

Figure 22: ThreadAnalyzer: view thread call stack

Page 72 of 97

Performance Troubleshooting Guide

One additional point: Javacores from Sun JVMs are all written to a single file, but ThreadAnalyzer expects to find each Javacore in a separate file. To use ThreadAnalyzer with Javacores from a Sun JVM, first open the native_stdout.log file in a text editor, and save each Full thread dump section to a separate file. Then these files can be opened in ThreadAnalyzer.

5.6.3 System monitoring


It is important to monitor resource utilization on the systems being used in a WebSphere Portal setup. This is done differently on different platforms, and there is much written on this topic. The following guides may be helpful: AIX: The book AIX 5L Practical Performance Tools and Tuning Guide contains a great deal of information on this topic. It is available at http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246478.html Linux: The book Performance Tuning for Linux Servers provides information on both performance monitoring and tuning. More information can be had at http://www.phptr.com/bookstore/product.asp?isbn=013144753X&rl=1. Solaris: Information on performance monitoring in Solaris is in the book System Administration Guide: Advanced Administration, in the chapter Managing System Performance Topics. This book is available at the Sun documentation site, http://docs.sun.com/. Windows: see Monitoring Performance on Microsoft TechNet, at http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/Server Help/8e9e5b72-13af-4525-9b60-52864b9d49b2.mspx

5.6.4 Using Tivoli Performance Viewer


WebSphere Application Server uses Performance Monitoring Infrastructure (PMI) counters to report performance data. A number of tools exist for reading and viewing PMI data, but one is provided as part of WebSphere Application Server: Tivoli Performance Viewer (or TPV). In WebSphere Application Server 6.0 and later, Tivoli Performance Viewer is part of the Integrated Solutions Console used to administer the application server. One use of TPV is to monitor the response time for portlets. While TPV does not show portlets, it can monitor servlets, and each portlet has a corresponding servlet. Also, each JSP within a portlet can also be separately monitored in this way. See the section, Monitoring performance in the WebSphere Application Server information center for information on how to enable this monitoring. This section can be found by looking in the table of contents under Monitoring > Task overviews > Monitoring performance. The following will describe the steps for investigating portlet response time. Tivoli Performance Viewer can display information about other components of the application

Page 73 of 97

Performance Troubleshooting Guide

server; for more information on this, refer to the WebSphere Application Server information center section referenced above. 1. Start at the homepage for the Integrated Solutions Console, and then log in:

Figure 23: Integrated Solutions Console: login page

2. Check that PMI is enabled. Using the navigation links on the left, go to Monitoring and Tuning > Performance Monitoring Infrastructure. Verify that

Page 74 of 97

Performance Troubleshooting Guide

Enable Performance Monitoring Infrastructure (PMI) is checked:

Figure 24: Integrated Solutions Console: enabling PMI

3. Then under Performance Viewer, select Current Activity to view current information, and select the application server to monitor. In a non-clustered environment, the server will be WebSphere_Portal. This will show a tree view of modules which can be monitored. The best place to start is usually the servlet summary report. Expand Summary Reports and click on Servlets:

4. In the servlet summary report, there are several important items. Here are some we have found useful: Often its best to sort this report by average response time, as this will show the slowest portlets. Click the arrows beside Avg. Resp Time to sort on this column.

Page 75 of 97

Performance Troubleshooting Guide

Then look for items with a high average response time. In the example below, the slowest components are index.jsp and PortletManager. The times are in milliseconds, so an average response time of 1,313 is 1.3 seconds. Looking in the Application column, we can see that these are part of PortletManager.war the webapp for the Manage Portlets portlet. Ignore entries from the application isclite#isclite.war. This is the administrative console application, so its response times are probably not important to endusers. An important servlet to locate is portal within the application wps#wps.war. This is the portal gateway servlet page requests go through this servlet. The average response time for this portlet will give the overall average response time as seen at the application server. This can then be compared against the enduser response time in particular, the end-user response time for the page markup, as seen in a tool like IBM Page Detailer. If Page Detailer is seeing high response times, but you see a fast response time here, then the problem is in the network, not the application server. But if portal is showing poor response times, then the problem is within the application server. These items are illustrated by this example:

Figure 25: Tivoli Performance Viewer: average response times

Page 76 of 97

Performance Troubleshooting Guide

If youre seeing slow response times at the application server, its a good idea to dig deeper into the PMI metrics. Its possible to examine an individual application in more detail by expanding Performance Modules in the tree and selecting an individual application or other metric. This can help identify the cause of the slow performance. If it is not clear what is leading to poor response times from the portlet, then a more detailed look into the portlet is required. See Performance monitoring and profiling tools for tools which can help with this.

Using Tivoli Performance Viewer to examine session sizes


Objects stored in the session will remain in memory for the lifetime of the session, which is the time while the user is interacting with the site, plus the session inactivity timeout. Therefore session sizes can have a significant impact on memory usage, reducing site capacity. Without some tools to monitor session sizes, it can be extremely difficult to determine if they are contributing to a performance problem. Fortunately, WebSphere PMI metrics can be used for this task. Here is the process to view this data: 1. Enable PMI, as mentioned above. However, the default PMI monitoring level will not capture session-size information. Capturing session-size information does add a performance overhead, so this should only be done when the information is needed. To capture this information, set the monitoring level to Custom, and click

Page 77 of 97

Performance Troubleshooting Guide

Custom to select the specific metrics to monitor.

Figure 26: Tivoli Performance Viewer: set custom monitoring level

2. Select Servlet Session Manager in the tree view. This will show the available metrics for the session manager. For this purpose, the two most important metrics are LiveCount (the number of sessions in the JVM) and SessionObjectSize (the estimated size of the session). Click Enable to make the changes.

Page 78 of 97

Performance Troubleshooting Guide

Figure 27: Tivoli Performance Viewer: select session metrics

3. Access Tivoli Performance Viewer as mentioned above. To view the session data, enable the checkbox beside Servlet Session Manager and click View Module(s).

Figure 28: Tivoli Performance Viewer: select session metrics

4. At that level, WebSphere is summarizing the data for all web applications in the system. By expanding the tree view, you can monitor session data for specific
Page 79 of 97

Performance Troubleshooting Guide

web applications; this can be used to track down portlets which have particularly large sessions.

Figure 29: Tivoli Performance Viewer: session metrics by webapp

5.6.5 Monitoring caches in WebSphere Portal


Overall cache status
WebSphere Portal uses a number of caches, containing various kinds of information. Some of these caches are used in all Portal interaction scenarios, while others are only needed for certain types of use. These caches are listed and detailed descriptions are given in the WebSphere Portal Tuning Guide (see the References section below). In order for WebSphere Portal to give the best possible performance, those caches need to be tuned properly. This can be a difficult task, as there are many caches to consider. One way to monitor the caches is to use WebSphere PMI (Performance Monitoring Infrastructure) any tool which can read this data can display activity on those caches. However, not all of the caches in WebSphere Portal have PMI data associated with them, so some information will not be seen in this way. Also, the PMI data will not give any tuning suggestions. An alternative is the cache monitor portlet, which is available from WebSphere Portal support. This portlet will monitor all of the caches in WebSphere Portal, and will also give tuning suggestions for caches which might need their size increased. Remember that tuning caches requires some trial and observation; a recommended methodology for this is discussed in the Methodology section of chapter 6.

Investigating an individual cache


If you want to view details about an individual cache, you can dig deeper with the IBM Extended Cache Monitor for IBM WebSphere Application Server. This tool is a free

Page 80 of 97

Performance Troubleshooting Guide

download; information is available from http://www.ibm.com/developerworks/websphere/downloads/cache_monitor.html.

5.7 Load generation


5.7.1 Excessive load on load generators
The workload must be considered for the load generators used in a benchmark environment. If this workload is too high, then the benchmark results will be an indication of the performance of the load generators, not the system under test. The simplest way to identify this kind of problem is to monitor the CPU load and memory consumption on the load generators. If the CPU load on the load generators is above 60%, then it could be impacting the observed results. In this case, additional and/or more powerful load generators will be needed. Likewise, if the load generation system is paging significantly, this will impact the observed results. It is impossible to state exactly how many virtual users can be supported by a single load generator. This varies based on many factors, such as the load generation software used, the settings used for the test, and the complexity of the test script. If you are using multiple load generators, its possible to check if one of them has become overloaded and introduced an artificial bottleneck. In general, each virtual user should create requests at the same rate (assuming they are all running the same workload). This can be calculated when analyzing the result: group the transaction rate by load generator, and then divide by the number of virtual users on that load generator. This will give the transactions rate per virtual user on each generator. If these are significantly different, then one of the generators may be overloaded. Another possible explanation is that the think time settings are not the same across all load generators. Unfortunately, if all the load generators are overloaded, that would not be revealed by this check.

Example benchmark run with an overloaded load generator


As an example of a sign which indicates that the CPU load on the load generators is too high, consider the following response time graph from a measurement performed by the WebSphere Portal performance team:

Page 81 of 97

Performance Troubleshooting Guide

Figure 30: Response times with an overloaded generator

In this test, a stepped workload is being used. This means that the load is increased to a certain level and run at that level for a period of time. Then the load is increased and run at that level for a period of time. This continues until the measurement is complete. The green line shows the response time for one of the page views in the test, and the blue line shows the number of running virtual users. We can see that the response time increases as the load ramps up to its initial load level of 500 virtual users, which seems reasonable. When the initial load level is reached, the response time levels out; again, this as we would expect. The next two steps which increase the load by 20% make no noticeable impact on the response time. But when the load is increased to 650 virtual users, the response time suddenly jumps to more than twice their previous value. And while increasing the load to 700 or 750 virtual users makes little impact on the response time, when the load is increased to 800 virtual users the response time again doubles. These changes in response time cannot be explained solely on the basis of the increasing load. The explanation for this odd response time behavior lies with the load generators used for this measurement. This measurement used three load generators. Generator 1 ran the initial group of 500 virtual users. Generators 2 and 3 ran the groups of 50 virtual users which brought the load up to 600 virtual users. The group which brought the load up to 650 virtual users was run on generator 1. At that point, generator 1 was already overloaded, so adding the additional virtual users on it forced the response times up dramatically. Likewise, the group of virtual users which brought the load level to 800 virtual users was also run on generator 1, so again the response times took a sudden jump. Once the symptom was observed, we checked generator 1 and saw that it was at 100% CPU load, proving that our response time data was invalid.

5.7.2 Assigning user IDs to virtual users


On most Web sites, it is unusual for the same user to be logged in simultaneously on multiple sessions. It is important that load tests do likewise. Not doing so can cause

Page 82 of 97

Performance Troubleshooting Guide

exceptions of the type mentioned in section 4.5, Problems specific to load tests. It can also cause unrealistically high CPU load and/or lock contention at the database server hosting the portal database(s). To accomplish this, create a pool of user IDs to be used in the load test. This pool of user IDs should be at least 25 times as large as the maximum number of virtual users being used. The load test scenario should be designed so that there is minimal chance for multiple virtual users to select the same user ID at the same time. For example, if the user IDs are in a sequential file, each virtual user should not start at the top of the file and sequentially select IDs this would give the largest possible chance for virtual users to pick the same ID. Preferable approaches include having virtual users pick IDs at random from the list, or assigning a specific user ID to each virtual user for the duration of the test. All of the users in the pool of user IDs will need to be created in the directory using used by WebSphere Portal. If any of the applications being used require data to be created for virtual users, this data must also be created prior to starting the load tests.

5.8 The server environment


5.8.1 Locating Logs
With WebSphere Portal 6.1, the log files created by WebSphere Portal are now located in the log directory created for the application server profile used by WebSphere Portal. The default profile is called wp_profile, and the logs are located in the directory logs/WebSphere_Portal. For example, consider a Windows-based system where WebSphere Portal is installed in C:\IBM\WebSphere\. In this case, the default location for the server logs will be in C:\IBM\WebSphere\wp_profile\logs\WebSphere_Portal. On an AIX system where WebSphere Portal is installed in /usr/WebSphere, the default location for the server logs would be /usr/WebSphere/wp_profile/logs/WebSphere_Portal.

5.8.2 IBM Support Assistant


IBM Support Assistant (ISA) is a complimentary software support solution that helps you find answers and resolve issues with IBM software products. ISA gives you key features to streamline your support experience, including: Robust search feature to query IBM and non-IBM knowledge banks concurrently Problem diagnosis and analysis tools to identify and fix software issues with JVM, configuration, memory, etc. Product information to locate APARs, fixes, and product education Links to easily transfer problems to IBM and open a PMR with collected data attached

Page 83 of 97

Performance Troubleshooting Guide

Automated, symptom-based data collection on local and remote systems.

More information about IBM Support Assistant is available from http://www01.ibm.com/software/support/isa/.

Page 84 of 97

Performance Troubleshooting Guide

6 Benchmarking Information
6.1 Methodology
This section will draw on the experience of the WPLC performance team to help you build effective benchmark measurements.

6.1.1 Overview
An important part of dealing with performance issues is to be methodical about attempting to resolve the issue. It can be very frustrating to have a performance problem mysteriously disappear (or get worse!), and have no idea as to why this occurred. We have found the following approach useful to reduce those frustrations: 1. Identify the scenario(s) of interest: whether the test will be executed with an automated test tool or manually (for example, from a Web browser), it is important to know the interactions youll make with the site. This way you can be certain that successive tests are measuring the same events. 2. Determine the criteria: for each scenario youre interested in, what are the criteria which really matter? For example, are you interested in response times, throughput, or processor consumption on a certain system? 3. Pick goals: its important to identify your objectives so you can stay focused on meeting them. Commonly, benchmark scenarios will have one or more of the following goals: 1. Validate capacity planning targets. This goal would be to validate a capacity target set when planning for the deployment. This target should be stated in terms of a throughput rate (total transactions per second) plus response-time goals. 2. Find and eliminate bottlenecks. A bottleneck is one component of the system thats preventing the system from achieving higher capacity. 3. Determine the maximum capacity of the system. In this case, theres no predetermined capacity goal; the goal is just to push the system as high as it can go. 4. Pick the environment: ideally, benchmarking should be done on the environment that will be used for production, but this often isnt possible. In that case, the benchmark environment needs to be as close as possible to being an identical mirror to the production environment. Every aspect which will differ will require extrapolation, which reduces the reliability of the results. 5. Gather a baseline measurement: it is very important to know the performance of each scenario before you start making changes to the system. With this

Page 85 of 97

Performance Troubleshooting Guide

information, you can quantify any improvements that are made, or any regressions that are experienced. 6. Determine the variability in the baseline: all measurements contain some amount of random variability. If tests are being run in an environment where some of the resources are shared with other uses, this variability can be significant. Repeating a measurement several times under identical circumstances can show how much variation exists. Once the amount of variation is understood, it can help correctly interpret subsequent measurements. For example, if the variation between runs is around 5%, and the result of making a change to the system is a performance improvement of 1%, then it is quite possible that the change actually made no improvement at all, and that the observed improvement comes from random variation. 7. Identify changes to try: often, there are several changes which you would like to try in an effort to resolve a performance problem. Make a note of the changes you are interested in trying, and decide what order you will try them in. 8. Measure the effect of changes: make changes, one at a time, and measure the effects of the changes. By making changes one at a time, you can determine the effect of that change; if multiple changes are made at once, it can be very difficult to know what effect each change made. While this is an ideal, you must also make concessions to reality: making changes and measuring their effects takes time. Therefore it often makes sense to make a group of related changes all at once, and measure their effects. For example, you may read the WebSphere Portal tuning guide and decide on a set of tuning changes to the HTTP server, the portal server, and the database server. This would be a lot of changes to introduce at once, so a sensible approach would be to make the tuning changes to one of the systems, such as the HTTP server, all at once, and then measure the effects. Then repeat this process for other tuning changes which are needed.

6.1.2 Measurement scenarios and practices


A load measurement is intended to show that the system being tested can handle the load it is being given. Almost always, its important that the system can support the load for a period of time, not just handle it for a brief moment. Its also important that the load build up over a period of time. Real-world workloads dont all start at once, and doing so in a performance measurement can cause unrealistic strain on the system. Therefore we recommend building measurement scenarios which ramp up to the target load level and hold it for a period of time, as is shown in the planned workload in the left graph, rather than spending most of the time in rampup and only holding the workload for a brief period of time as is shown in the right.

Page 86 of 97

Performance Troubleshooting Guide

Figure 31: Good and bad workload plans

Another part of the workload plan is selecting the think time to be used by the virtual users. There is an inverse relationship between think time and request rate: with the same number of virtual users, a lower think time will produce a higher request rate. When trying to achieve a target transaction rate, balance between the number of virtual users and think time. It is possible to create a high transaction rate with small number of virtual users running with no think time (or very low think times). However, this is a very unrealistic model of user interaction, and it tends to produce an unrealistically high performance result. Ideally, think times should be in the range of 10-60 seconds; think times below 10 seconds should be used only if you are limited in the number of virtual users available. In addition to defining the scenario to get meaningful data, it helps to have good operational practices around executing the measurements. We have found the following practices to be helpful in obtaining reliable, repeatable performance measurements: 1. Stop and restart WebSphere Portal before each measurement. Otherwise results can be influenced by data left in memory from previous runs. 2. Before the measurement, clean up the log files from all systems under test. This way it is clear that any entries in the logs relate to events which occurred during that test.

6.1.3 Collecting data from measurements


With each measurement you make, there is a great deal of information which should be gathered. Collect the information, store it, and associate it with the measurement. We have found the following data useful in analyzing measurements: Date and time of the start of the test, plus duration of the test Test scenario and version executed Number of virtual users run, including ramp-up and ramp-down rates Response times and throughputs for all transactions of interest Total transactions per second

Page 87 of 97

Performance Troubleshooting Guide

Pass/fail counts on all transactions Processor load on all systems in the test environment, including load generators Log files from all systems under test, including verbose garbage collection logs Counts of responses by HTTP status code Network traffic (bytes/second) Any observed anomalies in the test With this information, you can compare measurements against each other and easily determine how the results changed over various tests. There are also useful metrics which can be computed from the data mentioned above. These are some which we have found useful: Effective average think time: typically load generation scripts will contain think time, which represents periods of user inactivity. From the script, you can calculate the expected average think time. For example, if the script contains a 10-second think time after every operation, then the average think time is 10 seconds. As a way to check the correctness of the script, you can calculate the effective average think time observed in the run. If this is significantly different from the expected think time, then there may be an error in the script or the scenario settings which should be investigated. Our experience indicates that a variation of +/- 5% is possible from random variation, but a larger variation should be considered significant. The formula to calculate effective average think time is:
effective_tt = ((total_tps)/(num_vusers)) (average_response_time)

Where: o effective_tt is the effective think time discussed above o total_tps is the total transactions per second, averaged over the period of constant load o num_vusers is the number of virtual users running in that period of constant load o average_response_time is the weighted average response time for all transactions Bytes per transaction: this metric will capture the average page size across all the pages in the scenario. This should only change when the page content changes, so its a good way to check that the script is behaving in a similar manner across multiple measurements. A variation of +/- 5% is not uncommon on this metric, but a larger variation should be investigated. The formula for bytes per transaction is:

Page 88 of 97

Performance Troubleshooting Guide

bytes_per_transaction = (bytes_per_second) / (total_tps)

Where: o bytes_per_transaction is the average page size discussed above o bytes_per_second is the total amount of data transferred per second, averaged over the period of constant load o total_tps is the total transactions per second, averaged over the period of constant load Processor busy time per transaction: this metric gives the average processor time taken on a node in the system to process a transaction. This is an important metric when evaluating changes which should improve performance. The higher the busy time per transaction is, the more resources those transactions are taking so a performance improvement should reduce this value. The formula for processor busy time per transaction is:
busy_time = ((num_procs) * (proc_load) * 1000) / (total_tps)

Where: o busy_time is the average processor busy time per transaction. This value will be in milliseconds. o num_procs is the number of processors in the system. For systems with virtual processors, include virtual processors in this number. o proc_load is the average processor load on that system, averaged over the period of constant load o total_tps is the total transactions per second, averaged over the period of constant load

6.2 Drawing conclusions from performance measurements


When considering the results of a performance measurement, there are two main questions to ask: Is this a valid measurement? In section 4, Identifying common errors in performance measurements discusses error conditions to look for in the measurement to help understand if the measurement is valid. If its valid, continue and draw conclusions about the measurement; otherwise, fix the problems that were observed and repeat the measurement. What conclusions can be drawn about performance? The conclusions are drawn from the metrics discussed in the section above. From the point of view of the site owners and users, these are usually the most important considerations:

Page 89 of 97

Performance Troubleshooting Guide

How many transactions per second can be sustained? This is a key statement of capacity, as the number of users the site can support is extrapolated from this number. What are response times for key transactions? Key transactions are the most important ones from the users point of view. Exactly which are considered key depends on how the site will be used. They should be the 4-5 transactions the user executes most frequently or which have the biggest impression on user experience. Typical examples include the site homepage, login (if appropriate), and interactions with key applications. What are average response times? A realistic load test may include dozens or even hundreds of different transactions. Its extremely difficult to make a judgment about that many transactions. A good estimate of the overall performance health of the site is the average response time, weighted by the transaction frequency. How many users can be supported? Knowing the number of users who can be supported is a key indicator of whether the site is ready for production: can it handle the expected user population? However, be careful drawing conclusions here. Virtual users are not the same as human users one virtual user typically generates more load than one human user. Therefore its usually best to extrapolate the number of human users from the transaction per second rate. Sometimes a given test scenario is run at different load levels. Again, to compare these measurements, the best comparison is the transaction per second rate. However, that metric can sometimes be confusing to a non-technical audience, so the numbers of virtual users supported can be used as the comparison point instead but remember that this is valid only if the virtual users are executing the same interactions at the same rate.

6.3 Benchmarking tools


6.3.1 Automated load generation tools
To create and execute a performance benchmark, a tool will be needed to simulate a large number of virtual users. There are a number of tools available in the marketplace for this; here are a few we have seen in use with WebSphere Portal: Rational Performance Tester: a multi-user load testing and performance testing tool for validating Web application scalability. More information is available from http://www-01.ibm.com/software/awdtools/tester/performance/index.html.

Page 90 of 97

Performance Troubleshooting Guide

HP LoadRunner: software for performance validation. Details are available from HPs website (http://www.hp.com/). Apache JMeter: this is a Java application designed to load test functional behavior and measure performance. This is part of the Apache Jakarta project, available from http://jakarta.apache.org/jmeter/. OpenSTA: the OpenSTA toolset is distributed testing software licensed under the GNU General Public License. More information is available from http://www.opensta.org/. When considering load testing tools, consider the functions provided for creating load scripts and for analyzing the data. Our experience is that the commercial tools such as Rational Performance Tester and LoadRunner are superior in this area. This can make the performance testing process faster and more effective, leading to solutions more quickly.

6.3.2 Manual load generation


Sometimes using an automated load generation tool isnt practical or required. For example, it may be an administrative operation which only occurs infrequently. In that case, it wouldnt make sense to build a load script to execute it hundreds of times. For these situations, it can make sense to benchmark the operation with a Web browser. Here are a few suggestions to help get a more-reliable measurement in those cases: Capture the response times with a browser-based tool (such as IBM Page Detailer). This will give more accurate timings than a stopwatch. It will also allow checking the response time of each component of the page. This will eliminate variations due to the state of the browser cache; its not interesting to note that page response time changed from 8 seconds to 4 seconds without knowing if the state of the browser cache changed between the two measurements. A browser-based tool wont eliminate variations due to the state of the browser cache, but it will let you look at the details of the response time, so you can see if that is the cause of differences in response time. Run a representative load while capturing the response time. This doesnt always mean driving the system to full capacity. If the operation should be done during a normal business day, use an automated script to drive load to a typical level, not full system utilization. Repeat the measurement several times to reduce random variation. As with an automated workload, random variation will always show up; try to minimize it as much as possible.

Page 91 of 97

Performance Troubleshooting Guide

7 Conclusion
There can be many causes of performance problems for Web sites using WebSphere Portal. It is important to carefully examine the problem being experienced, and methodically look for the underlying cause to the problem. This guide describes many performance problems, as well as potential solutions. One additional closing thought to consider is that if you are experiencing a problem with WebSphere Portal, check the WebSphere Portal support site for possible solutions. This site can be found at http://www.ibm.com/software/genservers/portal/support/.

Feedback?
If you have feedback on the content or format of this guide, Martin Presler-Marshall can be reached by e-mail at mpresler@us.ibm.com.

Page 92 of 97

Performance Troubleshooting Guide

Appendix A: Credits
Thanks to the following team members for contributing to this paper: Mark Alkins Matt Babaian Tom Bal John Chambers Andy Citron Nathan Cook Megan Crouch Susan Hanis Alex Lang Stefan Liesche Ruthie Lyle Klaus Nossek Denny Pichardo Scott Snyder Dionne White Don Wood Laura Yen

Page 93 of 97

Performance Troubleshooting Guide

Appendix B: References and Further Reading


The following are sources of additional information for further reading: WebSphere Portal tuning guide: separate guies are available for different WebSphere Portal versions: WebSphere Portal 6.1: http://www01.ibm.com/support/docview.wss?rs=688&uid=swg27013972 WebSphere Portal 6.0: http://www01.ibm.com/support/docview.wss?uid=swg27008511

WebSphere Portal information center: information centers for various versions of WebSphere Portal are available from http://www.ibm.com/developerworks/websphere/zones/portal/proddoc.html WebSphere Application Server information centers: these are found at http://www.ibm.com/software/webservers/appserv/was/library/. For WebSphere Application Server 5.1, the appropriate section is titled Tuning, and for WebSphere Application Server 6.0, the appropriate section is titled Tuning Performance. IBM HTTP Server information center: the information center for IBM HTTP Server is part of the information center for WebSphere Application Server (above). Apache HTTP Server Documentation: as the IBM HTTP Server is based on the Apache HTTP Server, there is much useful information in the Apache documentation which applies to IBM HTTP Server. This documentation can be found at http://httpd.apache.org/docs/2.2/. DB2 information center: http://publib.boulder.ibm.com/infocenter/db2help/index.jsp Common malpractices whitepaper (Eleven ways to wreck a deployment): there are many ways to mis-handle deployment of an application, from performance issues to other problems. Available on the web at http://www1.ibm.com/support/docview.wss?rs=180&uid=swg27007543. IBM WebSphere Portal: Performance testing and analysis: based on experience with resolving customer performance problems, this paper hits some of the most common problems and solutions. Available from IBM developerWorks at http://www.ibm.com/developerworks/websphere/library/techarticles/0807_lang/08 07_lang.html. MustGather technotes related to WebSphere Portal: http://www1.ibm.com/support/docview.wss?rs=688&uid=swg21236371 Parallel Portlet rendering: http://www1.ibm.com/support/docview.wss?rs=688&context=SSHRKX&q1=Portlet+tempora

Page 94 of 97

Performance Troubleshooting Guide

rily+disabled+while+rendering+in+parallel&uid=swg21177538&loc=en_US&cs= utf-8&lang=en Out Of Memory issues: There are several resources available. For AIX, Linux, Windows: http://www1.ibm.com/support/docview.wss?uid=swg21138587&loc=en_US&cs=utf-8 &lang=en For Solaris: http://www01.ibm.com/support/docview.wss?rs=0&uid=swg21145349 Portal self help guide: http://www.redbooks.ibm.com/redpapers/pdfs/redp4339.pdf. This IBM Redpaper covers a variety of topics related to design and deployment of a WebSphere Portal installation, including performance concerns.

Page 95 of 97

Performance Troubleshooting Guide

Appendix C: HTTP Status Codes


The HTTP protocol defines status codes, such as the all-too-familiar 404 not found code. The full list of status codes can be found in the HTTP 1.1 specification, at http://www.w3.org/Protocols/rfc2616/rfc2616.html (section 10 of the specification contains the detailed definitions). Most HTTP servers include these status codes in their request and error logs. The following are the most common status codes from WebSphere Portal: Code 200 Meaning OK the request was handled properly Note: if WebSphere Portal is able to process a request, it will return a 200 code, even if a problem occurs while rendering the page (such as a portlet being unavailable). 302 Found the content can be retrieved from another URL WebSphere Portal will typically return this code when logging in the user; when the request is sent containing the user ID and password, WebSphere Portal will return a 302 status to direct the user to their initial authenticated page. 304 400 401 403 404 500 Not Modified the request was to check if the content has changed, and it has not changed. Bad Request Unauthorized Forbidden Not Found Internal Server Error When using an external HTTP server with WebSphere Portal, this status code will be given when the HTTP server is not able to get a response from the portal application server in a reasonable period of time.

Page 96 of 97

Performance Troubleshooting Guide

Copyright IBM Corporation 2008 IBM United States of America Produced in the United States of America All Rights Reserved The e-business logo, the eServer logo, IBM, the IBM logo, IBM Directory Server, DB2, Lotus, WebSphere, POWER4 and POWER5 are trademarks of International Business Machines Corporation in the United States, other countries or both. Lotus and Domino are trademarks of Lotus Development Corporation and/or IBM Corporation. The following are trademarks of other companies: Linux is a registered trademark of Linus Torvalds. Windows and Windows 2003 Enterprise Server are trademarks of Microsoft Corporation in the United States and/or other countries LoadRunner is a trademark of Mercury in the United States and/or other countries. Other company, product and service names may be trademarks or service marks of others. INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PAPER AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Information in this paper as to the availability of products (including portlets) was believed accurate as of the time of publication. IBM cannot guarantee that identified products (including portlets) will continue to be made available by their suppliers. This information could include technical inaccuracies or typographical errors. Changes may be made periodically to the information herein; these changes may be incorporated in subsequent versions of the paper. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this paper at any time without notice. Any references in this document to non-IBM web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY, USA 10504-1785

Page 97 of 97

Anda mungkin juga menyukai