Prepared by neil.johnson@microsoft.com
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, our provision of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. The descriptions of other companies products in this document, if any, are provided only as a convenience to you. Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers. 2011 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express authorization of Microsoft Corp. is strictly prohibited. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Page ii
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
2.0.0.5 2.0.0.6
2.0.0.7
Page iii
Jetstress 2013, Field Guide, Version 2.0.0.8 Issued Prepared by Neil Johnson "Document1" last modified on 26 Feb. 14, Rev 2
Document Contributors
Name Neil Johnson Alexandre Costa Ross Smith IV Position Senior Consultant, UK MCS SENIOR SDET, Exchange Test PRINCIPAL PROGRAM MANAGER, Exchange CXP Section Author Jetstress internals Configuring Jetstress Various Various Various
Ramon b. Infante DIR, WW COMMUNITIES, UC Matt Gossage Umair Ahmad PRINCIPAL PROGRAM MANAGER LEAD SDET II, Exchange Test
Page iv
Jetstress 2013, Field Guide, Version 2.0.0.8 Issued Prepared by Neil Johnson "Document1" last modified on 26 Feb. 14, Rev 2
Reviewers
Name Neil Johnson Alexandre Costa Ross Smith IV Version 2.0.0.1 2.0.0.1 2.0.0.1 Position Senior Consultant II, MCS UK SENIOR SDET, Exchange Test PRINCIPAL PROGRAM MANAGER, Office 365 - CAT SVCS DIR, WW COMMUNITIES, UC PRINCIPAL PROGRAM MANAGER LEAD, Exchange PM US SDET II, Exchange Test US SENIOR PROGRAM MANAGER, Exchange PM - US PRINCIPAL TECHNICAL WRITER, Content Publishing DELIVERY ARCHITECT, US-US-MCS West SL 2 SENIOR PROGRAM MANAGER LEAD, Office 365 - CAT SVCS REGIONAL ARCHITECT, US-MCS DOD SL 2 PRINCIPAL CONSULTANT, US-MCS Civilian SL 2 Date
Umair Ahmad Nathan Muggli Scott Schnoll Boris Lokhvitsky Jeff Mealiffe
2.0.0.1 2.0.0.1
Page v
Jetstress 2013, Field Guide, Version 2.0.0.8 Issued Prepared by Neil Johnson "Document1" last modified on 26 Feb. 14, Rev 2
Table of Contents
1 Purpose...................................................................................................................... 1 2 What is New in Jetstress 2013 .................................................................................... 1 3 Introduction to Jetstress ............................................................................................. 2 4 Jetstress Internals ...................................................................................................... 3
4.1 Main Jetstress Components ....................................................................................................... 3
4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 Auto Tuning Component ..................................................................................................................3 Thread Dispatcher ............................................................................................................................5 Background Log Checksummer ........................................................................................................5 Offline Log and Database Checksummer ..........................................................................................5 Reporting and Verification ................................................................................................................6
When should I run Jetstress in my project? ............................................................................... 9 Where should I run Jetstress in my infrastructure? ................................................................. 10 Failure Mode Testing ................................................................................................................ 11
5.4.1 5.4.2 5.4.3 Raid Array Testing ...........................................................................................................................11 Resilient Component Testing ..........................................................................................................11 Example of a failed degraded mode test ........................................................................................12
5.5 5.6
5.7 5.8
Preparing for the Jetstress test ................................................................................................ 17 What happens if the test fails? ................................................................................................. 18
Jetstress Version and Download .............................................................................................. 19 Prerequisites ............................................................................................................................. 20 Getting ESE Files necessary for Jetstress .................................................................................. 21
6.4.1 6.4.2 File locations from an installed Exchange Server ...........................................................................21 File locations from the installation media ......................................................................................21
6.5
Installation ................................................................................................................................ 22
6.5.1 6.5.2 Application Installation ...................................................................................................................22 ESE File Installation .........................................................................................................................24
7.2
9.3 9.4
10 11
12 13 14
14.1
Appendix C - Running a Jetstress Test with JetstressCmd.exe ............................... 47 Appendix E Running Jetstress on a production server ........................................ 49 Common Issues.................................................................................................... 50
Troubleshooting Jetstress......................................................................................................... 50
Jetstress cannot attach to or create a database .............................................................................50 Error loading Performance Monitor counters ................................................................................50 Unable to tune for the parameters ................................................................................................51 Unable to mount databases due to invalid mount point configuration .........................................51 14.1.1 14.1.2 14.1.3 14.1.4
14.1.5 Jetstress testing failed. Error: System.ApplicationException: Faulty performance counter paths: \MSExchange Database(*)\* .........................................................................................................................52
Page viii
Jetstress 2013, Field Guide, Version 2.0.0.8 Issued Prepared by Neil Johnson "Document1" last modified on 26 Feb. 14, Rev 2
Purpose
This document is intended to explain the process and requirements for validating an Exchange 2013 storage solution prior to releasing an Exchange deployment into production. It will explain how Jetstress works, how to plan for and perform a Jetstress test, and how to analyse the results of the test. This document is not intended to provide Exchange storage design guidance. For guidance on Exchange 2013, server design and planning refer to Planning and Deployment.
Important Changes Do not use Jetstress 2013 for older versions of Exchange Server. Jetstress 2013 has only been tested with Exchange Server 2013.
Page 1
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Introduction to Jetstress
Jetstress is a tool for simulating Exchange database I/O load without requiring Exchange to be installed. It is primarily used to validate physical deployments against the theoretical design targets that were derived during the design phase. To simulate the complex Exchange database I/O pattern effectively, Jetstress makes use of the same ESE.DLL that Exchange uses in production. It is therefore vital Jetstress use the same version of the Extensible Storage Engine (ESE) files that your Exchange infrastructure will be built with in production. Ideally, Jetstress testing will be part of the overall project plan. The best time to schedule Jetstress testing is just before Exchange will be physically installed onto the servers. Jetstress testing provides the following benefits prior to deploying live users. Validates that the physical deployment is capable of meeting specific performance requirements Validates that the storage design is capable of meeting specific performance requirements Finds weak components prior to deploying in production Proves storage and I/O stability
The most important aspect of Jetstress testing is that it allows you to see how the physically deployed storage and server infrastructure will behave once a real Exchange workload is applied. This often works out differently from expectations, especially in scenarios where shared storage infrastructure is deployed or where the storage design is complex. Often the Jetstress test will not provide the results that were expected. Sometimes by making subtle configuration changes to the storage infrastructure (for example, driver or firmware updates) it is then possible to get the test to pass. It is important to remember that when the Jetstress test reports a failure, Jetstress has not failed, Jetstress is just reporting on the performance of your storage solution. This may seem an obvious point, however a large number of customer escalation cases for Jetstress are not actually Jetstress cases and are instead storage performance cases. If you need to remediate a test failure, remember that Jetstress is dumb tool that is used worldwide by thousands of Exchange professionals and in Office 365. It is extremely unlikely that Jetstress is broken; it is far more likely that you have a design issue or misconfiguration with your storage deployment. Fundamentally, a successful Jetstress test validates that all of the hardware and software components within the I/O stack from the operating system down to the physical disk drive are working to a sufficient level to meet the predicted performance required by Exchange to operate successfully.
Page 2
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Important: The validity of your Jetstress testing is only as good as the user profile analysis and workload prediction that was completed during the design phase of the project.
4
4.1
Jetstress Internals
Main Jetstress Components
Like Exchange, Jetstress is an ESE-based application. It runs in user memory space, makes API calls to ESE, which in turn makes calls to the Windows File system and I/O Manager to gain access to the data stored on disk. During each of these tasks Windows records performance information about the specific task and the operating system as a whole. Once the test is completed, Jetstress analyses the performance data to determine if the system meets the targets specified at the beginning of the test.
Windows Operating System
Windows Performance Counters
Hardware
Performance Data
4.1.1
This component is responsible for auto tuning within Jetstress. It attempts to determine the maximum thread count that the solution can support. Each thread performs a set amount of ESE calls, which generates a set amount of disk I/O. By raising or lowering thread count, the storage workload can be modified. The auto-tuning component attempts to determine the maximum thread count that the storage solution can support, whilst remaining within the published disk latency guidelines for Exchange Server. The Jetstress test parameters for disk latency are shown in section 8.3 Interpreting Jetstress test results.
Device Drivers
Page 3
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
New: Auto tuning has been improved in Jetstress 2013 by moving to a global thread controller. Auto-tuning may still fail, however it should be successful in many more scenarios than in 2010.
Page 4
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
4.1.2
Thread Dispatcher
The thread dispatcher is responsible for managing workload within Jetstress. The main areas of interest within the thread dispatcher are as follows: ThreadCount: number of transactional threads globally (prior to Exchange 2010, it used to be the number of threads per storage group and in Exchange 2010 it was number of threads per database). In Exchange 2013 this is a global parameter. ThreadTypes: each of those threads chooses to do one type of work against the database. The same thread can perform different types of work during a given run. There are four types: insert, read, update and delete (all of those against records on a table). The default operation mix for an Exchange 2010 simulation is: 40%, 35%, 5% and 20%, respectively. SluggishSessions: the default is 1 for Exchange 2010. This is usually used to fine tune the amount of work performed by a given thread. Internally, a thread sleeps for (SluggishSessions * TaskRunTime) before picking up the next task to run. For example, if you have 3 for SluggishSessions and an insert thread took 100ms in the last cycle, it will sleep for 300ms before moving on to the next cycle. Of course, 0 means go full throttle.
4.1.3
This component simulates the I/O overhead of additional database copies. This copy operation has an I/O cost which increases with each additional copy.
4.1.4
This process checksums all database and log files at the end of a Jetstress run to ensure that all data is intact. It also provides performance data for CRC checksum speed should VSS copies require a checksum prior to backup. This process is extremely hard on storage hardware, often applying an I/O load many times greater than the workload that the actual Jetstress test applies. Important If you are running Jetstress on multiple servers in parallel on shared storage infrastructure, it is vital that the CRC check is not running while other servers are performing their Jetstress tests. Selecting the multi-host option during the test configuration causes the testing process to stop and wait for confirmation before beginning the CRC check to avoid servers interfering with each others results.
While working out the correct thread count to use it is not necessary to let the checksum part of the test complete. To stop the checksum you can either click on cancel, which will stop the checksum part of the test but still generate the performance test report, or edit the Jetstress configuration file and change the VerifyChecksum value to false (default is true). <VerifyChecksum>false</VerifyChecksum>
Page 5
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
4.1.5
At the end of a Jetstress test, the reporting and verification process compares the observed performance results against a set of acceptable values. These results are then written to a HTML file. During the test, binary performance data is written out to a BLG file.
Page 6
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Jetstress testing can be difficult to account for in your planning process. Particularly, how much time to allocate for testing, and which parts of the project should Jetstress testing occur? This section will try to answer some of these questions and explain the process in more detail.
5.1
The following process assumes that you are using the disk subsystem throughput test and autotuning as recommended.
5.1.1
Figure 2 - High Level Test Overview shows a high-level flowchart for Jetstress testing. The process begins with a completed Mailbox Role Calculator and ends when the test has passed successfully while meeting the targets identified in the calculator.
Complete Mailbox Role Calculator
Begin Testing
Jetstress Testing
Test Pass?
yes
Achieved IOPS?
yes
Validation Complete
No
No
Remediation / Reconfiguration
Page 7
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.1.2
IOPS Exceeded?
NO
NO
Testing Begins
YES
NO
Test Initialisation
Test Pass?
YES
IOPS Sufficient?
YES
Test Pass?
YES
Test Pass?
YES
Test Results
Testing Ends
NO
Page 8
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.2
So, why would you run Jetstress during the planning/design phase of a project? The simple answer is that with todays powerful hardware, Exchange design teams must use standard chunks of hardware to create their design. Rather than attempt to guess what the I/O limits are of the hardware it is preferable to perform some Jetstress tests on the hardware to determine the maximum storage IO capacity of the system. This allows the design team to specify the bill of materials much more precisely, thereby saving money and reducing risk. However, if you have already proven the solution in the lab, why test again at build time? This is a common question. Many projects only schedule sufficient time for testing a single server and its storage solution with the belief that they only need to validate the design. The problem with this approach is that it assumes a zero error rate in the build out. What happens if someone forgets a part of the build on one server? Alternatively, deploys a different device driver from the one used in the lab? What happens if a faulty piece of hardware has been deployed? Jetstress testing at build time is a great way to validate that the physically deployed hardware and software are capable of providing the required I/O performance for Exchange. Jetstress testing at build time is also a way to identify failing components such as disk drives; it is much less stressful to identify a weak batch of disks during a Jetstress test than on a Monday morning after a large user migration! If the project plan will allow it, build in sufficient time to test each server and storage chassis that will be deployed before migrating user mailboxes to it. Remember that Jetstress can be fully automated, so with a little bit of planning it can be left to run overnight and may not actually add any significant overhead to the project.
Page 9
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.3
Each database copy must be designed to provide sufficient I/O to support the copy if it were to become active. Therefore, by testing each database LUN in parallel, we are validating that the storage solution is able to meet the design requirements. We are also validating that any pieces of shared infrastructure are able to meet the demand of the entire solution, rather than simply testing each server individually. Note: Where there is no shared infrastructure and all storage is directly attached, servers may be tested individually. However, the test must be configured to include any active, replica or lagged LUNS that could become online at the same time to be a valid test.
Page 10
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.4
From a service availability perspective, it is important to validate that your storage can provide sufficient performance in all common failure conditions. Due to this, it is recommended to run the Jetstress test while the array is operating in the following conditions.
Array Condition Optimal Degraded Rebuilding Test importance Recommended for all deployments Recommended for all deployments Recommended if array has hot spare .
1
Description All disk spindles operating normally Single spindle removed from the array Failed spindle replaced and array controller is rebuilding the array
Ideally, the Jetstress test should still pass during a degraded mode test. If the test fails, refer to this post to analyse the failure severity.
5.4.2
Any aspect of the storage solution that has been designed to be resilient should also be tested in a failed state to determine the impact. For example if there are multiple paths between the host and the storage controller, the Jetstress test should still pass if one is disabled. Since there are so many possible types of resilient components, it is impossible to list them here, however the general spirit of this test is to evaluate potential sources of failure within your storage solution and ensure that Jetstress still passes if they enter a degraded state.
1
If your array does not contain a hot spare, you can choose to perform array rebuilds out of hours so the end user impact is minimized, however your data loss exposure is increased. If you plan on performing array rebuilds during working hours, even if you do not have a hot spare configured it is recommended to perform a Jetstress test run while the array is rebuilding. Page 11
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.4.3
This example shows an unacceptable test result. I have chosen to show an unacceptable result since a good test is just a flat line and that is not particularly interesting. In this instance, the storage was based on Raid6 technology. The Jetstress test was configured to run at 1256 IOPS (Mailbox Role Calculator predicted 1200 IOPS). Approximately half way through the test, a hard disk drive was (carefully) removed from the array and the spare began rebuilding. The test data shows that the average read I/O latency (Exchange Database ==> Instances\I/O Database Reads (Attached) /average Latency) increased from 11ms to 400ms+, with latency spikes of 3000-4000ms on the affected LUN. This situation took 18 hours to return to normal after the failure. This represented a clear failure of the degraded mode test. Important: Common failure modes such as a disk rebuild should not materially affect the test results.
Note: Please refer to the following section about understanding storage configuration for Exchange Server 2013 for more information on recommended raid configurations for Exchange Server. http://technet.microsoft.com/en-us/library/ee832792.aspx
Page 12
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.5
5.5.1
The approach and testing process do not change. The aim of the test is to validate that the storage presented to the virtual guest can provide sufficient performance to meet the predicted requirements from the mailbox role calculator. All performance counters and recommended values remain the same from a physical to a virtual guest and the recommendations for testing against raid arrays and in failure-modes still apply. However, there are things that we may need to consider during our Jetstress testing. 1. Is the virtual host operating at a normal working load during our test? If the host has capacity for 10 virtual machines and we are testing with a single virtual machine running, then there is the possibility that we will experience performance problems once the host is fully loaded. 2. Does the host server have any high availability technology that we need to test in degraded mode? This could include things like multiple paths to the storage or network, or maybe even a Hypervisor HA solution. Additionally the host may be the failover location for other guests, meaning that workload may increase dramatically in a failure scenario. 3. Follow the current recommended practices from both Microsoft and your hypervisor vendor. Yes, I know this is obvious but it still amazes me how many problems are resolved by following the recommended guidance!
Page 13
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Guidance The spirit of the test is to ensure that the system can meet its predicted workload during normal working conditions and during any common failure modes for which the system has been designed to survive.
Page 14
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
For more information about virtualizing Exchange Server: Announcing Enhanced Hardware Virtualization Support for Exchange 2010 (this applies equally to Exchange Server 2013): http://blogs.technet.com/b/exchange/archive/2011/05/16/announcing-enhancedhardware-virtualization-support-for-exchange-2010.aspx Demystifying Exchange 2010 SP1 Virtualization (this applies equally to Exchange Server 2013): http://blogs.technet.com/b/exchange/archive/2011/10/11/demystifying-exchange-2010sp1-virtualization.aspx Best Practices for Virtualizing Exchange Server 2010 with Windows Server 2008 R2 Hyper V (Applies equally to Exchange Server 2013): http://www.microsoft.com/download/en/details.aspx?id=2428
5.6
5.6.1
Initialisation
This phase includes installation, prerequisites and initial database creation. Of these tasks, the initial database creation will take the longest amount of time. Database creation time varies between hardware deployments however expect around 24 hours for 10TB of data per server (~7GB/minute). If you are using direct attached storage and initialise multiple servers in parallel these predictions apply to each server. If you are using shared storage, your initialisation time may take considerably longer.
DATA (TB) TIME (Hours) TIME (Days) 1TB 2.4 0.1 2TB 4.8 0.2 5TB 12.0 0.5 10TB 24.1 1.0 50TB 120.3 5.0 100TB 240.6 10.0
5.6.2
Testing
The actual testing phase will vary depending on the complexity and maturity of the design. If your design is based on complex, cutting-edge storage technology, it is highly likely that you will need to allocate more time for testing. If your design is based on common direct attached components, the testing phase is likely to be quite short. For simple direct attached solutions allow between 2-5 days, for complex SAN solutions try to allocate up to 10 working days. If you are working in a complex
Page 15
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
enterprise with large scale, complex storage infrastructure budget between 4-6 weeks for Jetstress testing. Troubleshooting storage performance issues can often be very time-consuming.
5.6.3
Clean-up
Before the server can be put into production, it is necessary to remove the Jetstress application and the test databases that were created. The recommended procedure is as follows Uninstall Jetstress and Reboot Copy the Jetstress data to a safe location Delete the Jetstress installation folder Remove all test databases
Depending on complexity, allow between 1 and 2 hours per Exchange server that needs to have Jetstress uninstalled. Tip: If you have a complex deployment, you can use the scripts embedded here:
JetstressScripts.zip
The scripts will parse your JetstressConfig.XML file and remove all database and log folders defined in the test. The scripts takes two input parameters: [XMLFile] Path to JetstressConfig.XML file defaults to C:\Program Files\Exchange Jetstress\JetstressConfig.xml if no other value is specified. [Prompt] $true or $false, default is $true, specify $false to use as part of an automated process.
Note that these scripts are unsupported and you use them entirely at your own risk. They are provided here for convenience only.
Page 16
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.7
Page 17
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.8
Page 18
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
6
6.1
Installing Jetstress
Documentation
The document that you are currently reading represents the main source of information for Jetstress 2013. If you are validating Exchange Server 2003, 2007 or 2010 refer to the Jetstress Field Guide for Jetstress 2010.
6.2
Note: Although there is a 32-bit build of Exchange 2007, it is not recommended or supported to use these ESE files to run a Jetstress test. This is due to the requirement for a 64-bit address space to simulate a realistic Exchange I/O pattern. Jetstress 2013 will not allow you to use an XML configuration file from an older version of Jetstress. Always ensure that you use the same version of Jetstress to initialise the databases and to perform the testing.
Refer to Appendix D Exchange 2003 for information on configuring Jetstress 14.01.225.x for Exchange 2003 Page 19
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
6.3
Prerequisites
.NET Framework 4.5 or higher A copy of your 64-bit production ESE files3 o ese.dll o eseperf.dll o eseperf.hxx o eseperf.ini o eseperf.xml
It is important that the version of ESE that is used for the test is the same version that will be used in production.
See section 5.4 Getting ESE Files necessary for Jetstress for the locations of these files. Page 20
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
6.4
6.4.1
6.4.2
Caution Remember to use the same version of ESE files in your Jetstress tests that you will use in production.
Page 21
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
6.5
Installation
Before performing this section, it is recommended that all prerequisites have been met and that Exchange server is not installed on any servers being used for Jetstress testing.
6.5.1
# 1.
Application Installation
Screenshot
2.
Page 22
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
3.
Leave the installation options as default unless you have a good reason to change them. Note: All performance data and HTML reports will be stored in the installation folder so if your system drive is short of space select an alternative folder.
4.
This is the last chance to stop the installation. Click on Next to install
Page 23
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
5.
6.5.2
# 1.
Instruction Copy ESE prerequisite files into the Jetstress installation folder. By default this is c:\Program Files\Exchange Jetstress
Page 24
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
2.
Start Exchange Jetstress 2013 Note: Jetstress requires local Administrator access. If user access control is enabled, ensure that you start the JetstressWin.EXE process as an administrator.
3.
4.
Jetstress will attempt to use the ESE files that were copied over in step 1. The first time that this occurs Jetstress must be restarted. Verify in the output on this screen that the ESE version is correct and that the last line of the status output requires that Jetstress be restarted. Close Jetstress This is the end of the Jetstress installation.
Page 25
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Configuring Jetstress
For the purposes of this document, we will be configuring a disk subsystem throughput test. The goal of this test is to identify the peak working IOPS value that the storage subsystem can sustain while remaining within the disk latency targets established by the Exchange Product Group.
7.1
7.1.2
Helps you determine whether your storage system meets or exceeds the planned Exchange mailbox profile. In the Exchange mailbox profile test scenario, you can specify the number of mailbox users, IOPS per mailbox and quota size to simulate the profiled Exchange mailbox load. This test type can be useful if your storage has been specifically designed to operate only at a specific disk capacity4. Note: Even if this test type is used, it is still recommended to complete the disk subsystem throughput test to determine the maximum working load of the storage solution at full capacity.
It is not recommended to design Exchange storage performance based on less than 80% utilisation capacity. Page 26
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
7.2
#
Initial configuration
Instruction Open Exchange Jetstress 2013 Screenshot
1.
2.
3.
Check that the status text does not ask for a restart and that the last two lines state that the ESE engine and performance libraries were detected.
Page 27
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
4.
Since this is the first time, we are configuring a test we will accept the defaults and click next. This will create a new configuration file called JetstressConfig.xml in the default installation directory. If you already have an XML file select that.
5.
Select the Test disk subsystem throughput test and click next
6.
Ensure that Supress tuning and use thread count is unchecked. This is a change to Jetstress 2010 where autotuning would rarely work. Auto tuning should work in most scenarios with Jetstress 2013. If Auto-tuning fails, revert to manual thread configuration as per Appendix A Configuring Thread Count. You should always test with 100% database capacity and target IOPS throughput, however if the storage presented to your servers is greatly oversized then you can control the Jetstress test database sizes by reducing the size the database using storage capacity percentage. Most validation tests should leave both values at 100.
Page 28
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
7.
Configure the test for performance. If you are testing a shared storage platform, enable the multi-host checkbox. Ensure that run background database maintenance is checked. Set continue the test run despite encountering errors to enabled. If any errors are detected during the test, they will be reported in a new table to highlight disk errors.
8.
Enter in the folder for storing the test results and set the correct duration for Jetstress. A minimum of one successful 2hr and a separate 24 test is required for deployment validation. Note: While auto-tuning or configuring thread count, you can set a shorter than 2 hour test by typing directly into the window. 0.75 = 45m 0.50 = 30m 0.25 = 15m
Recommendation: Use 0.50 (30 minute) test runs to set thread count for SAN storage.
9.
Configure the test to represent the production deployment. Number of databases should be the total on this server including all database copies, active, passive and lagged. Number of copies per database represents the number of total copies
Page 29
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
that will exist for each unique database. This value simply simulates some LOG I/O reads to account for the log shipping between active and passive databases it does NOT actually copy logs between servers. For example, if your 6 server DAG contained 30 databases, with 1 active copy, 2 passive HA copies and 1 lagged copy per database (or 120 database copies spread across 6 servers, with each server hosting 20 copies), you would set the number of databases to 20 and the number of copies per database to 4.
10. Configure the database and log file paths appropriately. Scroll to the bottom of this page to find the next link. Note: Refer to the Mailbox Role Calculators Distribution Tab to understand how your database should be configured.
11. If this is the first time the test has been run select to Create new databases, otherwise select Attach existing databases.
Page 30
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
12. Verify that the paths are as expected and click Prepare test
13. This will begin database initialisation this process will vary but plan on 24 hours for every 10TB worth of data to be initialised. This value should equate to 80% of the available storage. Refer to section 4.6.1 Initialisation, for further information on database sizes and creation time.
Page 31
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
14. Once the test has been initialised, click Execute Test.
15. Once the test has completed, close Jetstress and copy the Jetstress report and performance data somewhere for analysis. Each performance test will generate the following files. Performance_<date>.XML Performance_<date>.HTML Performance_<date>.BLG DBChecksum_<date>.XML DBChecksum_<date>.HTML DBChecksum_<date>.BLG XMLConfig_<date>.XML
Ensure that you make a copy of all of these files. Note: In addition you may also wish to make a copy of the *.EVT files which contain event log data taken during the test.
Table 9 - Jetstress initial configuration
Page 32
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
This section will explain what output files will be created after the test and what is in each one.
File Performance_<date>.BLG Content Purpose
Binary performance data To provide detailed data for analysis. captured during the performance Open this file in perfmon and test. examine the counters manually to understand reasons for failure. XML Report for the performance test Provides the status report data in XML format.
HTML Report for the performance Provides an easy to read status test report for the test. Binary performance data captured during the checksum test. Provides binary performance data gathered during the CRC checksum of the database. Useful if the checksum fails or takes a long time to complete.
XML Report for the checksum test Provides status report data in XML format. HTML Report for the checksum test XML Configuration File Provides an easy to read status report for the checksum test. Provides a backup of the Jetstress Configuration file used for the test.
Page 33
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
This section will walk through a very simple sample report, and explain where the key values are stored and how to interpret the data.
9.1
Make a note of the following value: Total Database Required IOPS / Server
Page 34
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.2
9.2.1
Test Summary
This section is a basic summary of the test, when it started, finished and which versions of operating system and ESE were used. The most important part of this section is the overall test result, pass or fail.
9.2.2
This section shows some more detailed parameters regarding the test. A test disk subsystem throughput test report will always show 100% for Capacity Percentage and Throughput Percentage. In this example, 4 x 25GB Databases were created on a 126GB LUN. Jetstress created a total of 101GB (109154926592 bytes) of data for testing which is 80% of the available space. This is normal behaviour; by default, in performance mode Jetstress will use 80% of the disk capacity to allow room for growth during the test process. The most important value in this section is the Achieved Transactional I/O per Second. In this example the test validated the storage can provide 231 transactional I/O per second. This represents random database IOPS. Note: To validate that the test has met the design requirements compare the Achieved Transactional I/O per Second from your Jetstress report to the Total Database Required IOPS / Server value recorded in section 8.1 Target design values, from the Mailbox Role Calculator.
Page 35
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.2.3
This section displays some system values that Jetstress used for this test. The important values for analysis here are the thread count and number of copies per database.
9.2.4
Database Configuration
This section lists the paths for each database and log combination. In this example, 4 x 25GB databases were configured on a single LUN. Check that all of the test databases are listed here and the path names are correct.
9.2.5
This section of the report displays the Transactional I/O values that were achieved for each database. Transactional I/O does not include I/O for Background Database Maintenance. BDM I/O is mostly sequential so it is not usually considered during the design phase. Information: If you sum the values highlighted in the red box the result should add up to the Achieved Transactional I/O per second reported in the Database Sizing and Throughput table. In this example, 33.859 + 24.069 + 33.87 + 23.491 + 33.978 + 24.186 + 34.043 + 23.807 = ~231 IOPS.
Page 36
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.2.6
This section displays the I/O that was used to perform Background Database Maintenance only. The sum of values in the red box shows the total amount of IO used for BDM operations. These are sequential operations and we do not usually need to account for them in our design. However, take the advice of your storage vendor on this aspect, some storage platforms do not handle sequential IO as well as others and may require some additional design work to help them deal with BDM more gracefully.
9.2.7
This section displays the I/O overhead for LOG file replication. In this example there were two replica copies (replicas=2), this is shown by a non-zero count for I/O Log Reads/sec. If this value is greater than zero it confirms that database replication is being simulated. Note: For those that noticed, I finally provided a report that shows log IO I know, the little things count
Page 37
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.2.8
This table shows all I/O that was recorded during the test (transactional I/O plus BDM I/O plus LOG I/O). The summation of I/O values from areas highlighted in red in this table should agree (roughly) with those observed at the storage subsystem. In this case, the summation suggests that the storage subsystem had to deal with 349 IOPS. However, roughly 1/3rd of those (349-231=117) IOPS were sequential and so were not accounted for during the design process, since sequential I/O is very easy on most disk subsystems. The following chart shows the observed IOPS from the Windows host during the Jetstress test. This counter includes all system IOPS as well as the test IOPS; however there should be a strong correlation between the IOPS observed on the windows host and at the storage subsystem. In the event of contradiction between observed IOPS at the Windows Host and those at the storage controller, the windows host values take precedence from a Jetstress validation perspective.
It is import to differentiate between sequential IOPS and transactional (random) IOPS when validating your storage. We are only interested in transactional IOPS when we are Jetstress testing BDM and LOG IO are sequential in nature and so we ignore them from a performance planning perspective for Exchange Server. Often storage teams are confused by the results of a Jetstress test since the achieved transactional I/O per second value is much lower than the observations they make at the storage system. It is important to differentiate between the workloads. Note: It is an invalid approach to sum the values displayed in the Total I/O Performance table and compare them to the Total Database Required IOPS / Server predicted by the Mailbox Role calculator. The only value from the Jetstress report that is required for validation is Achieved Transactional I/O per Second. All other values are for support and curiosity only!
9.2.9
This section of the report shows the observed system performance during the test. This section is most often used for troubleshooting. The most important thing to note from this section is that the CPU load from Jetstress is usually minimal. Jetstress has been optimized to evaluate the storage subsystem and not the host performance itself.
Page 39
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Error Code -1022 -1018 -1019 -1118 -1021 -533 -528 -501 -1023 -1024 -1025 -1032 -1812 -1852 -1305 -1119 -566 -567
Filesystem Corruptions
JET_errCheckpointCorrupt JET_errMissingLogFile JET_errLogFileCorrupt JET_errInvalidPath JET_errInvalidSystemPath JET_errInvalidLogDirectory JET_errFileAccessDenied JET_errFileInvalidType JET_errLogCorrupted JET_errObjectNotFound
Lost Flush
Information Some failure events are more important than others. Lost Flush events signal significant data corruption has occurred and something is very wrong with your storage (under no circumstances should you entertain putting a system into production that is experiencing ANY lost flush events during a test). However, some other IO Failures are relatively normal, for example, in a JBOD environment we may see -1021 (JET_errDiskReadVerificationFailure) which, although signifies that the data we read was not the same that we originally wrote (checksum failed), Exchange will try to deal with this scenario via Page Patching in normal operation and so is not of critical importance.
For a full list of JET/ESE event types see the following article Extensible Storage Engine Error Codes.
Page 40
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
What is a Lost Flush? A lost flush occurs if we issued a write operation to the disk and the OS reported the operation as having successfully completed, but it actually didnt get physically committed to the non-volatile storage. The two main reasons for this to happen are: 1. A bug somewhere in the storage stack. 2. Power loss on storage with write-cache enabled: in this case, the operation is committed to the volatile cache of the disk or controller, but if the hardware loses power, it means it never actually made it to the non-volatile storage, even though it was reported to the application that it did. This is the reason why we only run with write-cache enabled on the storage if theres a battery backing up the cache, so if it loses power, the controller makes sure to flush the uncommitted cache to the disk. A lost flush is a very insidious type of storage failure for a database engine because the consequences can range from none (if we are very lucky) to nasty and potentially undetectable logical database corruption (more likely). Undetected lost flushes on the active copy may show up as a JET_errDbTimeTooNew (-567) replication error on the passive copy. Undetected lost flushes on the passive copy may show up as a JET_errDbTimeTooOld (-566) replication error on the passive copy. ESE has implemented lost flush detection, based on a flush map. Basically, every time we issue a write on a page, we flip a bit on the actual page and also store that bit in a flush map in memory. If we read the page again off the disk, we check the bit against the in-memory flush map and if they dont match, it means the flush was lost. Important: The bottom line for lost flushes is that you should NEVER put a system into production that has recorded lost flushes during the Jetstress test. You must be 100% certain that you have resolved the underlying problem and have at least one good 24 hour test that has no lost flushes recorded before accepting the solution into production.
Page 41
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Page 42
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.3
Stress Test Lenient mode (> 6 hour test) Average Database Read Latency: 20ms Average Log File Write Latency: 10ms Max Database Read Latency: 200ms Max Log File Write Latency: 200ms
Page 43
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
9.4
Test evaluation
Evaluate the following criteria for each test run. The first test is validated against the design target and must be performed manually; Jetstress does not validate this value. The second and third are against pre-defined latency targets for Exchange, if these values are not within tolerance, Jetstress will report the test as failed. 1. DB IOPS Target: Is the Achieved Transactional I/O per Second in the test report higher than the Total Database Required IOPS / Server predicted in the Mailbox Role Calculator? 2. Is the I/O Database Reads Average Latency in the test report <20ms? 3. Is the I/O Log Writes Average Latency in the test report <10ms?
LOG Write Action Latency PASS PASS Test successful The test is failing to meet the IOPS target, but the latency values are good. Increase the thread count by 1 and re-test. Use sluggishsessions to finetune if necessary. At least one database has recorded latency over threshold. If the latency values are very close to limits increase sluggish sessions by 1, if both target IOPS and latency values are much higher decrease the thread count. If the test shows that Achieved IOPS is below the design target AND the test latency values are above limits the storage solution is unable to meet the requirements. At this stage, it is necessary to re-evaluate the storage design and begin troubleshooting the physical deployment to determine the correct remediation.
Page 44
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Notes: Try auto-tuning with Jetstress 2013 If in doubt start with thread=1 and work up until the test fails. If the thread count predicted is less than 1 it may be necessary to modify the sluggishsessions value afterwards. The exact quantity of IOPS generated per thread will change as the storage system workload changes. As the storage system gets closer to its performance limit the IOPS per thread value will reduce. Jetstress was designed to produce approximately 60 IOPS per thread at 20ms disk latency.
Page 45
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Page 46
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Config Generate
/c JetstressConfig.xml /g
TimeOut
/TimeOut 2H0M0S
Output
/output c:\output
DBPath
/dbpath m:\sg1\mdb /dbpath n:\sg2\mdb /log x:\sg1\log y:\sg2\log /pctcapacity 100 /throughput 100 /threads
DoNotRunDBMPerformance
RunDBMPerformance
New
Page 47
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Open existing databases Restore backup database Run soft recovery test Run streaming backup test Run transaction performance test Run database checksums
VerifyCheckSum
Page 48
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Remember: This is not supported or recommended only follow this as a matter of last resort or under the instruction of Microsoft Support/Microsoft Consulting Services.
Page 49
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
14 Common Issues
14.1 Troubleshooting Jetstress
While using Jetstress, you may encounter some known issues with Jetstress. This section provides possible causes, and the recommended solutions.
Event log error that may display: Error -1032 Possible cause: Permissions are insufficient to access the .edb file or the log files. Solution: Verify that permissions are sufficient for the account under which Jetstress is running. Jetstress requires read/write permission to the directories it is using.
Event log error that may display: Error -550 (0) Possible cause: The last time Jetstress was run, it was ended uncleanly. This caused the log files to become unsynchronized with the database. Solution: Delete the Jetstress database (*.edb), log files (*.log), and check file (*.chk), and re-create the Jetstress database. You can also use Eseutil.exe with the /r switch to resynchronize the logs and database.
Event log error that may display: Error -1022 Possible cause: The failure is caused by circular logging by Jetstress. Solution: Check the log drive for the log file name that is identified in the event log. Delete that log file and all the log files that have a higher number in the file name. Then, run Eseutil.exe /r to recover Jetstress.edb. When the database is in a good state, delete all the log files in the log directory, and rerun Jetstress.
Page 50
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
Solution: The mount path folder could be listed as <DIR> for a number of reasons:
Page 51
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2
1. Verify the LUN is present and in good health. 2. Use the storage system array management software to verify that the LUN has an assigned logical drive. 3. Using the Disk Management MMC, re-assign the LUN to the correct mount-point.
14.1.5 Jetstress testing failed. Error: System.ApplicationException: Faulty performance counter paths: \MSExchange Database(*)\*
Jetstress version 658.004 has an incompatibility with ESE version 620 (CU1) and above, if you try to run a test with more than 38 databases configured. If you experience this issue either use the RTM version of ESE (516.26) or use a version of ESE later than 726, which will be released with CU2. Additionally, a fixed version of Jetstress will be released (726) that will work with all versions of ESE after 516.26 (Exchange 2013 ESE).
Page 52
Jetstress 2013, Jetstress Field Guide, Version 2.0.0.8 Draft Prepared by neil.johnson@microsoft.com "Document1" last modified on 26 Feb. 14, Rev 2