Anda di halaman 1dari 14

Hands-On Lab

Lab Manual
Building Intel High Performance Linpack Tool
HPL to Windows Native Applications

Author David Wang, Phil Pennington


Solutions Architect

Please do not remove this manual from the lab

Page i
nformation in this document is subject to change without notice. The example companies, organizations,
products, people, and events depicted herein are fictitious. No association with any real company, organization,
product, person or event is intended or should be inferred. Complying with all applicable copyright laws is the
responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft
Corporation.

Microsoft may have patents, patent applications, trademarked, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.

© 2005 Microsoft Corporation. All rights reserved.

Microsoft, MS-DOS, MS, Windows, Windows NT, MSDN, Active Directory, BizTalk, SQL Server, SharePoint,
Outlook, PowerPoint, FrontPage, Visual Basic, Visual C++, Visual J++, Visual InterDev, Visual SourceSafe,
Visual C#, Visual J#, and Visual Studio are either registered trademarks or trademarks of Microsoft Corporation
in the U.S.A. and/or other countries.

Other product and company names herein may be the trademarks of their respective owners.

Page ii
Contents

LAB: BUILDING INTEL MPI BENCHMARK TOOL TO WINDOWS NATIVE APPLICATION ................................. 1
Lab Objective ......................................................................................................................................................... 1
Software Requirements .......................................................................................................................................... 1
Exercise 1 – Building HPL tool using Visual Studio 2005 ...................................................................................... 1
Task 1 – Download HPL Source code from netlib.org ....................................................................................... 1
Task 2 – Download Math Kernal Libruaries from either AMD or Intel ............................................................... 1
HPL Tuning ............................................................................................................................................................ 7
Description of the HPL.dat File .......................................................................................................................... 7
Guide Lines ...................................................................................................................................................... 10

Page iii
Lab: Building Intel MPI Benchmark tool to Windows Native
application
Linpack tool is the common tool used to test the Cluster Performance banchmark. The lab will instruct
you to build the linux source on windows platforms

Lab Objective
Estimated time to complete this lab: 60 minutes

The objectives of this lab are to get familiarize with Microsoft Developer tools and environment to build
Linux Applications. The exercise will also produce a tool from open Intel Source for Linkack bench Mark
testing formally known as Pallas test tool

Software Requirements
 Visual Studio 2005
 Platform SDK
 Internet connection to download source code from web
 CCS SDK

Exercise 1 – Building HPL tool using Visual Studio 2005


In this exercise, you will learn how to use the visual studio to build the linux source code into Windows
applications

Task 1 – Download HPL Source code from netlib.org


You can download the source code from web site:

http://www.netlib.org/benchmark/hpl/

Download and extract the source code to

c:\workshop\hpl

Task 2 – Download Math Kernal Libruaries from either AMD or Intel

 Download Intel MKL libruaries for Windows x64 from Intel


https://registrationcenter.intel.com/EvalCenter/DownloadCounter.aspx?File=1031&EmailID=dwang%40mi
crosoft.com&Sequence=450520

 Or Download ACML libraries from AMD


Download and unzip all the source code to c:\Workshop\IMP folder, note if you are not able to

Page 1
read and unzip .GZ files, you can go to winzip.com to download the latest unzip application to
uncompress the source files.
http://developer.amd.com/acml.aspx?session_id=CAHJH$51498@HIHFDAHE#a

Download the Windows x64 version compiled by using PGI

AMD Core Math Library (ACML) for Windows®. Built


acml-3-5-0-win64.exe (28 MB) Windows® 64-Bit
with PGI.

Make sure that you install it to C drive to the root

This will result in a folder created c:\acml3.5.0

Task 3 – Download and install CCP SDK


From link below
http://www.microsoft.com/downloads/details.aspx?FamilyID=D8462378-2F68-409D-9CB3-
02312BC23BFD&displaylang=en

Make sure that you install it onto C Drive and choose the default settings
This will result in the directory structure like:

Page 2
Task 4 – Build the HPL by using the make file provided

We have made a make file that will be used to build the package

Task 5 – Running HPL for benchmarking


Copy all .dat, .exe and .dll files to a directory on a head node. Lets assume it is \\headnode\scratch.
Modify the HPL.dat file to suit the target configuration (see more at
http://www.netlib.org/benchmark/hpl/tuning.html ):

 Set the value of Ns to represent the problem size. The actual value will depend on the total memory
available across all nodes. A value of 12,000 seems to work well for 2GB systems, 26,000 for 4 nodes
with 2GB each. Note that there is non-linear dependency between the problem size and the amount of
memory. You should play with this value. You can try setting it lower to get a quicker run and then up it
to get more accurate GFlop numbers. When it becomes too large the problem size will not fit in physical
memory and the systems will start swapping to disk, drastically reducing the performance. You can run
multiple problem sizes with a single run by putting several numbers separated by spaces on the same
line. You will need to set the value on the previous line to the number of problem sizes you want to
explore.

 Set the value of Ps to the number of processors (cores) you have on each node and the value of Qs to
the number of nodes. Note that Hyperthreading should be disabled in the BIOS. Again, you can explore
different combinations of PxQ and set the value of the previous line to the number of combinations you re
working with.

 Run the following command to schedule the job. N should be the total number of processors you want to
use and should be equal or greater than P * Q in the HPL.dat file.

 job submit /numprocessors:N /stdout:\\headnode\scratch\hpl.out mpiexec -wdir \\headnode\scratch

Page 3
xhpl.exe

 After the job starts, you should see HPL.out created in \\headnode\scratch.
This file we contain the output being generated by the running Linpack. With
large N values it takes a while to see the result appear. For values 26,000,
for example, it can be close to 10 minutes. When running on a single
processor (which you should do the first time around to set a baseline), the
resulting GFlop value should be roughly equal or greater to the clock speed
of a processor (3GHz -> 3GFlops). When scaling across the compute nodes
you will see a roughly linear drop-off in total performance from a theoretical
max (N procs * GFlops per proc). (Things will get non-linear on larger
cluster sizes and full discussion is beyond this quick note.) This is due to
the interconnect latency. For very rough calibration - on a GigE
interconnect, you should see 85% or so of max value when running across 2
nodes each with 2 procs. With Infiniband you should do better.

- It is helpful to create a Perfmon window that is showing all of the compute


nodes at once. Go to Start / Administrative Tools / Performance. Right
click on the main window and select Properties. In General tab change
View to Histogram. Go to Data tab and Add a % Processor Time counter for
each of the compute nodes by specifying their name in the dialog.

Monitor job completion using the Compute Cluster Job Manager or "job list" command
When job completes, results can be found in \\headnode\scratch\hpl.out
Or you can configure the job to be like this: note the working directory where it will be ran

Page 4
It should show result something like this

============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================

An explanation of the input/output parameters follows:


T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

Page 5
The following parameter values will be used:

N : 29 30 34 35
NB : 1 2 3 4
PMAP : Row-major process mapping
P : 2 1 4
Q : 2 4 1
PFACT : Left Crout Right
NBMIN : 2 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.


- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-016
- Computational tests pass if scaled residuals are less than 16.0

============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L2 29 1 2 2 0.16 1.120e-004
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N )= 0.0674622 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0519667 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0174238 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L4 29 1 2 2 0.01 3.028e-003

Task 6 – Experimenting and tuning the HPL tool


Run hpl several times after modifying the hpl.dat file
Examine results
Task 7 – Using Excel as a front end (optional)
Using Excel as a front-end to Linpack
=====================================

Start Excel
Select Tools / Macro / Security from the menu
Set Security Level to Medium
Restart Excel
Select File / Open and open linpack.xls
Slick Enable Macros in Security Warning
Under Cluster Parameters set 'cluster' and 'work dir'
Under Linpack Parameters set N, NB, P and Q
Click Run button to schedule job on a cluster

Page 6
Click Query button to periodically poll for results until the status changes to Successsful
Click Results button to parse the HPL.out file and plt the results
(Execute button runs xhpl.exe as a local process, not on a cluster)

Select Tools / Macro / Visual Basic Editor


The VB code that integrates with Compute Cluster is under Modules / Module1
Note that Cluster Parameters are mapped into an XSD under Data / XML / XML Source
All of the values are then used for job submission in RunJob() through
ActiveWorkbook.XmlMaps(1)

Appendix

HPL Tuning
After having built the executable hpl/bin/<arch>/xhpl, one may want to modify the input data file HPL.dat. This file should reside in the
same directory as the executable hpl/bin/<arch>/xhpl. An example HPL.dat file is provided by default. This file contains information about
the problem sizes, machine configuration, and algorithm features to be used by the executable. It is 31 lines long. All the selected
parameters will be printed in the output generated by the executable.

We first describe the meaning of each line of this input file below. Finally, a few useful experimental guide lines to set up the file are given
at the end of this page.

Description of the HPL.dat File

Line 1: (unused) Typically one would use this line for its own good. For example, it could be used to summarize the content of the input
file. By default this line reads:
HPL Linpack benchmark input file

Line 2: (unused) same as line 1. By default this line reads:


Innovative Computing Laboratory, University of Tennessee

Line 3: the user can choose where the output should be redirected to. In the case of a file, a name is necessary, and this is the line where
one wants to specify it. Only the first name on this line is significant. By default, the line reads:
HPL.out output file name (if any)
This means that if one chooses to redirect the output to a file, the file will be called "HPL.out". The rest of the line is unused, and this
space to put some informative comment on the meaning of this line.

Line 4: This line specifies where the output should go. The line is formatted, it must begin with a positive integer, the rest is unsignificant.
3 choices are possible for the positive integer, 6 means that the output will go the standard output, 7 means that the output will go to the
standard error. Any other integer means that the output should be redirected to a file, which name has been specified in the line above.
This line by default reads:
6 device out (6=stdout,7=stderr,file)
which means that the output generated by the executable should be redirected to the standard output.

Line 5: This line specifies the number of problem sizes to be executed. This number should be less than or equal to 20. The first integer is
significant, the rest is ignored. If the line reads:
3 # of problems sizes (N)
this means that the user is willing to run 3 problem sizes that will be specified in the next line.

Page 7
Line 6: This line specifies the problem sizes one wants to run. Assuming the line above started with 3, the 3 first positive integers are
significant, the rest is ignored. For example:
3000 6000 10000 Ns
means that one wants xhpl to run 3 (specified in line 5) problem sizes, namely 3000, 6000 and 10000.

Line 7: This line specifies the number of block sizes to be runned. This number should be less than or equal to 20. The first integer is
significant, the rest is ignored. If the line reads:
5 # of NBs
this means that the user is willing to use 5 block sizes that will be specified in the next line.

Line 8: This line specifies the block sizes one wants to run. Assuming the line above started with 5, the 5 first positive integers are
significant, the rest is ignored. For example:
80 100 120 140 160 NBs
means that one wants xhpl to use 5 (specified in line 7) block sizes, namely 80, 100, 120, 140 and 160.

Line 9: This line specifies how the MPI processes should be mapped onto the nodes of your platform. There are currently two possible
mappings, namely row- and column-major. This feature is mainly useful when these nodes are themselves multi-processor computers. A
row-major mapping is recommended.

Line 10: This line specifies the number of process grid to be runned. This number should be less than or equal to 20. The first integer is
significant, the rest is ignored. If the line reads:
2 # of process grids (P x Q)
this means that you are willing to try 2 process grid sizes that will be specified in the next line.

Line 11-12: These two lines specify the number of process rows and columns of each grid you want to run on. Assuming the line above
(10) started with 2, the 2 first positive integers of those two lines are significant, the rest is ignored. For example:
12 Ps
68 Qs
means that one wants to run xhpl on 2 process grids (line 10), namely 1-by-6 and 2-by-8. Note: In this example, it is required then to start
xhpl on at least 16 nodes (max of Pi-by-Qi). The runs on the two grids will be consecutive. If one was starting xhpl on more than 16 nodes,
say 52, only 6 would be used for the first grid (1x6) and then 16 (2x8) would be used for the second grid. The fact that you started the MPI
job on 52 nodes, will not make HPL use all of them. In this example, only 16 would be used. If one wants to run xhpl with 52 processes
one needs to specify a grid of 52 processes, for example the following lines would do the job:
4 2 Ps
13 8 Qs

Line 13: This line specifies the threshold to which the residuals should be compared with. The residuals should be or order 1, but are in
practice slightly less than this, typically 0.001. This line is made of a real number, the rest is not significant. For example:
16.0 threshold
In practice, a value of 16.0 will cover most cases. For various reasons, it is possible that some of the residuals become slightly larger, say
for example 35.6. xhpl will flag those runs as failed, however they can be considered as correct. A run should be considered as failed if the
residual is a few order of magnitude bigger than 1 for example 10^6 or more. Note: if one was to specify a threshold of 0.0, all tests
would be flagged as failed, even though the answer is likely to be correct. It is allowed to specify a negative value for this threshold, in
which case the checks will be by-passed, no matter what the threshold value is, as soon as it is negative. This feature allows to save time
when performing a lot of experiments, say for instance during the tuning phase. Example:
-16.0 threshold

The remaning lines allow to specifies algorithmic features. xhpl will run all possible combinations of those for each problem size, block
size, process grid combination. This is handy when one looks for an "optimal" set of parameters. To understand a little bit better, let say
first a few words about the algorithm implemented in HPL. Basically this is a right-looking version with row-partial pivoting. The panel
factorization is matrix-matrix operation based and recursive, dividing the panel into NDIV subpanels at each step. This part of the panel
factorization is denoted below by "recursive panel fact. (RFACT)". The recursion stops when the current panel is made of less than or

Page 8
equal to NBMIN columns. At that point, xhpl uses a matrix-vector operation based factorization denoted below by "PFACTs". Classic
recursion would then use NDIV=2, NBMIN=1. There are essentially 3 numerically equivalent LU factorization algorithm variants (left-
looking, Crout and right-looking). In HPL, one can choose every one of those for the RFACT, as well as the PFACT. The following lines of
HPL.dat allows you to set those parameters.

Lines 14-21: (Example 1)


3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try all variants of PFACT, 4 values for NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3 and 4, and all
variants for RFACT.

Lines 14-21: (Example 2)


2 # of panel fact
2 0 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try 2 variants of PFACT namely right looking and left looking, 2 values for NBMIN, namely 4 and 8, 1 value for NDIV
namely 2, and one variant for RFACT.

In the main loop of the algorithm, the current panel of column is broadcast in process rows using a virtual ring topology. HPL offers
various choices and one most likely want to use the increasing ring modified encoded as 1. 3 and 4 are also good choices.

Lines 22-23: (Example 1)


1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the increasing ring modified topology.

Lines 22-23: (Example 2)


2 # of broadcast
0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the increasing ring virtual topology and the long message algorithm.

Lines 24-25 allow to specify the look-ahead depth used by HPL. A depth of 0 means that the next panel is factorized after the update by
the current panel is completely finished. A depth of 1 means that the next panel is immediately factorized after being updated. The
update by the current panel is then finished. A depth of k means that the k next panels are factorized immediately after being updated.
The update by the current panel is then finished. It turns out that a depth of 1 seems to give the best results, but may need a large
problem size before one can see the performance gain. So use 1, if you do not know better, otherwise you may want to try 0. Look-ahead
of depths 3 and larger will probably not give you better results.

Lines 24-25: (Example 1):


1 # of lookahead depth
1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depth 1.

Lines 24-25: (Example 2):

Page 9
2 # of lookahead depth
0 1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depths 0 and 1.

Lines 26-27 allow to specify the swapping algorithm used by HPL for all tests. There are currently two swapping algorithms available, one
based on "binary exchange" and the other one based on a "spread-roll" procedure (also called "long" below). For large problem sizes, this
last one is likely to be more efficient. The user can also choose to mix both variants, that is "binary-exchange" for a number of columns
less than a threshold value, and then the "spread-roll" algorithm. This threshold value is then specified on Line 27.

Lines 26-27: (Example 1):


1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll" swapping algorithm. Note that a threshold is specified in that example but not used
by HPL.

Lines 26-27: (Example 2):


2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll" swapping algorithm as soon as there is more than 60 columns in the row panel.
Otherwise, the "binary-exchange" algorithm will be used instead.

Line 28 allows to specify whether the upper triangle of the panel of columns should be stored in no-transposed or transposed form.
Example:
0 L1 in (0=transposed,1=no-transposed) form

Line 29 allows to specify whether the panel of rows U should be stored in no-transposed or transposed form. Example:
0 U in (0=transposed,1=no-transposed) form

Line 30 enables / disables the equilibration phase. This option will not be used unless you selected 1 or 2 in Line 26. Example:
1 Equilibration (0=no,1=yes)

Line 31 allows to specify the alignment in memory for the memory space allocated by HPL. On modern machines, one probably wants to
use 4, 8 or 16. This may result in a tiny amount of memory wasted. Example:
8 memory alignment in double (> 0)

Guide Lines

1. Figure out a good block size for the matrix multiply routine. The best method is to try a few out. If you happen to know the
block size used by the matrix-matrix multiply routine, a small multiple of that block size will do fine. This particular topic is
discussed in the FAQs section.

2. The process mapping should not matter if the nodes of your platform are single processor computers. If these nodes are multi-
processors, a row-major mapping is recommended.

3. HPL likes "square" or slightly flat process grids. Unless you are using a very small process grid, stay away from the 1-by-Q and P-
by-1 process grids. This particular topic is also discussed in the FAQs section.

4. Panel factorization parameters: a good start are the following for the lines 14-21:
5. 1 # of panel fact
6. 1 PFACTs (0=left, 1=Crout, 2=Right)
7. 2 # of recursive stopping criterium
8. 4 8 NBMINs (>= 1)

Page 10
9. 1 # of panels in recursion
10. 2 NDIVs
11. 1 # of recursive panel fact.
12. 2 RFACTs (0=left, 1=Crout, 2=Right)
13. Broadcast parameters: at this time it is far from obvious to me what the best setting is, so i would probably try them all. If I had
to guess I would probably start with the following for the lines 22-23:
14. 2 # of broadcast
15. 1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
The best broadcast depends on your problem size and harware performance. My take is that 4 or 5 may be competitive for
machines featuring very fast nodes comparatively to the network.

16. Look-ahead depth: as mentioned above 0 or 1 are likely to be the best choices. This also depends on the problem size and
machine configuration, so I would try "no look-ahead (0)" and "look-ahead of depth 1 (1)". That is for lines 24-25:
17. 2 # of lookahead depth
18. 0 1 DEPTHs (>=0)
19. Swapping: one can select only one of the three algorithm in the input file. Theoretically, mix (2) should win, however long (1)
might just be good enough. The difference should be small between those two assuming a swapping threshold of the order of
the block size (NB) selected. If this threshold is very large, HPL will use bin_exch (0) most of the time and if it is very small (< NB)
long (1) will always be used. In short and assuming the block size (NB) used is say 60, I would choose for the lines 26-27:
20. 2 SWAP (0=bin-exch,1=long,2=mix)
21. 60 swapping threshold
I would also try the long variant. For a very small number of processes in every column of the process grid (say < 4), very little
performance difference should be observable.

22. Local storage: I do not think Line 28 matters. Pick 0 in doubt. Line 29 is more important. It controls how the panel of rows
should be stored. No doubt 0 is better. The caveat is that in that case the matrix-multiply function is called with ( Notrans,
Trans, ... ), that is C := C - A B^T. Unless the computational kernel you are using has a very poor (with respect to performance)
implementation of that case, and is much more efficient with ( Notrans, Notrans, ... ) just pick 0 as well. So, my choice:
23. 0 L1 in (0=transposed,1=no-transposed) form
24. 0 U in (0=transposed,1=no-transposed) form
25. Equilibration: It is hard to tell whether equilibration should always be performed or not. Not knowing much about the random
matrix generated and because the overhead is so small compared to the possible gain, I turn it on all the time.
26. 1 Equilibration (0=no,1=yes)
27. For alignment, 4 should be plenty, but just to be safe, one may want to pick 8 instead.
28. 8 memory alignment in double (> 0)

Page 11

Anda mungkin juga menyukai