A Hybrid Operating System Cluster Solution PDF

Architect of an Open WorldTM
A Hybrid OS Cluster Solution
Dual-Boot and Virtualization with

Windows HPC Server 2008 and
Linux Bull Advanced Server for Xeon
Published: June 2009
Dr. Patrice Calegari, HPC Application Specialist, BULL S.A.S.

Thomas Varlet, HPC Technology Solution Professional, Microsoft
The proof of concept presented in this document is neither a product nor a service offered by Microsoft or
BULL S.A.S.
The information contained in this document represents the current view of Microsoft Corporation and
BULL S.A.S. on the issues discussed as of the date of publication. Because Microsoft and BULL S.A.S.
must respond to changing market conditions, it should not be interpreted to be a commitment on the part
of Microsoft or BULL S.A.S., and Microsoft and BULL S.A.S. cannot guarantee the accuracy of any
information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT and BULL S.A.S. MAKE NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Microsoft Corporation and BULL
S.A.S.
Microsoft and BULL S.A.S. may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft or BULL S.A.S., as applicable, the furnishing of this document
does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
© 2008, 2009 Microsoft Corporation and BULL S.A.S. All rights reserved.
NovaScale is a registered trademark of Bull S.A.S.
Microsoft, Hyper-V, Windows, Windows Server, and the Windows logo are trademarks of the Microsoft
group of companies.
PBS GridWorks®, GridWorks™, PBS Professional®, PBS™ and Portable Batch System® are trademarks
of Altair Engineering, Inc.
The names of actual companies and products mentioned herein may be the trademarks of their
respective owners.
Initial publication: release 1.2, 52 pages, published in June 2008

Minor updates: release 1.5, 56 pages, published in Nov. 2008
This paper with meta-scheduler implementation: release 2.0, 76 pages, published in June 2009
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
2
Abstract
The choice of an operating system (OS) for a high performance computing (HPC) cluster is a critical
decision for IT departments. The goal of this paper is to show that simple techniques are available today
to optimize the return on investment by making that choice unnecessary, and keeping the HPC
infrastructure versatile and flexible. This paper introduces Hybrid Operating System Clusters (HOSC).
An HOSC is an HPC cluster that can run several OS’s simultaneously. This paper addresses the situation
where two OS’s are running simultaneously: Linux Bull Advanced Server for Xeon and Microsoft®
Windows® HPC Server 2008. However, most of the information presented in this paper can apply to 3 or
more simultaneous OS’s, possibly from other OS distributions, with slight adaptations. This document
gives general concepts as well as detailed setup information. Firstly, technologies necessary to design an
HOSC are defined (dual-boot, virtualization, PXE, resource manager and job scheduler). Secondly,
different approaches of HOSC architectures are analyzed and technical recommendations are given with
a focus on computing performance and management flexibility. The recommendations are then
implemented to determine the best technical choices for designing an HOSC prototype. The installation
setup of the prototype and the configuration steps are explained. A meta-scheduler based on Altair PBS
Professional is implemented. Finally, basic HOSC administrator operations are listed and ideas for future
works are proposed.
This paper can be downloaded from the following web sites:
http://www.bull.com/techtrends
http://www.microsoft.com/downloads
http://technet.microsoft.com/en-us/library/cc700329(WS.10).aspx
3
ABSTRACT.......................................................................................................................................................... 3
1 INTRODUCTION .......................................................................................................................................... 7
2 CONCEPTS AND PRODUCTS......................................................................................................................... 9
2.1 MASTER BOOT RECORD (MBR) ............................................................................................................................9

2.2 DUAL‐BOOT .......................................................................................................................................................9
2.3 VIRTUALIZATION ...............................................................................................................................................10
2.4 PXE ...............................................................................................................................................................12
2.5 JOB SCHEDULERS AND RESOURCE MANAGERS IN A HPC CLUSTER ................................................................................13
2.6 META‐SCHEDULER ............................................................................................................................................13
2.7 BULL ADVANCED SERVER FOR XEON .....................................................................................................................14
2.7.1 Description ...........................................................................................................................................14
2.7.2 Cluster installation mechanisms ..........................................................................................................14
2.8 WINDOWS HPC SERVER 2008 ...........................................................................................................................16
2.8.1 Description ...........................................................................................................................................16
2.8.2 Cluster installation mechanisms ..........................................................................................................16
2.9 PBS PROFESSIONAL ..........................................................................................................................................18
3 APPROACHES AND RECOMMENDATIONS...................................................................................................19
3.1 A SINGLE OPERATING SYSTEM AT A TIME ................................................................................................................19

3.2 TWO SIMULTANEOUS OPERATING SYSTEMS ............................................................................................................21
3.3 SPECIALIZED NODES ...........................................................................................................................................23
3.3.1 Management node ..............................................................................................................................23
3.3.2 Compute nodes ....................................................................................................................................23
3.3.3 I/O nodes..............................................................................................................................................24
3.3.4 Login nodes ..........................................................................................................................................24
3.4 MANAGEMENT SERVICES ....................................................................................................................................25
3.5 PERFORMANCE IMPACT OF VIRTUALIZATION ...........................................................................................................25
3.6 META‐SCHEDULER FOR HOSC ............................................................................................................................26
3.6.1 Goals ....................................................................................................................................................26
3.6.2 OS switch techniques ...........................................................................................................................26
3.6.3 Provisioning and distribution policies ..................................................................................................26
4 TECHNICAL CHOICES FOR DESIGNING AN HOSC PROTOTYPE.......................................................................27
4.1 CLUSTER APPROACH ..........................................................................................................................................27

4.2 MANAGEMENT NODE ........................................................................................................................................27
4.3 COMPUTE NODES..............................................................................................................................................27
4.4 MANAGEMENT SERVICES ....................................................................................................................................28
4.5 HOSC PROTOTYPE ARCHITECTURE........................................................................................................................32
4.6 META‐SCHEDULER ARCHITECTURE ........................................................................................................................33
4
5 SETUP OF THE HOSC PROTOTYPE ...............................................................................................................34
5.1 INSTALLATION OF THE MANAGEMENT NODES ..........................................................................................................34

5.1.1 Installation of the RHEL5.1 host OS with Xen ......................................................................................34
5.1.2 Creation of 2 virtual machines .............................................................................................................34
5.1.3 Installation of XBAS management node on a VM................................................................................36
5.1.4 Installation of InfiniBand driver on domain 0 ......................................................................................36
5.1.5 Installation of HPCS head node on a VM .............................................................................................36
5.1.6 Preparation for XBAS deployment on compute nodes.........................................................................37
5.1.7 Preparation for HPCS deployment on compute nodes.........................................................................37
5.1.8 Configuration of services on HPCS head node .....................................................................................38
5.2 DEPLOYMENT OF THE OPERATING SYSTEMS ON THE COMPUTE NODES..........................................................................39
5.2.1 Deployment of XBAS on compute nodes..............................................................................................39
5.2.2 Deployment of HPCS on compute nodes..............................................................................................40
5.3 LINUX‐WINDOWS INTEROPERABILITY ENVIRONMENT ...............................................................................................43
5.3.1 Installation of the Subsystem for Unix‐based Applications (SUA)........................................................43
5.3.2 Installation of the Utilities and SDK for Unix‐based Applications ........................................................43
5.3.3 Installation of add‐on tools..................................................................................................................43
5.4 USER ACCOUNTS ...............................................................................................................................................43
5.5 CONFIGURATION OF SSH.....................................................................................................................................44
5.5.1 RSA key generation ..............................................................................................................................44
5.5.2 RSA key.................................................................................................................................................44
5.5.3 Installation of freeSSHd on HPCS compute nodes................................................................................45
5.5.4 Configuration of freeSSHd on HPCS compute nodes............................................................................45
5.6 INSTALLATION OF PBS PROFESSIONAL...................................................................................................................45
5.6.1 PBS Professional Server setup ..............................................................................................................46
5.6.2 PBS Professional setup on XBAS compute nodes .................................................................................46
5.6.3 PBS Professional setup on HPCS nodes ................................................................................................46
5.7 META‐SCHEDULER QUEUES SETUP .......................................................................................................................46
5.7.1 Just in time provisioning setup.............................................................................................................48
5.7.2 Calendar provisioning setup ................................................................................................................48
6 ADMINISTRATION OF THE HOSC PROTOTYPE .............................................................................................49
6.1 HOSC SETUP CHECKING .....................................................................................................................................49

6.2 REMOTE REBOOT COMMAND ..............................................................................................................................49
6.3 SWITCH A COMPUTE NODE OS TYPE FROM XBAS TO HPCS ......................................................................................49
6.4 SWITCH A COMPUTE NODE OS TYPE FROM HPCS TO XBAS ......................................................................................50
6.4.1 Without sshd on the HPCS compute nodes ..........................................................................................50
6.4.2 With sshd on the HPCS compute nodes ...............................................................................................50
6.5 RE‐DEPLOY AN OS ............................................................................................................................................50
6.6 SUBMIT A JOB WITH THE META‐SCHEDULER ............................................................................................................51
5
6.7 CHECK NODE STATUS WITH THE META‐SCHEDULER...................................................................................................52
7 CONCLUSION AND PERSPECTIVES ..............................................................................................................55
APPENDIX A: ACRONYMS .................................................................................................................................56
APPENDIX B: BIBLIOGRAPHY AND RELATED LINKS ............................................................................................58
APPENDIX C: MASTER BOOT RECORD DETAILS..................................................................................................60
C.1 MBR STRUCTURE .............................................................................................................................................60
C.2 SAVE AND RESTORE MBR...................................................................................................................................60
APPENDIX D: FILES USED IN EXAMPLES.............................................................................................................61
D.1 WINDOWS HPC SERVER 2008 FILES ....................................................................................................................61
D.1.1 Files used for compute node deployment ............................................................................................61
D.1.2 Script for IPoIB setup............................................................................................................................62
D.1.3 Scripts used for OS switch ....................................................................................................................63
D.2 XBAS FILES .....................................................................................................................................................64
D.2.1 Kickstart and PXE files..........................................................................................................................64
D.2.2 DHCP configuration..............................................................................................................................65
D.2.3 Scripts used for OS switch ....................................................................................................................65
D.2.4 Network interface bridge configuration ..............................................................................................67
D.2.5 Network hosts......................................................................................................................................68
D.2.6 IB network interface configuration ......................................................................................................68
D.2.7 ssh host configuration..........................................................................................................................68
D.3 META‐SCHEDULER SETUP FILES ............................................................................................................................69
D.3.1 PBS Professional configuration files on XBAS.......................................................................................69
D.3.2 PBS Professional configuration files on HPCS ......................................................................................69
D.3.3 OS load balancing files.........................................................................................................................69
APPENDIX E: HARDWARE AND SOFTWARE USED FOR THE EXAMPLES...............................................................72
E.1 HARDWARE .....................................................................................................................................................72
E.2 SOFTWARE ......................................................................................................................................................72
APPENDIX F: ABOUT ALTAIR AND PBS GRIDWORKS ..........................................................................................73
F.1 ABOUT ALTAIR .................................................................................................................................................73

F.2 ABOUT PBS GRIDWORKS ..................................................................................................................................73
APPENDIX G: ABOUT MICROSOFT AND WINDOWS HPC SERVER 2008 ...............................................................74
G.1 ABOUT MICROSOFT ..........................................................................................................................................74
G.2 ABOUT WINDOWS HPC SERVER 2008.................................................................................................................74
APPENDIX H: ABOUT BULL S.A.S. ......................................................................................................................75
6
1 Introduction
The choice of the right operating system (OS) for a high performance computing (HPC) cluster can be a
very difficult decision for IT departments. And this choice will usually have a big impact on the Total Cost
of Ownership (TCO) of the cluster. Parameters like multiple user needs, application environment
requirements and security policies are adding to the complex human factors included in training,
maintenance and support planning, all leading to associated risks on the final return on investment (ROI)
of the whole HPC infrastructure. The goal of this paper is to show that simple techniques are available
today to make that choice unnecessary, and keep your HPC infrastructure versatile and flexible.
In this white paper we will study how to provide the best flexibility for running several OS’s on an HPC
cluster. There are two main types of approaches to providing this service depending on whether a single
operating system is selected each time the whole cluster is booted, or whether several operating systems
are run simultaneously on the cluster. The most common approach of the first type is called the dual-boot
cluster (described in [1] and [2]). For the second type of approach, we introduce the concept of a Hybrid
Operating System Cluster (HOSC): a cluster with some computing nodes running one OS type while the
remaining nodes run another OS type. Several approaches to both types are studied in this document in
order to determine their properties (requirements, limits, feasibility, and usefulness) with a clear focus on
computing performance and management flexibility.
The study is limited to 2 operating systems: Linux Bull Advanced Server for Xeon 5v1.1 and Microsoft
Windows HPC Server 2008 (respectively noted XBAS and HPCS in this paper). For optimizing the
interoperability between the two OS worlds, we use the Subsystem for Unix-based Applications (SUA) for
Windows. The description of the methodologies is as general as possible in order to apply to other OS
distributions but examples are given exclusively in the XBAS/HPCS context. The concepts developed in
this document could apply to 3 or more simultaneous OS’s with slight adaptations. However, this is out of
the scope of this paper.
We introduce a meta-scheduler that provides a single submission point for both Linux and Windows. It
selects the cluster nodes with the OS type required by submitted jobs. The OS type of compute nodes
can be switched automatically and safely without administrator intervention. This optimizes computational
workloads by adapting the distribution of OS types among the compute nodes.
A technical proof of concept is given by designing, installing and running an HOSC prototype. This
prototype can provide computing power under both XBAS and HPCS simultaneously. It has two virtual
management nodes (aka head nodes) on a single server and the choice of the OS distribution among the
compute nodes can be done dynamically. We have chosen Altair PBS Professional software to
demonstrate a meta-scheduler implementation. This project is the result of the collaborative work of
Microsoft and Bull.
Chapter 2 defines the main technologies used in HOSC: the Master Boot Record (MBR), the dual-boot
method, the virtualization, the Pre-boot eXecution Environment (PXE), the resource manager and job
scheduler tools. If you are already familiar with these concepts, you may want to skip this chapter and go
directly to Chapter 3 that analyzes different approaches to HOSC architectures and gives technical
7
recommendations for their design. The recommendations are implemented in Chapter 4 in order to
determine the best technical choices for building an HOSC prototype. The installation setup of the
prototype and the configuration steps are explained in Chapter 5. Appendix D shows the files that were
used during this step. Finally, basic HOSC administrator operations are listed in Chapter 6 and ideas for
future works are proposed in Chapter 7, which concludes this paper.
This document is intended for computer scientists who are familiar with HPC cluster administration.
All acronyms used in this paper are listed in Appendix A. Complementary information can be found in the
documents and web pages listed in Appendix B.
8
2 Concepts and products
We assume that the readers may not be familiar with every concept discussed in the remaining chapters
in both Linux and Windows environments. Therefore, this chapter introduces the technologies (Master
Boot Record, Dual-boot, virtualization and Pre-boot eXecution Environment) and products (Linux Bull
Advanced Server, Windows HPC Server 2008 and PBS Professional) mentioned in this document.
If you are already familiar with these concepts or are more interested in general Hybrid OS Cluster
(HOSC) considerations, you may want to skip this chapter and go directly to Chapter 3.
2.1 Master Boot Record (MBR)

The 512-byte boot sector is called the Master Boot Record (MBR). It is the first sector of a partitioned data
storage device such as a hard disk. The MBR is usually overwritten by operating system (OS) installation
procedures; the MBR previously written on the device is then lost.
The MBR includes the partition table of the 4 primary partitions and a bootstrap code that can start the
OS or load and run the boot loader code (see the complete MBR structure in Table 3 of Appendix C.1).
A partition is encoded as a 16-byte structure with size, location and characteristic fields. The first 1-byte
field of the partition structure is called the boot flag.
Windows MBR starts the OS installed on the active partition. The active partition is the first primary
partition that has its boot flag enabled. You can select an OS by activating the partition where it is
installed. Tools diskpart.exe and fdisk can be used to change partition activation. Appendix D.1.3
and Appendix D.2.3 give examples of commands that enable/disable the boot flag.
Linux MBR can run a boot loader code (e.g., GRUB or Lilo). You can then select an OS interactively from
its user interface at the console. If no choice is given at the console, the OS selection is taken from the
boot loader configuration file that you can edit in advance before a reboot (e.g., grub.conf for the
GRUB boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux
partition) can be replaced from a Windows command line with the dd.exe tool.
Appendix C.2 explains how to save and restore the MBR of a device. It is very important to understand
how the MBR works in order to properly configure dual-boot systems.
2.2 Dual-boot
Dual-booting is an easy way to have several operating systems (OS) on a node. When an OS is run, it
has no interaction with the other OS installed so the native performance of the node is not affected by the
use of the dual-boot feature. The only limitation is that these OS’s cannot be run simultaneously.
When designing a dual-boot node, the following points should be analyzed:
• The choice of the MBR (and choice of the boot loader if applicable)
9
• The disk partition restrictions. For example, Windows must have a system partition on at least
one primary partition of the first device)
• The compatibility with Logical Volume Managers (LVM). For example, RHEL5.1 LVM creates a
logical volume with the entire first device by default and this makes it impossible to install a
second OS on this device.
When booting a computer, the dual-boot feature gives the ability to choose which OS to start from
multiple OS’s installed on that computer. At boot time, the way you can select the OS of a node depends
on the installed MBR. A dual-boot method that relies on Linux MBR and GRUB is described in [1].
Another dual-boot method that exploits the properties of active partitions is described in [2] an [3].
2.3 Virtualization
The virtualization technique is used to hide the physical characteristics of computers and only present a
logical abstraction of these characteristics. Virtual Machines (VM) can be created by the virtualization
software: each VM has virtual resources (CPUs, memory, devices, network interfaces, etc.) whose
characteristics (quantity, size, etc.) are independent from those available on the physical server. The OS
installed in a VM is called a guest OS: the guest OS can only access the virtual resources available in its
VM. Several VMs can be created and run on one physical node. These VMs appear like physical
machines for the applications, the users and the other nodes (physical or virtual).
Virtualization is interesting in the context of our study for two reasons:
1. It makes possible the installation of several management nodes (MN) on a single physical
server. This is an important point for installing several OS on a cluster without increasing its cost
with the installation of an additional physical MN server.
2. It provides a fast and rather easy way to switch from one OS to another: by starting a VM that
runs an OS while suspending another VM that runs another OS.
A hypervisor is a software layer that runs at a higher privilege level on the hardware. The virtualization
software runs in a partition (domain 0 or dom0), from where it controls how the hypervisor allocates
resources to the virtual machines. The other domains where the VMs run are called unprivileged domains
and noted domU. A hypervisor normally enforces scheduling policies and memory boundaries. In some
Linux implementations it also provides access to hardware devices via its own drivers. On Windows, it
does not.
The virtualization software can be:
• Host-based (like VMware): this means that the virtualization software is installed on a physical
server with a classical OS called the host OS.
• Hypervisor-based (like Windows Server® 2008 Hyper-V™ and Xen): in this case, the hypervisor
runs at a lower level than the OS. The “host OS” becomes just another VM that is automatically
started at boot time. Such virtualization architecture is shown in Figure 1.
10
Figure 1 - Overview of hypervisor-based virtualization architecture
“Full virtualization” is an approach which requires no modification to the hosted operating system,
providing the illusion of a complete system of real hardware devices. Such Hardware Virtual Machines
(HVM) require hardware support provided for example by Intel® Virtual Technology (VT) and AMD-V
technology. Recent Intel® Xeon® processors support full virtualization thanks to the Intel® VT. Windows
is only supported on fully-virtualized VMs and not on para-virtualized VMs. “para-virtualization” is an
approach which requires modification to the operating system in order to run in a VM.
The market provides many virtualization software packages among which:
• Xen [6]: a freeware for Linux included in the RHEL5 distribution which allows a maximum of
8 virtual CPUs per virtual machine (VM). Oracle VM and Sun xVM VirtualBox are commercial
implementations.
• VMware [7]: commercial software for Linux and Windows which allows a maximum of 4 virtual
CPUs per VM.
• Hyper-V [8]: a solution provided by Microsoft which only works on Windows Server 2008 and
allows only 1 virtual CPU per VM for non-Windows VM.
• PowerVM [9] (formerly Advanced POWER Virtualization): an IBM solution for UNIX and Linux on
most processor architectures that does not support Windows as a guest OS.
11
• Virtuozzo [10]: a ‘Parallels, Inc’ solution designed to deliver near native physical performance. It
only supports VMs that run the same OS as the host OS (i.e., Linux VMs on Linux hosts and
Windows VMs on Windows hosts).
• OpenVZ [11]: an operating system-level virtualization technology licensed under GPL version 2. It
is a basis of Virtuozzo [10]. It requires both the host and guest OS to be Linux, possibly of
different distributions. It has a low performance penalty compared to a standalone server.
2.4 PXE
The Pre-boot eXecution Environment (PXE) is an environment to boot computers using a network
interface independently of available data storage devices or installed OS. The end goal is to allow a client
to network boot and receive a network boot program (NBP) from a network boot server.
In a network boot operation, the client computer will:
1. Obtain an IP address to gain network connectivity: when a PXE-enabled boot is initiated, the
PXE-based ROM requests an IP address from a Dynamic Host Configuration Protocol (DHCP)
server using the normal DHCP discovery process (see the detailed process in Figure 2). It will
receive from the DHCP server an IP address lease, information about the correct boot server and
information about the correct boot file.
2. Discover a network boot server: with the information from the DHCP server the client
establishes a connection to the PXE servers (TFTP, WDS, NFS, CIFS, etc.).
3. Download the NBP file from the network boot server and execute it: the client uses Trivial
File Transfer Protocol (TFTP) to download the NBP. Examples of NBP are: pxelinux.0 for
Linux and WdsNbp.com for Windows Server.
When booting a compute node with PXE, the goal can be to install or run it with an image deployed
through the network, or just to run it with an OS installed on its local disk. In the latter case, the PXE just
answers the compute node requests by indicating that it must boot on the next boot device listed in its
BIOS.
NODE 1 - DHCPDISCOVER (client MAC address) DHCP SERVER
bootpc(68): boot bootps(67): boot

2 - DHCPOFFER (NEW IP address)
protocol client protocol server
on port 68 on port 67
3 - DHCPREQUEST (NEW IP address)
Broadcast IP source Broadcast IP source =

= 0.0.0.0
4 - DHCPACK (NEW IP address and boot information) DHCP server IP addr.
Figure 2 - DHCP discovery process
12
2.5 Job schedulers and resource managers in a HPC cluster
In an HPC cluster, a resource manager (aka Distributed Resource Management System (DRMS) or
Distributed Resource Manager (DRM)) gathers information about all cluster resources that can be used
by application jobs. Its main goal is to give accurate resource information about the cluster usage to a job
scheduler.
A job scheduler (aka batch scheduler or batch system) is in charge of unattended background executions.
It provides a user interface for submitting, monitoring and terminating jobs. It is usually responsible for the
optimization of job placement on the cluster nodes. For that purpose it deals with resource information,
administrator rules and user rules: job priority, job dependencies, resource and time limits, reservation,
specific resource requirements, parallel job management, process binding, etc. With time, job schedulers
and resource managers evolved in such a way that they are now usually integrated under a unique
product name. Here are such noteworthy products:
• PBS Professional [12]: supported by Altair for Linux/Unix and Windows
• Torque [13]: an open source job scheduler based on the original PBS project. It can be used as a
resource manager by other schedulers (e.g., Moab workload manager).
• SLURM (Simple Linux Utility for Resource Management) [14]: freeware and open source
• LSF (Load Sharing Facility) [15]: supported by Platform for Linux/Unix and Windows
• SGE (Sun Grid Engine) [16]: supported by Sun Microsystems
• OAR [17]: freeware and open source for Linux, AIX and SunOS/Solaris
• Microsoft Windows HPC Server 2008 job scheduler: included in the Microsoft HPC pack [5]
2.6 Meta-Scheduler
According to Wikipedia [18], “Meta-scheduling or Super scheduling is a computer software technique of
optimizing computational workloads by combining an organization's multiple Distributed Resource
Managers into a single aggregated view, allowing batch jobs to be directed to the best location for
execution”. In this paper, we consider that the meta-scheduler is able to submit jobs on cluster nodes with
heterogeneous OS types and that it can switch automatically the OS type of these nodes when necessary
(for optimizing computational workloads). Here is a partial list of meta-schedulers currently available:
• Moab Grid Suite and Maui Cluster scheduler [19]: supported by Cluster Resources, Inc.
• GridWay [20]: a Grid meta-scheduler by the Globus Alliance
• CSF (Community Scheduler Framework) [21]: an open source framework (an add-on to the
Globus Toolkit v.3) for implementing a grid meta-scheduler, developed by Platform Computing
Recent job schedulers can sometime be adapted and configured to behave as “simple” meta-schedulers.
13
2.7 Bull Advanced Server for Xeon
2.7.1 Description
Bull Advanced Server for Xeon (XBAS) is a robust and efficient Linux solution that delivers total cluster
management. It addresses each step of the cluster lifecycle with a centralized administration interface:
installation, fast and reliable software deployments, topology-aware monitoring and fault handling (to
dramatically lower time-to-repair), cluster optimization and expansion. Integrated, tested and supported
by Bull [4], XBAS federates the very best of Open Source components, complemented by leading
software packages from well known Independent Software Vendors, and gives them a consistent view of
the whole HPC cluster through a common cluster database: the clusterdb. XBAS is fully compatible with
standard RedHat Enterprise Linux (RHEL). Latest Bull Advanced Server for Xeon 5 release (v3.1) is
based on RHEL5.3 1 .
2.7.2 Cluster installation mechanisms

The Installation of an XBAS cluster starts with the setup of the management node (see the installation &
configuration guide [22]). The compute nodes are then deployed by automated tools.
BIOS settings must be set so that XBAS compute nodes boot on network with PXE by default. The PXE
files stored on the management node indicate if a given compute node should be installed (i.e., its
DEFAULT label is ks) or if it is ready to be run (i.e., its DEFAULT label is local_primary).
In the first case, a new OS image should be deployed 2 . During the PXE boot process, operations to be
executed on the compute node are written in the kickstart file. Tools based on PXE are provided by XBAS
to simplify the installation of compute nodes. The “preparenfs” tool writes the configuration files with the
information given by the administrator and with those found in the clusterdb. The generated configuration
files are: the PXE files (e.g., /tftpboot/C0A80002), the DHCP configuration file (/etc/dhcpd.conf),
the kickstart file (e.g., /release/ks/kickstart) and the NFS export file (/etc/exportfs). No user
interface access (remote or local) to the compute node is required during its installation phase with the
preparenfs tool. Figure 3 shows the sequence of interactions between a new XBAS compute node being
installed and the servers run on the management node (DHCP, TFTP and NFS). On small clusters, the
“preparenfs” tool can be used to install every CN. On large clusters the ksis tool can be used to optimize
the total deployment time of the cluster by cloning the first CN installed with the “preparenfs” tool.
In the second case, the CN is already installed and the compute node just needs to boot locally on its
local disk. Figure 4 shows the XBAS compute node normal boot scheme.
1
The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1
based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.
2
In this document, we define the “deployment of an OS” as the installation of a given OS on several
nodes from a management node. A more restrictive definition that only applies to the duplication of OS
images on the nodes is often used in technical literature.
14
Bios settings Power on /etc/dhcpd.conf
Boot order:
filename ‘’pxelinux.0’’
1 Network
fixed-address 192.168.0.2
2 local HD
mac address=00:30:19:D6:77:8A next-server 192.168.0.1
hardware ethernet 00:30:19:D6:77:8A
Boot on network and
8.0.2 option host-name ‘’xbas1’’
looks for a DHCP server 192.16 x.0 DHCP
xbas1 .0.1 pxelinu
2 .16 8
19
pxelinux.0 ? pxelinux.0
Connect to “next-server” TFTP
Management node (192.168.0.1)

ux.0
pxelin /tftpboot/C0A80002
Boot micro kernel pxelinux.0 DEFAULT=ks
and translates IP address in C0A80002 ? LABEL ks
Compute node
hexadecimal format: KERNEL RHEL5.1/vmlinuz

TFTP APPEND ksdevice=eth0 ip=dhcp
192.168.0.2 = C0A80002
ks=nfs:192.168.0.1:/release/ks/kickstart
0002
COA8 initrd=RHEL5.1/initrd.img
Execute instructions from the vmlinuz + initrd.img ? RHEL5.1/vmlinuz

PXE file C0A80002 TFTP
g RHEL5.1/initrd.img
itrd.im
uz + in
vmlin
Boot RHEL5.1/vmlinuz /release/ks/kickstart.cfg ?
kernel through network NFS
/release/ks/kickstart
tart
/kicks #Kickstart file
se/ks
Execute instructions from /relea #with disk partitions info and
the kickstart file #OS DVD image location.
/release/RHEL5.1 ? DNS=192.168.0.1
/release/ks/kickstart
/release/RHEL5.1
PC
se/XH NFS
+ /relea /etc/exportfs
EL5.1
Installation of RHEL5.1 se/RH /release/RHEL5.1 <world>
through NFS with the /relea /release/XHPC <world>
kickstart file info
Figure 3 – XBAS compute node PXE installation scheme
Bios settings Power on /etc/dhcpd.conf

Boot order:
filename ‘’pxelinux.0’’
1 Network
fixed-address 192.168.0.2
2 local HD
mac address=00:30:19:D6:77:8A next-server 192.168.0.1
Management node (192.168.0.1)

Boot on network and hardware ethernet 00:30:19:D6:77:8A
Compute node (192.168.0.2)
looks for a DHCP server 68.0.2 .0 DHCP

option host-name ‘’xbas1’’
192.1 ux
xbas1 .0.1 pxelin
.1 68
192
pxelinux.0 ? pxelinux.0
Connect to “next-server” TFTP
u x.0
pxelin /tftpboot/C0A80002
Boot micro kernel pxelinux.0 DEFAULT=local_primary
and translates IP address in C0A80002 ? LABEL local_primary
hexadecimal format: KERNEL chain.c32
TFTP APPEND hd0
192.168.0.2 = C0A80002
0002
COA8
Execute instructions from the chain.c32 ? chain.c32
PXE file C0A80002 TFTP
.c 32
chain
Boot Linux kernel in
XBAS5 environment on
local disk
Figure 4 – XBAS compute node PXE boot scheme
15
2.8 Windows HPC Server 2008
2.8.1 Description
Microsoft Windows HPC Server 2008 (HPCS), the successor to Windows Computer Cluster Server
(WCCS) 2003, is based on the Windows Server 2008 operating system and is designed to increase
productivity, scalability and manageability. This new name reflects Microsoft HPC’s readiness to tackle
the most challenging HPC workloads [5]. HPCS includes key features, such as new high-speed
networking, highly efficient and scalable cluster management tools, advanced failover capabilities, a
service oriented architecture (SOA) job scheduler, and support for partners’ clustered file systems. HPCS
gives access to an HPC platform that is easy to deploy, operate, and integrate with existing enterprise
infrastructures
2.8.2 Cluster installation mechanisms

The Installation of a Windows HPC cluster starts with the setup of the head node (HN). For the
deployment of a compute node (CN), HPCS uses Windows Deployment Service (WDS), which fully
installs and configures HPCS and adds the new node to the set of Windows HPC compute nodes. WDS
is a deployment tool provided by Microsoft, it is the successor of Remote Installation services (RIS), and it
handles all the compute node installation process and acts as a TFTP server.
During the first installation step, Windows Preinstallation Environment (WinPE) is the boot operating
system. It is a lightweight version of Windows Server 2008 that is used for the deployment of servers. It is
intended as a 32-bit or 64-bit replacement for MS-DOS during the installation phase of Windows, and can
be booted via PXE, CD-ROM, USB flash drive or hard disk.
BIOS settings should be set so that HPCS compute nodes boot on network with PXE (we assume that a
private network exists and that CNs send PXE requests there first). From the head node point of view, a
compute node must be deployed if it doesn’t have any entry into the Active Directory (AD), or if the cluster
administrator has explicitly specified that it must be re-imaged. When a compute node with no OS boots,
it first sends a DHCP request in order to get an IP address, a valid network boot server and the name of a
network boot program (NBP). When the DHCP server has answered, the CN downloads the NBP called
WdsNbp.com from the WDS server. The purpose is to detect the architecture and to wait for other
downloads from the WDS server.
Then, on the HPCS administration console of the head node, the new compute node appears as “pending
approval”. The installation starts once the administrator assigns a deployment template to it. A WinPE
image is sent and booted on the compute node; files are transferred in order to prepare the Windows
Server 2008 installation, and an unattended installation of Windows Server 2008 is played. Finally, the
compute node is joined to the domain and the cluster. Figure 5 shows the details of PXE boot operations
executed during the installation procedure.
If the CN has already been installed, the AD already contains the corresponding computer object, so the
WDS server sends him a NBP called abortpxe.com which boots the server by using the next boot item
in the BIOS without waiting for a timeout. Figure 6 shows the PXE boot operations executed in this case.
16
Bios settings Power on
Boot order:
1 Network
2 local HD
mac address=00:30:19:D6:77:8A
Boot on network and
.168.0
.1 DHCP
looks for a DHCP server t=192
68 .0 .2 nex Nbp.com
192.1 t/x64/Wds
Boo
Boot micro kernel and WdsNbp.com ? WdsNbp.com
configure IP address WDS
TFTP
m
bp.co
WdsN
Boot micro kernel and Check for AD

Wait for HN approval
Head node (192.168.0.1)

configure IP address account
WDS
TFTP
Approve CN AD
Compute node
Assign template
Create an AD
(or .n12
) WDS account
t.com
pxeboo
TFTP
pxeboot.com
Boot micro kernel
pxeboot.com (or .n12)
unattend.xml
xt WDS
kpart.t
IM + dis
TFTP
diskpart.txt
BOOT.W
Boot kernel WinPE
and partition the disk
Ask for the Windows Server® image unattend.xml

WDS
TFTP
es CN is in the
WIM imag
Install Windows domain
Server® 2008 Join the domain
AD
Boot Windows Server® 2008
CIFS
k setup
HPC Pac
Play Microsoft® HPC Pack
Join the cluster
2008 installation
Figure 5 – HPCS compute node PXE installation scheme
Bios settings Power on

Boot order:
1 Network
2 local HD
mac address=00:30:19:D6:77:8A
Boot on network and
68.0.1
Compute node (192.168.0.2)
looks for a DHCP server 192.1 DHCP

next= bp.com
Head node (192.168.0.1)
2.16 8.0.2 dsN

19 x64/W
Boot/
Boot micro kernel and WdsNbp.com ? WdsNbp.com
configure IP address WDS
TFTP
m
bp.co
W dsN
Boot micro kernel and Check for AD

Wait for HN approval
configure IP address WDS account
TFTP
AD
WDS AD account
TFTP
abortpxe.com exists
abortpxe.com
Boot Windows Server®
2008 on local disk
Figure 6 – HPCS compute node PXE boot scheme
17
2.9 PBS Professional
This Section presents PBS Professional, the job scheduler that we used as meta-scheduler for building
the HOSC prototype described in Chapter 5. PBS Professional is part of the PBS GridWorks software
suite. It is the professional version of the Portable Batch System (PBS), a flexible workload management
system, originally developed to manage aerospace computing resources at NASA. PBS Professional has
since become the leader in supercomputer workload management and the de facto standard on Linux
clusters. A few of the more important features of PBS Professional 10 are listed below:
• Enterprise-wide Resource Sharing provides transparent job scheduling on any PBS system by
any authorized user. Jobs can be submitted from any client system both local and remote.
• Multiple User Interfaces provides a traditional command line and a graphical user interface for
submitting batch and interactive jobs; querying job, queue, and system status; and monitoring job.
• Job Accounting offers detailed logs of system activities for charge-back or usage analysis per
user, per group, per project, and per compute host.
• Parallel Job Support works with parallel programming libraries such as MPI. Applications can be
scheduled to run within a single multi-processor computer or across multiple systems.
• Job-Interdependency enables the user to define a wide range of interdependencies between jobs.
• Computational Grid Support provides an enabling technology for metacomputing and

computational grids.
• Comprehensive API includes a complete Application Programming Interface (API).
• Automatic Load-Leveling provides numerous ways to distribute the workload across a cluster of
machines, based on hardware configuration, resource availability, keyboard activity, and local
scheduling policy.
• Common User Environment offers users a common view of the job submission, job querying,
system status, and job tracking over all systems.
• Cross-System Scheduling ensures that jobs do not have to be targeted to a specific computer
system. Users may submit their job, and have it run on the first available system that meets their
resource requirements.
• Job Priority allows users the ability to specify the priority of their jobs.
• Username Mapping provides support for mapping user account names on one system to the
appropriate name on remote server systems. This allows PBS Professional to fully function in
environments where users do not have a consistent username across all hosts.
• Broad Platform Availability is achieved through support of Windows and every major version of
UNIX and Linux, from workstations and servers to supercomputers.
18
3 Approaches and recommendations
In this chapter, we will explain the different approaches to offer several OS’s on a cluster. The
approaches discussed in Sections 3.1 and 3.2 are summarized in Table 1 on the next page.
3.1 A single operating system at a time

Let us examine the case where all nodes run the same OS. The cluster OS of the cluster is selected at
boot time. Switching from an OS to another can be done by:
• Re-installing the selected OS on the cluster if necessary. But since this process can be long it is
not realistic for frequent changes. This is noted as approach 1 in Table 1.
• Deploying a new OS image on the whole cluster depending on the OS choice. The deployment
can be done on local disks or in memory with diskless compute nodes. It is difficult to deal with
the OS change on the management node in such an environment: either the management node
is dual-booted (this is approach 7 in Table 1), or an additional server is required to distribute the
OS image of the MN. This can be interesting in some specific cases: on HPC clusters with
diskless CN when the OS switches are rare, for example. Otherwise, this approach is not very
convenient. The deployment technique can be used in a more appropriate manner for clusters
with 2 simultaneous OS’s (i.e., 2 MNs); this will be shown in the next Section with approaches 3
and 11.
• Dual-booting the selected OS from dual-boot disks. Dual-booting the whole cluster
(management and computing nodes) is a good and very practical solution that was introduced
in [1] and [2]. This approach, noted 6 in Table 1, is the easiest way to install and manage a
cluster with several OS’s but it can only apply for small clusters with few users when no flexibility
is required. If only the MNs are on a dual-boot server while the CNs are installed with a single OS
(half of the CNs having an OS while the others have another), the solution has no sense because
only half of the cluster can be used at a time in this case (this is approach 5). If the MNs are on a
dual-boot server while the CNs are installed in VMs (2 VMs being installed on each compute
server), the solution has no real sense either because the added value of using VMs (quick OS
switching for instance) is cancelled by the need of booting the MN server (this is approach 8).
Whatever the OS switch method, a complete cluster reboot is needed at each change. This implies
cluster unavailability during reboots, a need for OS usage schedules and potential conflicts between user
needs, hence a real lack of flexibility.
In Table 1, approaches 1, 5, 6, 7, and 8 define clusters that can run 2 OS’s but not simultaneously. Even
if such clusters do not stick to the Hybrid Operating System Cluster (HOSC) definition given in Chapter 1,
they can be considered as a simplified approach of its concept.
19
2 Compute Nodes (CN) with 2 different OS’s
OS image Virtualization
1 OS per server Dual-boot (2 CN
(2 servers) (1 server)
deployment simultaneously
(1 server) on 1 server)
1 Starting point:
4 An HOSC
2 half size
3 An HOSC solution with
independent
solution that can potential
clusters with 2 2 Good HOSC
be interesting for performance
1 OS per OS’s, or 1 full size solution for large
large clusters: issues on
server single OS cluster clusters with OS
(2 servers)
with diskless CNs compute nodes
re-installed with a flexibility
or when the OS and extra-cost for
2 Management Nodes (MN) with 2 different OS’s
different OS when requirement

type of CNs is the additional
needed: the
rarely switched management
expensive solution
node
without flexibility
5 This “single OS
8 Having virtual
at a time” solution 7 A “single OS
6 Good CNs has no real
makes absolutely at a time” solution
Dual-boot classical dual- sense since the
(1 server)
no sense since that can only be
boot cluster MN must be
only half of the CN interesting for
solution rebooted to
can be used at a diskless CNs
switch the OS
time
10 Good HOSC
9 2 half size 12 Every node
solution for
independent is virtual: the
medium-sized 11 An HOSC
Virtualization clusters with a most flexible
clusters with OS solution that can
(2 MN single MN server: HOSC solution
simultaneously
flexibility be interesting for
a bad HOSC but with too many
on 1 server) requirement small clusters
solution with no performance
(without with diskless CNs
flexibility and very uncertainties at
additional
little cost saving the moment
hardware cost)
Table 1 - Possible approaches to HPC clusters with 2 operating systems
20
3.2 Two simultaneous operating systems
The idea is to provide, with a single cluster, the capability to have several OS’s running simultaneously on
an HPC cluster. This is what we defined as a Hybrid Operating System Cluster (HOSC) in Chapter 1.
Each compute node (CN) does not need to run every OS simultaneously. A single OS can run on a given
CN while another OS runs on other CNs at the same time. The CNs can be dual-boot servers, diskless
servers, or virtual machines (VM). The cluster is managed from separate management nodes (MN) with
different OS’s. MN can be installed on several physical servers or on several VMs running on a single
server. In Table 1, approaches 2, 3, 4, 9, 10, 11 and 12 are HOSC.
HPC users may consider HPC clusters with two simultaneous OS’s rather than a single OS at a time for
four main reasons:
1. To improve resource utilization and adapt the workload dynamically by easily changing the ratio
of OS’s (e.g., Windows vs. Linux compute nodes) in a cluster for different kinds of usage.
2. To be able to migrate smoothly from one OS to the other, giving time to port applications and train
users.
3. Simply to be able to try a new OS without stopping the already installed one (i.e., install a HPCS
cluster at low cost on an existing Bull Linux cluster or install a Bull Linux cluster at low cost on an
existing HPCS cluster).
4. To integrate specific OS environments (e.g., with legacy OS’s and applications) in a global IT
infrastructure.
The simplest approach for running 2 OS’s on a cluster is to install each OS on half (or at least a part) of
the cluster when it is built. This approach is equivalent to building 2 single OS clusters! Therefore it
cannot be classified as a cluster with 2 simultaneous OS’s. Moreover, this solution is expensive with its
2 physical MN servers and it is absolutely not flexible since the OS distribution (i.e., the OS allocation to
nodes) is fixed in advance. This approach is similar to approach 1 already discussed in the previous
section.
An alternative to this first approach is to use a single physical server with 2 virtual machines for installing
the 2 MNs. In this case there is no additional hardware cost but there is still no flexibility for the choice of
the OS distribution on the CNs since this distribution is done when the cluster is built. This approach is
noted 9.
On clusters with dual-boot CNs the OS distribution can be dynamically adapted to the user and
application needs. The OS of a CN can be changed just by rebooting the CN aided by a few simple dual-
boot operations (this will be demonstrated in Sections 6.3 and 6.4). With such dual-boot CNs, the 2 MNs
can be on a single server with 2 VMs: this approach, noted 10, is very flexible and requires no additional
hardware cost. It is a good HOSC solution, especially for medium-sized clusters.
21
With dual-boot CNs, the 2 MNs can also be installed on 2 physical servers instead of 2 VMs: this
approach, noted 2, can only be justified on large clusters because of the extra cost due to a second
physical MN.
A new OS image can be (re-)deployed on a CN on request. This technique allows changing the OS
distribution on CNs on a cluster quite easily. However, this is mainly interesting for clusters with diskless
CNs because re-deploying an OS image for each OS switch is slower and consumes more network
bandwidth than the other techniques discussed in this paper (dual-boot or virtualization). This can also be
interesting if the OS type of CNs is not switched too frequently. The MNs can then be installed in 2
different ways: either the MNs are installed on 2 physical servers (this is approach 3 that is interesting for
large clusters with diskless CNs or when the OS type of CNs is rarely switched) or they are installed on
2 VMs (this is approach 11 that is interesting for small and medium size diskless clusters).
The last technique for installing 2 CNs on a single server is to use virtual machines (VM). In this case,
every VM can be up and running simultaneously or only a single VM may run on each compute server
while the others are suspended. The switch from an OS to another can then be done very quickly. Using
several virtual CNs of the same server simultaneously is not recommended since the total performance of
the VMs is bounded by the native performance of the physical server and so no benefit can be expected
from such a configuration. Installing CNs on VMs makes it easier and quicker to switch from one OS to
another compared to a dual-boot installation but performance of the CNs may be decreased by the
computing overhead due to the virtualization software layer. Section 3.5 briefly presents articles that
analyze the performance impact of virtualization for HPC. Once again, the 2 MNs can be installed on 2
physical servers (this is approach 4 for large clusters), or they can be installed on 2 VMs (this is
approach 12 for small and medium-sized clusters). This latter approach is 100% virtual with only virtual
nodes. This is the most flexible solution, and very promising for the future; however it is too early to use it
now because of performance uncertainties.
For the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the host OS can be Linux or
Windows and any virtualization software could be used. The 6 approaches using VMs have thus dozens
of virtualization implementations.
The key points to check for choosing the right virtualization environment are listed here by order of
importance:
1. List of supported guest OS’s
2. Virtual resource limitations (maximum number of virtual CPUs, maximum number of network
interfaces, virtual/physical CPU binding features, etc.)
3. Impact on performance (CPU cycles, memory access latency and bandwidth, I/Os, MPI
optimizations)
4. VM management environment (tools and interfaces for VM creation, configuration and

monitoring)
22
Also, for the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the 2 nodes can be
configured on 2 VMs or one can be a VM while the other is just installed on the server host OS. When
upgrading an existing HPC cluster from a classical single OS configuration to an HOSC configuration, it
might look interesting at first glance to configure a MN (or a CN) on the host OS. For example, one virtual
machine could be created on an existing management node and the second management node could be
installed on this VM. Even if this configuration looks nice and quick and easy to setup, it should never be
used. Indeed, running any application or using resources of the host OS is not a recommended
virtualization practice. This creates a non-symmetrical situation between applications running on the host
OS and those running on the VM. This may lead to load balancing issues and resource access failures.
On an HOSC with dual-boot CNs, re-deployed CNs or virtual CNs, the OS distribution can be changed
dynamically without disturbing the other nodes. This could even be done automatically by a resource
manager in a unified batch environment 3 .
The dual boot technique limits the number of installed OS’s on a server because only 4 primary partitions
can be declared in the MBR. So, on an HOSC, if more OS’s are necessary and no primary partition is
available anymore, the best solution is to install virtual CNs, and to run them one by one while the others
are suspended on each CN (depending on the selected OS for that CN). The MNs should be installed on
VMs as much as possible (like in approach 12), but several physical servers can be necessary (as in
approach 4). This can happen in the case of large clusters for which the cost of an additional server is
negligible. This can also happen so as to keep a good level of performance when a lot of OS’s are
installed on the HOSC and thus many MNs are needed.
3.3 Specialized nodes

In an HPC cluster, specialized nodes dedicated to certain tasks are often used. The goal is to distribute
roles, for example, in order to reduce the management node (MN) load. We can usually distinguish
4 types of specialized nodes: the management nodes, the compute nodes (CN), the I/O nodes and the
login nodes. A cluster usually has 1 MN and many CNs. It can have several login and I/O nodes. On
small clusters, a node can be dedicated to several roles: a single node can be a management, login and
I/O node simultaneously.
3.3.1 Management node

The management node (MN), named Head Node (HN) in the HPCS documentation, is dedicated to
providing services (infrastructure, scheduler, etc.) and to running the cluster management software. It is
responsible for the installation and setup of the compute nodes (e.g., OS image deployment).
3.3.2 Compute nodes

The compute nodes (CN) are dedicated to computation. They are optimized for code execution, so they
are running a limited number of services. Users are not supposed to log in on them.
3
The batch solution is not investigated in this study but could be considered in the future.
23
3.3.3 I/O nodes
I/O nodes are in charge of Input/Output requests for the file systems.
For I/O intensive applications, an I/O node is necessary to reduce MN load. This is especially true when
the MNs are installed on virtual machines (VM). When a virtual MN handles heavy I/O requests it can
dramatically impact the I/O level of performance of the second virtual MN.
If an I/O node is aimed at serving nodes with different OS’s then it must have at least one network
interface for each OS subnet (i.e., a subnet that is declared for every node that runs with the same OS).
Section 4.4 and 4.5 show an example of OS subnets.
An I/O node could be installed with Linux or Windows for configuring a NFS server. NFS clients and
servers are supported on both OS’s. But the Lustre file system (delivered by Bull with XBAS) is not
available for Windows clusters so Lustre I/O nodes can only be installed on Linux I/O nodes (for the Linux
CN usage only 4 ). Other commercial cluster / parallel file systems are available for both Linux and
Windows (e.g., CXFS).
The I/O node can serve one file system shared by both OS nodes or two independent file systems (one
for each OS subnet). In the case of 2 independent file systems, 1 or 2 I/O nodes can be used.
3.3.4 Login nodes

Login nodes are used as cluster front end for user login, code compilation and data visualization. They
are specially used to:
• login
• develop, edit and compile programs
• debug parallel code programs
• submit a job to the cluster
• visualize the results returned by a job
Login nodes could run a Windows or Linux OS and they can be installed on dual-boot servers, virtual
machines or independent servers. A login node is usually only connected to other nodes running the
same OS as its own.
For the HPCS cluster, the use of a login node is not mandatory, as a job can be submitted from any
Windows client with the Microsoft HPC Pack installed (with the scheduler graphical interface or command
line) by using an account into the cluster domain. A login node can be used to provide a gateway to enter
into the cluster domain.
4
Lustre and GPFSTM clients for Windows are announced to be available soon.
24
3.4 Management services
From the infrastructure configuration point of view, we should study the potential interactions between
services that can be delivered from each MN (e.g., DHCP, TFTP, NTP, etc.). The goal is to avoid any
conflict between MN services while cluster operations or computations are done simultaneously on both
OS’s. This is especially complex during the compute node boot phase since the PXE procedure requires
DHCP and TFTP access from its very early start time. A practical case with XBAS and HPCS is shown in
Section 4.4.
At least the following services are required:
• a unique DHCP server (for PXE boot)
• a TFTP server (for PXE boot)
• a NFS server (for Linux compute node deployment)
• a CIFS server (for HPCS compute node deployment)
• a WDS server (for HPCS deployment)
• a NTP server (for the virtualization software and for MPI application synchronization)
3.5 Performance impact of virtualization

Many scientific articles deal with the performance impact of virtualization on servers in general. Some
recent articles are more focused on HPC requirements.
One of these articles compares virtualization technologies for HPC (see [25]). This paper systematically
evaluates various VMs for computationally intensive HPC applications using various standard scientific
benchmarks using VMware Server, Xen, and OpenVZ. It examines the suitability of full virtualization,
para-virtualization, and operating system-level virtualization in terms of network utilization, SMP
performance, file system performance, and MPI scalability. The analysis shows that none match the
performance of the base system perfectly: OpenVZ demonstrates low overhead and high performance,
Xen demonstrated excellent network bandwidth but its exceptionally high latency hindered its scalability,
VMware Server, while demonstrating reasonable CPU-bound performance, was similarly unable to cope
with the NPB MPI-based benchmark.
Another article evaluates the performance impact of Xen on MPI and process execution for HPC Systems
(see [26]). It investigates subsystem and overall performance using a wide range of benchmarks and
applications. It compares the performance of a para-virtualized kernel against three Linux operating
systems and concludes that in general, the Xen para-virtualizing system poses no statistically significant
overhead over other OS configurations.
25
3.6 Meta-scheduler for HOSC
3.6.1 Goals
The goal of a meta-scheduler used for an HOSC can be:
• Purely performance oriented: the most efficient OS is automatically chosen for a given run
(based on backlog, statistics, knowledge data base, input data size, application binary, etc)
• OS compatibility driven: if an application is only available for a given OS then this OS must be
used!
• High availability oriented: a few nodes with each OS are kept available all the time in case of
requests that must be treated extremely quickly or in case of failure of running nodes.
• Energy saving driven: the optimal number of nodes with each OS are booted while the others
are shut down (depending on the number of jobs in queue, the profile of active users, job history,
backlog, time table, external temperature, etc.)
3.6.2 OS switch techniques

The OS switch techniques that a meta-scheduler can use are those already discussed at the beginning of
Chapter 3 (see Table 1). The meta-scheduler must be able to handle all the processes related to these
techniques:
• Reboot a dual-Boot compute nodes (or power it on and off on demand)
• Activate/deactivate virtual machines that work as compute nodes
• Re-deploy the right OS and boot compute nodes (on diskless servers for example)
3.6.3 Provisioning and distribution policies

The OS type distribution among the nodes can be:
• Unplanned (dynamic): the meta-scheduler estimates dynamically the optimal size of node
partitions with each OS type (depending on job priority, queue, backlog, etc.), then it grows and
shrinks these partitions accordingly by switching OS type on compute nodes. This is usually
called “just in time provisioning”.
• Planned (dynamic): the administrators plan the OS distribution based on time, dates, team
budget, project schedules, people vacations, etc. The size of the node partitions with each OS
type are fixed for given periods of time. This is usually called “calendar provisioning”.
• Static: the size of node partitions with each OS type are fixed once for all and the meta-scheduler
cannot switch OS type. This is the simplest and less efficient case.
26
4 Technical choices for designing an HOSC prototype
We want to build a flexible medium-sized HOSC with XBAS and HPCS. We only have a small
5-server cluster to achieve this goal but it will be sufficient to simulate the usage of a medium-sized
cluster. We start from this cluster with InfiniBand and Gigabit network. The complete description of the
hardware is given in Appendix E. We have discussed the possible approaches in the previous chapter.
Let us now see what choice should be made in the particular case of this 5-server cluster. In the
remainder of the document, this cluster is named the HOSC prototype.
4.1 Cluster approach

According to recommendations given in the previous chapter, the most appropriate approach for medium-
sized clusters is that with 2 virtual management nodes on one server and dual-boot compute nodes. This
is the approach noted 10 in Table 1 of Section 3.
4.2 Management node

For the virtualization software choice, we cannot choose Hyper-V since the XBAS VM must be able to
use more than a single CPU to serve the cluster management requests in the best conditions. We cannot
choose virtualization software that does not support HPCS for obvious reasons. Finally, we have to
choose between VMware and Xen which both fulfill the requirements for our prototype. RHEL5.1 is
delivered with the XBAS 5v1.1 software stack and thus it is the most consistent Linux choice for the host
OS. So in the end, we chose Xen as our virtualization software since it is included in the RHEL5.1
distribution. Figure 7 shows the MN architecture for our HOSC prototype.
4.3 Compute nodes

We have chosen approach 10, so CNs are dual-boot servers with XBAS and HPCS installed on local
disks. We chose the Windows MBR for dual-booting CNs because it is easier to change the active
partition of a node than to edit its grub.conf configuration file at each OS switch request. This is especially
true when the node is running Windows since the grub.conf file is stored on the Linux file system: a
common file system (on a FAT32 partition for example) would then be needed to share file grub.conf.
When the OS type of CNs is switched manually, we decided to allow the OS type switch commands to be
sent only from the MN that runs the same OS as itself. In other words, the HPCS MN can “give up” one
of its CNs to the XBAS cluster and the XBAS MN can “give up” one of its CNs to the HPCS cluster, but
no MN can “take” a CN from the cluster with a different OS. This rule was chosen to minimize the risk of
switching the OS of a CN while it is used for computation with its current OS configuration. When the OS
type of CNs is switched automatically by a meta-scheduler, OS type switch commands are sent from the
meta-scheduler server. To help switching OS on CNs from the MNs, simple scripts were written. They are
listed in Appendices D.1.3 and D.2.3, and an example of their use is shown in Sections 6.3 and 6.4.
Depending on the OS type that is booted, the node has a different hostname and IP address. This
information is sent by a DHCP server whose configuration is updated at each OS switch request as
explained in next section.
27
Figure 7 - Management node virtualization architecture
4.4 Management services

We have to make choices to create a global infrastructure architecture for deploying, managing and using
the two OS’s on our HOSC prototype:
• The DHCP is the critical part, as it is the single point of entry when a compute node boots. In our
prototype it is running on the XBAS management node. The DHCP configuration file contains a
section for each node with its characteristics (hostname, MAC and IP address) and the PXE
information. Depending on the administrator needs, this section can be changed for deploying
and booting XBAS or HPCS on a compute node. (see an example of dhcp.conf file changes in
Appendix D.2.2)
• WDS and/or TFTP server: each management node has its own server because the installation
procedures are different. A booting compute node is directed to the correct server by the DHCP
server.
• Directory Service is provided by Active Directory (AD) for HPCS and by LDAP for XBAS. Our
prototype will not offer a unified solution, but since synchronization mechanisms between AD and
LDAP exist, a unified solution could be investigated.
28
• DNS: this service can be provided by the XBAS management node or the HPCS head node. The
DNS should be set as dynamic in order to provide simpler access for the AD. In our prototype, we
set a DNS server on the HPCS head node for the Windows nodes, and we use /etc/hosts
files for name resolution on XBAS nodes.
Recommendations given in Section 3.4 can be applied to our prototype by configuring the services as
shown in Figure 8 and in Table 2.
Figure 8 - Architecture of management services
Service Machine OS IP Mask Accessed by

NTP Xen domain0 RHEL5.1 192.168.10.1 255.255.0.0 all nodes
DHCP all nodes
TFTP Linux VM XBAS 192.168.0.1 255.255.0.0
XBAS nodes
NFS
WDS (TFTP, PXE)
CIFS Windows VM HPCS 192.168.1.1 255.255.0.0 HPCS nodes
DNS
Table 2 – Network settings for the services of the HOSC prototype
The netmask is set to 255.255.0.0 because it must provide connectivity between Xen domain 0 and each
DomU virtual machine.
Figures 9 and 10 describe respectively XBAS and HPCS compute node deployment steps, while
Figures 11 and 12 describe respectively XBAS and HPCS compute node normal boot steps on our HOSC
prototype. They show how the PXE operations detailed in Figures 3, 4, 5 and 6 of Chapter 2 are
consistently adapted in our heterogeneous OS environment with a unique DHCP server on the XBAS MN
and a Windows MBR on the CNs.
29
Figure 9 - Deployment of a XBAS compute node on our HOSC prototype
Figure 10 - Deployment of a HPCS compute node on our HOSC prototype
30
Figure 11 - Boot of a XBAS compute node on our HOSC prototype
Figure 12 - Boot of a HPCS compute node on our HOSC prototype
31
4.5 HOSC prototype architecture
The application network (i.e., InfiniBand network) should not be on the same subnet as the private
network (i.e., gigabit network): we chose 172.16.0.[1-5] and 172.16.1.[1-5] IP address ranges for the
application network address assignment.
The complete cluster architecture that results from the decisions taken in the previous sections is shown
in Figure 13 below:
129.183.251.53 (eth1)
Gigabit 129.183.251.40 (xenbr1) Gigabit
(IB switch RHEL5.1 with Xen (intranet/
192.168.0.220 129.183.251.41 (xenbr1) internet)
management) DOMAIN0 HPCS0 XBAS0
172.16.0.1 192.168.10.1 (eth0)

HPCS4 XBAS4
172.16.1.1 OR 192.168.0.1 (xenbr0)
192.168.1.1 (xenbr0)
HPCS3 XBAS3
OR
InfiniBand Gigabit
HPCS2 XBAS2
OR
172.16.0.[2-5] HPCS1 XBAS1 192.168.1.[2-5] OR 192.168.0.[2-5]

OR
172.16.1.[2-5]
Figure 13 - HOSC prototype architecture
If for some reasons the IB interface cannot be configured on the HN, you should setup a loop back
network interface instead and configure it with the IPoIB IP address (e.g., 172.16.1.1 in Figure 13). If for
some reasons the IB interface cannot be configured on the MN, its setup can be skipped since it is not
mandatory to connect the IB interface on the MN.
In the next chapter we will show how to install and configure the HOSC prototype with this architecture.
32
4.6 Meta-scheduler architecture
Without a meta-scheduler, users need to connect to the required cluster management node in order to
submit his job. In this case, each cluster has its own management node with its own scheduler (as shown
on the left side of Figure 14). By using a meta-scheduler, we offer a single point of entry to use the power
of the HOSC whatever the OS type required by the job (as shown on the right side of Figure 14).
Figure 14 - HOSC meta-scheduler architecture (in order to have a simpler scheme, the HOSC is
represented as two independent clusters: one with each OS type)
On the meta-scheduler, we create two job queues, one for the XBAS cluster and another one for HPCS
cluster. So according to the user request, the job will be automatically redirected to the correct cluster.
The meta-scheduler will also be managing the switch from an OS type to the other according to the
clusters workload.
We chose PBS Professional to be used as meta-scheduler for our prototype because of the experience
we already have with it on Linux and Windows platforms. PBS server should be installed on a node that is
accessible from every other nodes of the HOSC. We chose to install it on the XBAS management node.
PBS MOM (Machine Oriented Mini-server) is installed on all compute nodes (HPCS and XBAS) so they
can be controlled by the PBS server.
In the next chapter we will show how to install and configure this meta-scheduler on our HOSC prototype.
33
5 Setup of the HOSC prototype
This Chapter describes the general setup of the Hybrid Operating System Cluster (HOSC) defined in the
previous Chapter. The initial idea was to install Windows HPC Server 2008 on an existing Linux cluster
without affecting the existing Linux cluster installation. However, in our case it appeared that the
installation procedure requires reinstalling the management node with the 2 virtual machines. So finally
the installation procedure is given for an HOSC installation done from scratch.
5.1 Installation of the management nodes
5.1.1 Installation of the RHEL5.1 host OS with Xen

If you have an already configured XBAS cluster, do not forget to save the clusterdb (XBAS cluster data
base) and all your data stored on the current management node before reinstalling it with RHEL5.1 and
its virtualization software Xen.
Check that Virtual Technology (VT) is enabled in the BIOS settings of the server.
Install Linux RHEL5.1 from the DVD on the management server and select “virtualization” when optional
packages are proposed. SELinux must be disabled. Erase all existing partitions and design your partition
table so that enough free space is available in a volume group for creating logical volumes (LV). LVs are
virtual partitions used for the installation of virtual machines (VM), each VM being installed on one LV.
Volume groups and logical volumes are managed by the Logical Volume Manager (LVM). The advised
size of a LV is 30-50GB: leave at least 100GB of free space on the management server for the creation of
2 LVs.
It is advisable to install an up-to-date gigabit driver. One is included on the XBAS 5v1.1 XHPC DVD.
[xbas0:root] rpm -i XHPC/RPMS/e1000-7.6.15.4-2.x86_64.rpm
5.1.2 Creation of 2 virtual machines

A good candidate for the easy management of Xen Virtual machines is the Bull Hypernova tool.
Hypernova is an internal-to-Bull software environment based on RHEL5.1/Xen3 for the management of
virtual machines (VM) on Xeon® and Itanium2® systems. HN-Master is the web graphical interface (see
Figure 15) that manages VMs in the Hypernova environment. It can be used to create, delete, install and
clone VM; modify VM properties (network interfaces, number of virtual CPUs, memory, etc.); start, pause,
stop and monitor VM status.
2 virtual machines are needed to install the XBAS management node and the HPCS head node. Create
these 2 Xen virtual machines on the management server. The use of HN-Master is optional and all
operations done in the Hypernova environment could also be done with Xen commands in a basic Xen
environment. For the use of HN-Master, httpd service must be started (type “chkconfig --level 35
httpd on” to start it automatically at boot time).
34
Figure 15 - HN-Master user interface
The following values are used to create each VM:
• Virtualization mode: Full
• Startup allocated memory: 2048
• Virtual CPUs number: 4 5
• Virtual CPUs affinity type: mapping
• Logical Volume size: 50GB
Create 2 network interface bridges, xenbr0 and xenbr1, so that each VM can have 2 virtual network
interfaces (one on the private network and one on the public network). Detailed instructions for
configuring 2 network interface bridges are shown in Appendix D.2.4.
5
In case of problems when installing the OS’s (e.g., #IRQ disabled while files are copied), select only
1 virtual CPU for the VM during the OS installation step.
35
5.1.3 Installation of XBAS management node on a VM
Install XBAS on the first virtual machine. If applicable, use the clusterdb and the network configuration of
the initial management node. Update the clusterdb with the new management node MAC addresses: the
xenbr0 and xenbr1 MAC address of the VM. Follow the instructions given in the BAS for Xeon installation
and configuration guide [22], and choose the following options for the MN setup:
[xbas0:root] cd /release/XBAS5V1.1
[xbas0:root] ./install -func MNGT IO LOGIN -prod RHEL XHPC XIB
Update the clusterdb with the new management node MAC-address (see [22] and [23] for details).
5.1.4 Installation of InfiniBand driver on domain 0

The InfiniBand (IB) drivers and libraries should be installed on domain 0. They are available on the XIB
DVD included in the XBAS distribution. Assuming that the content of the XIB DVD is copied on the XBAS
management node (with the IP address 192.168.0.1) in directory /release as it is requested in the
installation guide [22], the following commands should be executed:
[xbas0:root] mkdir /release

[xbas0:root] mount 192.168.0.1:/release /release
[xbas0:root] scp root@192.168.0.1:/etc/yum.repos.d/*.repo /etc/yum.repos.d
[xbas0:root] yum install perl-DBI perl-XML-Parser perl-XML-Simple
[xbas0:root] yum install dapl infiniband-diags libibcm libibcommon libibmad
libibumad libibverbs libibverbs-utils libmlx4 libmthca librdmacm librdmacm-
utils mstflint mthca_fw_update ofed-docs ofed-scripts opensm-libs perftest
qperf --disablerepo=local-rhel
5.1.5 Installation of HPCS head node on a VM

Install Windows Server 2008 on the second virtual machine as you would do on any physical server.
Then the following instructions should be executed in this order:
1. set the Head Node (HN) hostname
2. configure the HN network interfaces
3. enable remote desktop (this is recommended for a remote administration of the cluster)
4. set “Internet Time Synchronization” so that the time is the same on the HN and the MN
5. install the Active Directory (AD) Domain Services and create a new domain for your cluster with
the wizard (dcpromo.exe), or configure the access to your existing AD on your local network
6. install the Microsoft HPC Pack
36
5.1.6 Preparation for XBAS deployment on compute nodes
Check that there is enough space on the first device of the compute nodes for creating an additional
primary partition (e.g., on /dev/sda). If not, make some space by reducing the existing partitions or by
redeploying XBAS compute nodes with the right partitioning (using the preparenfs command and a
dedicated kickstart file). Edit the kickstart file accordingly to an HOSC compatible disk partition. For
example, /boot on /dev/sda1, / on /dev/sda2 and SWAP on /dev/sda3. An example is given in
Appendix D.2.1.
Create a /opt/hosc directory and export it with NFS. Then mount it on every node of the cluster and
install the HOSC files listed in Appendix D.2.3 in it:
• switch_dhcp_host
• activate_partition_HPCS.sh
• fdisk_commands.txt
• from_XBAS_to_HPCS.sh and from_HPCS_to_XBAS.sh
5.1.7 Preparation for HPCS deployment on compute nodes

First configure the cluster by following the instructions given in the HPC Cluster Manager MMC to-do list:
1. Configure your network:
a. Topology: compute nodes isolated on private and application networks (topology 3)
b. DHCP server and NAT: not activated on the private interface
c. Firewall is “off” on private network (this is for the compute nodes only because the firewall
needs to be “on” for the head node)
2. Provide installation credentials
3. Configure the naming of the nodes (this step is mandatory even if it is not useful in our case: the
new node names will be imported from an XML file that we will create later). You can specify:
HPCS%1%
4. Create a deployment template with operating system and “Windows Server 2008” image
Bring the HN online in the management console: click on “Bring Online” in the “Node Management”
window of the “HPC Cluster Manager” MMC.
Add a recent network adapter Gigabit driver to the OS image that will be deployed: click on “Manage
drivers” and add the drivers for Intel PRO/1000 version 13.1.2 or higher (PROVISTAX64_v13_1_2.exe
can be downloaded from Intel web site).
37
Add a recent IB driver (see [27]) that supports Network Direct (ND). Then edit the compute node template
and add a “post install command” task that configures IPoIB IP address and register ND on the compute
nodes. The IPoIB configuration can be done by the script setIPoIB.vbs provided in Appendix D.1.2.
The ND registration is done by the command:
C:\> ndinstall -i
Two files used by the installation template must be edited in order to keep existing XBAS partitions
untouched on compute nodes while deploying HPCS. For example, choose the fourth partition
(/dev/sda4) for the HPCS deployment (see Appendix D.1.1 for more details):
• unattend.xml
• diskpart.txt
Create a shared C:\hosc directory and install the HOSC files listed in Appendix D.1.3 in it:
• activate_partition_XBAS.bat
• diskpart_commands.txt
• from_HPCS_to_XBAS.bat
5.1.8 Configuration of services on HPCS head node 6

The DHCP service is disabled on HPCS head node (it was not activated during the installation step). The
firewall must be enabled on the head node for the private network. It must be configured to drop all
incoming network packets on local ports 67/UDP and 68/UDP in order to block any DHCP traffic that
might be produced by the Windows Deployment Service (WDS). This is done by creating 2 inbound rules
from the Server Manager MMC. Click on:
Server Manager → Configuration → Windows Firewall with Advanced Security

→ Inbound Rules → New Rule
Then select the following options:

1. Rule type: Port
2. Protocol and ports: 67 (or 68 for the second rule)
3. Action: Block the connection
4. Name: UDP/67 blocked (or UDP/68 blocked for the second rule)
Instead of blocking these ports, it is also possible to disable all inbound rules that are enabled by default
on UDP ports 67 and 68.
6
Thanks a lot to Christian Terboven (research associate in the HPC group of the Center for Computing
and Communication at RWTH Aachen University) for his helpful contribution to this configuration phase.
38
5.2 Deployment of the operating systems on the compute nodes
The order in which the OS’s are deployed is not important but must be the same on every compute node.
The order should thus be decided before starting any node installation or deployment. The installation
scripts (such as diskpart.txt for HPCS or kickstart.<identifier> for XBAS) must be edited
accordingly in the desired order. In this example, we chose to deploy XBAS first. The partition table we
plan to create is:
/dev/sda1 /boot 100MB ext3 (Linux)

/dev/sda2 / 50GB ext3 (Linux)
/dev/sda3 SWAP 16GB (Linux)
/dev/sda4 C:\ 50GB ntfs (Windows)
First, check that the BIOS settings of all CNs are configured for PXE boot (and not local hard disk boot).
They should boot on the eth0 Gigabit Ethernet (GE) card. For example, the following settings are correct:
Boot order
1 - USB key
2 - USB disk
3 - GE card
4 - SATA disk
5.2.1 Deployment of XBAS on compute nodes

Follow the instructions given in the BAS5 for Xeon installation & configuration guide [22]. Here is the
information that must be entered to the preparenfs tool in order to generate a kickstart file (the kickstart
file could also be written manually with this information on other Linux distribution systems):
1. RHEL DVD is copied in: /release/RHEL5.1
2. partitioning method is: automatic (i.e., a predefined partitioning is used)
3. interactive mode is: not used (the installation is unattended)
4. VNC is: not enabled
5. BULL HPC installer is in: /release/XBAS5V1.1
6. node function is: COMPUTEX
7. optional BULL HPC software is: XIB
8. IP of NFS server is the default: 192.168.0.99
9. the nodes to be installed are: xbas[1-4]
10. hard reboot done by preparenfs: No
39
Once generated, the kickstart file needs a few modifications in order to fulfill the HOSC disk partition
requirements: see an example of these modifications in Appendix D.2.1.
When the modifications are done, boot the compute nodes and the PXE mechanisms will start to install
XBAS on the compute nodes with the information stored in the kickstart file. Figure 16 shows the console
of a CN while it is PXE booting for its XBAS deployment.
Intel(R) Boot Agent GE v1.2.36

Copyright (C) 1997-2005, Intel Corporation
CLIENT MAC ADDR: 00 30 48 33 4C F6 GUID: 53D19F64 D663 A017 8922 003048334CF6

CLIENT IP: 192.168.0.2 MASK: 255.255.0.0 DHCP IP: 192.168.0.1
PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H. Peter Anvin

UNDI data segment at: 000921C0
UNDI data segment size: 62C0
UNDI code segment at: 00098480
UNDI code segment size: 3930
PXE entry point found (we hope) at 9848:0106
My IP address seems to be COA80002 192.168.0.2
ip=192.168.0.2:192.168.0.1:0.0.0.0:255.255.0.0
TFTP prefix:
Trying to load: pxelinux.cfg/00-30-48-33-4C-F6
Trying to load: pxelinux.cfg/COA80002
boot:
Booting...
Figure 16 - XBAS compute node console while the node starts to PXE boot
It is possible to install every CN with the preparenfs tool or to install a single CN with the preparenfs tool
and then duplicate it on every other CN servers with the help of the ksis deployment tool. However, the
use of ksis is only possible if XBAS is the first installed OS, since ksis overwrites all existing partitions. So
it is advisable to only use the preparenfs tool for CN installation on a HOSC.
Check that the /etc/hosts file is consistent on XBAS CNs (see Appendix D.2.5). Configure the IB
interface on each node by editing file ifcfg-ib0 (see Appendix D.2.6) and enable the IB interface by
starting the openibd service:
[xbas1:root] service openibd start
In order to be able to boot Linux with the Windows MBR (after having installed HPCS on the CNs), install
the GRUB boot loader on the first sector of the /boot partition by typing on each CN:
[xbas1:root] grub-install /dev/sda1
The last step is to edit all PXE files in /tftboot directory and set both TIMEOUT and PROMPT variables to 0
in order to boot compute nodes quicker.
5.2.2 Deployment of HPCS on compute nodes

On the XBAS management node, change the DHCP configuration file so the compute nodes point to the
Windows WDS server when they PXE boot. Edit the DHCP configuration file /etc/dhcpd.conf for
each CN host section and change the fields as shown in Appendix D.2.2 (filename, fixed-address, host-
40
name next-server and server-name). The DHCP configuration file changes can be done by using the
switch_dhcp_host script (see Appendix D.2.3) for each compute node. Once the changes are done in
the file, the dhcpd service must be restarted in order to take changes into account. For example, type:
[xbas0:root] switch_dhcp_host xbas1

File /etc/dhcp.conf is updated with host hpcs1
[xbas0:root] switch_dhcp_host xbas2
File /etc/dhcp.conf is updated with host hpcs2
[...]
[xbas0:root] service dhcpd restart
Shutting down dhcpd: [ OK ]
Starting dhcpd: [ OK ]
Now prepare the deployment of the nodes for the HPCS management console: get the MAC address of
all new compute nodes and create an XML file with the MAC address, compute node name and domain
name of each node. An example of such an XML file (my_cluster_nodes.xml) is given in
Appendix D.1.1. Import this XML file from the administrative console (see Figure 17) and assign a
deployment “compute node template” to the nodes.
Figure 17 - Import node XML interface
Boot the compute nodes. Figure 18 shows the console of a CN while it is PXE booting for its HPCS
deployment with a DHCP server on the XBAS management node (192.168.0.1) and a WDS server on the
HPCS head node (192.168.1.1).
41
Intel(R) Boot Agent GE v1.2.36
Copyright (C) 1997-2005, Intel Corporation
CLIENT MAC ADDR: 00 30 48 33 4C F6 GUID: 53D19F64 D663 A017 8922 003048334CF6

CLIENT IP: 192.168.1.2 MASK: 255.255.0.0 DHCP IP: 192.168.0.1
Downloaded WDSNBP...
Architecture: x64
Contacting Server: 192.168.1.1 ............
Figure 18 - HPCS compute node console while the node starts to PXE boot
The nodes will appear with the “provisioning” state in the management console as shown in Figure 19.
Figure 19 - Management console showing the “provisioning” compute nodes
After a while the compute node console shows that the installation is complete as in Figure 20.
Figure 20 - Compute node console shows that its installation is complete
At the end of the deployment, the compute node state is “offline” in the management console. The last
step is to click on “Bring online” in order to change the state to “online”. The HPCS compute nodes can
now be used.
42
5.3 Linux-Windows interoperability environment
In order to enhance the interoperability between the two management nodes, we set up a Unix/Linux
environment on the HPCS head node using the Subsystem for Unix-based Applications (SUA). We also
install SUA supplementary tools such as openssh that can be useful for HOSC administration tasks (e.g.,
ssh can be used to execute commands from a management node to the other in a safe manner).
The installation of SUA is not mandatory for setting up an HOSC and many tools can also be found from
other sources but it is a rather easy and elegant way to have a homogeneous HOSC environment: firstly,
it provides a lots of Unix tools on Windows systems, and secondly it provides a framework for porting and
running Linux applications in a Windows environment.
The installation is done in 3 steps.
5.3.1 Installation of the Subsystem for Unix-based Applications (SUA)

The Subsystem for Unix-based Applications (SUA) is part of Windows Server 2008 distribution. To turn
the SUA features on, open the “Server Manager” MMC, select the “Features” section in the left frame of
the MMC and click on “Add Features”. Then check the box for “Subsystem for UNIX_based Applications”
and click on “Next” and “Install”.
5.3.2 Installation of the Utilities and SDK for Unix-based Applications

Download “Utilities and SDK for UNIX-based Applications_AMD64.exe” from Microsoft
web site [28]. Run the custom installation and select the following packages in addition to those included
in the default installation: “GNU utilities” and “GNU SDK”.
5.3.3 Installation of add-on tools

Download the “Power User” add-on bundle available from Interops Systems [29] on the SUA community
web site [30]. The provided installer pkg-current-bundleuser60.exe handles all package
integration, environment variables and dependencies. Install the bundle on the HPCS head node. This
will install, configure and start an openssh server daemon (sshd) on the HPCS HN.
Other tools, such as proprietary compilers, can also be installed in the SUA environment.
5.4 User accounts

Users must have the same login name on all nodes (XBAS and HPCS). As mentioned in Section 4.4, we
decided not to use LDAP on our prototype but it is advised to use it on larger clusters.
User home directories should at least be shared on all compute nodes running the same OS: for
example, an NFS exported directory /home_nfs/test_user/ on XBAS CNs and a shared CIFS
directory C:\Users\test_user\ on HPCS CNs for user test_user.
It is also be possible (and even recommended) to have a unique home directory for both OS’s by
configuring Samba [36] on XBAS nodes.
43
5.5 Configuration of ssh
5.5.1 RSA key generation

Generate your RSA keys (or DSA keys, depending on your security policy) on the XBAS MN (see [23]):
[xbas0:root] ssh-keygen -t rsa -N ′′
This should also be done for each user account.
5.5.2 RSA key

Configure ssh so that it does not request to type a password when the root user connects from the XBAS
MN to the other nodes. For the XBAS CNs (and the HPCS HN if openssh is installed with the SUA), copy
the keys (private and public) generated on the XBAS MN.
For example, type:
[xbas0:root] cd /root/.ssh
[xbas0:root] cp id_rsa.pub authorized_keys
[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@hpcs0:.ssh/
[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@xbas1:.ssh/
Enter root password when requested (it will never be requested anymore later).
This should also be done for each user account.
For copying the RSA key on the HPCS CNs see Section 5.5.4.
By default, the first time a server connects to a new host it checks if its “server” RSA public key (stored in
/etc/ssh/) is already known and it asks the user to validate the authenticity of this new host. In order to
avoid typing the “yes” answer for each node of the cluster different ssh configurations are possible:
• The easiest, but less secure, solution is to disable the host key checking in file
/etc/ssh/ssh_config by setting: StrictHostKeyChecking no
• Another way is to merge the RSA public key of all nodes in a file that is copied on each node: the
/etc/ssh/ssh_known_hosts file. A trick is to duplicate the same server private key (stored
in file /etc/ssh/ssh_host_rsa_key) and thus the same public key (stored in file
/etc/ssh/ssh_host_rsa_key.pub) on every node. The generation of the
ssh_known_hosts file is then easier since each node has the same public key. An example of
such an ssh_known_hosts file is given in Appendix D.2.7.
44
5.5.3 Installation of freeSSHd on HPCS compute nodes
If you want to use PBS Professional and the OS balancing feature that was developed for our HOSC
prototype, a ssh server daemon is required on each compute node. The sshd daemon is already
installed by default on the XBAS CNs and it should be installed on the HPCS CNs: we chose the
freeSSHd [34] freeware. This software can be downloaded from [34] and its installation is straight-
forward: execute freeSSHd.exe, keep all default values proposed during the installation process and
accept to “run FreeSSHd as a system service”.
5.5.4 Configuration of freeSSHd on HPCS compute nodes

In the freeSSHd configuration window:
• add the user “root'”
o select “Authorization: Public key (SSH only)“
o select “User can use: Shell”
• select “Password authentication: Disabled”
• select “Public key authentication: Required“
The configuration is stored in file C:\Program Files (x86)\freeSSHd\FreeSSHDService.ini so

you can copy this file on each HPCS CN instead of configuring them one by one with the configuration
window. You must modify the IP address field (SSHListenAddress=<CN_IP_address>) in the
FreeSSHDService.ini file for each CN. The freeSSHd system service needs to be restarted to take
the new configuration into account.
Then finish the setup by copying the RSA key file /root/.ssh/id_rsa.pub from the XBAS MN to file
C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root on the HPCS CNs. Edit
this file (C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root) and remove the
@xbas0 string at the end of the file: it should end with the string root instead of root@xbas0.
5.6 Installation of PBS Professional 7

For installing PBS Professional on the HOSC cluster, first install the PBS Professional server on a
management node (or at least on a server that shares the same subnet as all the HOSC nodes), then
install PBS MOM (Machine Oriented Mini-server) on each CN (HPCS and XBAS). The basic information
is given in this Section. For more detailed explanations follow the instructions of the PBS Professional
Administrator’s Guide [31].
7
Thanks a lot to Laurent Aumis (SEMEA GridWorks Technical Manager at ALTAIR France) for his
valuable help and expertise in setting up this PBS Professional configuration.
45
5.6.1 PBS Professional Server setup
Install PBS server on the XBAS MN: during the installation process, select “PBS Installation: 1. Server,
execution and commands” (see [31] for detailed instructions). By default, the MOM (Machine Oriented
Mini-server) is installed with the server. Since the MN should not be used as a compute node, stop PBS
with “/etc/init.d/pbs stop”, disable the MOM by setting PBS_START_MOM=0 in file
/etc/pbs.conf (see Appendix D.3.1) and restart PBS with “/etc/init.d/pbs start”.
If you want to use a UID/GID on Windows and Linux nodes without UID unified, you need to set the flag
flatuid=true with the qmgr tool. UID/GID of PBS server will be used. Type:
[xbas0:root] qmgr
[xbas0:root] Qmgr: set server flatuid=True
[xbas0:root] Qmgr: exit
5.6.2 PBS Professional setup on XBAS compute nodes

Install PBS MOM on the XBAS CNs: during the installation process, select “PBS Installation: 2. Execution
only” (see [31]). Add PBS_SCP=/usr/bin/scp in file /etc/pbs.conf (see Appendix D.3.1) and
restart PBS MOM with “/etc/init.d/pbs restart”.
It would also be possible to use $usecp in PBS to move files around instead of scp. Samba [36] could
be configured on Linux systems to allow the HPCS compute nodes to drop files directly to Linux servers.
5.6.3 PBS Professional setup on HPCS nodes

First, log on the HPCS HN and create a new user in the cluster domain for PBS administration:
pbsadmin. Create file lmhosts on each HPCS node with PBS server hostname and IP address (as
shown in Appendix D.3.2). Then install PBS Professional on each HPCS node:
1. select setup type “Execution” (only) on CNs and “Commands” (only) on the HN
2. enter pbsadmin user password (as defined on the PBS server: on XBAS MN in our case)
3. enter PBS server hostname (xbas0 in our case)
4. keep all other default values that are proposed by the PBS installer
5. reboot the node
5.7 Meta-Scheduler queues setup

Create queues for each OS type and set the default_chunk.arch accordingly (it must be consistent
with the resources_available.arch field of the nodes).
Here is a summary of the PBS Professional configuration on our HOSC prototype. The following is a
selection of the most representative information reported by the PBS queue manager (qmgr):
46
Qmgr: print server
# Create and define queue windowsq
create queue windowsq
set queue windowsq queue_type = Execution
set queue windowsq default_chunk.arch = windows
set queue windowsq enabled = True
set queue windowsq started = True
# Create and define queue linuxq
create queue linuxq
set queue linuxq queue_type = Execution
set queue linuxq default_chunk.arch = linux
set queue linuxq enabled = True
set queue linuxq started = True
# Set server attributes.
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 60
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 3600
set server license_count = "Avail_Global:0 Avail_Local:1024 Used:0 High_Use:8"
set server eligible_time_enable = False
Qmgr: print node xbas1
# Create and define node xbas1
create node xbas1
set node xbas1 state = free
set node xbas1 resources_available.arch = linux
set node xbas1 resources_available.host = xbas1
set node xbas1 resources_available.mem = 16440160kb
set node xbas1 resources_available.ncpus = 4
set node xbas1 resources_available.vnode = xbas1
set node xbas1 resv_enable = True
set node xbas1 sharing = default_shared
Qmgr: print node hpcs2
# Create and define node hpcs2
create node hpcs2
set node hpcs2 state = free
set node hpcs2 resources_available.arch = windows
set node hpcs2 resources_available.host = hpcs2
set node hpcs2 resources_available.mem = 16775252kb
set node hpcs2 resources_available.ncpus = 4
set node hpcs2 resources_available.vnode = hpcs2
set node hpcs2 resv_enable = True
set node hpcs2 sharing = default_shared
47
5.7.1 Just in time provisioning setup
This paragraph describes the implementation of a simple example of “just in time” provisioning (see
Section 3.6.3). We developed a Perl script (see pbs_hosc_os_balancing.pl in Appendix D.3.2)
that gets PBS server information about queues, jobs and nodes for both OS’s (e.g., number of free
nodes, number of nodes requested by jobs in queues, number of nodes requested by the smallest job).
Based on this information, the script checks a simple rule that defines the cases when the OS type of
CNs should be switched. If the rule is “true” then the script selects free CNs and switches their OS type.
In our example, we defined a conservative rule (i.e., the number of automatic OS switches is kept low):
“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let
us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e.,
there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α
nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.
The script is run periodically based on the schedule defined by the crontab of the PBS server host. The
administrator can also switch more OS’s manually if necessary at any time (see Sections 6.3 and 6.4).
The crontab setup can be done by editing the following lines with the crontab command 8 :
[xbas0:root] crontab -e
# run HOSC Operating System balancing script every 10 minutes (noted */10)
*/10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl
The OS distribution balancing is then controlled by this cron job. Instead of running the
pbs_hosc_os_balancing.pl script as a cron job, it would also be possible to call it as an external
scheduling resource sensor (see [31] for information about PBS Professional scheduling resources), or to
call it with PBS Professional hooks (see [31]). For developing complex OS balancing rules, the Perl script
could be replaced by a C program (for details about PBS Professional API see [33]).
This simple script could be further developed in order to be more reliable. For example:
• check that the script is only run once at a time (by setting a lock file for example),
• allow to switch the OS type of more than η-α nodes at once if the number of free nodes and the
number of queued jobs is high (this can happen when many small jobs are submitted),
• impose a delay between two possible switches of OS type on each compute node.
5.7.2 Calendar provisioning setup

This paragraph just gives the main ideas to setup calendar provisioning (see Section 3.6.3). As for the
previous provisioning example, the setup should rely on the cron mechanism. A script that can switch the
OS type of a given number of compute nodes could easily be written (by slightly modifying the scripts
provided in Appendix of this paper). This script could be run hourly as a cron job and it could read the
requested number of nodes with each OS type from a configuration file written by the administrator.
8
“crontab -e” opens the /var/spool/cron/root file in a vi mode and restarts the cron service automatically.
48
6 Administration of the HOSC prototype
6.1 HOSC setup checking

Basically, the cluster checking is done as if there were 2 independent clusters. The fact that an HOSC is
used does not change anything at this level. The usual cluster diagnosis tests should thus be used.
For HPCS, this means that the basic services and connectivity tests should be run first, followed by the
automated diagnosis tests from the “cluster management” MMC.
For XBAS, the sanity checks can be done with basic Linux commands (ping, pdsh, etc.) and monitoring
tools like Nagios (see [23] and [24] for details).
6.2 Remote reboot command

A reboot command can be sent remotely to compute nodes by the management nodes.
The HPCS head node can send a reboot command to its HPCS compute nodes only (soft reboot) with
“clusrun”. For example:
C:\> clusrun /nodes:hpcs1,hpcs2 shutdown /r /f /t 5 /d p:2:4
Use “clusrun /all” for rebooting all HPCS compute nodes (the head node should not be declared as
a compute node; otherwise this command would reboot it too).
The XBAS management node can send a reboot command to its XBAS compute nodes only (soft reboot)
with pdsh. For example:
[xbas0:root] pdsh –w xbas[1-4] reboot
The XBAS management node can also reboot any compute node (HPCS or XBAS) with the NovaScale
control “nsctrl” command (hard reboot). For example:
[xbas0:root] nsctrl reset xbas[1-4]
6.3 Switch a compute node OS type from XBAS to HPCS

To switch a compute node OS from XBAS to HPCS, type the from_XBAS_to_HPCS.sh command on
the XBAS management node (you must be logged on as “root”). See Appendix D.2.3 for information on
this command implementation. For example, if you want to switch the OS type of node xbas2, type:
[xbas0:root] from_XBAS_to_HPCS.sh xbas2
The compute node is then automatically rebooted with the HPCS OS type.
49
6.4 Switch a compute node OS type from HPCS to XBAS
6.4.1 Without sshd on the HPCS compute nodes

To switch a compute node OS from HPCS to XBAS, first execute the switch_dhcp_host command on
the XBAS management node and restart the dhcp service. This can be done locally on the XBAS MN
console or remotely from the HPCS HN using a secure shell client (e.g., putty or openssh). Type:
[xbas0:root] switch_dhcp_host hpcs2

[xbas0:root] service dhcpd restart
Then take the node offline in the MMC and type the from_HPCS_to_XBAS.bat command in a
“command prompt” window of the HPCS head node. See Appendix D.1.3 for information on this
command implementation. For example, if you want to switch the OS of node hpcs2, type:
C:\> from_HPCS_to_XBAS.bat hpcs2
The compute node is then automatically rebooted with the XBAS OS type.
6.4.2 With sshd on the HPCS compute nodes

If you installed a ssh server daemon (e.g., FreeSSHd) on the HPCS CNs then you can also type the
following command from the XBAS management node. It will execute all the commands (listed in
previous Section) from the XBAS MN without having to log on the HPCS HN. Type:
[xbas0:root] from_HPCS_to_XBAS.sh hpcs2
The compute node is then automatically rebooted with the XBAS OS type.
This script was mainly implemented to be used with a meta-scheduler since it is not recommended to
switch the OS type of a HPCS CN by sending a command from the XBAS MN (see Section 4.3).
6.5 Re-deploy an OS
The goal is to be able to re-deploy an OS on an HOSC without impacting the other OS that is already
installed. Do not forget to save your MBR since it can be overwritten during the installation phase (see
Appendix C.2).
For re-deploying XBAS compute nodes, ksis tools cannot be used (it would erase existing Windows
partitions). The “preparenfs” command is the only tool that can be used. The partition declarations done in
the kickstart file should then be edited in order to reuse existing partitions and not to remove them or
recreate new ones. The modifications are slightly different from those done for the first install. If the
existing partitions are those created with the kickstart file shown as example in Appendix D.2.1:
/dev/sda1 /boot 100MB ext3 (Linux)

/dev/sda2 / 50GB ext3 (Linux)
/dev/sda3 SWAP 16GB (Linux)
/dev/sda4 C:\ 50GB ntfs (Windows)
50
Then the new kickstart file used for re-deploying a XBAS compute node should include the lines below:
/release/ks/kickstart.<identifier>
…
part /boot --fstype="ext3" --onpart sda1
part / --fstype="ext3" --onpart sda2
part swap --noformat --onpart sda3
…
In the PXE file stored on the MN (e.g., /tftboot/C0A80002 for node xbas1), the DEFAULT label should
be set back to ks instead of local_primary. The CN can then be rebooted for starting the re-
deployment process.
For re-deploying Windows HPC Server 2008 compute nodes, check that the partition number in
unattend.xml file is consistent with the existing partition table and if necessary edit it (in our example:
<PartitionID>4</PartitionID>). Edit diskpart.txt file so that it only re-formats the NTFS Windows
partition without cleaning or removing the existing partitions (see Appendix D.1.1). Manually
update/delete the previous computer and hostname declaration in the Active Directory before re-
deploying the nodes and then play the compute node deployment template as for the first install.
6.6 Submit a job with the meta-scheduler

For detailed explanation about using PBS Professional and submitting jobs read the PBS Professional
User’s Guide [32]. This paragraph just gives an example of the specificities of our meta-scheduler
environment.
Let us suppose we have a user named test_user. This user has two applications to run: one for each
OS type. He also has two job submission scripts: my_job_Win.sub for the Windows application and
my_job_Lx.sub for the Linux application:
my_job_Win.sub
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -q windowsq
C:\Users\test_user\my_windows_application
my_job_Lx.sub
#!/bin/bash
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -q linuxq
/home/test_user/my_linux_application
Whatever the OS type the application should run on, the scripts can be submitted from any Windows or
Linux computer with the same qsub command. The only requirement is that the computer needs to have
credentials to connect with the PBS Professional server.
51
The command lines can be typed from a Windows system:
C:\> qsub my_job_Win.sub
C:\> qsub my_job_Lx.sub
or the command lines can be typed from a Linux system:
[xbas0:test_user] qsub my_job_Win.sub

[xbas0:test_user] qsub my_job_Lx.sub
You can check the PBS queue status with the qstat command. Here is the example of an output:
[xbas0:root] qstat -n
xbas0:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
129.xbas0 thomas windowsq my_job_Win 3316 2 8 -- -- R 03:26
hpcs3/0*4+hpcs4/0*4
130.xbas0 laurent linuxq my_job_Lx. 21743 2 8 -- -- R 01:23
xbas1/0*4+xbas2/0*4
131.xbas0 patrice linuxq my_job_Lx. -- 2 8 -- -- Q --
--
--
133.xbas0 laurent windowsq my_job_Win -- 2 8 -- -- Q --
--
134.xbas0 thomas windowsq my_job_Win -- 1 4 -- -- Q --
--
135.xbas0 thomas windowsq my_job_Win -- 1 1 -- -- Q --
--
--
6.7 Check node status with the meta-scheduler

The status of the nodes can be checked with the PBS Professional monitoring tool. Each physical node
appears twice in the PBS monitor window: once for each OS type. For example, the first node appears
with two hostnames (xbas1 and hpcs1). The hostname associated with the running OS type is flagged as
“free” or “busy” while the other hostname is flagged as “offline”. This gives a complete view of the OS type
distribution on the HOSC.
Figure 21 shows that the two first CNs run XBAS while the two other CNs run HPCS. It also shows that all
four CNs are busy. This corresponds to the qstat output shown as example in the previous Section
above. Figure 22 shows that there are three free CNs running XBAS and one busy CN running HPCS on
our HOSC prototype. Since we do not want to run applications on the XBAS MN (xbas0), we disabled its
MOM (see Section 5.6.1). That is why it is seen as “down” in both Figures.
52
Figure 21 - PBS monitor with all 4 compute nodes busy (2 with XBAS and 2 with HPCS)
Figure 22 - PBS monitor with 1 busy HPCS compute node and 3 free XBAS compute nodes
53
54
7 Conclusion and perspectives
We studied 12 different approaches to HPC clusters that can run 2 OS’s. We particularly focused on
those being able to run the 2 OS’s simultaneously and we named them: Hybrid Operating System
Clusters (HOSC). The 12 approaches have dozens of possible implementations among which the most
common alternatives were discussed, resulting in technical recommendations for designing an HOSC.
This collaborative work between Microsoft and Bull gave the opportunity to build an HOSC prototype that
provides computing power under Linux Bull Advanced Server for Xeon and Windows HPC Server 2008
simultaneously. The prototype has 2 virtual management nodes installed on 2 Xen virtual machines run
on a single host server with RHEL5.1, and 4 dual-boot compute nodes that boot with the Windows master
boot record. The methodology to dynamically switch the OS type easily on some compute nodes without
disturbing the other compute nodes was provided.
A meta-scheduler based on Altair PBS Professional was implemented. It provides a single submission
point for both Linux and Windows and it adapts automatically (with some simple rules given as example)
the distribution of OS types among the compute nodes to the user needs (i.e., the pool of submitted jobs).
This successful project could be continued with the aim of improving the current HOSC prototype
features. Ideas of possible improvements are to
• develop a unique monitoring tool for both OS compute nodes (e.g., based on Ganglia [35]);
• centralize user account management (e.g., with Samba [36]);
• work on interoperability between PBS and HPCS job scheduler (e.g., by using the tools of OGF,
the Open Grid Forum [37]).
We could also work on security aspects that were intentionally overlooked during this first study. More
intensive and exhaustive performance tests with virtual machines (e.g., InfiniBand ConnectX virtualization
feature, virtual processor binding, etc.) could also be done. Finally, a third OS could be installed on our
HOSC prototype to validate the general nature of the method exposed.
More generally, the framework presented in this paper should be considered as a building block for more
specific implementations. Various requirements of real applications, environments or loads could lead to
sensibly different or more sophisticated developments. We hope that this initial building block will help
those who will add subsequent layers, and we are eager to hear about successful production
environments designed from there 9 .
9
Do not hesitate to send your comments to the authors about this paper and your HOSC experiments:
patrice.calegari@bull.net and thomas.varlet@microsoft.com.
55
Appendix A: Acronyms
AD Active Directory (Microsoft)

BAS Bull Advanced Server
BIOS Basic Input Output System
CIFS Common Internet File System (Microsoft)
clusterdb cluster management data base (Bull)
CN Compute Node
CSF Community Scheduler Framework
DHCP Dynamic Host Configuration Protocol
Dom0 Domain 0 (Xen)
DomU Unprivileged Domain (Xen)
DRM Distributed Resource Manager
DRMS Distributed Resource Management System
DSA Digital Signature Algorithm
EHA Ethernet Hardware Address (aka MAC address)
FAT32 File Allocation Table file system with 32-bit addresses
GE Gigabit Ethernet
GID Group IDentifier
GNU Gnu's Not Unix
GPFS General Parallel File SystemTM (IBM)
GPL GNU General Public License
GRUB GRand Unified Bootloader
HN Head Node (Windows)
HOSC Hybrid Operating System Cluster
HPC High Performance Computing
HPCS Windows HPC Server® 2008 (Microsoft)
HVM Hardware Virtual Machine
IB InfiniBand
IP Internet Protocol
IPoIB Internet Protocol over InfiniBand protocol
IT Information Technology
LDAP Lightweight Directory Access Protocol
LILO LInux LOader
LINUX Linux Is Not UniX (Linus Torvald's UNIX)
LSF Load Sharing Facility
LVM Logical Volume Manager
MAC address Media Access Control address (aka EHA)
56
MBR Master Boot Record
MMC Microsoft Management Console
MN Management Node (Bull)
MOM Machine Oriented Mini-server (Altair)
MPI Message Passing Interface
MULTICS Multiplexed Information and Computing Service
NBP Network Boot Program
ND Network Direct (Microsoft)
NFS Network File System
NPB NASA advanced supercomputing (NAS) Parallel Benchmarks
NTFS New Technology File System (Windows)
OGF Open Grid Forum
OS Operating System
PBS Portable Batch System
PXE Pre-boot eXecution Environment
RHEL RedHat Enterprise Linux
ROI Return On Investment
RSA Rivest, Shamir, and Adelman
SDK Software Development Kit
SGE Sun Grid Engine
SLURM Simple Linux Utility for Resource Management
SSH Secure SHell
SUA Subsystem for Unix-based Applications
TCO Total Cost of Ownership
TCP Transmission Control Protocol
TFTP Trivial File Transfer Protocol
UDP User Datagram Protocol
UID User IDentifier
UNIX This is a pun on MULTICS (not an acronym!)
VM Virtual Machine
VNC Virtual Network Computing
VT Virtual Technology (Intel®)
WCCS Windows Compute Cluster Server
WDS Windows Deployment Service
WIM Windows IMage (Microsoft)
WinPE Windows Preinstallation Environment (Microsoft)
XBAS Bull Advanced Server for Xeon
XML eXtensible Markup Language
57
Appendix B: Bibliography and related links
[1] “Dual Boot: Windows Compute Cluster Server 2003 and Linux - Setup and Configuration
Guide”, July 2007. This white paper describes the installation and configuration of an HPC cluster
for a dual-boot of Windows Compute Cluster Server 2003 (WCCS) and Linux OpenSuSE.
http://www.microsoft.com/downloads/details.aspx?FamilyID=1457BC0A-EAFF-4303-99ED-
B199AB1C0857&displaylang=en
[2] “Dual Boot: Windows Compute Cluster Server and Rocks Cluster Distribution - Setup and
Configuration Guide”, Jason Bucholtz, HPC Practice Lead, X-ISS, Michael Zebrowski, HPC
Analyst, X-ISS, 2007. This white paper describes the installation and configuration of an HPC
cluster for a dual-boot of WCCS 2003 and Rocks Cluster Distribution (formerly called NPACI
Rocks). http://www.microsoft.com/downloads/details.aspx?FamilyID=e73a468e-2dbf-4782-8faa-
aaa20acb63f8&DisplayLang=en
[3] “Dual-boot Linux and HPC Server 2008” on G. Marchetti blog:
http://blogs.technet.com/gmarchetti/archive/2007/12/11/dual-boot-linux-and-hpc-server-2008.aspx
[4] BULL S.A.S. HPC solutions: http://www.bull.com/hpc
[5] Windows HPC Server: http://www.microsoft.com/hpc and http://www.windowshpc.net
[6] Xen: http://xen.xensource.com
[7] VMware: http://www.vmware.com
[8] Hyper-V: http://www.microsoft.com/windowsserver2008/en/us/virtualization-consolidation.aspx
[9] PowerVM: http://www-03.ibm.com/systems/power/software/virtualization/index.html
[10] Virtuozzo: http://www.parallels.com/en/products/virtuozzo/
[11] OpenVZ: http://openvz.org
[12] PBS Professional: http://www.pbsgridworks.com/ and http://www.altair.com/
[13] Torque: http://www.clusterresources.com/pages/products/torque-resource-manager.php
[14] SLURM: https://computing.llnl.gov/linux/slurm/
[15] LSF: http://www.platform.com/Products/platform-lsf
[16] SGE: http://gridengine.sunsource.net
[17] OAR: http://oar.imag.fr/index.html
[18] Wikipedia: http://www.wikipedia.org
[19] Moab and Maui: http://www.clusterresources.com
[20] GridWay: http://www.gridway.org
[21] Community Scheduler Framework: http://sourceforge.net/projects/gcsf
58
[22] “BAS5 for Xeon - Installation & Configuration Guide”, Ref: 86 A2 87EW00, April 2008
[23] “BAS5 for Xeon - Administrator’s guide”, Ref: 86 A2 88EW, April 2008
[24] “BAS5 for Xeon - User’s guide”, Ref: 86 A2 89EW, April 2008
[25] "A Comparison of Virtualization Technologies for HPC", John Paul Walters, Vipin Chaudhary,
Minsuk Cha, Salvatore Guercio Jr., Steve Gallo, In Proceedings of the 22nd International
Conference on Advanced Information Networking and Applications (AINA 2008), pp. 861-868, 2008
DOI= http://doi.ieeecomputersociety.org/10.1109/AINA.2008.45
[26] “Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC
Systems”, Youseff, L., Wolski, R., Gorda, B., and Krintz, C. In Proceedings of the 2nd international
Workshop on Virtualization Technology in Distributed Computing, Virtualization Technology in
Distributed Computing, IEEE Computer Society, 2006, DOI= http://dx.doi.org/10.1109/VTDC.2006.4
[27] Mellanox Basic InfiniBand Software Stack for Windows HPC Server 2008 including
NetworkDirect support http://www.mellanox.com/products/MLNX_WinOF.php
[28] Utilities and SDK for Subsystem for UNIX-based Applications (SUA) in Microsoft Windows Vista
RTM/Windows Vista SP1 and Windows Server 2008 RTM:
http://www.microsoft.com/downloads/details.aspx?familyid=93ff2201-325e-487f-a398-
efde5758c47f&displaylang=en
[29] Interops Systems: http://www.interopsystems.com
[30] SUA Community: http://www.suacommunity.com
[31] PBS Professional 10.0 Administrator’s Guide, 610 pages, GridWorks, Altair, 2009
[32] PBS Professional 10.0 User’s Guide, 304 pages, GridWorks, Altair, 2009
[33] PBS Professional 10.0 external reference specification, GridWorks, Altair, 2009
[34] freeSSHd and freeFTPd: http://www.freesshd.com
[35] Ganglia: http://ganglia.info
[36] Samba: http://www.samba.org
[37] Open Grid Forum: http://www.ogf.org
[38] Top500 supercomputing site: http://www.top500.org
This paper can be downloaded from the following web sites:
http://www.bull.com/techtrends
http://www.microsoft.com/downloads
http://technet.microsoft.com/en-us/library/cc700329(WS.10).aspx
59
Appendix C: Master boot record details
C.1 MBR Structure

The Master Boot Record (MBR) defined in Section 2.1 occupies the first sector of a device (we assume
that the size of a sector is always 512 bytes). Its structure is shown in Table 3 below.
Address
Description Size in bytes
Hex Dec
0000 0 Code Area ≤ 446
01B8 440 Optional disk signature 4
01BC 444 Usually null: 0x0000 2
Table of primary partitions
01BE 446 64
(four 16-byte partition structures)
01FE 510 55h
MBR signature: 0xAA55 2
01FF 511 AAh
MBR total size: 446 + 64 + 2 = 512
Table 3 - Structure of a Master Boot Record
C.2 Save and restore MBR

If you want to save a MBR, just copy the first sector of the first device in a file (keep a copy of that file on
another media if you want to protect it from a device failure).
On Linux, type:
[xbas0:root] dd if=/dev/sda of=mbr.bin bs=512 count=1
If you want to restore the MBR replace the first sector with the saved file.
On Linux, type:
[xbas0:root] dd if=mbr.bin of=/dev/sda bs=512 count=1
On Windows Server 2008, MBR can be restored even if not previously saved. Type:
C:\> bootrec /FixMbr
60
Appendix D: Files used in examples
Here are the files (scripts, configuration files, etc.) written or modified to build the HOSC prototype and to
validate information given in this document.
D.1 Windows HPC Server 2008 files
D.1.1 Files used for compute node deployment

The first 2 files are used by the deployment template and they need to be modified in order to fulfill the
HOSC requirements. The 3rd XML file is used for template deployment based on CN MAC addresses.
C:\Program Files\Microsoft HPC Pack\Data\InstallShare\unattend.xml
… …
<InstallTo> <InstallTo> If XBAS uses the
<DiskID>0</DiskID> <DiskID>0</DiskID> first 3 partitions
<PartitionID>1</PartitionID> <PartitionID>4</PartitionID> then Windows
</InstallTo> </InstallTo>
can be installed
… …
on the 4th
partition.
C:\Program Files\Microsoft HPC Pack\Data\InstallShare\Config\diskpart.txt
select disk 0
clean
The “clean” instruction removes all
create partition primary existing partitions. It must be deleted
assign letter=c to preserve existing partitions.
format FS=NTFS LABEL="Node" QUICK OVERRIDE
active
exit
select disk 0 This is needed because of

create partition primary a removable USB device
select volume 0 declared as “volume C” by
remove
For select volume 1
default on R421 systems.
deployment assign letter=c This must be adapted to
format FS=NTFS LABEL="Node" QUICK OVERRIDE your system.
active
exit
Existing partitions (Linux

select volume 0 and Windows) are kept
For format FS=NTFS LABEL="Node" QUICK OVERRIDE and the Windows partition
re-deployment active
exit is re-formatted.
61
my_cluster_nodes.xml
<?xml version="1.0" encoding="utf-8"?>
<Nodes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://schemas.microsoft.com/HpcNodeConfigurationFile/2007/12">
<Node Name="hpcs1" Domain="WINISV">
<MacAddress>003048334cf6</MacAddress>
</Node>
<MacAddress>003048334d04</MacAddress>
</Node>
<MacAddress>003048334d3c</MacAddress>
</Node>
<MacAddress>003048347990</MacAddress>
</Node>
</Nodes>
D.1.2 Script for IPoIB setup

setIPoIB.vbs
set objargs=wscript.arguments
Set fs=CreateObject("Scripting.FileSystemObject")
Set WshNetwork = WScript.CreateObject("WScript.Network")
wscript.sleep(10000)
hostname=WshNetwork.ComputerName
ip=GetIP(hostname)
Set logFile = fs.opentextfile("c:\netconfig.log",8,True)
WScript.Echo "Computername: " & hostname
WScript.Echo "IP: " & ip
logfile.writeline("Computername: " & hostname)
logfile.writeline("IP: " & ip)
res=setIPoIB(ip)
logfile.writeline(res)
wscript.echo res
'-------------------------------------------------------------------------
Function GetIP(hostname)
set sh = createobject("wscript.shell")
set fso = createobject("scripting.filesystemobject")
workfile = "c:\PrivateIPadress.txt"
sh.run "%comspec% /c netsh interface ip show addresses private > " & workfile,0,true
Set ts = fso.opentextfile(workfile)
data = split(ts.readall,vbcr)
ts.close
fso.deletefile workfile
for n = 0 to ubound(data)
if instr(data(n),"Address") then
parts = split(data(n),":")
GetIP= trim(cstr(parts(1)))
end if
IP = "could not resolve IP address"
62
Next
End Function
'---------------------------------------------------------------------
Function setIPoIB(IPAddress)
PartialIP=Split(ipaddress,".")
strIPAddress = Array("10.1.0." & PartialIP(3))
strSubnetMask = Array("255.255.255.0")
strGatewayMetric = Array(1)
WScript.Echo "IB: " & strIPAddress(0)
strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2")
Set colNetAdapters = objWMIService.ExecQuery _
("select * from win32_networkadapterconfiguration where IPEnabled=true and description
like 'Mellanox%'")
For Each objNetAdapter in colNetAdapters
errEnable = objNetAdapter.EnableStatic(strIPAddress, strSubnetMask)
If errEnable = 0 Then
SetIPoIB="The IP address on Infiniband has been changed"
Else
SetIPoIB="The IP address on IB could not be changed. Error: " & errEnable
End If
Next
End Function
D.1.3 Scripts used for OS switch

Here are the scripts developed on the HPCS head node to switch the OS type of a compute node from
HPCS to XBAS:
C:\hosc\activate_partition_XBAS.bat
@echo off
rem the argument is the head node hostname for shared file system mount. For example: \\HPCS0
echo ... Partitioning disk...
diskpart.exe /s %1\hosc\diskpart_commands.txt
echo ... Shutting down node %COMPUTERNAME% ...
shutdown /r /f /t 20 /d p:2:4
C:\hosc\diskpart_commands.txt
select disk 0
select partition 1
active
C:\hosc\from_HPCS_to_XBAS.bat
@echo off
rem the argument is the node hostname. For example: hpcs1
echo Check that file dhcpd.conf is updated on the XBAS management node !
if NOT "%1"=="" clusrun /nodes:%1 %LOGONSERVER%\hosc\activate_partition_XBAS.bat %LOGONSERVER%
if "%1"=="" echo "usage: from_HPCS_to_XBAS.bat <hpcs_hostname>"
63
D.2 XBAS files
D.2.1 Kickstart and PXE files

Here is an example of modifications that must be done in the kickstart file generated by the preparenfs
tool in order to fulfill the HOSC requirements:
/release/ks/kickstart.<identifier> (for example kickstart.22038)
…
part / --asprimary --fstype="ext3" --ondisk=sda --size=10000
part /usr --asprimary --fstype="ext3" --ondisk=sda --size=10000
part /opt --fstype="ext3" --ondisk=sda --size=10000
part /tmp --fstype="ext3" --ondisk=sda --size=10000
part swap --fstype="swap" --ondisk=sda --size=16000
part /var --fstype="ext3" --grow --ondisk=sda --size=10000
…
…
part /boot --asprimary --fstype="ext3" --ondisk=sda --size=100
part / --asprimary --fstype="ext3" --ondisk=sda --size=50000
part swap --fstype="swap" --ondisk=sda --size=16000
…
Here is an example of a PXE file generated by preparenfs for node xbas1. Before deployment, the
DEFAULT label is set to ks and after deployment the DEFAULT label is set to local_primary
automatically.
/tftboot/C0A80002 (complete file before compute node deployment)

# GENERATED BY PREPARENFS SCRIPT
TIMEOUT 10
DEFAULT ks
PROMPT 1
LABEL local_primary
KERNEL chain.c32
APPEND hd0
LABEL ks
KERNEL RHEL5.1/vmlinuz
APPEND console=tty0 console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp
ks=nfs:192.168.0.99:/release/ks/kickstart.22038 initrd=RHEL5.1/initrd.img driverload=igb
LABEL rescue
KERNEL RHEL5.1/vmlinuz
APPEND console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp
method=nfs:192.168.0.99:/release/RHEL5.1 initrd=RHEL5.1/initrd.img rescue driverload=igb
64
/tftboot/C0A80002 (head of the file after compute node deployment)
# GENERATED BY PREPARENFS SCRIPT
TIMEOUT 10
DEFAULT local_primary
PROMPT 1
The remainder of the file is unchanged. Set TIMEOUT and PROMPT to 0 in order to boot nodes quicker.
D.2.2 DHCP configuration

The initial DHCP configuration file must be changed for HPCS CN deployment: the global
next-server field must be deleted and each CN host section must be modified as shown below:
/etc/dhcpd.conf
next-server 192.168.0.99; # global “next-server” entry is removed.

########### END GLOBAL PARAMETERS ########### END GLOBAL PARAMETERS
subnet 192.168.0.0 netmask 255.255.0.0{ subnet 192.168.0.0 netmask 255.255.0.0{
authoritative; authoritative;
host xbas1 { host hpcs1 {

filename "pxelinux.0"; filename "Boot\\x64\\WdsNbp.com";
fixed-address 192.168.0.2; fixed-address 192.168.1.2;
hardware ethernet 00:30:48:33:4c:f6; hardware ethernet 00:30:48:33:4c:f6;
option host-name "xbas1"; option host-name "hpcs1";
} next-server 192.168.1.1;
server-name "192.168.1.1";
Remark: this modification can be done by the option domain-name-servers 192.168.1.1;
switch_dhcp_host script }
The NBP file path must be written with double \\ in order to be correctly interpreted during the PXE boot.
D.2.3 Scripts used for OS switch

Here are the scripts developed on the XBAS management node to switch the OS of a compute node:
/opt/hosc/switch_dhcp_host
#!/usr/bin/python -t
import os, os.path, sys
############## Cluster characteristics must be written here ################
xbas_hostname_base='xbas'
hpcs_hostname_base='hpcs'
field_dict = {hpcs_hostname_base:{'filename':'"Boot\\\\x64\\\\WdsNbp.com";\n',
'fixed-address':'192.168.1.',
'next-server':'192.168.1.1;\n',
'server-name':'"192.168.1.1";\n'},
xbas_hostname_base:{'filename':'"pxelinux.0";\n',
'fixed-address':'192.168.0.',
'next-server':'192.168.0.1;\n',
'server-name':'"192.168.0.1";\n'}}
65
if (len(sys.argv) <> 2):
print ('usage: switch_dhcp_host <current compute node hostname>')
sys.exit(1)
elif (len(str(sys.argv[1]))>1) and (str(sys.argv[1])[-2:].isdigit()):
node_base = str(sys.argv[1])[:-2]
node_rank = str(sys.argv[1])[-2:]
else:
node_base = str(sys.argv[1])[:-1]
node_rank = str(sys.argv[1])[-1:]
if (node_base == xbas_hostname_base ):
old_hostname= xbas_hostname_base + node_rank
new_hostname=hpcs_hostname_base + node_rank
new_node_base = hpcs_hostname_base
elif (node_base == hpcs_hostname_base):
old_hostname=hpcs_hostname_base + node_rank
new_hostname= xbas_hostname_base + node_rank
new_node_base = xbas_hostname_base
else:
print ('unknown hostname: ' + sys.argv[1])
sys.exit(1)
file_name = '/etc/dhcpd.conf'
if not os.path.isfile(file_name):
print file_name + ' does not exists !'
sys.exit(1)
status = 'File ' + file_name + ' was not modified'
file_name_save = file_name + '.save'
file_name_temp = file_name + '.temp'
old_file = open(file_name,'r')
new_file = open(file_name_temp,'w')
S = old_file.readline()
while S:
if (S[0:11] == 'next-server'): S = old_file.readline() # Removes global next-server line
if (S.find('host ' + old_hostname) <> -1):
while (S.find('hardware ethernet') == -1):
S = old_file.readline() # Skips old host section lines
hardware_ethernet=S.split()[2] # Gets host Mac address
while (S.find('}') == -1):
S = old_file.readline() # Skips old host section lines
# Writes new host section lines:
new_file.write(' host ' + new_hostname + ' {\n')
new_file.write(' filename ' + field_dict[new_node_base]['filename'])
new_file.write(' fixed-address ' + field_dict[new_node_base]['fixed-address']
+ str(int(node_rank)+1) + ';\n')
new_file.write(' hardware ethernet ' + hardware_ethernet + '\n')
new_file.write(' option host-name ' + '"' + new_hostname + '";\n')
new_file.write(' next-server ' + field_dict[new_node_base]['next-server'])
new_file.write(' server-name ' + field_dict[new_node_base]['server-name'])
if (new_node_base == hpcs_hostname_base):
new_file.write('option domain-name-servers '+field_dict[new_node_base]['next-server'])
new_file.write(' }\n')
status = 'File ' + file_name + ' is updated with host ' + new_hostname
else: new_file.write(S) # Copies the line from the original file without modifications
S = old_file.readline()
# End while loop
66
old_file.close()
new_file.close()
if os.path.isfile(file_name_save): os.remove(file_name_save)
os.rename(file_name,file_name_save)
os.rename(file_name_temp,file_name)
print status
print ('Do not forget to validate changes by typing: service dhcpd restart')
sys.exit(0)
# End of switch_dhcp_host script
/opt/hosc/activate_partition_HPCS.sh
#!/bin/sh
#the argument is the node hostname. For example: xbas1
ssh $1 fdisk /dev/sda < /opt/hosc/fdisk_commands.txt
/opt/hosc/fdisk_commands.txt
a
4
a
1
w
q
/opt/hosc/from_XBAS_to_HPCS.sh
#!/bin/sh
#the argument is the node hostname. For example: xbas1
/opt/hosc/switch_dhcp_host $1
/sbin/service dhcpd restart
/opt/hosc/activate_partition_HPCS.sh $1
ssh $1 shutdown -r -t 20 now
/opt/hosc/from_HPCS_to_XBAS.sh
#!/bin/sh
#this script requires a ssh server daemon to be installed on the HPCS compute nodes
#the argument is the compute node hostname. For example: hpcs1
#HPCS head node hostname is hard coded in this script as: hpcs0
/opt/hosc/switch_dhcp_host $1
/sbin/service dhcpd restart
ssh $1 -l root cmd /c \\\\hpcs0\\hosc\\activate_partition_XBAS.bat \\\\hpcs0
D.2.4 Network interface bridge configuration

For configuring 2 network interface bridges, xenbr0 and xenbr1, replace the following line in file:
/etc/xen/xen-config.sxp
(network-script network-bridge) (network-script my-network-bridges)
67
Then create file:
/etc/xen/scripts/my-network-bridges
#!/bin/bash
XENDIR="/etc/xen/scripts"
$XENDIR/network-bridge "$@" netdev=eth0 bridge=xenbr0 vifnum=0
$XENDIR/network-bridge "$@" netdev=eth1 bridge=xenbr1 vifnum=1
D.2.5 Network hosts

The hosts file declares the IP addresses of the network interfaces of Linux nodes. XBAS CNs needs to
have the same hosts file. Here is an example for our HOSC cluster:
/etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.0.1 xbas0
192.168.0.2 xbas1
192.168.0.3 xbas2
192.168.0.4 xbas3
192.168.0.5 xbas4
172.16.0.1 xbas0-ic0
172.16.0.2 xbas1-ic0
172.16.0.3 xbas2-ic0
172.16.0.4 xbas3-ic0
172.16.0.5 xbas4-ic0
D.2.6 IB network interface configuration

For configuring the IB interface on each node, create/edit the following file with the right IP address. Here
is an example for the compute node xbas1:
/etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
ONBOOT=yes
BOOTPROTO=static
NETWORK=192.168.220.0
IPADDR=192.168.220.2
D.2.7 ssh host configuration

/etc/ssh/ssh_known_hosts
xbas0,192.168.0.1 ssh-rsa AAAB3NzaC1yc2EAAABIwAAAQE/yiPG/x5gl+dq5XXhffF456fggDFt … lC92dxQUE5qQ==
68
D.3 Meta-scheduler setup files
D.3.1 PBS Professional configuration files on XBAS

Here is an example of PBS Professional configuration file for PBS server on the XBAS MN:
/etc/pbs.conf
PBS_EXEC= /opt/pbs/default
PBS_HOME=/var/spool/PBS
PBS_START_SERVER=1
PBS_START_MOM=0
PBS_START_SCHED=1
PBS_SERVER=xbas0
PBS_SCP=/usr/bin/scp
Here is an example of PBS Professional configuration file for PBS MOM on the XBAS CNs:
/etc/pbs.conf
PBS_EXEC= /opt/pbs/default
PBS_HOME=/var/spool/PBS
PBS_START_SERVER=0
PBS_START_MOM=1
PBS_START_SCHED=0
PBS_SERVER=xbas0
PBS_SCP=/usr/bin/scp
D.3.2 PBS Professional configuration files on HPCS

Here is an example of the lmhosts file needed on HPCS nodes:
C:\Windows\System32\drivers\etc\lmhosts
192.168.0.1 xbas0 #PBS server for HOSC
D.3.3 OS load balancing files

This script gets information from the PBS server and switches the OS type of compute nodes according to
the rule defined in Section 5.7.1:
“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let
us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e.,
there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α
nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.
69
/opt/hosc/pbs_hosc_os_balancing.pl
#!/usr/bin/perl
#use strict;
#Gets information with pbsnodes about free nodes

$command_pbsnodes = "/usr/pbs/bin/pbsnodes -a |";
open (PBSC, $command_pbsnodes ) or die "Failed to run command: $command_pbsnodes";
@cmd_output = <PBSC>;
close (PBSC);
foreach $line (@cmd_output) {
if (($line !~ /^(\s+)\w+/) && ($line !~ /^(\s+)$/) &&($line =~ /^(.*)\s+/)) {
$nodename = $1;
push (@pbsnodelist, $nodename);
$pbsnodes->{$nodename}->{state} = 'unknown';
$pbsnodes->{$nodename}->{arch} = 'unknown';
} elsif ($line =~ "state") {
$pbsnodes->{$nodename}->{state} = (split(' ', $line))[2];
} elsif ($line =~ "arch") {
$pbsnodes->{$nodename}->{arch} = (split(' ', $line))[2];
}
}
foreach my $node (@pbsnodelist) {
if ($pbsnodes->{$node}->{state}=~"free") {
if ($pbsnodes->{$node}->{arch}=~"linux") {
push (@free_linux_nodes, $node);
} else {
push (@free_windows_nodes, $node);
}
}
}
#Gets information with qstat about the number of nodes requested by queued jobs
$command_qstat = "/usr/pbs/bin/qstat -a |";
open (PBSC, $command_qstat ) or die "Failed to run command: $command_qstat";
@cmd_output = <PBSC>;
close (PBSC);
$nb_windows_nodes_of_smallest_job = 1e09;
$nb_linux_nodes_of_smallest_job = 1e09;
foreach $line (@cmd_output) {
if ((split(' ', $line))[9] =~ "Q") {
$nb_nodes = (split(' ', $line))[5];
if ($line =~ "windowsq") {
$nb_windows_nodes_queued += $nb_nodes;
if ($nb_nodes < $nb_windows_nodes_of_smallest_job) {
$nb_windows_nodes_of_smallest_job = $nb_nodes;
}
} elsif ($line =~ "linuxq") {
$nb_linux_nodes_queued += $nb_nodes;
if ($nb_nodes < $nb_linux_nodes_of_smallest_job) {
$nb_linux_nodes_of_smallest_job = $nb_nodes;
}
}
}
}
70
#STDOUT is redirected to a LOG file
open LOG, ">>/tmp/pbs_hosc_log.txt";
select LOG;
#Compute the number of possible requested nodes whose OS type should be switched
$requested_windows_nodes = $nb_windows_nodes_of_smallest_job - scalar @free_windows_nodes;
$requested_linux_nodes = $nb_linux_nodes_of_smallest_job - scalar @free_linux_nodes;
#The decision rule based on previous information is applied

if (($nb_windows_nodes_of_smallest_job > scalar @free_windows_nodes) &&
(scalar @free_linux_nodes >= $requested_windows_nodes)){
#switch $requested_windows_nodes nodes from XBAS to HPCS
for ($i = 0; $i < $requested_windows_nodes; $i++) {
$command_offline = "/usr/pbs/bin/pbsnodes -o $free_linux_nodes[$i]";
system ($command_offline);
$command_switch_to_HPCS = "/opt/hosc/from_XBAS_to_HPCS.sh $free_linux_nodes[$i]";
system ($command_switch_to_HPCS);
($new_node = $free_linux_nodes[$i]) =~ s/xbas/hpcs/;
$command_online = "/usr/pbs/bin/pbsnodes -c $new_node";
system ($command_online);
print "switch OS type from XBAS to HPCS: $free_linux_nodes[$i] -> $new_node\n";
}
} elsif (($nb_linux_nodes_of_smallest_job > scalar @free_linux_nodes) &&
(scalar @free_windows_nodes >= $requested_linux_nodes)) {
#switch $requested_linux_nodes nodes from HPCS to XBAS
for ($i = 0; $i < $requested_linux_nodes; $i++) {
$command_offline = "/usr/pbs/bin/pbsnodes -o $free_windows_nodes[$i]";
system ($command_offline);
$command_switch_to_XBAS= "/opt/hosc/from_HPCS_to_XBAS.sh $free_windows_nodes[$i]";
system ($command_switch_to_XBAS);
($new_node = $free_windows_nodes[$i]) =~ s/hpcs/xbas/;
$command_online = "/usr/pbs/bin/pbsnodes -c $new_node";
system ($command_online);
print "switch OS type from HPCS to XBAS: $free_windows_nodes[$i] -> $new_node\n";
}
}
close LOG;
The above script is run periodically every 10 minutes as defined by the crontab file:
/var/spool/cron/root
# run HOSC Operating System balancing script every 10 minutes (noted */10)
*/10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl
71
Appendix E: Hardware and software used for the examples
Here are the details of the hardware and software configuration used to illustrate examples. They were
used to build the HOSC prototype and to validate information given in this document. Any Bull NovaScale
or bullx cluster with Linux Bull Advanced Server for Xeon and Windows HPC Server 2008 could be used
in the same manner.
E.1 Hardware
• 1 Bull NovaScale R460 server
o 2 dual core Intel® Xeon® processors (5130 - Woodcrest) at 2GHz
o 8 GB memory, 2x 146GB SAS disks
• 4 Bull NovaScale R421 servers
o 2 dual core Intel® Xeon® processors (5160 - Woodcrest) at 3GHz
o 16 GB Memory, 2x 160GB SATA disks
• Voltaire ISR 9024D-M InfiniBand Switch and 5 HCA-410EX-D (4X)
• Cisco Gigabit switch (24 ports)
E.2 Software
• Windows
o Windows HPC Server 2008: Windows Server 2008 Standard and the Microsoft HPC
Pack
o Intel® network adapter driver for Windows Vista and Server 2008 x64 v13.1.2
o Mellanox InfiniBand Software Stack for Windows HPC Server 2008 v1.4.1
o Microsoft Utilities and SDK for UNIX-based Applications AMD64 (v. 10.0.6030.0) and
Interops Systems “Power User” add-on bundle (v. 6.0)
o PBS Professional 10.1 for Windows Server 2008 x86_64
o freeSSHd 1.2.1
• Linux
o Bull Advanced Server for Xeon 5v1.1: Red Hat Enterprise Linux 5.1 including Xen
3.0.3 with Bull XHPC and XIB packs (optional: Bull Hypernova 1.1.B2)
o PBS Professional 10.1 for Linux x86_64
72
Appendix F: About Altair and PBS GridWorks
F.1 About Altair

Altair empowers client innovation and decision-making through technology that optimizes the analysis,
management and visualization of business and engineering information. Privately held with more than
1,400 employees, Altair has offices throughout North America, South America, Europe and Asia/Pacific.
With a 20-year-plus track record for product design, advanced engineering software and grid computing
technologies, Altair consistently delivers a competitive advantage to customers in a broad range of
industries.
To learn more, please visit http://www.altair.com.
F.2 About PBS GridWorks

Altair's PBS GridWorks is a suite of on-demand grid computing technologies that allows enterprises to
maximize ROI on computing infrastructure assets. PBS GridWorks is the most widely implemented
software environment for grid-, cluster- and on-demand computing worldwide. The suite's flagship
product, PBS Professional, provides a flexible, on-demand computing environment that allows enterprises
to easily share diverse (heterogeneous) computing resources across geographic boundaries.
To learn more, please visit http://www.pbsgridworks.com.
73
Appendix G: About Microsoft and Windows HPC Server 2008
G.1 About Microsoft

Founded in 1975, Microsoft (Nasdaq “MSFT”) is the worldwide leader in software, services and solutions
that help people and businesses realize their full potential.
More information about Microsoft is available at: http://www.microsoft.com.
G.2 About Windows HPC Server 2008

Windows HPC Server 2008, the next generation of high-performance computing (HPC), provides
enterprise-class tools for a highly productive HPC environment. Built on Windows Server 2008, 64-bit
technology, Windows HPC Server 2008 can efficiently scale to thousands of processing cores and
includes management consoles that help you to proactively monitor and maintain system health and
stability. Job scheduling interoperability and flexibility enables integration between Windows and Linux
based HPC platforms, and supports batch and service oriented application (SOA) workloads. Enhanced
productivity, scalable performance, and ease of use are some of the features that make Windows HPC
Server 2008 best-of-breed for Windows environments.
More information and resources for Windows HPC Server 2008 are available at:
Windows HPC Server 2008 Web site: http://www.microsoft.com/hpc
Windows HPC Community Web site: http://windowshpc.net
74
Appendix H: About BULL S.A.S.
Bull is one of the leading European IT companies, and has become an indisputable player in the
High-Performance Computing field in Europe, with exceptional growth over the past four years, major
contracts, numerous records broken, and significant investments in R&D.
In June 2009, Bull confirmed its commitment to supercomputing, with the launch of its bullx range: the first
European-designed supercomputers to be totally dedicated to Extreme Computing. Designed by Bull’s team of
specialists working in close collaboration with major customers, bullx embodies the company’s strategy to
become one of the three worldwide leaders in Extreme Computing, and number one in Europe. The bullx
supercomputers benefit from the know-how and skills of Europe’s largest center of expertise dedicated to
Extreme Computing. Delivering anything from a few teraflops to several petaflops of computing power, they are
easy to implement by everyone from a small R&D office to a world-class data center.
Bull has now won worldwide recognition thanks to several TOP500-class systems (see [38]). Bull gathered
significant momentum in HPC in recent years, with over 120 customers in 15 countries across three continents.
The spread of countries and industry sectors covered, as well as the sheer diversity of solutions that Bull has
sold, illustrates the reputation that the company now enjoys. From the first major supercomputer installed at the
CEA to the numerous supercomputers delivered to many higher education establishments in Brazil, France,
Spain, Germany and the United Kingdom – such as the two large clusters acquired by the Jülich Research
Center, which deliver a global peak performance of more than 300 teraflops. In industry, prestigious customers
including Alcan, Pininfarina, Dassault-Aviation and Alenia have chosen Bull solutions. And Miracle Machines in
Singapore implemented a Bull supercomputer that will be used to study and help predict tsunamis.
Alongside this commercial success, the breaking of a number of world records highlights Bull's expertise in the
design and integration of the most advanced technologies. Bull systems have achieved some major
performance records, particularly for ultra-large file systems, image searches in very large-scale databases (the
engines of future research), and the search for new prime numbers. These systems have also been used to
carry out the most extensive simulation ever of the formation of the structures of the Universe.
To prepare the systems of the future, Bull is a founder or a member of several important consortia including
Parma which forms part of ITEA2 and brings a large number of European research centers to develop the next
generation of parallel systems. Finally, Bull is a founder member of the POPS consortium - under the auspices
of the SYSTEM@TIC competitiveness cluster based in the Ile de France region, which is developing
tomorrow's petascale systems.
Bull and the French Atomic Energy Authority (CEA) are currently collaborating to design and build Tera 100,
the future petascale supercomputer to support the French nuclear simulation program.
For more information, visit http://www.bull.com/hpc
75
76

A Hybrid Operating System Cluster Solution PDF

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Hybrid Operating System Cluster Solution PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Architect of an Open WorldTM

A Hybrid OS Cluster Solution

Dual-Boot and Virtualization with

Published: June 2009

Dr. Patrice Calegari, HPC Application Specialist, BULL S.A.S.

NovaScale is a registered trademark of Bull S.A.S.

Initial publication: release 1.2, 52 pages, published in June 2008

This paper can be downloaded from the following web sites:

2.1 MASTER BOOT RECORD (MBR) ............................................................................................................................9

3.1 A SINGLE OPERATING SYSTEM AT A TIME ................................................................................................................19

4.1 CLUSTER APPROACH ..........................................................................................................................................27

5.1 INSTALLATION OF THE MANAGEMENT NODES ..........................................................................................................34

6.1 HOSC SETUP CHECKING .....................................................................................................................................49

F.1 ABOUT ALTAIR .................................................................................................................................................73

2.1 Master Boot Record (MBR)

When designing a dual-boot node, the following points should be analyzed:

Virtualization is interesting in the context of our study for two reasons:

The virtualization software can be:

The market provides many virtualization software packages among which:

In a network boot operation, the client computer will:

NODE 1 - DHCPDISCOVER (client MAC address) DHCP SERVER

bootpc(68): boot bootps(67): boot

Broadcast IP source Broadcast IP source =

Figure 2 - DHCP discovery process

• PBS Professional [12]: supported by Altair for Linux/Unix and Windows

• SGE (Sun Grid Engine) [16]: supported by Sun Microsystems

• GridWay [20]: a Grid meta-scheduler by the Globus Alliance

2.7.2 Cluster installation mechanisms

Management node (192.168.0.1)

hexadecimal format: KERNEL RHEL5.1/vmlinuz

Execute instructions from the vmlinuz + initrd.img ? RHEL5.1/vmlinuz

Figure 3 – XBAS compute node PXE installation scheme

Bios settings Power on /etc/dhcpd.conf

Management node (192.168.0.1)

looks for a DHCP server 68.0.2 .0 DHCP

Figure 4 – XBAS compute node PXE boot scheme

2.8.2 Cluster installation mechanisms

Boot micro kernel and Check for AD

Head node (192.168.0.1)

Ask for the Windows Server® image unattend.xml

Figure 5 – HPCS compute node PXE installation scheme

Bios settings Power on

looks for a DHCP server 192.1 DHCP

2.16 8.0.2 dsN

Boot micro kernel and Check for AD

Figure 6 – HPCS compute node PXE boot scheme

• Computational Grid Support provides an enabling technology for metacomputing and

• Comprehensive API includes a complete Application Programming Interface (API).

3.1 A single operating system at a time

different OS when requirement

Table 1 - Possible approaches to HPC clusters with 2 operating systems

1. List of supported guest OS’s

4. VM management environment (tools and interfaces for VM creation, configuration and

3.3 Specialized nodes

3.3.1 Management node

3.3.2 Compute nodes

3.3.4 Login nodes

• develop, edit and compile programs

• debug parallel code programs

• submit a job to the cluster

• visualize the results returned by a job

At least the following services are required: