The information contained in this document represents the current view of Microsoft Corporation and
BULL S.A.S. on the issues discussed as of the date of publication. Because Microsoft and BULL S.A.S.
must respond to changing market conditions, it should not be interpreted to be a commitment on the part
of Microsoft or BULL S.A.S., and Microsoft and BULL S.A.S. cannot guarantee the accuracy of any
information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT and BULL S.A.S. MAKE NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Microsoft Corporation and BULL
S.A.S.
Microsoft and BULL S.A.S. may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft or BULL S.A.S., as applicable, the furnishing of this document
does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
© 2008, 2009 Microsoft Corporation and BULL S.A.S. All rights reserved.
Microsoft, Hyper-V, Windows, Windows Server, and the Windows logo are trademarks of the Microsoft
group of companies.
PBS GridWorks®, GridWorks™, PBS Professional®, PBS™ and Portable Batch System® are trademarks
of Altair Engineering, Inc.
The names of actual companies and products mentioned herein may be the trademarks of their
respective owners.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
2
Abstract
The choice of an operating system (OS) for a high performance computing (HPC) cluster is a critical
decision for IT departments. The goal of this paper is to show that simple techniques are available today
to optimize the return on investment by making that choice unnecessary, and keeping the HPC
infrastructure versatile and flexible. This paper introduces Hybrid Operating System Clusters (HOSC).
An HOSC is an HPC cluster that can run several OS’s simultaneously. This paper addresses the situation
where two OS’s are running simultaneously: Linux Bull Advanced Server for Xeon and Microsoft®
Windows® HPC Server 2008. However, most of the information presented in this paper can apply to 3 or
more simultaneous OS’s, possibly from other OS distributions, with slight adaptations. This document
gives general concepts as well as detailed setup information. Firstly, technologies necessary to design an
HOSC are defined (dual-boot, virtualization, PXE, resource manager and job scheduler). Secondly,
different approaches of HOSC architectures are analyzed and technical recommendations are given with
a focus on computing performance and management flexibility. The recommendations are then
implemented to determine the best technical choices for designing an HOSC prototype. The installation
setup of the prototype and the configuration steps are explained. A meta-scheduler based on Altair PBS
Professional is implemented. Finally, basic HOSC administrator operations are listed and ideas for future
works are proposed.
http://www.bull.com/techtrends
http://www.microsoft.com/downloads
http://technet.microsoft.com/en-us/library/cc700329(WS.10).aspx
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
3
ABSTRACT.......................................................................................................................................................... 3
1 INTRODUCTION .......................................................................................................................................... 7
2 CONCEPTS AND PRODUCTS......................................................................................................................... 9
3 APPROACHES AND RECOMMENDATIONS...................................................................................................19
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
4
5 SETUP OF THE HOSC PROTOTYPE ...............................................................................................................34
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
5
6.7 CHECK NODE STATUS WITH THE META‐SCHEDULER...................................................................................................52
7 CONCLUSION AND PERSPECTIVES ..............................................................................................................55
APPENDIX A: ACRONYMS .................................................................................................................................56
APPENDIX B: BIBLIOGRAPHY AND RELATED LINKS ............................................................................................58
APPENDIX C: MASTER BOOT RECORD DETAILS..................................................................................................60
C.1 MBR STRUCTURE .............................................................................................................................................60
C.2 SAVE AND RESTORE MBR...................................................................................................................................60
APPENDIX D: FILES USED IN EXAMPLES.............................................................................................................61
D.1 WINDOWS HPC SERVER 2008 FILES ....................................................................................................................61
D.1.1 Files used for compute node deployment ............................................................................................61
D.1.2 Script for IPoIB setup............................................................................................................................62
D.1.3 Scripts used for OS switch ....................................................................................................................63
D.2 XBAS FILES .....................................................................................................................................................64
D.2.1 Kickstart and PXE files..........................................................................................................................64
D.2.2 DHCP configuration..............................................................................................................................65
D.2.3 Scripts used for OS switch ....................................................................................................................65
D.2.4 Network interface bridge configuration ..............................................................................................67
D.2.5 Network hosts......................................................................................................................................68
D.2.6 IB network interface configuration ......................................................................................................68
D.2.7 ssh host configuration..........................................................................................................................68
D.3 META‐SCHEDULER SETUP FILES ............................................................................................................................69
D.3.1 PBS Professional configuration files on XBAS.......................................................................................69
D.3.2 PBS Professional configuration files on HPCS ......................................................................................69
D.3.3 OS load balancing files.........................................................................................................................69
APPENDIX E: HARDWARE AND SOFTWARE USED FOR THE EXAMPLES...............................................................72
E.1 HARDWARE .....................................................................................................................................................72
E.2 SOFTWARE ......................................................................................................................................................72
APPENDIX F: ABOUT ALTAIR AND PBS GRIDWORKS ..........................................................................................73
APPENDIX G: ABOUT MICROSOFT AND WINDOWS HPC SERVER 2008 ...............................................................74
G.1 ABOUT MICROSOFT ..........................................................................................................................................74
G.2 ABOUT WINDOWS HPC SERVER 2008.................................................................................................................74
APPENDIX H: ABOUT BULL S.A.S. ......................................................................................................................75
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
6
1 Introduction
The choice of the right operating system (OS) for a high performance computing (HPC) cluster can be a
very difficult decision for IT departments. And this choice will usually have a big impact on the Total Cost
of Ownership (TCO) of the cluster. Parameters like multiple user needs, application environment
requirements and security policies are adding to the complex human factors included in training,
maintenance and support planning, all leading to associated risks on the final return on investment (ROI)
of the whole HPC infrastructure. The goal of this paper is to show that simple techniques are available
today to make that choice unnecessary, and keep your HPC infrastructure versatile and flexible.
In this white paper we will study how to provide the best flexibility for running several OS’s on an HPC
cluster. There are two main types of approaches to providing this service depending on whether a single
operating system is selected each time the whole cluster is booted, or whether several operating systems
are run simultaneously on the cluster. The most common approach of the first type is called the dual-boot
cluster (described in [1] and [2]). For the second type of approach, we introduce the concept of a Hybrid
Operating System Cluster (HOSC): a cluster with some computing nodes running one OS type while the
remaining nodes run another OS type. Several approaches to both types are studied in this document in
order to determine their properties (requirements, limits, feasibility, and usefulness) with a clear focus on
computing performance and management flexibility.
The study is limited to 2 operating systems: Linux Bull Advanced Server for Xeon 5v1.1 and Microsoft
Windows HPC Server 2008 (respectively noted XBAS and HPCS in this paper). For optimizing the
interoperability between the two OS worlds, we use the Subsystem for Unix-based Applications (SUA) for
Windows. The description of the methodologies is as general as possible in order to apply to other OS
distributions but examples are given exclusively in the XBAS/HPCS context. The concepts developed in
this document could apply to 3 or more simultaneous OS’s with slight adaptations. However, this is out of
the scope of this paper.
We introduce a meta-scheduler that provides a single submission point for both Linux and Windows. It
selects the cluster nodes with the OS type required by submitted jobs. The OS type of compute nodes
can be switched automatically and safely without administrator intervention. This optimizes computational
workloads by adapting the distribution of OS types among the compute nodes.
A technical proof of concept is given by designing, installing and running an HOSC prototype. This
prototype can provide computing power under both XBAS and HPCS simultaneously. It has two virtual
management nodes (aka head nodes) on a single server and the choice of the OS distribution among the
compute nodes can be done dynamically. We have chosen Altair PBS Professional software to
demonstrate a meta-scheduler implementation. This project is the result of the collaborative work of
Microsoft and Bull.
Chapter 2 defines the main technologies used in HOSC: the Master Boot Record (MBR), the dual-boot
method, the virtualization, the Pre-boot eXecution Environment (PXE), the resource manager and job
scheduler tools. If you are already familiar with these concepts, you may want to skip this chapter and go
directly to Chapter 3 that analyzes different approaches to HOSC architectures and gives technical
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
7
recommendations for their design. The recommendations are implemented in Chapter 4 in order to
determine the best technical choices for building an HOSC prototype. The installation setup of the
prototype and the configuration steps are explained in Chapter 5. Appendix D shows the files that were
used during this step. Finally, basic HOSC administrator operations are listed in Chapter 6 and ideas for
future works are proposed in Chapter 7, which concludes this paper.
This document is intended for computer scientists who are familiar with HPC cluster administration.
All acronyms used in this paper are listed in Appendix A. Complementary information can be found in the
documents and web pages listed in Appendix B.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
8
2 Concepts and products
We assume that the readers may not be familiar with every concept discussed in the remaining chapters
in both Linux and Windows environments. Therefore, this chapter introduces the technologies (Master
Boot Record, Dual-boot, virtualization and Pre-boot eXecution Environment) and products (Linux Bull
Advanced Server, Windows HPC Server 2008 and PBS Professional) mentioned in this document.
If you are already familiar with these concepts or are more interested in general Hybrid OS Cluster
(HOSC) considerations, you may want to skip this chapter and go directly to Chapter 3.
The MBR includes the partition table of the 4 primary partitions and a bootstrap code that can start the
OS or load and run the boot loader code (see the complete MBR structure in Table 3 of Appendix C.1).
A partition is encoded as a 16-byte structure with size, location and characteristic fields. The first 1-byte
field of the partition structure is called the boot flag.
Windows MBR starts the OS installed on the active partition. The active partition is the first primary
partition that has its boot flag enabled. You can select an OS by activating the partition where it is
installed. Tools diskpart.exe and fdisk can be used to change partition activation. Appendix D.1.3
and Appendix D.2.3 give examples of commands that enable/disable the boot flag.
Linux MBR can run a boot loader code (e.g., GRUB or Lilo). You can then select an OS interactively from
its user interface at the console. If no choice is given at the console, the OS selection is taken from the
boot loader configuration file that you can edit in advance before a reboot (e.g., grub.conf for the
GRUB boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux
partition) can be replaced from a Windows command line with the dd.exe tool.
Appendix C.2 explains how to save and restore the MBR of a device. It is very important to understand
how the MBR works in order to properly configure dual-boot systems.
2.2 Dual-boot
Dual-booting is an easy way to have several operating systems (OS) on a node. When an OS is run, it
has no interaction with the other OS installed so the native performance of the node is not affected by the
use of the dual-boot feature. The only limitation is that these OS’s cannot be run simultaneously.
• The choice of the MBR (and choice of the boot loader if applicable)
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
9
• The disk partition restrictions. For example, Windows must have a system partition on at least
one primary partition of the first device)
• The compatibility with Logical Volume Managers (LVM). For example, RHEL5.1 LVM creates a
logical volume with the entire first device by default and this makes it impossible to install a
second OS on this device.
When booting a computer, the dual-boot feature gives the ability to choose which OS to start from
multiple OS’s installed on that computer. At boot time, the way you can select the OS of a node depends
on the installed MBR. A dual-boot method that relies on Linux MBR and GRUB is described in [1].
Another dual-boot method that exploits the properties of active partitions is described in [2] an [3].
2.3 Virtualization
The virtualization technique is used to hide the physical characteristics of computers and only present a
logical abstraction of these characteristics. Virtual Machines (VM) can be created by the virtualization
software: each VM has virtual resources (CPUs, memory, devices, network interfaces, etc.) whose
characteristics (quantity, size, etc.) are independent from those available on the physical server. The OS
installed in a VM is called a guest OS: the guest OS can only access the virtual resources available in its
VM. Several VMs can be created and run on one physical node. These VMs appear like physical
machines for the applications, the users and the other nodes (physical or virtual).
1. It makes possible the installation of several management nodes (MN) on a single physical
server. This is an important point for installing several OS on a cluster without increasing its cost
with the installation of an additional physical MN server.
2. It provides a fast and rather easy way to switch from one OS to another: by starting a VM that
runs an OS while suspending another VM that runs another OS.
A hypervisor is a software layer that runs at a higher privilege level on the hardware. The virtualization
software runs in a partition (domain 0 or dom0), from where it controls how the hypervisor allocates
resources to the virtual machines. The other domains where the VMs run are called unprivileged domains
and noted domU. A hypervisor normally enforces scheduling policies and memory boundaries. In some
Linux implementations it also provides access to hardware devices via its own drivers. On Windows, it
does not.
• Host-based (like VMware): this means that the virtualization software is installed on a physical
server with a classical OS called the host OS.
• Hypervisor-based (like Windows Server® 2008 Hyper-V™ and Xen): in this case, the hypervisor
runs at a lower level than the OS. The “host OS” becomes just another VM that is automatically
started at boot time. Such virtualization architecture is shown in Figure 1.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
10
Figure 1 - Overview of hypervisor-based virtualization architecture
“Full virtualization” is an approach which requires no modification to the hosted operating system,
providing the illusion of a complete system of real hardware devices. Such Hardware Virtual Machines
(HVM) require hardware support provided for example by Intel® Virtual Technology (VT) and AMD-V
technology. Recent Intel® Xeon® processors support full virtualization thanks to the Intel® VT. Windows
is only supported on fully-virtualized VMs and not on para-virtualized VMs. “para-virtualization” is an
approach which requires modification to the operating system in order to run in a VM.
• Xen [6]: a freeware for Linux included in the RHEL5 distribution which allows a maximum of
8 virtual CPUs per virtual machine (VM). Oracle VM and Sun xVM VirtualBox are commercial
implementations.
• VMware [7]: commercial software for Linux and Windows which allows a maximum of 4 virtual
CPUs per VM.
• Hyper-V [8]: a solution provided by Microsoft which only works on Windows Server 2008 and
allows only 1 virtual CPU per VM for non-Windows VM.
• PowerVM [9] (formerly Advanced POWER Virtualization): an IBM solution for UNIX and Linux on
most processor architectures that does not support Windows as a guest OS.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
11
• Virtuozzo [10]: a ‘Parallels, Inc’ solution designed to deliver near native physical performance. It
only supports VMs that run the same OS as the host OS (i.e., Linux VMs on Linux hosts and
Windows VMs on Windows hosts).
• OpenVZ [11]: an operating system-level virtualization technology licensed under GPL version 2. It
is a basis of Virtuozzo [10]. It requires both the host and guest OS to be Linux, possibly of
different distributions. It has a low performance penalty compared to a standalone server.
2.4 PXE
The Pre-boot eXecution Environment (PXE) is an environment to boot computers using a network
interface independently of available data storage devices or installed OS. The end goal is to allow a client
to network boot and receive a network boot program (NBP) from a network boot server.
1. Obtain an IP address to gain network connectivity: when a PXE-enabled boot is initiated, the
PXE-based ROM requests an IP address from a Dynamic Host Configuration Protocol (DHCP)
server using the normal DHCP discovery process (see the detailed process in Figure 2). It will
receive from the DHCP server an IP address lease, information about the correct boot server and
information about the correct boot file.
2. Discover a network boot server: with the information from the DHCP server the client
establishes a connection to the PXE servers (TFTP, WDS, NFS, CIFS, etc.).
3. Download the NBP file from the network boot server and execute it: the client uses Trivial
File Transfer Protocol (TFTP) to download the NBP. Examples of NBP are: pxelinux.0 for
Linux and WdsNbp.com for Windows Server.
When booting a compute node with PXE, the goal can be to install or run it with an image deployed
through the network, or just to run it with an OS installed on its local disk. In the latter case, the PXE just
answers the compute node requests by indicating that it must boot on the next boot device listed in its
BIOS.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
12
2.5 Job schedulers and resource managers in a HPC cluster
In an HPC cluster, a resource manager (aka Distributed Resource Management System (DRMS) or
Distributed Resource Manager (DRM)) gathers information about all cluster resources that can be used
by application jobs. Its main goal is to give accurate resource information about the cluster usage to a job
scheduler.
A job scheduler (aka batch scheduler or batch system) is in charge of unattended background executions.
It provides a user interface for submitting, monitoring and terminating jobs. It is usually responsible for the
optimization of job placement on the cluster nodes. For that purpose it deals with resource information,
administrator rules and user rules: job priority, job dependencies, resource and time limits, reservation,
specific resource requirements, parallel job management, process binding, etc. With time, job schedulers
and resource managers evolved in such a way that they are now usually integrated under a unique
product name. Here are such noteworthy products:
• Torque [13]: an open source job scheduler based on the original PBS project. It can be used as a
resource manager by other schedulers (e.g., Moab workload manager).
• SLURM (Simple Linux Utility for Resource Management) [14]: freeware and open source
• LSF (Load Sharing Facility) [15]: supported by Platform for Linux/Unix and Windows
• OAR [17]: freeware and open source for Linux, AIX and SunOS/Solaris
• Microsoft Windows HPC Server 2008 job scheduler: included in the Microsoft HPC pack [5]
2.6 Meta-Scheduler
According to Wikipedia [18], “Meta-scheduling or Super scheduling is a computer software technique of
optimizing computational workloads by combining an organization's multiple Distributed Resource
Managers into a single aggregated view, allowing batch jobs to be directed to the best location for
execution”. In this paper, we consider that the meta-scheduler is able to submit jobs on cluster nodes with
heterogeneous OS types and that it can switch automatically the OS type of these nodes when necessary
(for optimizing computational workloads). Here is a partial list of meta-schedulers currently available:
• Moab Grid Suite and Maui Cluster scheduler [19]: supported by Cluster Resources, Inc.
• CSF (Community Scheduler Framework) [21]: an open source framework (an add-on to the
Globus Toolkit v.3) for implementing a grid meta-scheduler, developed by Platform Computing
Recent job schedulers can sometime be adapted and configured to behave as “simple” meta-schedulers.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
13
2.7 Bull Advanced Server for Xeon
2.7.1 Description
Bull Advanced Server for Xeon (XBAS) is a robust and efficient Linux solution that delivers total cluster
management. It addresses each step of the cluster lifecycle with a centralized administration interface:
installation, fast and reliable software deployments, topology-aware monitoring and fault handling (to
dramatically lower time-to-repair), cluster optimization and expansion. Integrated, tested and supported
by Bull [4], XBAS federates the very best of Open Source components, complemented by leading
software packages from well known Independent Software Vendors, and gives them a consistent view of
the whole HPC cluster through a common cluster database: the clusterdb. XBAS is fully compatible with
standard RedHat Enterprise Linux (RHEL). Latest Bull Advanced Server for Xeon 5 release (v3.1) is
based on RHEL5.3 1 .
BIOS settings must be set so that XBAS compute nodes boot on network with PXE by default. The PXE
files stored on the management node indicate if a given compute node should be installed (i.e., its
DEFAULT label is ks) or if it is ready to be run (i.e., its DEFAULT label is local_primary).
In the first case, a new OS image should be deployed 2 . During the PXE boot process, operations to be
executed on the compute node are written in the kickstart file. Tools based on PXE are provided by XBAS
to simplify the installation of compute nodes. The “preparenfs” tool writes the configuration files with the
information given by the administrator and with those found in the clusterdb. The generated configuration
files are: the PXE files (e.g., /tftpboot/C0A80002), the DHCP configuration file (/etc/dhcpd.conf),
the kickstart file (e.g., /release/ks/kickstart) and the NFS export file (/etc/exportfs). No user
interface access (remote or local) to the compute node is required during its installation phase with the
preparenfs tool. Figure 3 shows the sequence of interactions between a new XBAS compute node being
installed and the servers run on the management node (DHCP, TFTP and NFS). On small clusters, the
“preparenfs” tool can be used to install every CN. On large clusters the ksis tool can be used to optimize
the total deployment time of the cluster by cloning the first CN installed with the “preparenfs” tool.
In the second case, the CN is already installed and the compute node just needs to boot locally on its
local disk. Figure 4 shows the XBAS compute node normal boot scheme.
1
The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1
based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.
2
In this document, we define the “deployment of an OS” as the installation of a given OS on several
nodes from a management node. A more restrictive definition that only applies to the duplication of OS
images on the nodes is often used in technical literature.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
14
Bios settings Power on /etc/dhcpd.conf
Boot order:
filename ‘’pxelinux.0’’
1 Network
fixed-address 192.168.0.2
2 local HD
mac address=00:30:19:D6:77:8A next-server 192.168.0.1
hardware ethernet 00:30:19:D6:77:8A
Boot on network and
8.0.2 option host-name ‘’xbas1’’
looks for a DHCP server 192.16 x.0 DHCP
xbas1 .0.1 pxelinu
2 .16 8
19
pxelinux.0 ? pxelinux.0
Connect to “next-server” TFTP
PC
se/XH NFS
+ /relea /etc/exportfs
EL5.1
Installation of RHEL5.1 se/RH /release/RHEL5.1 <world>
through NFS with the /relea /release/XHPC <world>
kickstart file info
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
15
2.8 Windows HPC Server 2008
2.8.1 Description
Microsoft Windows HPC Server 2008 (HPCS), the successor to Windows Computer Cluster Server
(WCCS) 2003, is based on the Windows Server 2008 operating system and is designed to increase
productivity, scalability and manageability. This new name reflects Microsoft HPC’s readiness to tackle
the most challenging HPC workloads [5]. HPCS includes key features, such as new high-speed
networking, highly efficient and scalable cluster management tools, advanced failover capabilities, a
service oriented architecture (SOA) job scheduler, and support for partners’ clustered file systems. HPCS
gives access to an HPC platform that is easy to deploy, operate, and integrate with existing enterprise
infrastructures
During the first installation step, Windows Preinstallation Environment (WinPE) is the boot operating
system. It is a lightweight version of Windows Server 2008 that is used for the deployment of servers. It is
intended as a 32-bit or 64-bit replacement for MS-DOS during the installation phase of Windows, and can
be booted via PXE, CD-ROM, USB flash drive or hard disk.
BIOS settings should be set so that HPCS compute nodes boot on network with PXE (we assume that a
private network exists and that CNs send PXE requests there first). From the head node point of view, a
compute node must be deployed if it doesn’t have any entry into the Active Directory (AD), or if the cluster
administrator has explicitly specified that it must be re-imaged. When a compute node with no OS boots,
it first sends a DHCP request in order to get an IP address, a valid network boot server and the name of a
network boot program (NBP). When the DHCP server has answered, the CN downloads the NBP called
WdsNbp.com from the WDS server. The purpose is to detect the architecture and to wait for other
downloads from the WDS server.
Then, on the HPCS administration console of the head node, the new compute node appears as “pending
approval”. The installation starts once the administrator assigns a deployment template to it. A WinPE
image is sent and booted on the compute node; files are transferred in order to prepare the Windows
Server 2008 installation, and an unattended installation of Windows Server 2008 is played. Finally, the
compute node is joined to the domain and the cluster. Figure 5 shows the details of PXE boot operations
executed during the installation procedure.
If the CN has already been installed, the AD already contains the corresponding computer object, so the
WDS server sends him a NBP called abortpxe.com which boots the server by using the next boot item
in the BIOS without waiting for a timeout. Figure 6 shows the PXE boot operations executed in this case.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
16
Bios settings Power on
Boot order:
1 Network
2 local HD
mac address=00:30:19:D6:77:8A
Boot on network and
.168.0
.1 DHCP
looks for a DHCP server t=192
68 .0 .2 nex Nbp.com
192.1 t/x64/Wds
Boo
Boot micro kernel and WdsNbp.com ? WdsNbp.com
configure IP address WDS
TFTP
m
bp.co
WdsN
Approve CN AD
Compute node
Assign template
Create an AD
(or .n12
) WDS account
t.com
pxeboo
TFTP
pxeboot.com
Boot micro kernel
pxeboot.com (or .n12)
unattend.xml
xt WDS
kpart.t
IM + dis
TFTP
diskpart.txt
BOOT.W
Boot kernel WinPE
and partition the disk
AD
WDS AD account
TFTP
abortpxe.com exists
abortpxe.com
Boot Windows Server®
2008 on local disk
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
17
2.9 PBS Professional
This Section presents PBS Professional, the job scheduler that we used as meta-scheduler for building
the HOSC prototype described in Chapter 5. PBS Professional is part of the PBS GridWorks software
suite. It is the professional version of the Portable Batch System (PBS), a flexible workload management
system, originally developed to manage aerospace computing resources at NASA. PBS Professional has
since become the leader in supercomputer workload management and the de facto standard on Linux
clusters. A few of the more important features of PBS Professional 10 are listed below:
• Enterprise-wide Resource Sharing provides transparent job scheduling on any PBS system by
any authorized user. Jobs can be submitted from any client system both local and remote.
• Multiple User Interfaces provides a traditional command line and a graphical user interface for
submitting batch and interactive jobs; querying job, queue, and system status; and monitoring job.
• Job Accounting offers detailed logs of system activities for charge-back or usage analysis per
user, per group, per project, and per compute host.
• Parallel Job Support works with parallel programming libraries such as MPI. Applications can be
scheduled to run within a single multi-processor computer or across multiple systems.
• Job-Interdependency enables the user to define a wide range of interdependencies between jobs.
• Automatic Load-Leveling provides numerous ways to distribute the workload across a cluster of
machines, based on hardware configuration, resource availability, keyboard activity, and local
scheduling policy.
• Common User Environment offers users a common view of the job submission, job querying,
system status, and job tracking over all systems.
• Cross-System Scheduling ensures that jobs do not have to be targeted to a specific computer
system. Users may submit their job, and have it run on the first available system that meets their
resource requirements.
• Job Priority allows users the ability to specify the priority of their jobs.
• Username Mapping provides support for mapping user account names on one system to the
appropriate name on remote server systems. This allows PBS Professional to fully function in
environments where users do not have a consistent username across all hosts.
• Broad Platform Availability is achieved through support of Windows and every major version of
UNIX and Linux, from workstations and servers to supercomputers.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
18
3 Approaches and recommendations
In this chapter, we will explain the different approaches to offer several OS’s on a cluster. The
approaches discussed in Sections 3.1 and 3.2 are summarized in Table 1 on the next page.
• Re-installing the selected OS on the cluster if necessary. But since this process can be long it is
not realistic for frequent changes. This is noted as approach 1 in Table 1.
• Deploying a new OS image on the whole cluster depending on the OS choice. The deployment
can be done on local disks or in memory with diskless compute nodes. It is difficult to deal with
the OS change on the management node in such an environment: either the management node
is dual-booted (this is approach 7 in Table 1), or an additional server is required to distribute the
OS image of the MN. This can be interesting in some specific cases: on HPC clusters with
diskless CN when the OS switches are rare, for example. Otherwise, this approach is not very
convenient. The deployment technique can be used in a more appropriate manner for clusters
with 2 simultaneous OS’s (i.e., 2 MNs); this will be shown in the next Section with approaches 3
and 11.
• Dual-booting the selected OS from dual-boot disks. Dual-booting the whole cluster
(management and computing nodes) is a good and very practical solution that was introduced
in [1] and [2]. This approach, noted 6 in Table 1, is the easiest way to install and manage a
cluster with several OS’s but it can only apply for small clusters with few users when no flexibility
is required. If only the MNs are on a dual-boot server while the CNs are installed with a single OS
(half of the CNs having an OS while the others have another), the solution has no sense because
only half of the cluster can be used at a time in this case (this is approach 5). If the MNs are on a
dual-boot server while the CNs are installed in VMs (2 VMs being installed on each compute
server), the solution has no real sense either because the added value of using VMs (quick OS
switching for instance) is cancelled by the need of booting the MN server (this is approach 8).
Whatever the OS switch method, a complete cluster reboot is needed at each change. This implies
cluster unavailability during reboots, a need for OS usage schedules and potential conflicts between user
needs, hence a real lack of flexibility.
In Table 1, approaches 1, 5, 6, 7, and 8 define clusters that can run 2 OS’s but not simultaneously. Even
if such clusters do not stick to the Hybrid Operating System Cluster (HOSC) definition given in Chapter 1,
they can be considered as a simplified approach of its concept.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
19
2 Compute Nodes (CN) with 2 different OS’s
OS image Virtualization
1 OS per server Dual-boot (2 CN
(2 servers) (1 server)
deployment simultaneously
(1 server) on 1 server)
1 Starting point:
4 An HOSC
2 half size
3 An HOSC solution with
independent
solution that can potential
clusters with 2 2 Good HOSC
be interesting for performance
1 OS per OS’s, or 1 full size solution for large
large clusters: issues on
server single OS cluster clusters with OS
(2 servers)
with diskless CNs compute nodes
re-installed with a flexibility
or when the OS and extra-cost for
2 Management Nodes (MN) with 2 different OS’s
5 This “single OS
8 Having virtual
at a time” solution 7 A “single OS
6 Good CNs has no real
makes absolutely at a time” solution
Dual-boot classical dual- sense since the
(1 server)
no sense since that can only be
boot cluster MN must be
only half of the CN interesting for
solution rebooted to
can be used at a diskless CNs
switch the OS
time
10 Good HOSC
9 2 half size 12 Every node
solution for
independent is virtual: the
medium-sized 11 An HOSC
Virtualization clusters with a most flexible
clusters with OS solution that can
(2 MN single MN server: HOSC solution
simultaneously
flexibility be interesting for
a bad HOSC but with too many
on 1 server) requirement small clusters
solution with no performance
(without with diskless CNs
flexibility and very uncertainties at
additional
little cost saving the moment
hardware cost)
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
20
3.2 Two simultaneous operating systems
The idea is to provide, with a single cluster, the capability to have several OS’s running simultaneously on
an HPC cluster. This is what we defined as a Hybrid Operating System Cluster (HOSC) in Chapter 1.
Each compute node (CN) does not need to run every OS simultaneously. A single OS can run on a given
CN while another OS runs on other CNs at the same time. The CNs can be dual-boot servers, diskless
servers, or virtual machines (VM). The cluster is managed from separate management nodes (MN) with
different OS’s. MN can be installed on several physical servers or on several VMs running on a single
server. In Table 1, approaches 2, 3, 4, 9, 10, 11 and 12 are HOSC.
HPC users may consider HPC clusters with two simultaneous OS’s rather than a single OS at a time for
four main reasons:
1. To improve resource utilization and adapt the workload dynamically by easily changing the ratio
of OS’s (e.g., Windows vs. Linux compute nodes) in a cluster for different kinds of usage.
2. To be able to migrate smoothly from one OS to the other, giving time to port applications and train
users.
3. Simply to be able to try a new OS without stopping the already installed one (i.e., install a HPCS
cluster at low cost on an existing Bull Linux cluster or install a Bull Linux cluster at low cost on an
existing HPCS cluster).
4. To integrate specific OS environments (e.g., with legacy OS’s and applications) in a global IT
infrastructure.
The simplest approach for running 2 OS’s on a cluster is to install each OS on half (or at least a part) of
the cluster when it is built. This approach is equivalent to building 2 single OS clusters! Therefore it
cannot be classified as a cluster with 2 simultaneous OS’s. Moreover, this solution is expensive with its
2 physical MN servers and it is absolutely not flexible since the OS distribution (i.e., the OS allocation to
nodes) is fixed in advance. This approach is similar to approach 1 already discussed in the previous
section.
An alternative to this first approach is to use a single physical server with 2 virtual machines for installing
the 2 MNs. In this case there is no additional hardware cost but there is still no flexibility for the choice of
the OS distribution on the CNs since this distribution is done when the cluster is built. This approach is
noted 9.
On clusters with dual-boot CNs the OS distribution can be dynamically adapted to the user and
application needs. The OS of a CN can be changed just by rebooting the CN aided by a few simple dual-
boot operations (this will be demonstrated in Sections 6.3 and 6.4). With such dual-boot CNs, the 2 MNs
can be on a single server with 2 VMs: this approach, noted 10, is very flexible and requires no additional
hardware cost. It is a good HOSC solution, especially for medium-sized clusters.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
21
With dual-boot CNs, the 2 MNs can also be installed on 2 physical servers instead of 2 VMs: this
approach, noted 2, can only be justified on large clusters because of the extra cost due to a second
physical MN.
A new OS image can be (re-)deployed on a CN on request. This technique allows changing the OS
distribution on CNs on a cluster quite easily. However, this is mainly interesting for clusters with diskless
CNs because re-deploying an OS image for each OS switch is slower and consumes more network
bandwidth than the other techniques discussed in this paper (dual-boot or virtualization). This can also be
interesting if the OS type of CNs is not switched too frequently. The MNs can then be installed in 2
different ways: either the MNs are installed on 2 physical servers (this is approach 3 that is interesting for
large clusters with diskless CNs or when the OS type of CNs is rarely switched) or they are installed on
2 VMs (this is approach 11 that is interesting for small and medium size diskless clusters).
The last technique for installing 2 CNs on a single server is to use virtual machines (VM). In this case,
every VM can be up and running simultaneously or only a single VM may run on each compute server
while the others are suspended. The switch from an OS to another can then be done very quickly. Using
several virtual CNs of the same server simultaneously is not recommended since the total performance of
the VMs is bounded by the native performance of the physical server and so no benefit can be expected
from such a configuration. Installing CNs on VMs makes it easier and quicker to switch from one OS to
another compared to a dual-boot installation but performance of the CNs may be decreased by the
computing overhead due to the virtualization software layer. Section 3.5 briefly presents articles that
analyze the performance impact of virtualization for HPC. Once again, the 2 MNs can be installed on 2
physical servers (this is approach 4 for large clusters), or they can be installed on 2 VMs (this is
approach 12 for small and medium-sized clusters). This latter approach is 100% virtual with only virtual
nodes. This is the most flexible solution, and very promising for the future; however it is too early to use it
now because of performance uncertainties.
For the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the host OS can be Linux or
Windows and any virtualization software could be used. The 6 approaches using VMs have thus dozens
of virtualization implementations.
The key points to check for choosing the right virtualization environment are listed here by order of
importance:
2. Virtual resource limitations (maximum number of virtual CPUs, maximum number of network
interfaces, virtual/physical CPU binding features, etc.)
3. Impact on performance (CPU cycles, memory access latency and bandwidth, I/Os, MPI
optimizations)
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
22
Also, for the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the 2 nodes can be
configured on 2 VMs or one can be a VM while the other is just installed on the server host OS. When
upgrading an existing HPC cluster from a classical single OS configuration to an HOSC configuration, it
might look interesting at first glance to configure a MN (or a CN) on the host OS. For example, one virtual
machine could be created on an existing management node and the second management node could be
installed on this VM. Even if this configuration looks nice and quick and easy to setup, it should never be
used. Indeed, running any application or using resources of the host OS is not a recommended
virtualization practice. This creates a non-symmetrical situation between applications running on the host
OS and those running on the VM. This may lead to load balancing issues and resource access failures.
On an HOSC with dual-boot CNs, re-deployed CNs or virtual CNs, the OS distribution can be changed
dynamically without disturbing the other nodes. This could even be done automatically by a resource
manager in a unified batch environment 3 .
The dual boot technique limits the number of installed OS’s on a server because only 4 primary partitions
can be declared in the MBR. So, on an HOSC, if more OS’s are necessary and no primary partition is
available anymore, the best solution is to install virtual CNs, and to run them one by one while the others
are suspended on each CN (depending on the selected OS for that CN). The MNs should be installed on
VMs as much as possible (like in approach 12), but several physical servers can be necessary (as in
approach 4). This can happen in the case of large clusters for which the cost of an additional server is
negligible. This can also happen so as to keep a good level of performance when a lot of OS’s are
installed on the HOSC and thus many MNs are needed.
3
The batch solution is not investigated in this study but could be considered in the future.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
23
3.3.3 I/O nodes
I/O nodes are in charge of Input/Output requests for the file systems.
For I/O intensive applications, an I/O node is necessary to reduce MN load. This is especially true when
the MNs are installed on virtual machines (VM). When a virtual MN handles heavy I/O requests it can
dramatically impact the I/O level of performance of the second virtual MN.
If an I/O node is aimed at serving nodes with different OS’s then it must have at least one network
interface for each OS subnet (i.e., a subnet that is declared for every node that runs with the same OS).
Section 4.4 and 4.5 show an example of OS subnets.
An I/O node could be installed with Linux or Windows for configuring a NFS server. NFS clients and
servers are supported on both OS’s. But the Lustre file system (delivered by Bull with XBAS) is not
available for Windows clusters so Lustre I/O nodes can only be installed on Linux I/O nodes (for the Linux
CN usage only 4 ). Other commercial cluster / parallel file systems are available for both Linux and
Windows (e.g., CXFS).
The I/O node can serve one file system shared by both OS nodes or two independent file systems (one
for each OS subnet). In the case of 2 independent file systems, 1 or 2 I/O nodes can be used.
• login
Login nodes could run a Windows or Linux OS and they can be installed on dual-boot servers, virtual
machines or independent servers. A login node is usually only connected to other nodes running the
same OS as its own.
For the HPCS cluster, the use of a login node is not mandatory, as a job can be submitted from any
Windows client with the Microsoft HPC Pack installed (with the scheduler graphical interface or command
line) by using an account into the cluster domain. A login node can be used to provide a gateway to enter
into the cluster domain.
4
Lustre and GPFSTM clients for Windows are announced to be available soon.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
24
3.4 Management services
From the infrastructure configuration point of view, we should study the potential interactions between
services that can be delivered from each MN (e.g., DHCP, TFTP, NTP, etc.). The goal is to avoid any
conflict between MN services while cluster operations or computations are done simultaneously on both
OS’s. This is especially complex during the compute node boot phase since the PXE procedure requires
DHCP and TFTP access from its very early start time. A practical case with XBAS and HPCS is shown in
Section 4.4.
• a NTP server (for the virtualization software and for MPI application synchronization)
One of these articles compares virtualization technologies for HPC (see [25]). This paper systematically
evaluates various VMs for computationally intensive HPC applications using various standard scientific
benchmarks using VMware Server, Xen, and OpenVZ. It examines the suitability of full virtualization,
para-virtualization, and operating system-level virtualization in terms of network utilization, SMP
performance, file system performance, and MPI scalability. The analysis shows that none match the
performance of the base system perfectly: OpenVZ demonstrates low overhead and high performance,
Xen demonstrated excellent network bandwidth but its exceptionally high latency hindered its scalability,
VMware Server, while demonstrating reasonable CPU-bound performance, was similarly unable to cope
with the NPB MPI-based benchmark.
Another article evaluates the performance impact of Xen on MPI and process execution for HPC Systems
(see [26]). It investigates subsystem and overall performance using a wide range of benchmarks and
applications. It compares the performance of a para-virtualized kernel against three Linux operating
systems and concludes that in general, the Xen para-virtualizing system poses no statistically significant
overhead over other OS configurations.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
25
3.6 Meta-scheduler for HOSC
3.6.1 Goals
The goal of a meta-scheduler used for an HOSC can be:
• Purely performance oriented: the most efficient OS is automatically chosen for a given run
(based on backlog, statistics, knowledge data base, input data size, application binary, etc)
• OS compatibility driven: if an application is only available for a given OS then this OS must be
used!
• High availability oriented: a few nodes with each OS are kept available all the time in case of
requests that must be treated extremely quickly or in case of failure of running nodes.
• Energy saving driven: the optimal number of nodes with each OS are booted while the others
are shut down (depending on the number of jobs in queue, the profile of active users, job history,
backlog, time table, external temperature, etc.)
• Re-deploy the right OS and boot compute nodes (on diskless servers for example)
• Unplanned (dynamic): the meta-scheduler estimates dynamically the optimal size of node
partitions with each OS type (depending on job priority, queue, backlog, etc.), then it grows and
shrinks these partitions accordingly by switching OS type on compute nodes. This is usually
called “just in time provisioning”.
• Planned (dynamic): the administrators plan the OS distribution based on time, dates, team
budget, project schedules, people vacations, etc. The size of the node partitions with each OS
type are fixed for given periods of time. This is usually called “calendar provisioning”.
• Static: the size of node partitions with each OS type are fixed once for all and the meta-scheduler
cannot switch OS type. This is the simplest and less efficient case.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
26
4 Technical choices for designing an HOSC prototype
We want to build a flexible medium-sized HOSC with XBAS and HPCS. We only have a small
5-server cluster to achieve this goal but it will be sufficient to simulate the usage of a medium-sized
cluster. We start from this cluster with InfiniBand and Gigabit network. The complete description of the
hardware is given in Appendix E. We have discussed the possible approaches in the previous chapter.
Let us now see what choice should be made in the particular case of this 5-server cluster. In the
remainder of the document, this cluster is named the HOSC prototype.
When the OS type of CNs is switched manually, we decided to allow the OS type switch commands to be
sent only from the MN that runs the same OS as itself. In other words, the HPCS MN can “give up” one
of its CNs to the XBAS cluster and the XBAS MN can “give up” one of its CNs to the HPCS cluster, but
no MN can “take” a CN from the cluster with a different OS. This rule was chosen to minimize the risk of
switching the OS of a CN while it is used for computation with its current OS configuration. When the OS
type of CNs is switched automatically by a meta-scheduler, OS type switch commands are sent from the
meta-scheduler server. To help switching OS on CNs from the MNs, simple scripts were written. They are
listed in Appendices D.1.3 and D.2.3, and an example of their use is shown in Sections 6.3 and 6.4.
Depending on the OS type that is booted, the node has a different hostname and IP address. This
information is sent by a DHCP server whose configuration is updated at each OS switch request as
explained in next section.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
27
Figure 7 - Management node virtualization architecture
• The DHCP is the critical part, as it is the single point of entry when a compute node boots. In our
prototype it is running on the XBAS management node. The DHCP configuration file contains a
section for each node with its characteristics (hostname, MAC and IP address) and the PXE
information. Depending on the administrator needs, this section can be changed for deploying
and booting XBAS or HPCS on a compute node. (see an example of dhcp.conf file changes in
Appendix D.2.2)
• WDS and/or TFTP server: each management node has its own server because the installation
procedures are different. A booting compute node is directed to the correct server by the DHCP
server.
• Directory Service is provided by Active Directory (AD) for HPCS and by LDAP for XBAS. Our
prototype will not offer a unified solution, but since synchronization mechanisms between AD and
LDAP exist, a unified solution could be investigated.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
28
• DNS: this service can be provided by the XBAS management node or the HPCS head node. The
DNS should be set as dynamic in order to provide simpler access for the AD. In our prototype, we
set a DNS server on the HPCS head node for the Windows nodes, and we use /etc/hosts
files for name resolution on XBAS nodes.
Recommendations given in Section 3.4 can be applied to our prototype by configuring the services as
shown in Figure 8 and in Table 2.
The netmask is set to 255.255.0.0 because it must provide connectivity between Xen domain 0 and each
DomU virtual machine.
Figures 9 and 10 describe respectively XBAS and HPCS compute node deployment steps, while
Figures 11 and 12 describe respectively XBAS and HPCS compute node normal boot steps on our HOSC
prototype. They show how the PXE operations detailed in Figures 3, 4, 5 and 6 of Chapter 2 are
consistently adapted in our heterogeneous OS environment with a unique DHCP server on the XBAS MN
and a Windows MBR on the CNs.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
29
Figure 9 - Deployment of a XBAS compute node on our HOSC prototype
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
30
Figure 11 - Boot of a XBAS compute node on our HOSC prototype
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
31
4.5 HOSC prototype architecture
The application network (i.e., InfiniBand network) should not be on the same subnet as the private
network (i.e., gigabit network): we chose 172.16.0.[1-5] and 172.16.1.[1-5] IP address ranges for the
application network address assignment.
The complete cluster architecture that results from the decisions taken in the previous sections is shown
in Figure 13 below:
129.183.251.53 (eth1)
Gigabit 129.183.251.40 (xenbr1) Gigabit
(IB switch RHEL5.1 with Xen (intranet/
192.168.0.220 129.183.251.41 (xenbr1) internet)
management) DOMAIN0 HPCS0 XBAS0
HPCS3 XBAS3
OR
InfiniBand Gigabit
HPCS2 XBAS2
OR
If for some reasons the IB interface cannot be configured on the HN, you should setup a loop back
network interface instead and configure it with the IPoIB IP address (e.g., 172.16.1.1 in Figure 13). If for
some reasons the IB interface cannot be configured on the MN, its setup can be skipped since it is not
mandatory to connect the IB interface on the MN.
In the next chapter we will show how to install and configure the HOSC prototype with this architecture.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
32
4.6 Meta-scheduler architecture
Without a meta-scheduler, users need to connect to the required cluster management node in order to
submit his job. In this case, each cluster has its own management node with its own scheduler (as shown
on the left side of Figure 14). By using a meta-scheduler, we offer a single point of entry to use the power
of the HOSC whatever the OS type required by the job (as shown on the right side of Figure 14).
Figure 14 - HOSC meta-scheduler architecture (in order to have a simpler scheme, the HOSC is
represented as two independent clusters: one with each OS type)
On the meta-scheduler, we create two job queues, one for the XBAS cluster and another one for HPCS
cluster. So according to the user request, the job will be automatically redirected to the correct cluster.
The meta-scheduler will also be managing the switch from an OS type to the other according to the
clusters workload.
We chose PBS Professional to be used as meta-scheduler for our prototype because of the experience
we already have with it on Linux and Windows platforms. PBS server should be installed on a node that is
accessible from every other nodes of the HOSC. We chose to install it on the XBAS management node.
PBS MOM (Machine Oriented Mini-server) is installed on all compute nodes (HPCS and XBAS) so they
can be controlled by the PBS server.
In the next chapter we will show how to install and configure this meta-scheduler on our HOSC prototype.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
33
5 Setup of the HOSC prototype
This Chapter describes the general setup of the Hybrid Operating System Cluster (HOSC) defined in the
previous Chapter. The initial idea was to install Windows HPC Server 2008 on an existing Linux cluster
without affecting the existing Linux cluster installation. However, in our case it appeared that the
installation procedure requires reinstalling the management node with the 2 virtual machines. So finally
the installation procedure is given for an HOSC installation done from scratch.
Check that Virtual Technology (VT) is enabled in the BIOS settings of the server.
Install Linux RHEL5.1 from the DVD on the management server and select “virtualization” when optional
packages are proposed. SELinux must be disabled. Erase all existing partitions and design your partition
table so that enough free space is available in a volume group for creating logical volumes (LV). LVs are
virtual partitions used for the installation of virtual machines (VM), each VM being installed on one LV.
Volume groups and logical volumes are managed by the Logical Volume Manager (LVM). The advised
size of a LV is 30-50GB: leave at least 100GB of free space on the management server for the creation of
2 LVs.
It is advisable to install an up-to-date gigabit driver. One is included on the XBAS 5v1.1 XHPC DVD.
2 virtual machines are needed to install the XBAS management node and the HPCS head node. Create
these 2 Xen virtual machines on the management server. The use of HN-Master is optional and all
operations done in the Hypernova environment could also be done with Xen commands in a basic Xen
environment. For the use of HN-Master, httpd service must be started (type “chkconfig --level 35
httpd on” to start it automatically at boot time).
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
34
Figure 15 - HN-Master user interface
Create 2 network interface bridges, xenbr0 and xenbr1, so that each VM can have 2 virtual network
interfaces (one on the private network and one on the public network). Detailed instructions for
configuring 2 network interface bridges are shown in Appendix D.2.4.
5
In case of problems when installing the OS’s (e.g., #IRQ disabled while files are copied), select only
1 virtual CPU for the VM during the OS installation step.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
35
5.1.3 Installation of XBAS management node on a VM
Install XBAS on the first virtual machine. If applicable, use the clusterdb and the network configuration of
the initial management node. Update the clusterdb with the new management node MAC addresses: the
xenbr0 and xenbr1 MAC address of the VM. Follow the instructions given in the BAS for Xeon installation
and configuration guide [22], and choose the following options for the MN setup:
[xbas0:root] cd /release/XBAS5V1.1
[xbas0:root] ./install -func MNGT IO LOGIN -prod RHEL XHPC XIB
Update the clusterdb with the new management node MAC-address (see [22] and [23] for details).
3. enable remote desktop (this is recommended for a remote administration of the cluster)
4. set “Internet Time Synchronization” so that the time is the same on the HN and the MN
5. install the Active Directory (AD) Domain Services and create a new domain for your cluster with
the wizard (dcpromo.exe), or configure the access to your existing AD on your local network
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
36
5.1.6 Preparation for XBAS deployment on compute nodes
Check that there is enough space on the first device of the compute nodes for creating an additional
primary partition (e.g., on /dev/sda). If not, make some space by reducing the existing partitions or by
redeploying XBAS compute nodes with the right partitioning (using the preparenfs command and a
dedicated kickstart file). Edit the kickstart file accordingly to an HOSC compatible disk partition. For
example, /boot on /dev/sda1, / on /dev/sda2 and SWAP on /dev/sda3. An example is given in
Appendix D.2.1.
Create a /opt/hosc directory and export it with NFS. Then mount it on every node of the cluster and
install the HOSC files listed in Appendix D.2.3 in it:
• switch_dhcp_host
• activate_partition_HPCS.sh
• fdisk_commands.txt
c. Firewall is “off” on private network (this is for the compute nodes only because the firewall
needs to be “on” for the head node)
3. Configure the naming of the nodes (this step is mandatory even if it is not useful in our case: the
new node names will be imported from an XML file that we will create later). You can specify:
HPCS%1%
4. Create a deployment template with operating system and “Windows Server 2008” image
Bring the HN online in the management console: click on “Bring Online” in the “Node Management”
window of the “HPC Cluster Manager” MMC.
Add a recent network adapter Gigabit driver to the OS image that will be deployed: click on “Manage
drivers” and add the drivers for Intel PRO/1000 version 13.1.2 or higher (PROVISTAX64_v13_1_2.exe
can be downloaded from Intel web site).
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
37
Add a recent IB driver (see [27]) that supports Network Direct (ND). Then edit the compute node template
and add a “post install command” task that configures IPoIB IP address and register ND on the compute
nodes. The IPoIB configuration can be done by the script setIPoIB.vbs provided in Appendix D.1.2.
The ND registration is done by the command:
C:\> ndinstall -i
Two files used by the installation template must be edited in order to keep existing XBAS partitions
untouched on compute nodes while deploying HPCS. For example, choose the fourth partition
(/dev/sda4) for the HPCS deployment (see Appendix D.1.1 for more details):
• unattend.xml
• diskpart.txt
Create a shared C:\hosc directory and install the HOSC files listed in Appendix D.1.3 in it:
• activate_partition_XBAS.bat
• diskpart_commands.txt
• from_HPCS_to_XBAS.bat
6
Thanks a lot to Christian Terboven (research associate in the HPC group of the Center for Computing
and Communication at RWTH Aachen University) for his helpful contribution to this configuration phase.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
38
5.2 Deployment of the operating systems on the compute nodes
The order in which the OS’s are deployed is not important but must be the same on every compute node.
The order should thus be decided before starting any node installation or deployment. The installation
scripts (such as diskpart.txt for HPCS or kickstart.<identifier> for XBAS) must be edited
accordingly in the desired order. In this example, we chose to deploy XBAS first. The partition table we
plan to create is:
First, check that the BIOS settings of all CNs are configured for PXE boot (and not local hard disk boot).
They should boot on the eth0 Gigabit Ethernet (GE) card. For example, the following settings are correct:
Boot order
1 - USB key
2 - USB disk
3 - GE card
4 - SATA disk
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
39
Once generated, the kickstart file needs a few modifications in order to fulfill the HOSC disk partition
requirements: see an example of these modifications in Appendix D.2.1.
When the modifications are done, boot the compute nodes and the PXE mechanisms will start to install
XBAS on the compute nodes with the information stored in the kickstart file. Figure 16 shows the console
of a CN while it is PXE booting for its XBAS deployment.
Figure 16 - XBAS compute node console while the node starts to PXE boot
It is possible to install every CN with the preparenfs tool or to install a single CN with the preparenfs tool
and then duplicate it on every other CN servers with the help of the ksis deployment tool. However, the
use of ksis is only possible if XBAS is the first installed OS, since ksis overwrites all existing partitions. So
it is advisable to only use the preparenfs tool for CN installation on a HOSC.
Check that the /etc/hosts file is consistent on XBAS CNs (see Appendix D.2.5). Configure the IB
interface on each node by editing file ifcfg-ib0 (see Appendix D.2.6) and enable the IB interface by
starting the openibd service:
In order to be able to boot Linux with the Windows MBR (after having installed HPCS on the CNs), install
the GRUB boot loader on the first sector of the /boot partition by typing on each CN:
The last step is to edit all PXE files in /tftboot directory and set both TIMEOUT and PROMPT variables to 0
in order to boot compute nodes quicker.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
40
name next-server and server-name). The DHCP configuration file changes can be done by using the
switch_dhcp_host script (see Appendix D.2.3) for each compute node. Once the changes are done in
the file, the dhcpd service must be restarted in order to take changes into account. For example, type:
Now prepare the deployment of the nodes for the HPCS management console: get the MAC address of
all new compute nodes and create an XML file with the MAC address, compute node name and domain
name of each node. An example of such an XML file (my_cluster_nodes.xml) is given in
Appendix D.1.1. Import this XML file from the administrative console (see Figure 17) and assign a
deployment “compute node template” to the nodes.
Boot the compute nodes. Figure 18 shows the console of a CN while it is PXE booting for its HPCS
deployment with a DHCP server on the XBAS management node (192.168.0.1) and a WDS server on the
HPCS head node (192.168.1.1).
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
41
Intel(R) Boot Agent GE v1.2.36
Copyright (C) 1997-2005, Intel Corporation
Downloaded WDSNBP...
Architecture: x64
Contacting Server: 192.168.1.1 ............
Figure 18 - HPCS compute node console while the node starts to PXE boot
The nodes will appear with the “provisioning” state in the management console as shown in Figure 19.
After a while the compute node console shows that the installation is complete as in Figure 20.
At the end of the deployment, the compute node state is “offline” in the management console. The last
step is to click on “Bring online” in order to change the state to “online”. The HPCS compute nodes can
now be used.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
42
5.3 Linux-Windows interoperability environment
In order to enhance the interoperability between the two management nodes, we set up a Unix/Linux
environment on the HPCS head node using the Subsystem for Unix-based Applications (SUA). We also
install SUA supplementary tools such as openssh that can be useful for HOSC administration tasks (e.g.,
ssh can be used to execute commands from a management node to the other in a safe manner).
The installation of SUA is not mandatory for setting up an HOSC and many tools can also be found from
other sources but it is a rather easy and elegant way to have a homogeneous HOSC environment: firstly,
it provides a lots of Unix tools on Windows systems, and secondly it provides a framework for porting and
running Linux applications in a Windows environment.
Other tools, such as proprietary compilers, can also be installed in the SUA environment.
User home directories should at least be shared on all compute nodes running the same OS: for
example, an NFS exported directory /home_nfs/test_user/ on XBAS CNs and a shared CIFS
directory C:\Users\test_user\ on HPCS CNs for user test_user.
It is also be possible (and even recommended) to have a unique home directory for both OS’s by
configuring Samba [36] on XBAS nodes.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
43
5.5 Configuration of ssh
[xbas0:root] cd /root/.ssh
[xbas0:root] cp id_rsa.pub authorized_keys
[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@hpcs0:.ssh/
[xbas0:root] scp id_rsa id_rsa.pub authorized_keys root@xbas1:.ssh/
Enter root password when requested (it will never be requested anymore later).
For copying the RSA key on the HPCS CNs see Section 5.5.4.
By default, the first time a server connects to a new host it checks if its “server” RSA public key (stored in
/etc/ssh/) is already known and it asks the user to validate the authenticity of this new host. In order to
avoid typing the “yes” answer for each node of the cluster different ssh configurations are possible:
• The easiest, but less secure, solution is to disable the host key checking in file
/etc/ssh/ssh_config by setting: StrictHostKeyChecking no
• Another way is to merge the RSA public key of all nodes in a file that is copied on each node: the
/etc/ssh/ssh_known_hosts file. A trick is to duplicate the same server private key (stored
in file /etc/ssh/ssh_host_rsa_key) and thus the same public key (stored in file
/etc/ssh/ssh_host_rsa_key.pub) on every node. The generation of the
ssh_known_hosts file is then easier since each node has the same public key. An example of
such an ssh_known_hosts file is given in Appendix D.2.7.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
44
5.5.3 Installation of freeSSHd on HPCS compute nodes
If you want to use PBS Professional and the OS balancing feature that was developed for our HOSC
prototype, a ssh server daemon is required on each compute node. The sshd daemon is already
installed by default on the XBAS CNs and it should be installed on the HPCS CNs: we chose the
freeSSHd [34] freeware. This software can be downloaded from [34] and its installation is straight-
forward: execute freeSSHd.exe, keep all default values proposed during the installation process and
accept to “run FreeSSHd as a system service”.
Then finish the setup by copying the RSA key file /root/.ssh/id_rsa.pub from the XBAS MN to file
C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root on the HPCS CNs. Edit
this file (C:\Program Files (x86)\freeSSHd\ssh_authorized_keys\root) and remove the
@xbas0 string at the end of the file: it should end with the string root instead of root@xbas0.
7
Thanks a lot to Laurent Aumis (SEMEA GridWorks Technical Manager at ALTAIR France) for his
valuable help and expertise in setting up this PBS Professional configuration.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
45
5.6.1 PBS Professional Server setup
Install PBS server on the XBAS MN: during the installation process, select “PBS Installation: 1. Server,
execution and commands” (see [31] for detailed instructions). By default, the MOM (Machine Oriented
Mini-server) is installed with the server. Since the MN should not be used as a compute node, stop PBS
with “/etc/init.d/pbs stop”, disable the MOM by setting PBS_START_MOM=0 in file
/etc/pbs.conf (see Appendix D.3.1) and restart PBS with “/etc/init.d/pbs start”.
If you want to use a UID/GID on Windows and Linux nodes without UID unified, you need to set the flag
flatuid=true with the qmgr tool. UID/GID of PBS server will be used. Type:
[xbas0:root] qmgr
[xbas0:root] Qmgr: set server flatuid=True
[xbas0:root] Qmgr: exit
It would also be possible to use $usecp in PBS to move files around instead of scp. Samba [36] could
be configured on Linux systems to allow the HPCS compute nodes to drop files directly to Linux servers.
1. select setup type “Execution” (only) on CNs and “Commands” (only) on the HN
2. enter pbsadmin user password (as defined on the PBS server: on XBAS MN in our case)
3. enter PBS server hostname (xbas0 in our case)
4. keep all other default values that are proposed by the PBS installer
5. reboot the node
Here is a summary of the PBS Professional configuration on our HOSC prototype. The following is a
selection of the most representative information reported by the PBS queue manager (qmgr):
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
46
Qmgr: print server
# Create and define queue windowsq
create queue windowsq
set queue windowsq queue_type = Execution
set queue windowsq default_chunk.arch = windows
set queue windowsq enabled = True
set queue windowsq started = True
# Create and define queue linuxq
create queue linuxq
set queue linuxq queue_type = Execution
set queue linuxq default_chunk.arch = linux
set queue linuxq enabled = True
set queue linuxq started = True
# Set server attributes.
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 60
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 3600
set server license_count = "Avail_Global:0 Avail_Local:1024 Used:0 High_Use:8"
set server eligible_time_enable = False
Qmgr: print node xbas1
# Create and define node xbas1
create node xbas1
set node xbas1 state = free
set node xbas1 resources_available.arch = linux
set node xbas1 resources_available.host = xbas1
set node xbas1 resources_available.mem = 16440160kb
set node xbas1 resources_available.ncpus = 4
set node xbas1 resources_available.vnode = xbas1
set node xbas1 resv_enable = True
set node xbas1 sharing = default_shared
Qmgr: print node hpcs2
# Create and define node hpcs2
create node hpcs2
set node hpcs2 state = free
set node hpcs2 resources_available.arch = windows
set node hpcs2 resources_available.host = hpcs2
set node hpcs2 resources_available.mem = 16775252kb
set node hpcs2 resources_available.ncpus = 4
set node hpcs2 resources_available.vnode = hpcs2
set node hpcs2 resv_enable = True
set node hpcs2 sharing = default_shared
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
47
5.7.1 Just in time provisioning setup
This paragraph describes the implementation of a simple example of “just in time” provisioning (see
Section 3.6.3). We developed a Perl script (see pbs_hosc_os_balancing.pl in Appendix D.3.2)
that gets PBS server information about queues, jobs and nodes for both OS’s (e.g., number of free
nodes, number of nodes requested by jobs in queues, number of nodes requested by the smallest job).
Based on this information, the script checks a simple rule that defines the cases when the OS type of
CNs should be switched. If the rule is “true” then the script selects free CNs and switches their OS type.
In our example, we defined a conservative rule (i.e., the number of automatic OS switches is kept low):
“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let
us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e.,
there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α
nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.
The script is run periodically based on the schedule defined by the crontab of the PBS server host. The
administrator can also switch more OS’s manually if necessary at any time (see Sections 6.3 and 6.4).
The crontab setup can be done by editing the following lines with the crontab command 8 :
[xbas0:root] crontab -e
# run HOSC Operating System balancing script every 10 minutes (noted */10)
*/10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl
The OS distribution balancing is then controlled by this cron job. Instead of running the
pbs_hosc_os_balancing.pl script as a cron job, it would also be possible to call it as an external
scheduling resource sensor (see [31] for information about PBS Professional scheduling resources), or to
call it with PBS Professional hooks (see [31]). For developing complex OS balancing rules, the Perl script
could be replaced by a C program (for details about PBS Professional API see [33]).
This simple script could be further developed in order to be more reliable. For example:
• check that the script is only run once at a time (by setting a lock file for example),
• allow to switch the OS type of more than η-α nodes at once if the number of free nodes and the
number of queued jobs is high (this can happen when many small jobs are submitted),
• impose a delay between two possible switches of OS type on each compute node.
8
“crontab -e” opens the /var/spool/cron/root file in a vi mode and restarts the cron service automatically.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
48
6 Administration of the HOSC prototype
For HPCS, this means that the basic services and connectivity tests should be run first, followed by the
automated diagnosis tests from the “cluster management” MMC.
For XBAS, the sanity checks can be done with basic Linux commands (ping, pdsh, etc.) and monitoring
tools like Nagios (see [23] and [24] for details).
The HPCS head node can send a reboot command to its HPCS compute nodes only (soft reboot) with
“clusrun”. For example:
Use “clusrun /all” for rebooting all HPCS compute nodes (the head node should not be declared as
a compute node; otherwise this command would reboot it too).
The XBAS management node can send a reboot command to its XBAS compute nodes only (soft reboot)
with pdsh. For example:
The XBAS management node can also reboot any compute node (HPCS or XBAS) with the NovaScale
control “nsctrl” command (hard reboot). For example:
The compute node is then automatically rebooted with the HPCS OS type.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
49
6.4 Switch a compute node OS type from HPCS to XBAS
Then take the node offline in the MMC and type the from_HPCS_to_XBAS.bat command in a
“command prompt” window of the HPCS head node. See Appendix D.1.3 for information on this
command implementation. For example, if you want to switch the OS of node hpcs2, type:
The compute node is then automatically rebooted with the XBAS OS type.
The compute node is then automatically rebooted with the XBAS OS type.
This script was mainly implemented to be used with a meta-scheduler since it is not recommended to
switch the OS type of a HPCS CN by sending a command from the XBAS MN (see Section 4.3).
6.5 Re-deploy an OS
The goal is to be able to re-deploy an OS on an HOSC without impacting the other OS that is already
installed. Do not forget to save your MBR since it can be overwritten during the installation phase (see
Appendix C.2).
For re-deploying XBAS compute nodes, ksis tools cannot be used (it would erase existing Windows
partitions). The “preparenfs” command is the only tool that can be used. The partition declarations done in
the kickstart file should then be edited in order to reuse existing partitions and not to remove them or
recreate new ones. The modifications are slightly different from those done for the first install. If the
existing partitions are those created with the kickstart file shown as example in Appendix D.2.1:
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
50
Then the new kickstart file used for re-deploying a XBAS compute node should include the lines below:
/release/ks/kickstart.<identifier>
…
part /boot --fstype="ext3" --onpart sda1
part / --fstype="ext3" --onpart sda2
part swap --noformat --onpart sda3
…
In the PXE file stored on the MN (e.g., /tftboot/C0A80002 for node xbas1), the DEFAULT label should
be set back to ks instead of local_primary. The CN can then be rebooted for starting the re-
deployment process.
For re-deploying Windows HPC Server 2008 compute nodes, check that the partition number in
unattend.xml file is consistent with the existing partition table and if necessary edit it (in our example:
<PartitionID>4</PartitionID>). Edit diskpart.txt file so that it only re-formats the NTFS Windows
partition without cleaning or removing the existing partitions (see Appendix D.1.1). Manually
update/delete the previous computer and hostname declaration in the Active Directory before re-
deploying the nodes and then play the compute node deployment template as for the first install.
my_job_Win.sub
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -q windowsq
C:\Users\test_user\my_windows_application
my_job_Lx.sub
#!/bin/bash
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -q linuxq
/home/test_user/my_linux_application
Whatever the OS type the application should run on, the scripts can be submitted from any Windows or
Linux computer with the same qsub command. The only requirement is that the computer needs to have
credentials to connect with the PBS Professional server.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
51
The command lines can be typed from a Windows system:
C:\> qsub my_job_Win.sub
C:\> qsub my_job_Lx.sub
You can check the PBS queue status with the qstat command. Here is the example of an output:
[xbas0:root] qstat -n
xbas0:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
129.xbas0 thomas windowsq my_job_Win 3316 2 8 -- -- R 03:26
hpcs3/0*4+hpcs4/0*4
130.xbas0 laurent linuxq my_job_Lx. 21743 2 8 -- -- R 01:23
xbas1/0*4+xbas2/0*4
131.xbas0 patrice linuxq my_job_Lx. -- 2 8 -- -- Q --
--
132.xbas0 patrice linuxq my_job_Lx. -- 1 4 -- -- Q --
--
133.xbas0 laurent windowsq my_job_Win -- 2 8 -- -- Q --
--
134.xbas0 thomas windowsq my_job_Win -- 1 4 -- -- Q --
--
135.xbas0 thomas windowsq my_job_Win -- 1 1 -- -- Q --
--
136.xbas0 patrice linuxq my_job_Lx. -- 1 1 -- -- Q --
--
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
52
Figure 21 - PBS monitor with all 4 compute nodes busy (2 with XBAS and 2 with HPCS)
Figure 22 - PBS monitor with 1 busy HPCS compute node and 3 free XBAS compute nodes
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
53
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
54
7 Conclusion and perspectives
We studied 12 different approaches to HPC clusters that can run 2 OS’s. We particularly focused on
those being able to run the 2 OS’s simultaneously and we named them: Hybrid Operating System
Clusters (HOSC). The 12 approaches have dozens of possible implementations among which the most
common alternatives were discussed, resulting in technical recommendations for designing an HOSC.
This collaborative work between Microsoft and Bull gave the opportunity to build an HOSC prototype that
provides computing power under Linux Bull Advanced Server for Xeon and Windows HPC Server 2008
simultaneously. The prototype has 2 virtual management nodes installed on 2 Xen virtual machines run
on a single host server with RHEL5.1, and 4 dual-boot compute nodes that boot with the Windows master
boot record. The methodology to dynamically switch the OS type easily on some compute nodes without
disturbing the other compute nodes was provided.
A meta-scheduler based on Altair PBS Professional was implemented. It provides a single submission
point for both Linux and Windows and it adapts automatically (with some simple rules given as example)
the distribution of OS types among the compute nodes to the user needs (i.e., the pool of submitted jobs).
This successful project could be continued with the aim of improving the current HOSC prototype
features. Ideas of possible improvements are to
• develop a unique monitoring tool for both OS compute nodes (e.g., based on Ganglia [35]);
• work on interoperability between PBS and HPCS job scheduler (e.g., by using the tools of OGF,
the Open Grid Forum [37]).
We could also work on security aspects that were intentionally overlooked during this first study. More
intensive and exhaustive performance tests with virtual machines (e.g., InfiniBand ConnectX virtualization
feature, virtual processor binding, etc.) could also be done. Finally, a third OS could be installed on our
HOSC prototype to validate the general nature of the method exposed.
More generally, the framework presented in this paper should be considered as a building block for more
specific implementations. Various requirements of real applications, environments or loads could lead to
sensibly different or more sophisticated developments. We hope that this initial building block will help
those who will add subsequent layers, and we are eager to hear about successful production
environments designed from there 9 .
9
Do not hesitate to send your comments to the authors about this paper and your HOSC experiments:
patrice.calegari@bull.net and thomas.varlet@microsoft.com.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
55
Appendix A: Acronyms
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
56
MBR Master Boot Record
MMC Microsoft Management Console
MN Management Node (Bull)
MOM Machine Oriented Mini-server (Altair)
MPI Message Passing Interface
MULTICS Multiplexed Information and Computing Service
NBP Network Boot Program
ND Network Direct (Microsoft)
NFS Network File System
NPB NASA advanced supercomputing (NAS) Parallel Benchmarks
NTFS New Technology File System (Windows)
OGF Open Grid Forum
OS Operating System
PBS Portable Batch System
PXE Pre-boot eXecution Environment
RHEL RedHat Enterprise Linux
ROI Return On Investment
RSA Rivest, Shamir, and Adelman
SDK Software Development Kit
SGE Sun Grid Engine
SLURM Simple Linux Utility for Resource Management
SSH Secure SHell
SUA Subsystem for Unix-based Applications
TCO Total Cost of Ownership
TCP Transmission Control Protocol
TFTP Trivial File Transfer Protocol
UDP User Datagram Protocol
UID User IDentifier
UNIX This is a pun on MULTICS (not an acronym!)
VM Virtual Machine
VNC Virtual Network Computing
VT Virtual Technology (Intel®)
WCCS Windows Compute Cluster Server
WDS Windows Deployment Service
WIM Windows IMage (Microsoft)
WinPE Windows Preinstallation Environment (Microsoft)
XBAS Bull Advanced Server for Xeon
XML eXtensible Markup Language
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
57
Appendix B: Bibliography and related links
[1] “Dual Boot: Windows Compute Cluster Server 2003 and Linux - Setup and Configuration
Guide”, July 2007. This white paper describes the installation and configuration of an HPC cluster
for a dual-boot of Windows Compute Cluster Server 2003 (WCCS) and Linux OpenSuSE.
http://www.microsoft.com/downloads/details.aspx?FamilyID=1457BC0A-EAFF-4303-99ED-
B199AB1C0857&displaylang=en
[2] “Dual Boot: Windows Compute Cluster Server and Rocks Cluster Distribution - Setup and
Configuration Guide”, Jason Bucholtz, HPC Practice Lead, X-ISS, Michael Zebrowski, HPC
Analyst, X-ISS, 2007. This white paper describes the installation and configuration of an HPC
cluster for a dual-boot of WCCS 2003 and Rocks Cluster Distribution (formerly called NPACI
Rocks). http://www.microsoft.com/downloads/details.aspx?FamilyID=e73a468e-2dbf-4782-8faa-
aaa20acb63f8&DisplayLang=en
[3] “Dual-boot Linux and HPC Server 2008” on G. Marchetti blog:
http://blogs.technet.com/gmarchetti/archive/2007/12/11/dual-boot-linux-and-hpc-server-2008.aspx
[4] BULL S.A.S. HPC solutions: http://www.bull.com/hpc
[5] Windows HPC Server: http://www.microsoft.com/hpc and http://www.windowshpc.net
[6] Xen: http://xen.xensource.com
[7] VMware: http://www.vmware.com
[8] Hyper-V: http://www.microsoft.com/windowsserver2008/en/us/virtualization-consolidation.aspx
[9] PowerVM: http://www-03.ibm.com/systems/power/software/virtualization/index.html
[10] Virtuozzo: http://www.parallels.com/en/products/virtuozzo/
[11] OpenVZ: http://openvz.org
[12] PBS Professional: http://www.pbsgridworks.com/ and http://www.altair.com/
[13] Torque: http://www.clusterresources.com/pages/products/torque-resource-manager.php
[14] SLURM: https://computing.llnl.gov/linux/slurm/
[15] LSF: http://www.platform.com/Products/platform-lsf
[16] SGE: http://gridengine.sunsource.net
[17] OAR: http://oar.imag.fr/index.html
[18] Wikipedia: http://www.wikipedia.org
[19] Moab and Maui: http://www.clusterresources.com
[20] GridWay: http://www.gridway.org
[21] Community Scheduler Framework: http://sourceforge.net/projects/gcsf
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
58
[22] “BAS5 for Xeon - Installation & Configuration Guide”, Ref: 86 A2 87EW00, April 2008
[23] “BAS5 for Xeon - Administrator’s guide”, Ref: 86 A2 88EW, April 2008
[24] “BAS5 for Xeon - User’s guide”, Ref: 86 A2 89EW, April 2008
[25] "A Comparison of Virtualization Technologies for HPC", John Paul Walters, Vipin Chaudhary,
Minsuk Cha, Salvatore Guercio Jr., Steve Gallo, In Proceedings of the 22nd International
Conference on Advanced Information Networking and Applications (AINA 2008), pp. 861-868, 2008
DOI= http://doi.ieeecomputersociety.org/10.1109/AINA.2008.45
[26] “Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC
Systems”, Youseff, L., Wolski, R., Gorda, B., and Krintz, C. In Proceedings of the 2nd international
Workshop on Virtualization Technology in Distributed Computing, Virtualization Technology in
Distributed Computing, IEEE Computer Society, 2006, DOI= http://dx.doi.org/10.1109/VTDC.2006.4
[27] Mellanox Basic InfiniBand Software Stack for Windows HPC Server 2008 including
NetworkDirect support http://www.mellanox.com/products/MLNX_WinOF.php
[28] Utilities and SDK for Subsystem for UNIX-based Applications (SUA) in Microsoft Windows Vista
RTM/Windows Vista SP1 and Windows Server 2008 RTM:
http://www.microsoft.com/downloads/details.aspx?familyid=93ff2201-325e-487f-a398-
efde5758c47f&displaylang=en
[29] Interops Systems: http://www.interopsystems.com
[30] SUA Community: http://www.suacommunity.com
[31] PBS Professional 10.0 Administrator’s Guide, 610 pages, GridWorks, Altair, 2009
[32] PBS Professional 10.0 User’s Guide, 304 pages, GridWorks, Altair, 2009
[33] PBS Professional 10.0 external reference specification, GridWorks, Altair, 2009
[34] freeSSHd and freeFTPd: http://www.freesshd.com
[35] Ganglia: http://ganglia.info
[36] Samba: http://www.samba.org
[37] Open Grid Forum: http://www.ogf.org
[38] Top500 supercomputing site: http://www.top500.org
http://www.bull.com/techtrends
http://www.microsoft.com/downloads
http://technet.microsoft.com/en-us/library/cc700329(WS.10).aspx
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
59
Appendix C: Master boot record details
Address
Description Size in bytes
Hex Dec
0000 0 Code Area ≤ 446
01B8 440 Optional disk signature 4
01BC 444 Usually null: 0x0000 2
Table of primary partitions
01BE 446 64
(four 16-byte partition structures)
01FE 510 55h
MBR signature: 0xAA55 2
01FF 511 AAh
MBR total size: 446 + 64 + 2 = 512
Table 3 - Structure of a Master Boot Record
On Linux, type:
If you want to restore the MBR replace the first sector with the saved file.
On Linux, type:
On Windows Server 2008, MBR can be restored even if not previously saved. Type:
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
60
Appendix D: Files used in examples
Here are the files (scripts, configuration files, etc.) written or modified to build the HOSC prototype and to
validate information given in this document.
… …
<InstallTo> <InstallTo> If XBAS uses the
<DiskID>0</DiskID> <DiskID>0</DiskID> first 3 partitions
<PartitionID>1</PartitionID> <PartitionID>4</PartitionID> then Windows
</InstallTo> </InstallTo>
can be installed
… …
on the 4th
partition.
select disk 0
clean
The “clean” instruction removes all
create partition primary existing partitions. It must be deleted
assign letter=c to preserve existing partitions.
format FS=NTFS LABEL="Node" QUICK OVERRIDE
active
exit
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
61
my_cluster_nodes.xml
<?xml version="1.0" encoding="utf-8"?>
<Nodes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://schemas.microsoft.com/HpcNodeConfigurationFile/2007/12">
<Node Name="hpcs1" Domain="WINISV">
<MacAddress>003048334cf6</MacAddress>
</Node>
<Node Name="hpcs2" Domain="WINISV">
<MacAddress>003048334d04</MacAddress>
</Node>
<Node Name="hpcs3" Domain="WINISV">
<MacAddress>003048334d3c</MacAddress>
</Node>
<Node Name="hpcs4" Domain="WINISV">
<MacAddress>003048347990</MacAddress>
</Node>
</Nodes>
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
62
Next
End Function
'---------------------------------------------------------------------
Function setIPoIB(IPAddress)
PartialIP=Split(ipaddress,".")
strIPAddress = Array("10.1.0." & PartialIP(3))
strSubnetMask = Array("255.255.255.0")
strGatewayMetric = Array(1)
WScript.Echo "IB: " & strIPAddress(0)
strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2")
Set colNetAdapters = objWMIService.ExecQuery _
("select * from win32_networkadapterconfiguration where IPEnabled=true and description
like 'Mellanox%'")
For Each objNetAdapter in colNetAdapters
errEnable = objNetAdapter.EnableStatic(strIPAddress, strSubnetMask)
If errEnable = 0 Then
SetIPoIB="The IP address on Infiniband has been changed"
Else
SetIPoIB="The IP address on IB could not be changed. Error: " & errEnable
End If
Next
End Function
C:\hosc\activate_partition_XBAS.bat
@echo off
rem the argument is the head node hostname for shared file system mount. For example: \\HPCS0
echo ... Partitioning disk...
diskpart.exe /s %1\hosc\diskpart_commands.txt
echo ... Shutting down node %COMPUTERNAME% ...
shutdown /r /f /t 20 /d p:2:4
C:\hosc\diskpart_commands.txt
select disk 0
select partition 1
active
C:\hosc\from_HPCS_to_XBAS.bat
@echo off
rem the argument is the node hostname. For example: hpcs1
echo Check that file dhcpd.conf is updated on the XBAS management node !
if NOT "%1"=="" clusrun /nodes:%1 %LOGONSERVER%\hosc\activate_partition_XBAS.bat %LOGONSERVER%
if "%1"=="" echo "usage: from_HPCS_to_XBAS.bat <hpcs_hostname>"
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
63
D.2 XBAS files
…
part / --asprimary --fstype="ext3" --ondisk=sda --size=10000
part /usr --asprimary --fstype="ext3" --ondisk=sda --size=10000
part /opt --fstype="ext3" --ondisk=sda --size=10000
part /tmp --fstype="ext3" --ondisk=sda --size=10000
part swap --fstype="swap" --ondisk=sda --size=16000
part /var --fstype="ext3" --grow --ondisk=sda --size=10000
…
…
part /boot --asprimary --fstype="ext3" --ondisk=sda --size=100
part / --asprimary --fstype="ext3" --ondisk=sda --size=50000
part swap --fstype="swap" --ondisk=sda --size=16000
…
Here is an example of a PXE file generated by preparenfs for node xbas1. Before deployment, the
DEFAULT label is set to ks and after deployment the DEFAULT label is set to local_primary
automatically.
LABEL local_primary
KERNEL chain.c32
APPEND hd0
LABEL ks
KERNEL RHEL5.1/vmlinuz
APPEND console=tty0 console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp
ks=nfs:192.168.0.99:/release/ks/kickstart.22038 initrd=RHEL5.1/initrd.img driverload=igb
LABEL rescue
KERNEL RHEL5.1/vmlinuz
APPEND console=ttyS1,115200 ksdevice=eth0 lang=en ip=dhcp
method=nfs:192.168.0.99:/release/RHEL5.1 initrd=RHEL5.1/initrd.img rescue driverload=igb
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
64
/tftboot/C0A80002 (head of the file after compute node deployment)
# GENERATED BY PREPARENFS SCRIPT
TIMEOUT 10
DEFAULT local_primary
PROMPT 1
The remainder of the file is unchanged. Set TIMEOUT and PROMPT to 0 in order to boot nodes quicker.
/etc/dhcpd.conf
The NBP file path must be written with double \\ in order to be correctly interpreted during the PXE boot.
/opt/hosc/switch_dhcp_host
#!/usr/bin/python -t
import os, os.path, sys
############## Cluster characteristics must be written here ################
xbas_hostname_base='xbas'
hpcs_hostname_base='hpcs'
field_dict = {hpcs_hostname_base:{'filename':'"Boot\\\\x64\\\\WdsNbp.com";\n',
'fixed-address':'192.168.1.',
'next-server':'192.168.1.1;\n',
'server-name':'"192.168.1.1";\n'},
xbas_hostname_base:{'filename':'"pxelinux.0";\n',
'fixed-address':'192.168.0.',
'next-server':'192.168.0.1;\n',
'server-name':'"192.168.0.1";\n'}}
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
65
if (len(sys.argv) <> 2):
print ('usage: switch_dhcp_host <current compute node hostname>')
sys.exit(1)
elif (len(str(sys.argv[1]))>1) and (str(sys.argv[1])[-2:].isdigit()):
node_base = str(sys.argv[1])[:-2]
node_rank = str(sys.argv[1])[-2:]
else:
node_base = str(sys.argv[1])[:-1]
node_rank = str(sys.argv[1])[-1:]
if (node_base == xbas_hostname_base ):
old_hostname= xbas_hostname_base + node_rank
new_hostname=hpcs_hostname_base + node_rank
new_node_base = hpcs_hostname_base
elif (node_base == hpcs_hostname_base):
old_hostname=hpcs_hostname_base + node_rank
new_hostname= xbas_hostname_base + node_rank
new_node_base = xbas_hostname_base
else:
print ('unknown hostname: ' + sys.argv[1])
sys.exit(1)
file_name = '/etc/dhcpd.conf'
if not os.path.isfile(file_name):
print file_name + ' does not exists !'
sys.exit(1)
status = 'File ' + file_name + ' was not modified'
file_name_save = file_name + '.save'
file_name_temp = file_name + '.temp'
old_file = open(file_name,'r')
new_file = open(file_name_temp,'w')
S = old_file.readline()
while S:
if (S[0:11] == 'next-server'): S = old_file.readline() # Removes global next-server line
if (S.find('host ' + old_hostname) <> -1):
while (S.find('hardware ethernet') == -1):
S = old_file.readline() # Skips old host section lines
hardware_ethernet=S.split()[2] # Gets host Mac address
while (S.find('}') == -1):
S = old_file.readline() # Skips old host section lines
# Writes new host section lines:
new_file.write(' host ' + new_hostname + ' {\n')
new_file.write(' filename ' + field_dict[new_node_base]['filename'])
new_file.write(' fixed-address ' + field_dict[new_node_base]['fixed-address']
+ str(int(node_rank)+1) + ';\n')
new_file.write(' hardware ethernet ' + hardware_ethernet + '\n')
new_file.write(' option host-name ' + '"' + new_hostname + '";\n')
new_file.write(' next-server ' + field_dict[new_node_base]['next-server'])
new_file.write(' server-name ' + field_dict[new_node_base]['server-name'])
if (new_node_base == hpcs_hostname_base):
new_file.write('option domain-name-servers '+field_dict[new_node_base]['next-server'])
new_file.write(' }\n')
status = 'File ' + file_name + ' is updated with host ' + new_hostname
else: new_file.write(S) # Copies the line from the original file without modifications
S = old_file.readline()
# End while loop
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
66
old_file.close()
new_file.close()
if os.path.isfile(file_name_save): os.remove(file_name_save)
os.rename(file_name,file_name_save)
os.rename(file_name_temp,file_name)
print status
print ('Do not forget to validate changes by typing: service dhcpd restart')
sys.exit(0)
# End of switch_dhcp_host script
/opt/hosc/activate_partition_HPCS.sh
#!/bin/sh
#the argument is the node hostname. For example: xbas1
ssh $1 fdisk /dev/sda < /opt/hosc/fdisk_commands.txt
/opt/hosc/fdisk_commands.txt
a
4
a
1
w
q
/opt/hosc/from_XBAS_to_HPCS.sh
#!/bin/sh
#the argument is the node hostname. For example: xbas1
/opt/hosc/switch_dhcp_host $1
/sbin/service dhcpd restart
/opt/hosc/activate_partition_HPCS.sh $1
ssh $1 shutdown -r -t 20 now
/opt/hosc/from_HPCS_to_XBAS.sh
#!/bin/sh
#this script requires a ssh server daemon to be installed on the HPCS compute nodes
#the argument is the compute node hostname. For example: hpcs1
#HPCS head node hostname is hard coded in this script as: hpcs0
/opt/hosc/switch_dhcp_host $1
/sbin/service dhcpd restart
ssh $1 -l root cmd /c \\\\hpcs0\\hosc\\activate_partition_XBAS.bat \\\\hpcs0
/etc/xen/xen-config.sxp
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
67
Then create file:
/etc/xen/scripts/my-network-bridges
#!/bin/bash
XENDIR="/etc/xen/scripts"
$XENDIR/network-bridge "$@" netdev=eth0 bridge=xenbr0 vifnum=0
$XENDIR/network-bridge "$@" netdev=eth1 bridge=xenbr1 vifnum=1
/etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.0.1 xbas0
192.168.0.2 xbas1
192.168.0.3 xbas2
192.168.0.4 xbas3
192.168.0.5 xbas4
172.16.0.1 xbas0-ic0
172.16.0.2 xbas1-ic0
172.16.0.3 xbas2-ic0
172.16.0.4 xbas3-ic0
172.16.0.5 xbas4-ic0
/etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
ONBOOT=yes
BOOTPROTO=static
NETWORK=192.168.220.0
IPADDR=192.168.220.2
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
68
D.3 Meta-scheduler setup files
/etc/pbs.conf
PBS_EXEC= /opt/pbs/default
PBS_HOME=/var/spool/PBS
PBS_START_SERVER=1
PBS_START_MOM=0
PBS_START_SCHED=1
PBS_SERVER=xbas0
PBS_SCP=/usr/bin/scp
Here is an example of PBS Professional configuration file for PBS MOM on the XBAS CNs:
/etc/pbs.conf
PBS_EXEC= /opt/pbs/default
PBS_HOME=/var/spool/PBS
PBS_START_SERVER=0
PBS_START_MOM=1
PBS_START_SCHED=0
PBS_SERVER=xbas0
PBS_SCP=/usr/bin/scp
C:\Windows\System32\drivers\etc\lmhosts
192.168.0.1 xbas0 #PBS server for HOSC
“Let us define η as the smallest number of nodes requested by a queued job for a given OS type A. Let
us define α (respectively β) as the number of free nodes with the OS type A (respectively B). If η>α (i.e.,
there are not enough free nodes to run the submitted job with OS type A) and if β≥η-α (at least η-α
nodes are free with the OS type B) then the OS type of η-α nodes should be switched from B to A”.
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
69
/opt/hosc/pbs_hosc_os_balancing.pl
#!/usr/bin/perl
#use strict;
#Gets information with qstat about the number of nodes requested by queued jobs
$command_qstat = "/usr/pbs/bin/qstat -a |";
open (PBSC, $command_qstat ) or die "Failed to run command: $command_qstat";
@cmd_output = <PBSC>;
close (PBSC);
$nb_windows_nodes_of_smallest_job = 1e09;
$nb_linux_nodes_of_smallest_job = 1e09;
foreach $line (@cmd_output) {
if ((split(' ', $line))[9] =~ "Q") {
$nb_nodes = (split(' ', $line))[5];
if ($line =~ "windowsq") {
$nb_windows_nodes_queued += $nb_nodes;
if ($nb_nodes < $nb_windows_nodes_of_smallest_job) {
$nb_windows_nodes_of_smallest_job = $nb_nodes;
}
} elsif ($line =~ "linuxq") {
$nb_linux_nodes_queued += $nb_nodes;
if ($nb_nodes < $nb_linux_nodes_of_smallest_job) {
$nb_linux_nodes_of_smallest_job = $nb_nodes;
}
}
}
}
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
70
#STDOUT is redirected to a LOG file
open LOG, ">>/tmp/pbs_hosc_log.txt";
select LOG;
#Compute the number of possible requested nodes whose OS type should be switched
$requested_windows_nodes = $nb_windows_nodes_of_smallest_job - scalar @free_windows_nodes;
$requested_linux_nodes = $nb_linux_nodes_of_smallest_job - scalar @free_linux_nodes;
The above script is run periodically every 10 minutes as defined by the crontab file:
/var/spool/cron/root
# run HOSC Operating System balancing script every 10 minutes (noted */10)
*/10 * * * * /opt/hosc/pbs_hosc_os_balancing.pl
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
71
Appendix E: Hardware and software used for the examples
Here are the details of the hardware and software configuration used to illustrate examples. They were
used to build the HOSC prototype and to validate information given in this document. Any Bull NovaScale
or bullx cluster with Linux Bull Advanced Server for Xeon and Windows HPC Server 2008 could be used
in the same manner.
E.1 Hardware
• 1 Bull NovaScale R460 server
E.2 Software
• Windows
o Windows HPC Server 2008: Windows Server 2008 Standard and the Microsoft HPC
Pack
o Intel® network adapter driver for Windows Vista and Server 2008 x64 v13.1.2
o Mellanox InfiniBand Software Stack for Windows HPC Server 2008 v1.4.1
o Microsoft Utilities and SDK for UNIX-based Applications AMD64 (v. 10.0.6030.0) and
Interops Systems “Power User” add-on bundle (v. 6.0)
o freeSSHd 1.2.1
• Linux
o Bull Advanced Server for Xeon 5v1.1: Red Hat Enterprise Linux 5.1 including Xen
3.0.3 with Bull XHPC and XIB packs (optional: Bull Hypernova 1.1.B2)
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
72
Appendix F: About Altair and PBS GridWorks
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
73
Appendix G: About Microsoft and Windows HPC Server 2008
More information and resources for Windows HPC Server 2008 are available at:
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
74
Appendix H: About BULL S.A.S.
Bull is one of the leading European IT companies, and has become an indisputable player in the
High-Performance Computing field in Europe, with exceptional growth over the past four years, major
contracts, numerous records broken, and significant investments in R&D.
In June 2009, Bull confirmed its commitment to supercomputing, with the launch of its bullx range: the first
European-designed supercomputers to be totally dedicated to Extreme Computing. Designed by Bull’s team of
specialists working in close collaboration with major customers, bullx embodies the company’s strategy to
become one of the three worldwide leaders in Extreme Computing, and number one in Europe. The bullx
supercomputers benefit from the know-how and skills of Europe’s largest center of expertise dedicated to
Extreme Computing. Delivering anything from a few teraflops to several petaflops of computing power, they are
easy to implement by everyone from a small R&D office to a world-class data center.
Bull has now won worldwide recognition thanks to several TOP500-class systems (see [38]). Bull gathered
significant momentum in HPC in recent years, with over 120 customers in 15 countries across three continents.
The spread of countries and industry sectors covered, as well as the sheer diversity of solutions that Bull has
sold, illustrates the reputation that the company now enjoys. From the first major supercomputer installed at the
CEA to the numerous supercomputers delivered to many higher education establishments in Brazil, France,
Spain, Germany and the United Kingdom – such as the two large clusters acquired by the Jülich Research
Center, which deliver a global peak performance of more than 300 teraflops. In industry, prestigious customers
including Alcan, Pininfarina, Dassault-Aviation and Alenia have chosen Bull solutions. And Miracle Machines in
Singapore implemented a Bull supercomputer that will be used to study and help predict tsunamis.
Alongside this commercial success, the breaking of a number of world records highlights Bull's expertise in the
design and integration of the most advanced technologies. Bull systems have achieved some major
performance records, particularly for ultra-large file systems, image searches in very large-scale databases (the
engines of future research), and the search for new prime numbers. These systems have also been used to
carry out the most extensive simulation ever of the formation of the structures of the Universe.
To prepare the systems of the future, Bull is a founder or a member of several important consortia including
Parma which forms part of ITEA2 and brings a large number of European research centers to develop the next
generation of parallel systems. Finally, Bull is a founder member of the POPS consortium - under the auspices
of the SYSTEM@TIC competitiveness cluster based in the Ile de France region, which is developing
tomorrow's petascale systems.
Bull and the French Atomic Energy Authority (CEA) are currently collaborating to design and build Tera 100,
the future petascale supercomputer to support the French nuclear simulation program.
For more information, visit http://www.bull.com/hpc
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
75
A Hybrid OS Cluster Solution: Dual-Boot and Virtualization with Windows HPC Server 2008 and Linux Bull Advanced Server for Xeon
76