Dynamic Hierarchical Model For Fault Tolerant Grid Computing

World Applied Programming, Vol (1), No (5), December 2011.
309-321
ISSN: 2222-2510
2011 WAP journal. www.waprogramming.com
Dynamic Hierarchical Model

for Fault Tolerant Grid Computing
Mohammed REBBAH *
Yahya SLIMANI
Computer Science Department, University of Mascara,

LRGB Laboratory, EDTEC Group
Mascara, Algeria
Rebbah_med@yahoo.fr
Computer Science Department, University of El Manar,

Tunis, Tunisia
yahya.slimani@fst.rnu.tn
Abdelkader BENYETTOU
Lionel BRUNIE
University of Sciences and Technology of Oran

Mohammed BOUDIAF,
Oran, Algeria
a_benyettou@yahoo.fr
University of Lyon, CNRS, INSA-Lyon, LIRIS,

UMR5205, F-69621,
Lyon, France
lionel.brunie@insa-lyon.fr
Abstract: Our contribution in this paper is twofold. Firstly, we propose a dynamic hierarchical model
for the grid, which models the grid as a dynamic n-ary tree, composed of a root, a set of intermediate
levels according to the number of available resources and the lowest level containing the resources
loaded to execute jobs. Secondly, we support our model by a mechanism of fault tolerance based on
distribution and swapping techniques. The technique of distribution adopted to tolerate faults in the
intermediate levels allows to keep jobs in their leaves and to reconnect the children of the failed nodes
to the siblings of their parents without any replication. The implementation of our model over Globus
Toolkit 4 allows extending its functionality to tolerate faults.
Key word: Grid computing, Fault tolerance, Dynamic hierarchical model, Distribution, Swaping,
Globus Toolkit.
I. INTRODUCTION
The grid computing was primarily introduced by Foster and Kesselman [1] to define a distributed computing
infrastructure for advanced science and engineering. It is a collection of distributed computing resources available
over a local or wide area network that appears to an end user or application as one large virtual computing system.
The main goal of this infrastructure is to provide shared heterogeneous services and resources accessible by users
and applications to solve high computational problems and access to large storage spaces.
Most of the models are static, where the grid is formed, in general, by a set of clusters connected through a
local area network (LAN) or wide area network (WAN), every cluster contains a set of local sites managing a set
of nodes loaded to execute the user's jobs, often called the G/S/M model [27, 28], although these models have
yielded encouraging results but they are limited in some areas, such as:
1. Grid resources mismanagement: this mismanagement goes in both directions, in the case of the reduction
of grid resources, it is pointless to structure them into a number of elevated levels and in the other sense, when the
number of resources is scalable, which typically require elevated levels of hierarchy.
2. Mismanagement of the dynamicity: One of the fundamental characteristics of the grid is the dynamicity of
resources. This dynamicity cannot be treated by a static hierarchical model.
In this paper, we propose a dynamic hierarchical model, which models the grid as a dynamic n-ary tree,
composed of a root, a set of intermediate levels according to the number of available resources and the lowest
level containing the resources loaded to execute jobs. The dynamic nature of the proposed model is related to the
number of available resources in the grid and their distributions in the tree.
The large computing potentiality of computational grids is often hampered by their susceptibility to failures,
which include process failures, machine crashes and network failures. In grid computing, the fault management is
very important and difficult problem for grid application developers. The failure of resources fatally affects job
309
Mohammed REBBAH et al., World Applied Programming (WAP), Vol (1), No (5), December 2011.
execution therefore fault tolerance functionality is essential in grid computing [2]. The computational grid consists
of large sets of diverse resources geographically distributed, that are grouped into virtual computers for executing
specific applications. As the number of grid system components increases, the probability of failure is higher than
in traditional parallel computing [1].
In this paper, we present two mechanisms of fault tolerance based on the distribution and the swapping
techniques. We have implemented our model by a grid service (Dynamic Hierarchical Model for Fault Tolerant
Grid; DHM-FTGrid) over Globus Toolkit 4. The mechanism of fault tolerance proposed treats crash faults and
network failures. It is based on the error recovery by distribution and swapping techniques.
The rest of this paper is organized as follows: Section 2 gives an overview on different works on fault
tolerance in grid computing. Related work is discussed in section 3. Section 4 defines our proposed model, its
various actors and the familys concept defines between the levels of the model. Section 5 describes the types of
faults treated and the mechanism of fault tolerance proposed; section 6 presents the architecture of DHM-FTGrid
grid service developed over Globus GT4 [29] to validate our proposed model and experimental results are
presented in section 7. Conclusion and future works are presented in Section 8.
II. FAULT TOLERANCE IN GRID COMPUTING
Failure in large-scale Grid systems is and will be a fact of life. Hosts, networks, disks and applications
frequently fail, restart, disappear and behave otherwise unexpectedly. Support for the development of faulttolerant applications has been identified as one of the major technical challenges to address for the successful
deployment of computational grids [3,4,5]. Three techniques for fault tolerance for grid computing have been of
particular importance: (i) checkpointing, or periodically saving the state of a process running on a computational
resource so that, in the event of failure, it can be migrated to an operational resource [6,7], (ii) replication, or
maintaining a sufficient number of replicas, or copies, of a process executing in parallel with identical state but on
different resources, so that at least one replica is guaranteed to finish the process correctly [8,9,10] and (iii) in the
event of failure, rescheduling, or finding different resources to that can accept and run failed tasks. The replication
of data is an important aspect of providing fault tolerance in data grids [11,12,13]. Several approaches for the
implementation of fault tolerance in message-passing applications exist. MPICHGF [14] is checkpointing system
based on MPICH-G2 [15], a Grid-enabled version of MPICH. It handles checkpointing, error detection, and
process restart in a manner transparent to the user [16]. Pawel Garbacki et al. address the problem of making
parallel Java applications based on Remote Method Invocation (RMI) fault tolerant in a way transparent to the
programmer [17]. Azzedin and Maheswaran [18] suggested integrating the trust concept into grid resource
management. Abawajy [19] presented a Distributed Fault-Tolerant Scheduling (DFTS) to provide fault tolerance
for jobs execution in a grid environment. Song [20] developed a security-binding scheme through site reputation
assessment and trust integration across grid sites. A Fuzzy-logic based Self-Adaptive job Replication Scheduling
(FSARS) algorithm is proposed to handle the fuzziness or uncertainties of job replication number which is highly
related to trust factors behind grid sites or user jobs was presented by Congfeng Jiang et al [21].
III. RELATED WORK
In centralized systems, decisions are made by a central controller, which maintains all information about the
applications and keeps track of all available resources in the system. Centralized systems are simple to implement,
easy to deploy, and presents few management hassles. However, it is not scalable with respect to the number grid
resources. For hierarchical systems, there is a central manager and multiple lower-level managers. This central
manager is responsible for handling the complete execution of an application and assigning the individual
informations of this application to the low-level. Whereas, each lower-level is responsible for mapping
information onto grid resources. The main advantage of using hierarchical architecture is that different
management policies can be deployed at central manager and lower-levels. However, the failure of the central
manager results in entire system failure. Ranganathan and Foster [22, 23] describe and evaluate various replication
strategies for hierarchical data grids. These strategies are defined depending on when, where, and how replicas are
created and destroyed. They compare six different replication strategies: No Replication, Best Client, Cascading,
Plain Caching, Caching plus Cascading and Fast Spread. One of the enhancements is support for the hierarchical
desktop grid concept as described by Kacsuk et al. [24], which allows a set of projects to be connected to form a
directed acyclic graph where work is distributed among the edges of this graph. The hierarchy concept is solved
with the help of a modified BOINC client application, the Hierarchy Client, which is always running beside any
child project, and its only task is to connect to the parent desktop grid, report itself as a powerful client consisting
of a given number of processors, and inject fetched workunits into the local desktop grids database. Generally, a
project acting as a parent does not have to be aware of the hierarchy; it only sees the child desktop grid as one
powerful client. Marosi et al. [25] show how to implement automatic application deployment in hierarchical
desktop grid systems, thus administrators of lower level desktop grids do not have to deal with deploying
310
applications of higher level parent desktop grids. Farkas et al. [26] describe an important property of scheduling
algorithms for hierarchical desktop grid systems that is each child desktop grid runs an instance of one of the
scheduling algorithms. The task of the scheduling algorithm is not to send workunits to attach clients, but to
determine a number of CPU cores reflecting the performance of the given desktop grid reported by the Hierarchy
Client. As the child desktop grid connects to its parent, it will represent itself as powerful client consisting of so
many cores, so it will process at most so many workunits originating from its parent in parallel.
All these works did not take into account the characteristics of dynamic nature of grid resources, that
can unpredictably appear and disappear and the heterogeneous aspect of its resources. However, dynamic
nature of these resources and its heterogeneity impose further challenge in seamless collaboration. In this paper,
we propose a dynamic hierarchical fault tolerance model based on distribution and swapping techniques
IV.
A.
PROPOSED MODEL
Grid model
The model supposes that the grid (see Figure 1) is a finite set of G clusters Ck, interconnected by gates gtk, k
{0, ...,G1}, where each cluster contains one or more sites Sjk interconnected by switches SWjk and every site
contains some Processor Elements PEijk and some Storage Elements SEijk, interconnected by a local area
network.
Figure 1. : Grid topology
B. Dynamic Hierarchical Model

The hierarchical architecture is frequently used for designing complex systems. It organizes the system in
hierarchical levels. We propose to structure the resources of the grid in a dynamic n-ary tree composed of n
levels described as follows (see Figure 2):

Level leafs: Every node at this level represents a leaf that has the following functions:
Execution of jobs,
Sending the states of the jobs to the superior level (parent).
Intermediate levels: Nodes at this level have the functions described below:
311
Detection and fault tolerance of nodes at the lower levels (the children),
Updating the status of childrens jobs,
Sending the states of their children to the parent.
Level root: This level corresponds to the root of the tree. It constitutes of a node that is associated with the
entire grid and is called the manager of the grid. Its role pertains:
Detection and fault tolerance of its children,
Updating the status of its childrens jobs.
The user submits a job to the root, which distributes the job fairly to its children. This process spreads in all the
intermediate levels until the leaves, which are designated to execute the jobs. The results are then transmitted to
their parents until the root, which transmits them to the end user.
Root level
Rn,m
DRn-1,m,1
Nn-1,m,2
.
.
.
.
.
.
Nn-1,m,3
.
.
.
Intermediate
levels
..
..
..
N1,p1,1
N0,p1,
N0,p1,
N1,p2,k
...
...
N0,p2,
Figure 2.
N0,p2,2
.......
...
N0,p,1
N1,p,n
N0,p,2
...
Leaf level
: Dynamic Hierarchical Model of a grid.
Where:
R: Root.
DR: Duplicate Root
N: Node.
Ni,j,k : indicate for each node (level, number of its parent, its number).
In order to build a balanced initial tree, depending on the number of the children and the number of levels, we
defined the function A (L, F) returns the number of nodes required to build the tree requested.
A (L, F) =
where (F > 1)
Where:
F: Children number for every node.
L: Number of levels.
Example: To build a balanced tree of 3 levels and 2 children for each parent, we need 15 nodes.
312
C.
Family concept
Because of the dynamicity of nodes in the tree, we define the notion of family; each node of the tree has a
family, possibly consisting of a parent, siblings and children. Family members vary according to the position of
the node in the tree (see Figures 3).
Leaves: have siblings and a parent.
Intermediate nodes: have all family members.
Root: it has only children.
Parent of Ni,k,L
Ni+1,j,k
Ni,k,L
Ni,k,p
Siblings of Ni,k,L
.
.
.
..
Ni-1,L,q
Children of Ni,k,L
...
Figure 3. Family of the node Ni,k,L
V.
FAULT TOLERANCE MODEL
A. Types of faults
Our system is able to detect and tolerate crash faults and disconnection faults.
Crash fault: In this case, an entity stops abruptly and is no longer accessible. A crash fault may occur at a
leaf, an intermediate node, or the root.
Leaf level: When a leaf is affected by a crash fault, the execution of the jobs is stopped. We cannot submit a
job to this leaf and it can no longer send a message to its parent.
Intermediate level: When a node at this level suffers from a crash fault, its manager fault tolerance is stopped.
It cannot send messages to the parent and it can no longer receive messages from its children (messages
statements).
Root: When the root experiences a crash fault, all the information of the grid is lost.
Disconnection Fault: In this type of fault, there is a failure of the communication medium. This type of fault
occurs when there is an error in the management of the communication between the different elements of the grid.
For example, a fault in the DNS Manager, a connection problem (wiring), or a problem in the system files.
B. Fault Detection
Fault detection is a crucial for providing a scalable, dependable and highly available grid computing
environment. We use the heartbeat message at intermediate nodes and the root. The detection of this type of fault
is the responsibility of the parent. Each parent periodically sends a heartbeat message to all its children, when it
receives no reply from one of its children; it waits for a certain time interval. If it receives its state by another path
(through another node of the tree), it is a disconnection fault, otherwise it is a crash fault.
C. Fault tolerance
The mechanism of fault tolerance follows the hierarchical model; each parent tolerates faults of its children,
we propose in this model two techniques of fault tolerance based on the distribution and the swapping. When a
parent detects a childrens crash fault, it counts its siblings; if the children is an only child, it uses the swapping
method; otherwise it uses the distribution method.
1) Distribution method
313
Upon the detection of a crash fault in the tree, its tolerance is the responsibility of its parent. It consists in
distributing the children of the failed node over its siblings; its parent counts the children of the failed node, and it
connects them fairly on its siblings; after the repair of the failed node, it becomes a leaf in the tree (see Figure 4).
The distribution of nodes is from left to right.
Free
Ni,j,k
Failed
Ni-1,k,1
Ni-1,k,2
Ni-1,k,3
Ni-2,1,1
Ni-2,1,2
Ni-2,1,3
Ni-2,2,1
Ni-2,2,2
Ni-2,3,1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Ni-2,3,2
.
.
.
..
..
..
..
..
..
..
After the distribution and repair of the

faulty we obtain:
Ni,j,k
Ni-1,k,2
Ni-1,k,3
Ni-2,2,1
Ni-2,2,2
Ni-2,1,1
Ni-2,1,3
Ni-2,3,1
Ni-2,3,2
Ni-2,1,2
.
.
.
.
.
.
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
Ni-1,k,1
Figure 4. Distribution method: Failed node has three children and two siblings
2) Swapping method
It consists in replacing a failed node by a leaf in the tree, we choose the least loaded substitute to minimize the
jobs to tolerate and after repairing of the failed node, it will become a leaf.
We use this method in two cases:
Swapping a node of the intermediate level: When a parent detects a childrens crash fault, it seeks the least
loaded leaf to replace the failed node and we tolerate the leaf's jobs in its siblings. If there are still not allowed
jobs, it transmits them to the next level (see Figure 5).
Swapping of the root by the duplicate root: If the root fails we lose all the jobs submitted in the grid. For
this, we added to the model, a Duplicate Root (DR), which is a child of the root. The DR is assigned to detect
crash faults of the root and to tolerate its fault by swapping method; the root periodically updates the DR. When
the root breaks down, the DR swaps it and the DR will be replaced by a leaf of the tree (we choose the least
loaded leaf and we tolerate its jobs) and the root becomes a leaf after repair.
D. Disconnection fault tolerance
The disconnections faults are tolerated from children to parent, when a node disconnects with its parent, it
consults its siblings to find a path through one of them, once this path is found, it transmits its status to its parent,
but if it can not find a path, the parent will be considered in crash fault.
314
E. Tree Restructuring
Following changes in the structure of the tree, mostly due of some faults tolerance by the distribution
techniques and the insertion of new leaves; these changes make the tree unbalanced compared to its initial state
where the children are distributed equitably in all levels. We supported our model by a tree restructuring method,
this restructuring is managed by the root; the user defines the number of children for every parent (F) and
according to available nodes in the tree (Y), we calculate the number of levels by the formula
(this formula is deduced from the function A (L, F)). The tree restructuring is costly;
it is initiated only when it is necessary.
Nn,i,k
Free
Failed
Nn-1, k, 2
Nn-1, k,1
.
.
.
.
.
.
..
..
N1, L, m
N1, i, j
N0, j, 1
N0, j, 2
N0, j, 3
...
N0, m, 1
N0, m, 2
...
Assuming that N0,j,1 is the less charged and

the tolerance of jobs we get:
Job1
Job3
Job4
Job2
Nn,i,k
N0,j,1
Nn-1,k,2
.
.
.
.
.
.
..
..
N1,i,j
Nn-1,k,1
N0,j,2
N1,l,m
N0,j,3
Job1
Job2
Job3
Job4
...
N0,m,1
N0,m,2
...
Figure 5. Swapping a node in the intermediate level.
VI.
IMPLEMENTATION
A. Service DHM-FTGrid architecture

We have integrated our grid service DHM-FTGrid over Globus GT4, DHM-FTGrid is composed of 4 basic
modules (see Figure 6):
1. DBTFjob represents the database DBTFjob, which contains the following tables db_children, db_family
and db_jobs.
2. Job Manager: it took over the job from its submission on the grid until its affection to a leaf, once the job
finished, it will transmit its results. It is responsible for updating the tables DBTFjob.
315
3. Fault Detector: each manager detects a fault of its children by sending periodic status jobs, in the lack of a
sending state, the manager sends a ping to the children, if it receives no response then a fault is detected in this
node and it transmits its status to the Fault Manager.
4. Fault Manager is responsible to tolerate the failed node by applying the algorithms explained below; it
works in conjunction with the Job Manager, which redistributes the jobs of failed nodes locally or at a higher
level.
Our model is composed of a set of hierarchical levels, the leaf level (L0) is loaded to execute jobs and doesn't
have any task of fault tolerance; which is installed differently in first level (L1), the intermediate levels (Li) and the
root.
Job Manager
(Submits & Monitors job)
Results
User
Submit Job
Fault
Manager
DBTFjob
Fault Detector
Figure 6. DHM-FTGrid architecture
B.
Level L1
Data structure: We use in every level of the tree three tables: db_children, db_family and db_jobs defined as
follows:
Table db_children: is composed of attributes id_n0, ip_n0, nb_job, size_list and size_list_free.
id_n0: node identifier.
ip_n0: IP address of the node.
nb_job: number of jobs.
size_list: the maximum size of the queue.
size_list_free: free size of the queue.
Table db_family: The attributes of this table are : id_node, ip_node and type.
id_node: node identifier.
ip_node: IP address of the node.
Type: type of node can be a parent or sibling.
Table db_jobs: is composed of the attributes: id_n0, id_job, job, job_state, duplique and tolerated.
id_job: job identifier.
Job: contains the following parameters: executable, argument, stdout, stderr.
Job_state: the status of job can be active (in execution), Failed (down), Pending (pending) or Done
(Free).
Duplique: the job can be passive or active duplication.
Tolerated: indicates whether the job is tolerated or not.
When the node in level L1 detects a fault of it children, it launched the process of fault tolerance in two phases,
first it selects its jobs from the table db_jobs, it runs a local fault tolerance by distributing jobs on its children,
according to their charges, if there are still not allowed jobs, it transmits them to the next level (L2).
C. Intermediate levels (Li)
Data Structures: Each table db_children and db_jobs is an aggregation of data stored in the tables of the level
(Li-1) by adding an attribute indicating the ID of each node (level Li-1).
Fault tolerance: When the manager fault tolerance at level (Li) detects a fault of its children, the fault
tolerance technique is as follows (see algorithm 1):
1.The manager counts the number of the siblings of the failed node.
316
2. If this number is greater or equal to 2, it will use the distribution method (see Algorithm 2).
3. Otherwise it will use the swapping method (see algorithm 3).
5. Sends the state to its parent.
Algorithm1: TF_Ni(db_children, db_jobs, ID_Ni)
Begin
1. db_children.first();
2. While not db_children.eof() do
3.
ID = db_children.ID_Ni-1;
4.
If (ping (ID_Ni, ID) == false) then
5.
wait_result();
6.
If (wait_result()== false) then
7.
K = search_nbfailed_sibling (db_children);
8.
If (k >= 2) then
9.
distribute_failed_children (ID, k, db_children, db_jobs);
10.
Else
11.
swap_failed (ID, db_children);
12.
End if
13.
End if
14.
Else
15.
If completed_job ( ID, List_job) then
16.
update db_jobs set job_state = "Done" where id_job in Liste_job[id_job];
17.
End if
18.
If complete_job (id_ni+1, Liste_job) then
19.
update db_jobs set job_state=="Done" where id_job in Liste_job[id_job];
20.
End if
21.
send_state_job_Done(ID);
22.
End if
23.
db_children.Next();
24
send state_Ni_to_Ni+1();
25. End while
End.
Algorithm 2: distribute_failed_children (ID, k, db_children, db_jobs)
Begin
1. failed_node = [select * from db_children where id_ni-1==ID];
2. F = [select id_ni-1 from db_children where id_ni-1 !=ID];
3. L = failed_node.count;
4. N = L div k;
5. M = L mod k;
6. failed_node.first();
7. If N!=0 then
8.
F.first();
9.
For (s=1;s<=F.length;s++) do
10.
For (r=1;r<=N;r++) do
11.
failed_node [id_ni] = F[s];
12.
update db_children set values id_ni-1 = failed_node [id_ni-1] where id_n0 == failed_node[id_n0];
13.
update db_jobs set values id_ni-1 = failed_node [id_ni-1] where id_n0 == failed_node [id_n0];
14.
failed_node.next();
15.
End for
16.
End for
17. End if
18. For (r=1;r<=M;r++) do
19.
failed_node [id_ni-1] = F[r];
20.
update db_children set id_ni-1 = failed_node [id_ni-1] where id_n0 == failed_node [id_n0];
21.
update db_jobs set id_ni-1 = failed_node [id_ni-1] where id_n0 == failed_node [id_n0];
22.
failed_node.next();
23. End for
24. Send_state_Ni_to_Ni+1();
End.
Algorithm 3: swap_failed(ID, db_children)
317
Begin
1. failed_children = [select* from db_children where id_ni-1==ID];
2. J = [select id_n0 from failed_children where nb_job in (select min(nb_job)from failed_children)];
3. update db_children set id_ni-1=j where id_ni-1==ID;
4. update db_children set id_fils=ID where id_fils==J;
5. Toler_job(J,nb_job,db_jobs,ID,id_ni);
End.
D. Root Level
The data structures used in the root are the same used in level (Li) except that table db_family doesnt exist
and the technique of fault tolerance remains the same as intermediate levels, except that the root doesnt have a
higher level.
E. Duplicate Root
The DR is responsible to detect faults root, in the case of a disconnection fault, the DR receives updates
through one of its siblings and if the root fault is crash, the DR replaces the root, it selects a leaf to replace the DR
and it tolerates the jobs of this leaf.
VII. EXPERIMENTATION
The aim of the experiments conducted in this model is to show the rate of contribution of each level of the tree
in the fault tolerance process and to check the most fault tolerance technique used (distribution or swapping). To
evaluate the performance of our model, we developed DHM-FTGrid over the Globus GT4 on Pentium 4 with a
speed of 2.8 Ghz, DDR 160 GB and RAM 1 GB. Our system operates under a hierarchical architecture using two
tree models:
1.
2.
8/4/2 model uses 15 nodes, distributed as follows: 8 leaves, 4 nodes in L1, 2 in L2 and a root.
9/3 model uses 13 nodes, distributed as follows: 9 leaves, 3 nodes in L1 and a root.
A. Model 8/4/2
The first series of experiments is taken from the architecture 8/4/2, where we increased the number of jobs
from 5 to 60, each node has a queue of 5 jobs.
We have noted the following conclusions:
1.
The levels of tolerance are related to the number of jobs in the grid (see Figure 7),
2.
Tolerance techniques (distribution and swapping) are related to the number of levels and children,
3. In our case of experiments, we used much of the swapping method because of the reduced number of
children (see Figure 8),
4. The model waits failed jobs in the root when the rate of jobs submitted in the grid exceeds the size of the
grid by 250 %.
Figure 7. Levels of tolerance
318
Figure 8. Fault tolerance techniques
B.
Model 9/3
The second series of experiments is taken from the architecture 9/3, where we increased the number of jobs
from 5 to 50; each node has a queue of 5 jobs.
We have noted the following conclusions:
1.
The levels of tolerance are related to the number of jobs in the grid (see Figure 10),
2. Tolerance techniques (distribution and swapping) are related to the number of levels and children (see
Figure 9),
3.
In our case of experiments, we used much of the distribution because of the higher number of children,
4. The model waits failed jobs in the root when the rate of jobs submitted in the grid exceeds the size of the
grid by 133%.
Figure 9. Fault tolerance techniques
319
Figure 10. Levels of tolerance
VIII. CONCLUSION
In this paper, we have proposed a model of fault tolerance adapted for grid computing that takes into account
the dynamic nature of the resources, the scalability and the heterogeneity of the grid. This model is completely
independent of any physical architecture. We model the grid as a dynamic virtual tree. This tree is composed of a
root for the grid, a set of intermediate levels and leaf levels designated to run jobs. We present two mechanisms
for fault tolerance based on distribution and swapping. We have implemented DHM-FTGrid a fault tolerance grid
service over Globus GT4. We observe that the dynamicity of the tree structure responds appropriately to the
dynamic nature of grid resources. The technique of distribution adopted to tolerate faults in the intermediate levels
allows to keep jobs in their leaves and to reconnect the children of the failed nodes to the siblings their parents
without any replication.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. eds., San Francisco, Calif.: Morgan Kaufmann
Publishers, 1999, 677 pages.
H.-M. Lee, K.-S. Chung, S. Jin, D.-W. Lee,W.-G. Lee, S. Jung, and H.-C. Yu. A fault tolerance service for QoS in grid computing. In
LNCS, pages 286296. Springer-Verlag., 2003.
Garg, R., Singh, A. K.: Fault tolerance grid computing: state of the art and open issues. International Journal of Computer Science
& Engineering Survey (IJCSES) Vol.2, No.1: 88-97, Feb 2011.
Siva Sathya, S., Syam Babu, K.: Survey of fault tolerant techniques for grid, Computer Science Review, Vol.4, No. 2: 101-120, 2010.
Dabrowski, C.: Reliability in grid computing Systems. Concurrency and Computation: Practice and Experience, Vol. 21, No. 8: 927959. DOI: 10.1002/cpe.1410, 2009.
Jin, H., Shi, X., Qiang W. and Zou., D. : DRIC: Dependable Grid Computing Framework,, IEICE - Transactions on Information and
Systems, Vol.E89-D, No.2:612-623, February 2006 [doi>10.1093/ietisy/e89-d.2.612].
Jafar, S., Krings, A., and Gautier, T.: Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing, IEEE Transactions on
Dependable and Secure Computing, Vol. 6, No. 1:32-44, JANUARY-MARCH 2009
Lac C, Ramanathan S. A Resilient Telco Grid Middleware. Proceeding of 11th IEEE Symposium on Computers and Communications
(ISCC'06), June 2006. IEEE Computer Society Press: Los Alamitos, CA, pp. 306-311, 2006.
Jiang. C., Xu. X., Wan, J.: Replication Based Job Scheduling in Grids with Security Assurance, Proceedings of the Third International
Symposium on Electronic Commerce and Security Workshops (ISECS 10) Guangzhou, P. R. China, 29-31, pp. 156-159, July 2010.
Sangho, Y.,, Derrick, K., Bongjae, K., Geunyoung, P., Yookun, C.: Using Replication and Checkpointing for Reliable Task
Management in Computational Grids, International Conference on High Performance Computing and
Simulation (HPCS), pp 125 - 131 , France 2010
Huedo, E., Montero, R., Llorente, I.: Evaluating the reliability of computational grids from the end user's point of view. Journal of
Systems Architecture, 52 (12): 727-736, 2006.
OLTEANU, A., POP, F., DOBRE, C., CRISTEA, C.: Re-scheduling and error recovering algorithm for distributed environments,
U.P.B. Scientific Bulletin, Series C, Vol. 73, Iss. 1: 27-38, 2011.
Leyli, M. K., Maryam E. F., Ali G., Reliable Job Scheduler using RFOH in Grid Computing, Journal of Emerging Trends in
Computing and Information Sciences Vol. 1, No. 1:43-48, July 2010
Woo, N., Jung, H., Yeom, H. Y., Park, T. and Park., H.: MPICHGF: Transparent Checkpointing and Rollback-Recovery for GridEnabled MPI Processes. IEICE Transactions on Information and Systems, 87(7):18201828, 2004.
Nicholas, I. F., Karonis, T., Toonen, B: MPICH-G2: A Grid-enabled implementation of the Message Passing Interface. Journal of
Parallel and Distributed Computing, 63(5):551563, May 2003.
320
[16] Daz, D., Pardo, X. C., Martn, M. J., Gonzlez, P.: Application-Level Fault-Tolerance Solutions for Grid Computing; Eighth IEEE
International Symposium on Cluster Computing and the Grid (CCGRID08), IEEE Computer Society, Washington, USA, pp. 554-559,
2008.
[17] Pawel, G., Bartosz, B., Henri, E. B.: Transparent Fault Tolerance for Grid Applications. Proceedings of the European Grid Conference
(EGC 2005), Amsterdam, The Netherlands, pp. 671-680, 2005.
[18] Azzedin, F., Maheswaran, M.: Integrating trust into grid resource management systems; In: Proceedings of the International
Conference on Parallel Processing (ICPP02), IEEE Computer Society Press, Los Alamitos, pp. 4754, 2002.
[19] Abawajy, J.: "Fault-Tolerant Scheduling Policy for Grid Computing Systems"; In Proceedings of the 18th International Parallel and
Distributed Processing Symposium, IPDPS04, Santa Fe, New Mexico, pp. 238244, 2004.
[20] Song, S., Hwang, K., Kwok, Y.: Trusted grid computing with security binding and trust integration; Journal of Grid Computing 3,
pp. 5373, 2005.
[21] Jiang, C., Wang, C., Liu, X., Zhao, Y.: A Fuzzy Logic Approach for Secure and Fault Tolerant Grid Job Scheduling; Autonomic
and Trusted Computing, 4th International Conference, ATC 2007, Hong Kong, China, Volume 4610, pp. 549-558 of Lecture Notes in
Computer Science, Springer, July 11-13, 2007.
[22] Kavitha Ranganathan, Ian T. Foster. Identifying Dynamic Replication Strategies for a High-Performance Data Grid. In Proceedings of
GRID'2001. pp.75~86
[23] Ranganathan, K., Foster, I.: Design and evaluation of dynamic replication strategies for a high performance data Grid. In:
International Conference on Computing in High Energy and Nuclear Physics, 2001.
[24] P. Kacsuk, N. Podhorszki, T. Kiss, Scalable desktop grid system, in: High Performance Computing for Computational Science
VECPAR 2006, in: Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2007, pp. 2738.
[25] A. Marosi, G. Gombas, Z. Balaton, Secure application deployment in thehierarchical local desktop grid, in: Proceedings of the 6th
AustrianHungarian Workshop on Distributed and Parallel Systems, DAPSYS 2006, pp. 145154.
[26] Z. Farkas, A.C. Marosi, P. Kacsuk, Job scheduling in hierarchical desktop grids, in: Remote Instrumentation and Virtual Laboratories,
Springer US, 2010, pp. 7997.
[27] M. REBBAH, C. MOKHTARI, M. KHALDI, M.F. BOURASI, O. SMAIL, Hierarchical model for fault tolerant grid computing over
Globus Toolkit. International Congress on Models, Optimization and Security of Systems (ICMOSS2010), Tiaret, May, 2010
[28] B. Yagoubi, Y. Slimani. Task load balancing strategy for grid computing. Journal of Computer Science 3 (3):186-194, ISSN 15469239, 2007.
[29] Globus Toolkit http://www.globus.org [Oct, 20, 2011].
321

Dynamic Hierarchical Model For Fault Tolerant Grid Computing

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Dynamic Hierarchical Model For Fault Tolerant Grid Computing

Diunggah oleh

Hak Cipta:

Format Tersedia

World Applied Programming, Vol (1), No (5), December 2011.

Dynamic Hierarchical Model

Computer Science Department, University of Mascara,

Computer Science Department, University of El Manar,

University of Sciences and Technology of Oran

University of Lyon, CNRS, INSA-Lyon, LIRIS,

Figure 1. : Grid topology

B. Dynamic Hierarchical Model

levels described as follows (see Figure 2):

Sending the states of the jobs to the superior level (parent).

Updating the status of childrens jobs,

Sending the states of their children to the parent.

Detection and fault tolerance of its children,

Updating the status of its childrens jobs.

: Dynamic Hierarchical Model of a grid.

Leaves: have siblings and a parent.

Intermediate nodes: have all family members.

Root: it has only children.

Figure 3. Family of the node Ni,k,L

FAULT TOLERANCE MODEL

After the distribution and repair of the

Assuming that N0,j,1 is the less charged and

Figure 5. Swapping a node in the intermediate level.

A. Service DHM-FTGrid architecture

Figure 6. DHM-FTGrid architecture

Figure 7. Levels of tolerance

Figure 8. Fault tolerance techniques

Figure 9. Fault tolerance techniques

Figure 10. Levels of tolerance

Anda mungkin juga menyukai