31
Article Info
ABSTRACT
Article history:
Keyword:
Cloud Computing,
Quality of Service
Dynamic Replication
Availability
Fault Tolerance
Latency
Semi-passive
Corresponding Author:
Akanksha Chandola,
Departement of Computer Applications,
Kanya Gurukul Campus, Gurukul Kangri Vishvidhyalaya, Dehradun, 248001, India.
Email: chandola.akanksha@gmail.com
1.
INTRODUCTION
Cloud computing offers virtualized environment for dynamic computing over internet. Virtualization
had changed the trend of computing and makes it to converge into virtual world using latest technologies and
firm wares. We had proposed this Fault Tolerance system for the type of Consumer-centric service Oriented
Cloud Computing and to provide Quality of Service (QOS) for client oriented applications.
Customer requirement in the cloud environment is subjected to change with higher frequency thus
system should be adaptable to deal with this dynamic requirement and impart higher QoS. A prerequisite for
these systems is to have dynamic replication strategies, where automatic creation, deletion and management
of replica perpetuate and strategies have the ability to accommodate to changes in users behavior [1].
Self healing is a proactive method to incur tolerance to the failures in distributed systems using
redundancy provisions. Redundant application sets are formulated and kept on standby which may swap to
replace any failure instance. When a user generates a request for a file, major part of bandwidth could be
consume to transfer the file from the server side to the client side. Furthermore the latency could be
considering over the size of the files involved. The reduction of access latency and bandwidth consumption is
main aim of practice of replication [1]. With replication one should able to deliver flexible, reliable and costefficient solution for virtual machines from disaster recovery and protection of data in the environment.
Virtualization is foundation aspect of cloud as well as high performance computing and helps
deliver services on pay per usage bases. Whenever VM is replicated all the jobs residing inside them are also
replicated thus providing a two level reliability provision, one at machine level and other at processing level.
In this work we are contributing Fault Tolerance model using replication strategy. DRHA-FT is put
into analysis with some other FT policies over the properties like; availability, reliability, throughput and
Journal homepage: http://iaesjournal.com/online/index.php/ IJ-CLOSER
IJ-CLOSER
32
ISSN: 2089-3337
latency. Further it shows that our works quite efficiently and could be implemented for vital application
computing in cloud.
Our research work comprises of following folds;
We are intended to propose a policy for proactive fault tolerance using replication strategy. Here we
had selected the proactive model with message passing approach and replica manager for backups
of job using cloning of VMs.
We had incorporated residency constraints for VM replicas i.e. where to make the replica of VMs to
reside so as to react for failed situation and also to balance over the availability and latency metrics.
We had simulated our algorithm and results are generated over variable failure rates.
Overall design of Fault Tolerance mechanism is defined as:
<ft_unit > DRHAFT= (DRHA, p, A) where P states all properties of ft_unit and A are the set of attributes of
it.
<ft_unit> DRHAFT = (DRHA, p= {throughput=99.99%, availability=0.9833, average latency=low}, A=
{mechanism= semi_passive_replication, fault_model= crash fault, number_of_replica=2k (initially two and
k dynamically generated/assigned)}
2.
f (T ) 0, T 0, 0, 0,
where;
(2)
Weibull distributions for < 1 have a failure rate inversely propositional to time, also known as
infantile or early-life failures. In case of close to or equal to 1, a fairly constant failure rate, is observed
which indicates useful life or random failures. And for > 1, failure rate that increases with time, which
is known to be wear-out failures [12]. Another parameter , scale parameter stretch and skew the
distribution range. Variation with the value of and had generated enough deviation for input failure
rate, for testing over reliability of our strategy.
2.2. Failure Tolerance Concept and system design
Failure in computing environment leads to some serious consequences; so with the emerging of
cloud, lots of efforts are made to evolve tolerance to these failures either reactively, proactively or hybrid
manner. In Proactive FT preventive measures are taken in account to avoid failure experiences so as to
keep an application alive. In contrast, reactive FT at time of failure experience, try to recover running
application through recovery from experienced failures [2]. Proactive FT involves higher overhead in
failure prediction as there is no provision for checkpoint/restart and migration overhead, false alarms,
availability and reallocation of resources added more to it. Failure Prediction module plays a vital role in
Proactive FT model; however its accuracy is not part of our research work, we had included probability
based data for our simulation.
Title of manuscript is short and clear, implies research results (First Author)
IJ-CLOSER
33
ISSN: 2089-3337
Et.al Engelmann in [2] provided a foundation for proactive FT in HPC by identifying its architecture
and classification. The presented architecture overall depends on feedback-loop control mechanism,
where system and application health is monitored and to avoid imminent application failure preventive
steps are taken by reallocating it from unhealthy to healthy node. Thanda et.al [3] proved in his research
work in reference to the cluster computing that message logging is one of the most efficient ways to
achieve fault tolerance thus also giving Message Passing Interface an upper edge for deploying in High
Performance Computing environment.
2.3. Replication Technique and brief survey
Data redundancy is implemented in any system to provide availability, reliability, accessibility and
QoS for critical application processes. Replication as the name suggests is all about making the critical
components or services redundant in order to achieve fault tolerance. While using replication strategy
consistency between the primary components and its replicas are ensure using suitable protocols.
2.3.1 Types of Replication techniques
Replication techniques are broadly classified into active and passive replication.
In active replication technique, every replica behave in same manner i.e. any request receives is
passed on to all replicas and they work independently for the completion of that request. In asynchronous
mode all the replicas are having same system state, thus also known to be state-machine approach. This
approach provides low response time as the process will continue even after the failure of single replica;
however resource consumption is high for the same.
In passive replication technique, one of the processing units is primary replica which is entitled to
receive request, process it and respond back. While execution of the request other backup replicas interact
with primary replica and for system state updates. In case of failure of primary replica, one of the backup
replicas takes over control. This approach with a reduction of resource overhead, come up with the higher
response time in case of failures. Passive replication has two variations: warm and cold passive replication. In
warm passive replication, other replicas are subjected to periodically checkpoints update in accordance to the
state of primary replica. In cold passive replication, until primary is detected to be failed, no backup replicas
were launched [7]. Apart from above traditional types, these were further modified into semi-active and semipassive replications also.
Semi-active replication technique implements nondeterministic computation model with the concept
of a leader and its followers. In normal circumstances only the leader component provides output messages
whereas all replicas were subjected for processing of request. In case of non-deterministic processing it is the
leader which computes and passes on the processing information to the followers. This approach ensures fast
reaction in case of crash fault, simultaneously without incurring high cost in an incorrect failure suspicion
instance.
2.3.2 Replication based Fault Tolerance
To obtain higher reliability and availability Byzantine Fault Tolerance (BFT) protocol presented in
[8] is a powerful approach, which works over active replication strategy however too expensive for practical
usage. It uses 3f+1 replica concept (1 primary and remaining backup) so as to tolerate f byzantine faults.
However author in [8] implemented the concept of leader-follower one-backup replication in Low
Latency Fault Tolerance (LLFT) system for distributed applications within LAN. Using message protocol it
provides strong consistent group membership and while failures condition it shows lower latency
configuration and recovery. It provides transparency for the applications undergoes crash fault or timing
fault, however cant handle with Byzantine fault applications.
In his research with Large-scale Graph Processing Peng et. al [8] provide the concept of imitator, a
replica based fault tolerance mechanism to provide low-overhead and fast crash recovery. The key concept of
imitator is to extend the replication mechanism using addition mirrors a replica having direct interaction with
ongoing process i.e. master. In case of failure the complete states of master can be reconstructed from states
of its mirrors. Distribution of mirrors accounts directly to the scalability of recovery, since mirrors places key
role in reconstruction of master states.
Thus our proposal is inspired from these replication based fault tolerance mechanism where the concept
of two- backup is implemented with initial state having a trio of master-mirror-replica where master is an
active virtual machine/process, mirror is a replica having direct interaction with master and replica as another
backup replica which is idle till the failure of master.
2.4 Managing Replicas
Title of manuscript is short and clear, implies research results (First Author)
ISSN: 2089-3337
Fault tolerance mechanism should ensure that under the faulty scenario the availability of job
replicas should remain higher as it is meant to tolerate failure proactively. Thus deploying replicas to specific
location would have been thought processing where every scenario had its own weight-age. Availability in
cluster computing are defined granularity at five different levels; from level 5 to level 1 where two replicas
are made to reside on different data center, rooms, rack, servers and on same server respectively. On more
broader terms replicas can be made to place on multiple machine on same cluster, multiple clusters within a
data center and within multiple data center with ascending order of failure independence and latency
whereas lesser bandwidth usage. Another important factor i.e. availability of replica found to highest by Ravi
et. al [5] in case of locating them at different data center.
Comparing semi-active and semi-passive replication, former seems to be better in context to availability
but on the other hand resource overhead is much higher for the same. Similarly in case of fault which can
spread on, the replicas need to be distributed such that they may not be affected by making the availability of
job optimal. Assuming a cluster comprises of single host, whereas in actual computation environment a
cluster can have different number of host, as host represent a physical machine and cluster is logical
combination of more than single machine. Further we had opted for latter replication strategy, and making the
one replica to reside in same host at same data center and second replica to reside at different host again in
same data center thus trying to balance among fault dependency, availability and latency factors.
34
IJ-CLOSER
35
ISSN: 2089-3337
Figure 2 (a) Initial setup with VM1 as active VM, VM2 and VM3 as its primary and secondary replica,
residing in same and different host respectively. (b) After first failure VM1 stops working and VM2 takes as
master/active VM and VM3 as mirror i.e. its replica. (c) Next failure leads to the generation of new VM say,
VM4 acting as mirror and which in turn would be the replica of VM3 now master.
3.3 Placement Policy for Higher Availability
Initially VM would have two replica one to reside at same hosts whereas second replica to locate at
different host again within same data center. While moving within data center leverage availability but on the
other hand it would maximize the latency to trigger the replica and creation of new replica in case of failure
of former. In case of dynamic generation of new replica it is also made to follow the above residing rule thus
locating it at another available host when two instances of VMs fails in the same machine.
To improvise over the resident policy of VMs and its replicas, a constraint is added while triggering
and initializing the mirror VMs which states the only two VMs instances for same computing scenarios are
allowed to be allocated on the same host. More precisely say if VM1 fails its replica take over which is
placed on the same host. Now for the next failure as we had saturated with the processing of same jobs
Title of manuscript is short and clear, implies research results (First Author)
ISSN: 2089-3337
36
assigned to the VMs we would switch to the next host and in turn the new VM would executes on the
different host.
Data published in [9] computed the overall availability of replication strategy with respect to the
different placement schemes. Availability of the replica is highest when made to reside at different datacenter
and moderate within same group of machines and lowest inside same host. So as to balance between latency
and availability we made the above constraint for placement and regeneration of replicas.
IJ-CLOSER
37
ISSN: 2089-3337
// Initially all jobs are allocated to first VM whose status is already set to master VM
1 Input: hostList, vmList Output: allocation of VMs
2 Integer i, j=1
3 allocatedHost NULL
4 for Integer i=1 to vmList.size()
5
if ( i==1) then
6
vmmaster vmi
// first VM as master VM
7
vmmirror vmi+1
// second VM as mirror VM
8
else
9
vmrep vmi
// next VM as replica VM which is subjected to
generate
//dynamically in case of number of failures
exceeds to one.
10 if host has enough resource for vm then
11
hostj allocate vmi and vmi+1
12 if (hostj ==Null)
13
do nothing
14
move to next host from hostList
15 return allocation
Function of UpdateVmReplica is to update the mirror VM status for successful completion of jobs at
master VM. InitilizeVmReplica is devised to stop current master VM on encounter of failure and initiate
mirror as master and replica as new mirror VM. It is also subjected to select a new VM as new replica
mentioned as per residency policy and return new master and mirror.
Algorithm 3: executejob (job, vm)
1 Input: job,vm Output: job and vm status
2
bindjobtovm (jobi, vmmaster)
// as soon as Job gets bind to VM it
starts execution
3
if job execution fail
4
vmmaster .status <- fail;
// set VM status as fail
5
jobi.status <- fail;
// set job status as fail
6
else
7
vmmaster .status <- success;
// set VM status as fail
8
jobi.status <- success;
// set VM status as fail
9
buffer vmbufvmmaster Image
//save VM instance to buffer
10 return job and vm Status
Algorithm 4: UpdateVmReplica (vmmirror, buffer)
// update mirror for successful completion of job at master VM
1 Input: buffer and vm Output: VmUpdateAck
2 boolean VmUpdateAck false
3
if buffer != NULL
4
set vmmirrorbuffer
// save buffer data to vm mirror
5
VmUpdateAck True
6
else
7
msg Buffer empty , no data for further updation
8 return VmUpdateAck
Title of manuscript is short and clear, implies research results (First Author)
ISSN: 2089-3337
38
1 Input: vmmaster, vmmirror, job Output: vmmaster , vmmirror pair
2 ht_rep vmrep.gethost()
3 rank vmrep.getRank()
// get rank of replica VM
4 vmrep vmbuf;
5 vmmaster vmmirror ;
// mirror VM take over as master VM
6 vmmirror vmrep;
// replica VM take over as mirror VM
7 if rank%2!=0
// which is now the rank of mirror
8
new_vm get_vm(ht)
9
vmrepnew_vm
// create new replica in same host
10 else
// both not available with the same
host ht
11
new_vm get_vm(ht++)
// get new VM from next host of VM
replica
12
vmrepnew_vm
// invoke this VM as replica VM
13 return vmmaster , vmmirror
3.5 Low Latency
With the proliferation of on demand computation in cloud speedy processing and minimizing
latency is becoming a critical concern for cloud providers, to deliver improve QoS parameter and for better
future expansions. Latency is something which is quite complex to compute as it is highly dependent on
different infrastructural layers like, internet traffic on distributed computing environment, abstraction at
virtualization level, priorities over SLA and QoS level, high abstraction for clients over location of
datacenters where their jobs are actually getting executed.
All these factors directly influence the latency rate and complicate the procedure for exact
computation of latency.
The end-to-end latency Tlat as specified for [6], is the sum of the processing time at the client Tclient
and the processing time at the server Tserver, delay at the client end Tdelayc and at the server Tdelays, at time of
failure and overall time of message transmission Tmsg, and overall time is expressed as;
Tlat= Tclient + Tserver + Tdelayc + Tdelays+ Tmsg
(3)
At server a buffer is maintained as an intermediate entity for facilitating communication and
periodic revision between master and mirror VMs. Taken in account the faulty scenario the latency increased
as master and mirror replacement takes place. Thus resultant latency will increase a communication overhead
to Tdelays and Tdelayc thus redefining. The transmission delay at server side for each transaction is
represented as;
Tdelays = k [Ttobuffer + Tfrombuffer ] + (k+) [Ttrigmirror + Tcomtoclient]
(4)
Tdelayc = (1+ ) [Tcomtoserver]
(5)
Where is number of failures and Ttrigmirror is time required to trigger mirror VM passing execution
control to it. Tcomtoclient is the communication to be established with client and new master VM. As a two way
communication link is between client and server, for every failure a delay is also incurred at client end in
getting connecting to server.
In this era of fast and reliable communication where data is transmitted close enough with the speed
of light, overall raw transmission time is almost negligible. When we transmit data between two VMs
residing in the same host or different host, the transmission delay is analogous making the choice
imperceptible. Latency is directly propositional to the hop count in some cases and geographical locations of
data centers also have some significance over it. However here as we had not specified about zonal
distribution or location of various datacenters, we would avoid the detail dialogue over it. Rest for our
proposed policy we had subjected the choice of VMs from same data center thus imparting lower latency
thus delivering lesser communication delay and higher QoS at client end.
4. SIMULATION AND DISCUSSION
Simulation of our experiment is done using WorkflowSim which is an open source toolkit extends
CloudSim. Hundred heterogeneous jobs are submitted for execution to VM, as the jobs executes successfully
VM replica is updated whereas in case of failure, VM replica is triggered for further execution and master
VM is set into sleep mode.
Varying Weibull parameters failure rate is oscillated running simulation and output are recorded for
further analysis over throughput, availability, latency and power consumption.
4.1 Throughput
Job completion graph against failure distribution are shown in the figure 3 (a), (b) and (c), where
number of VMs are generated dynamically as per failure rate and number of jobs assigned to the master VM
IJECE Vol. x, No. x, Month 201x : xx xx
IJ-CLOSER
39
ISSN: 2089-3337
are varied over. Weibull distribution is the core concept behind failure generation, shape and scale parameter
are varied to be 0.5, 1.0 and 2.0 (for both shape and scale parameter) and we could see that for all running
instances our proposed algorithm DRHA FT which is deploying 2k-Backup generation policy come out to
have cent percent throughput.
Figure 3(a)
Figure 3(b)
Figure 3(c)
Figure 3 (a), (b) and (c) Throughput Measurement for different Failure Rate, generated by variation in
Weibull Parameters.
4.2 Optimized Availability Factor
Fault Tolerant mechanism had to be considered over several attributes in order to prove it to be
better or optimal. And some of the attributes are interdependent on others thus trading off with the other
attributes features. Similar is the behavior of latency and availability. As we move far from the current
working node to select for its backup replica for higher availability, we trade off with the latency factor. Thus
balancing between the two we selected for only two instance of VM to run on same host, and for third failure
we jump to the next host again running only the two instances on selected host.
Availability values (normalized to 1) for replication techniques at different deployment scenario are
taken from [11]. Availability for the first failure is 0.9826 as the replica is stored in the same cluster and for
next failure this factor goes to 0.9840 where mirror replica is located at different cluster. In this way the
factor keeps on altering as the location of mirror replica changes from same cluster to different cluster. As
already mentioned we had defined a cluster to be comprises of one host and not group of host. We restricted
our location to the same datacenter as fluctuation between datacenter takes latency to another higher level. As
Title of manuscript is short and clear, implies research results (First Author)
ISSN: 2089-3337
40
we had worked for dual core elements number of processors with a single host are two only. Same could be
implemented for quad-core or octa-core processors also.
DRHA FT policy is compared with traditional policy here named as Random policy in which n (here
it is set to 10), VMs are working in two hosts and in case of saturation of all the VMs of first host i.e. n/2
VMs, VMs from next host instantiated for execution.
Figure 4. Comparitive analysis of DRHA FT resident policy with Tradition Random Policy for Availability of
the System at times of Failures.
Thus we can attain an optimal availability of 0.9833 with the residing policy in our DRHA FT
algorithm. And linear analysis shows that with the passage of time and increases of failure average
availability is of monotonically increasing order. With this resident policy we tried to higher availability
factor without compromising with latency rate, thus putting them together in the equilibrium states.
5. CONCLUSION
Fault Tolerance mechanisms are always subjected to evolutions and developments because of its
dependency on numerous factors like QoS, SLA at client end and energy consumption, cost efficiency,
robust, scalable etc., at service providers end. In near future the same mechanism would be slot in for
varying cluster size hosts and scrutinize over more metrics such as reliability and power consumption. Also at
this computation solution can be orchestrated with checkpoint updation at task level. To meet up with change
in requisitions and diversified environment fault tolerance mechanism should be scalable and robust, thus we
could also investigate modified policy over these important. This proposed work had not envisage failure of
datacenter which, however can be think out using live migration of current working VM to other healthy
datacenter.
ACKNOWLEDGEMENTS
I would like to acknowledge my mentor and fellow researchers for their input in this research work.
REFERENCES
[1] B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, IEEE
Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337350, Oct. 2010.
[2] K. Ranganathan and I. Foster, Identifying dynamic replication strategies for a high-performance data grid, Lecture
Notes in Computer Science, pp. 7586, Jan. 2001.
[3] C. Engelmann, G. R. Vallee, T. Naughton, and S. L. Scott, Proactive fault tolerance using preemptive migration,
2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2009.
[4] K. V. Vishwanath and N. Nagappan, Characterizing cloud computing hardware reliability, Proceedings of the 1st
ACM symposium on Cloud computing - SoCC 10, 2010.
[5] R. Jhawar and V. Piuri, Fault tolerance and resilience in cloud computing environments, Computer and Information
Security Handbook, pp. 125141, 2013.
[6] W. Zhao, L. E. Moser, and P. M. Melliar-Smith, End-to-end latency of a fault-tolerant CORBA infrastructure,
Performance Evaluation, vol. 63, no. 45, pp. 341363, May 2006.
[7] X. Defago, A. Schiper, and N. Sergent, Semi-passive replication, Proceedings Seventeenth IEEE Symposium on
Reliable Distributed Systems (Cat. No.98CB36281), 1998.
[8] M. Castro and B. Liskov, Practical byzantine fault tolerance and proactive recovery, ACM Transactions on
Computer Systems, vol. 20, no. 4, pp. 398461, Nov. 2002.
[9] W. E. Smith, K. S. Trivedi, L. A. Tomek, and J. Ackaret, Availability analysis of blade server systems, IBM Systems
Journal, vol. 47, no. 4, pp. 621640, 2008.
IJ-CLOSER
41
ISSN: 2089-3337
[10] R. Jhawar and V. Piuri, Fault tolerance management in IaaS clouds, 2012 IEEE First AESS European Conference
on Satellite Telecommunications (ESTEL), Oct. 2012.
[11] D. S. Kim, F. Machida, and K. S. Trivedi, Availability modeling and analysis of a Virtualized system, 2009 15th
IEEE Pacific Rim International Symposium on Dependable Computing, Nov. 2009.
[12] M. N. Sharif and M. N. Islam, The Weibull distribution as a general model for forecasting technological change,
Technological Forecasting and Social Change, vol. 18, no. 3, pp. 247256, Nov. 1980.
[13] X. Dfago, K. R. Mazouni, and A. Schiper, Highly available trading system: Experiments with CORBA,
Middleware98, pp. 91104, 1998.
[14] A. Duminuco, E. Biersack, and T. En-Najjary, Proactive replication in distributed storage systems using machine
availability estimation, Proceedings of the 2007 ACM CoNEXT conference on - CoNEXT 07, 2007.
BIBLIOGRAPHY OF AUTHORS
Akanksha Chandola is an Assistant Professor of Computer Applications at Amrapali Institute
of Management and Computer Application, affiliated to Uttarakhand Technical University. He
received his Masters in Computer Application from Hemwati Nandan Garhwal University (now
Central University), Srinagar Garhwal. His current research interest includes computer graphics,
artificial neural network, geographical information systems and algorithms. He had about 10
years of experience as faculty including industrial exposure. He had several publications in
national and internation conference.
Dr. (Prof.) Nipur Singh is Professor at Gurukul Kangri Vishwalvidhyalaya, Haridwar and
currently Head of Department of Computer Science at Kanya Gurukul Campus, Dehradun. She
had working experience of about three decades as an academician. Her major research interest
includes; wireless computing, distributed networking, interconnection networks, adhoc network,
mobile agents and cloud computing. She had several publications at national and international
conferences, journals and books published in her name.
Ajay Rawat has eleven yearsof experiences in the academicsand industry in India andabroad.
He received the M.S.degree in Software from BITSPilani, India and presently he is pursuing
Ph.D from the Department of Computer Science& Engineering, UttarakhandTechnology
University, Uttarakhand, India. He hadworked with the Department of Computer Application of
Graphic Era University, Dehradun, India, in the capacity of Assistant Professor and
softwaredeveloper in NIIT Technologies, New Delhi, Inida.His area of interest includes cloud
computing, faulttolerance, algorithm, etc. He has published variouspapers in national and
international journals. He is certified in OpenStack Cloud.
Title of manuscript is short and clear, implies research results (First Author)