2011 Seventh International Conference on Computational Intelligence and Security
A Markov Chainbased Availability Model of Virtual Cluster Nodes
Jianhua Che
Information & Network Security Key Lab. of State Grid State Grid Electric Power Research Institute Nanjing, China chejianhua@zju.edu.cn
Weimin Lin
Information & Network Security Key Lab. of State Grid State Grid Electric Power Research Institute Nanjing, China linweimin@sgepri.sgcc.com.cn
Tao Zhang
Information & Network Security Key Lab. of State Grid State Grid Electric Power Research Institute Nanjing, China zhangtao@sgepri.sgcc.com.cn
Houwei Xi
Information & Network Security Key Lab. of State Grid State Grid Electric Power Research Institute Nanjing, China xihouwei@sgepri.sgcc.com.cn
Abstract—Benefiting from the virtualization technology, virtual cluster system possesses a lot of advantages different from traditional cluster system. However, the availability analysis of virtual cluster system is still short of efficient methods. The availability analysis of virtual cluster node is the base of analyzing the availability of virtual cluster system. In this paper, we summarize a typical architecture paradigm of virtual cluster node by studying the overall architecture of virtual cluster system and the deployment style of virtual cluster nodes, i.e., two active virtual cluster nodes building on a physical machine and their standby virtual cluster nodes building on another physical machine, and give its state transition diagram by analyzing the complete lifecycle of virtual cluster node and the transition conditions of different node states, and design a Markov Chainbased availability model for this typical architecture paradigms of virtual cluster node. This model enables to characterize the lifecycle state and state transition of virtual cluster node and provide an efficient method to understand the availability level of each virtual cluster node in a complicate virtual cluster system or cloud data center. Finally, the practicability of the proposed model was proved by numerical simulation experimental results.
Keywordsvirtual cluster node; availability modeling; Markov Chain; virtualization
I.
INTRODUCTION
The resurgence of virtual machine (VM) [13] provides a high efficient solution for many IT service demands and important business applications, e.g. cloud computing [10] and Internet Data Center(IDC) [14]. With its widespread application, virtual machine is introduced into traditional cluster system, which promotes the birth of virtual cluster system. Virtual Cluster System (VCS) is a kind of cluster systems that install cluster nodes into virtual machines and manage cluster nodes with virtualization technology [6]. Compared to traditional cluster system, the advantages of virtual cluster system are higher resource utilization, lower standby cost, simpler management work and higher level availability, etc. [11] But at the same time of offering these advantages, virtual cluster system has several defects, e.g. the existence of virtual machine monitor(VMM) [9] brings some unsteady factors and the availability of virtual cluster node is a problem of Single Point of Failure(SPOF) [2]. How
to evaluate the availability of virtual cluster system is becoming the focus of numerous researchers. Furthermore, the availability of virtual cluster node is the base of evaluating the availability of virtual cluster system. As virtual machine has many special features compared with physical machine, the availability models of traditional cluster node are not adaptive to the availability analysis of virtual cluster node. Therefore, it's very necessary to model the availability of virtual cluster node. In this paper, we proposed a Markov Chainbased model for analyzing the availability level of virtual cluster nodes, and validated its practicability with numerical simulation experiments. Specifically, the contribution of this paper is as follows:
First, we summarized a typical architecture paradigm of virtual cluster nodes by studying the overall architecture of virtual cluster systems and the deployment style of virtual cluster nodes; Second, we described the state transition diagram of this kind of typical virtual cluster nodes by analyzing their complete lifecycle state and the transition conditions of different lifecycle states; Third, we designed an availability model based on the Markov Chain theory to analyze the availability level of virtual cluster nodes in a virtual cluster system or cloud data center. The rest of this paper is organized as follows: We begin in Section 2 with related work. Then, we introduce the proposed Markov Chainbased availability model for one typical paradigm of virtual cluster nodes in Section 3. Furthermore, we validate the practicability of the proposed model with several numerical simulation experiments in Section 4. Finally, we conclude with discussion in Section 5.
II.
RELATED _{W}_{O}_{R}_{K}
Although the availability evaluation of traditional computer system has been extensively studied, the availability evaluation of virtual cluster system is still short of efficient methods at current time. Allen and Miroslaw [7] gave an earlier survey on the availability analysis models and evaluation tools of traditional cluster system, and introduced the availability model elements(including fault rate, recovery time and service cost) and analysis models(including fault
9780769545844/11 $26.00 © 2011 IEEE DOI 10.1109/CIS.2011.118
tree, reliability diagram, Markov Chain and Stochastic Petri). After introducing some basic concepts of availability, Alan Wood [1] explored the use of Markov model in the availability analysis. Regarding to the lifecycle state and availability level of virtual machine, Farr etc. [4] extended the state type of virtual machine in the DMTF specification with two kinds of states: Active and Inactive. Herein, the Active state includes Operational and Gemini}, the Inactive state includes Planned and Unplanned. Hence, there are six kinds of virtual machine states: Latent, Defined, , , Paused and Suspend. Le etc. [8] has studied the fault injection of virtual machine system and the application of virtual machine in the fault injection of traditional computer system. Qin and Xie etc. [12] analyzed the mutual impact between the scheduling of multiple applications and the availability in a heterogeneous system, and modeled all nodes of a heterogeneous system according to the computing power and availability data of every node. Brendan cully etc. [3] presented a solution of building general high availability service framework with virtual machine and developed a prototype systemRemus. Werner Fischer and Christoph Mitasch [5] summarized the availability problems of a virtual machine system and gave the node architecture scheme that can increase the availability of virtual cluster system. Thandar Thein and Jong Sou Park etc. [14] optimized the rejuvenation process and enhanced the tolerance ability of computer system with virtualization and software rejuvenation, designed a framework that can increase the survivability of a distributed system, and clarified the relation between the availability of virtual machine system and the number of backup virtual machines. Based on the previous work, Thandar Thein and Jong Sou Park etc. have evaluated the availability of virtual cluster system using software rejuvenation [17], provided the formulation description of multiplevirtual machine system with state transition diagram and verified their work with numerical simulation experiments [16]. The work of this paper is based on Thandar Thein's work and has different model semantic and parameter definition compared with their work.
III.
MARKOV CHAINBASED AVAILABILITY MODEL OF VIRTUAL CLUSTER NODES
Availability may be analyzed based on many models, for example, fault tree, reliability block diagrams, Markov chain and stochastic Petri nets, etc. Markov chain is firstly proposed by the Russian mathematician Andrey Markov in 1907, and often used to model the availability of fault tolerant computer system, dynamic redundant computer system, sequencedependant fault and recovery computer system. In the Markov chain model, the stochastic process of virtual cluster nodes running are denoted by a series of state transitions of virtual cluster nodes. Herein, the state of a virtual cluster node is denoted by the vertex of a state diagram, the translation between different states is denoted by the edge of a state diagram, and the conditional probability of state transition acts as the weight of edges in a state diagram. The conditional probability of virtual cluster nodes transiting from one state into its next state is
determined by the current state, and has nothing with the historic states. In addition, several important solutions of Markov chain model include steadystate solution, transient solution and decomposition method and so on.
A. One Typical Paradigm Of Virtual Cluster Nodes
In a traditional cluster system, all nodes are built on the physical machines. At the same time, almost every active node owns a corresponding standby node to improve its availability. However, in a virtual cluster system, many even all nodes may be built in virtual machines, including active nodes and standby nodes. One typical paradigm(we call it 2VNs/2PMs) in all kinds of virtual cluster nodes is that two active nodes are built respectively in two virtual machines dwelling on a physical machine, and their standby nodes are built respectively in two other virtual machines dwelling on another physical machine as figure 1 shown.
One typical fundamental paradigm of virtual cluster nodes
Figure 1.
This paradigm not only increases the resource utilization of virtual cluster nodes, and also reduces the standby cost of virtual cluster nodes. At the same time, this paradigm relieves the problem of Single Point of Failure (SPOF) to a certain extent. Hence, this paradigm is used widely.
B. Basic Concept and Prerequisite Condition
During the complete lifecycle of a virtual cluster node, there are usually five main states: Normal, Unsteady, Rejuvenation, Switchover and Failure. Herein, Normal means that a virtual cluster node is staying in normal work stage, Unsteady means that a virtual cluster node is staying in abnormal and unsteady stage and the virtual cluster node service is still available with a decreased performance, Rejuvenation means that a virtual cluster node is staying in the stage of transiting from Unsteady to Normal state, Switchover means that a virtual cluster nodes staying in Unsteady state is switching to its standby node for unrecovered faults, Failure means that a virtual cluster node stops working for the sake of failure. The virtual cluster node staying in Normal state may go into Unsteady state after running some time, the virtual cluster node staying in Unsteady state has three next states:
Rejuvenation, Switchover and Failure, and the recoverable virtual cluster node goes into Rejuvenation state, the unrecoverable virtual cluster node goes into Switchover state and then Failure state after migrating the runtime context and application workload into the standby virtual cluster node by live migrating. In addition, the virtual cluster node that has no time to migrate the runtime context and application workload for sudden unexpected reasons goes directly into Failure state, the virtual cluster node staying in Rejuvenation state goes into Normal state by software
rejuvenating. The virtual cluster node staying in Switchover state fails at last, its corresponding service is provided by its standby virtual cluster node, and the standby virtual cluster node has all same states and state transition. The state transition diagram of virtual cluster nodes is shown as figure
2.
Figure 2.
The state transition of virtual cluster nodes
Based on the above analysis, we define the model parameters as shown in Table 1. According to the definitions of these model parameters, we can know that the rejuvenation time of a virtual cluster node staying in the rejuvenation state is 1/ , and the switchover time of a virtual cluster node staying in the switchover state to migrate to its standby node is 1/ . At the same time, these model parameters are under the following prerequisite conditions and hypothesis:
TABLE I.
THE DEFINITION OF MARKOV CHAINBASED AVAILABILITY MODEL PARAMETERS
The definition of model parameters 


N 
The time ratio of a VCN staying in Normal state 


U 
The time ratio of a VCN staying in Unsteady state 


R 
The time ratio of a VCN staying in Rejuvenation state 


S 
The time ratio of a VCN staying in Switchover state 


F 
The time ratio of a VCN staying in Failure state 


The frequency of a VCN changing from Normal to Unsteady state 


The probability of a VCN changing from Unsteady to Normal state 


The probability of a VCN changing from Rejuvenation to Normal 


state The probability of a VCN migrating from active to standby node 

1/ 
The 
frequency of a 
VCN migrating from Switchover state to 


standby node The probability of a VCN changing from Unsteady to Failure state 


The frequency of a VCN changing from Failure to Normal state 
•
•
•
The , , , , , and of all virtual cluster nodes are same and steady; Compared to other probabilities, the probability of a virtual cluster node transiting from Normal to Failure state can be neglected; Virtual cluster node can still provide a continuous service during the rejuvenation process.
C. State Transition Diagram and Analysis Model
For two physical machines that each hosts two virtual cluster nodes, one physical machine is the standby machine
of the other one. When any active virtual cluster node hosted by the active physical machine fails, its runtime context and application workload will be migrated into the standby virtual cluster node hosted by the standby physical machine. When all virtual cluster nodes hosted by a physical
machine fail, then the physical machine fails.
The state transition diagram of the 2VNs/2PMs paradigm
is shown as figure 3. Herein, N denotes the Normal state, U
denotes the Unsteady state, R denotes the Rejuvenation
state, S denotes the Switchover state, and F denotes the
Failure state. In addition, the suffix 1 and 2 means the
number of virtual cluster nodes hosted by the same physical
machine, the suffix A means the virtual cluster node is an
active one, the suffix S means the virtual cluster node is a
standby one.
Figure 3.
The state transition diagram of the 2VNs/2PMs paradigm
According to the hypothesis in previous section, the balance equations of the state transition diagram are as follows:
ηπ N 1 
A 
= 
( λ + ε + τ 
) π U 1 A 

επ 
U 
1 A = γπ R 1 A 

ηπ N 1 S = γπ R 1 S ηπ = ( 
+ δπ S 1 A + μπ F S λ + ε + τ π ) 

N 1 
S 
U 1 S 

επ U 1 S τπ 
= γπ R 1 S = δπ 

ηπ N 2 A = λπ U ηπ = ( 
U 1 S S 1 S 1 A + γπ R 2 A + δπ S 2 S λ + ε + τ π ) 

N 2 A U 2 A επ U 2 A = γπ R 2 A τπ U 2 A = δπ S 2 A ηπ N 2 S = λπ U 1S + γπ R 2 S + δπ S 2 A ηπ N 2 S = λ + ε + τ π ( ) U 2 S επ U 2 S = γπ R 2 S τπ U 2 S = δπ S 2 S 
1
2 3
4
5
6
7
8
9
10
11 12
13
14
By resolving the sum of all 
λπ U 2 A = μπ F A 15 λπ U 2 S = μπ F S 16 state probabilities, the 

conservation equation of the state transition diagram are as 

follows: 

2 2 Uπ N iA + π iA + 2 Rπ iA 
+ 2 Sπ 
iA + 
2 Nπ 
iS + 

i 
= 
1 
i = 1 i = 1 
i = 1 
i 
= 1 

2 2 2 Sπ U iS + Rπ iS + π iS F+ π F A + π S 
= 1 
17 

i 
= 
1 
i = 1 i = 1 
And we can obtain the following expression of state probability by combining the above balance equations and the conservation equation. We can have the following equations by resolving the equation (47)~(58).
π
U
1 A
=
λ
+
2
τ
+
2
2 (
λ
+
τ
)
π
U
1 S
π
U
2 A
=
λ
+
4
τ
+
2
2 (
λ
+
2
τ
)
π
U
1 S
π
U
2 S
=
2
λ λ
(
+
τ
)
+
τ λ
(
+
4
τ
+
2 )
2 (
λ
+
τ
)(
λ
+
2
τ
)
π
U
1 S
18
19
20
Furthermore, we can obtain the closure formulation of virtual cluster node availability model about the 2VNs/2PMs paradigm as the following:
π
U
1
S
=
λ
+
4
τ
+
2
ε
+
τ
+
λ
+
λ
+
ε
+
τ
λ
+
τ
γ
ε
2
μ
η
(1 +
)
− 1
21
As the virtual cluster node in the 2VNs/2PMs paradigm is unavailable in the switchover and failure state, the steady availability of virtual cluster node in the 2VNs/2PMs paradigm is:
A =
lim
t →∞
( )
A t
=
1
(
− π
+ π
S
1 S
+ π
S
2 A
+ π
S
2 S
+ π
F
A
+ π
F
S
)
So the downtime in a given time interval L is:
(
DT L
) = (
π
S
1
A
+
π
S
1
S
+
π
S
2
A
+
π
S
2
S
+
π
F
A
+
π
F
S
) ×
L
And the cost of downtime is:
C
(
L
)
=
(
π
S
1
A
×
C
S
1
A
+
π
S
1
S
×
C
S
1
S
+
π
S
2
A
×
C
S
2
A
+
π
S
2
S
×
C
S
2
S
+
π
F
A
×
C
F
A
+
π
F
S
×
C
F
S
)
×
L
_{I}_{V}_{.}
_{A}_{N}_{A}_{L}_{Y}_{S}_{I}_{S} AND VALIDATION OF THE PROPOSED MODEL
As the precise data of seven model parameters is hard to measure, we choose six average model parameter value tuples of 10,00 real data tuples as Table 2 shown to represent six kinds of different availability level, and analyze the availability of the 2VNs/2PMs paradigm using the proposed availability model. All real data comes from one of our web servers. In the experiments, every model parameter is set to a default value: =2 times/month, =75%, =6 times/hour, =24%, 1/ =6 second, =2 times/year and =2 times/month. When every model parameter varies according to six values
in Table 2, the other model parameters are set to their own
default values.
TABLE II.
SIX TUPLES OF AVERAGE MODEL PARAMETER VALUES IN THE EXPERIMENTS
Model 
The values of model parameters 

parameter 
The 1st 
The 2nd 
The 3rd 
The 4th 
The 5th 
The 6th 
tuple 
tuple 
tuple 
tuple 
tuple 
tuple 

(times/month) 
1 
2 
3 
4 
5 
6 

60% 
65% 
70% 
75% 
80% 
85% 
(times/hour) 
6 
10 
12 
15 
20 
30 

30% 
25% 
20% 
15% 
10% 
5% 
1/ (seconds) 
4 
6 
10 
30 
60 
300 
(times/year) 
1 
2 
3 
4 
5 
6 
(times/month) 
1 
2 
4 
8 
30 
60 
The transition relation between seven model parameters
and the availability of the 2VNs/2PMs paradigm is illuminated in figure 4 and 5. We can find that the
availability of the 2VNs/2PMs paradigm increases with the
value of increasing in figure 4(a), and furthermore the availability of the 1VN/1PM paradigm(as a benchmark)
increases more because the increasing of means that the
probability of a virtual cluster node turning into Rejuvenation state increases and the probability of a virtual cluster node turning into Failure state degrades in the same interval. As the 1VN/1PM paradigm does not have a standby virtual machine to switch when failing, its availability is influenced
much by the failure rate( _{F} ), but not by the model parameters
of and 1/ .
Figure 4.
The relation between the model parameters and the availability of virtual cluster node
The availability of the 2VNs/2PMs paradigm degrades with the value of increasing according to figure 4(b), in a less extent than the 1VN/1PM paradigm for the sake of higher rejuvenation rate. The availability of the 2VNs/2PMs paradigm degrades with the probability of virtual cluster nodes turning into the Switchover state increasing as figure 4(c). This mainly because the switchover between an active virtual cluster node and its standby virtual cluster node will make its unavailable and degrade its availability. From figure 4(d), we can find that the availability of virtual cluster nodes
Avai l abi l i t y( %)
Avai l abi l i t y( %)
degrades with the increasing of the switchover time (i.e. the downtime in live migrating process).
100. 00
99. 98
99. 96
99. 94
99. 92
99. 90
99. 88
99.
99.
99.
99.
99.
99.
99.
99.
99.
123456
123456
(ti mes/year ) e
The relation between the model parameters and the availability of virtual cluster node
9999
9960
9921
9882
9843
9804
9765
9726
9687
Figure 5.
According to figure 5(e), the availability of the 2VNs/2PMs paradigm degrade with the increasing of , and the availability degradation degree of the 1VN/1PM paradigm is bigger. The availability of the 2VNs/2PMs paradigm degrades with the increasing of , and is influenced to a lower extent as figure 5(f). From figure 5(g), we can find the availability of the 2VNs/2PMs paradigm increase with the increasing of , and the availability of the 1VN/1PM paradigm increase with a bigger degree.
REFERENCES
Alan Wood. Availability modeling: understanding Markov models to calculate system reliability. Circuits & Devices, pp.2227, 1994.
M. Aldinucci, M. Danelutto, and M. Torquati, and F. Polzella, and G.
Spinatelli, and M. Vanneschi, and A. Gervaso, and M. Cacitti, and P.
Zuccato. VirtuaLinux: virtualized highdensity clusters with no single
point of failure. Proc. of the Int. Conference ParCo2007. Vol. 38,
pp.355362, 2007.
B. Cully, and G. Lefebvre, and D. Meyer, and M. Feeley, and N.
Hutchinson, and A. Warfield. Remus: High Availability via
Asynchronous Virtual Machine Replication. Proceedings of the 5th
USENIX Symposium on Networked Systems Design and
Implementation. San Francisco, California, pages 161174, 2008.
E. Farr, and R. Harper, and L. Spainhower, and J. Xenidis. A Case for
High Availability in a Virtualized Environment (HAVEN).
Proceedings of the 2008 Third International Conference on
Availability, Reliability and Security, 675682, 2008.
W. Fischer, and C. Mitasch. High availability clustering of virtual
machinespossibilities and pitfalls. Paper for the talk at the 12th Linuxtag, Wiesbaden/Germany, May 3rd6th, 2006.
I. Foster, and T. Freeman, and K. Keahey, and D. Scheftner, and B.
Sotomayor, and X. Zhang. Virtual clusters for grid communities. Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid. pages 513520, 2006.
A.M. Jr. Johnson, and M. Malek. Survey of software tools for evaluating reliability, availability, and serviceability. ACM Computing Surveys (CSUR), 20(4): 227269, 1988.
M. Le, and A. Gallagher, and Y. Tamir. Challenges and Opportunities with Fault Injection in Virtualized Systems. First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, Texas, April 2008.
M. Rosenblum and T. Garfinkel. Virtual Machine Monitors: Current Technology and Future Trends. IEEE Computer, 38(5):3947, 2005.
M.A. Vouk. Cloud computingIssues, research and implementations.
Journal of Computing and Information Technology. 16(4): 235246,
V. CONCLUSION AND FUTURE WORK
The availability evaluation of virtual cluster system is an important issue in its promotion and application, and the availability evaluation of virtual cluster node is the base of the availability evaluation of virtual cluster system. This paper summarized one typical architecture paradigm of virtual cluster nodes by analyzing the overall architecture of virtual cluster systems and the deployment style of virtual cluster nodes, gave the state transition diagram of this typical paradigm by studying the complete lifecycle of virtual cluster nodes and the transition conditions of different node states, and proposed an availability model based on the Markov chain theory to contribute the availability analysis of virtual cluster nodes and virtual cluster systems. Finally, the numerical simulation experimental results proved the practicability of this proposed availability model. In the future, we will study the availability relation between virtual cluster node and virtual cluster system.
ACKNOWLEDGMENT
2008.
[11] H. Nishimura, and N. Maruyama, and S. Matsuoka. Virtual clusters on the flyfast, scalable, and flexible installation. Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil. Pages 549556, 2007.
[12] X. Qin, and T. Xie. An availabilityaware task scheduling strategy for heterogeneous systems. IEEE Transactions on Computers, 57(2):
188199, 2008.
[13] M. Steinder, and I. Whalley, and D. Chess. Server virtualization in autonomic management of heterogeneous workloads. ACM SIGOPS Operating Systems Review. VOL.42NO.1:9495. 2008.
[14] T. Thein, and M. Pokharel, and S.D. Chi, and J.S. Park. A Recovery Model for Survivable Distributed Systems through the Use of Virtualization. The Fourth International Conference on Networked Computing and Advanced Information Management (NCM’08). Gyeongju, Korea. September 24, 2008.
[15] T. Thein, and J.S. Park. Availability analysis of application servers using software rejuvenation and virtualization. Journal of Computer Science and Technology. 24(2): 339346 Mar. 2009.
[16]
T. Thein, and J.S. Park, and S.D. Chi. Availability Modeling and Analysis on Virtualized Clustering with Rejuvenation. International Journal of Computer Science and Network Security. VOL.8 No.9, September 2008.
This work is supported by the State Key Development Program for Basic Research of China ("973 project", No.2007CB310900) and the 2010 Annual Funding Project of Baoding Association of society and Science (No.20100309). The authors want to thank Prof. Qinming He coming from Zhejiang University for his helpful advice.
[17] T. Thein, and S.D. Chi, and J.S. Park. Improving Fault Tolerance by Virtualization and Software Rejuvenation. Proceedings of the 2008 Second Asia International Conference on Modeling & Simulation (AMS), pages 855860, 2008.