Anda di halaman 1dari 9

Proceedings of International Conference on Computing Sciences

WILKES100 ICCS 2013


ISBN: 978-93-5107-172-3
Issues and challenges of mobile distributed systems for designing
an efficient recovery protocols (MDS-ERP)
Ravneet Kaur
1*
, Tejinder Thind
2
, Rachit Garg
3
1
Student, Dept. of Computer Science & Engg. Lovely Professional University, Phagwara, Punjab, India
2
Research
Scholar, Guru Kashi University, Bathinda, Punjab, India
3
Associate Professor, school of Computer Applications, Lovely Professional University, Phagwara, Punjab, India
Abstract
Distributed system is the collection of independent computers that appears to its user as a single coherent system, it is mainly
used to solve those problems, which cannot be solved by an individual system. Mobile Distributed System (MDS) is the
combination of mobility and distributed system and it contains Mobile Hosts (MH) which maintains their network while
moving from one place to another. The system raises some issues like limited battery power, lack of stable storage, mobility.
Failure is one of the major issues faced by the MDS. It occurs when the number of components increased in the mobile
distributed systems. So to recover this problem we have some check pointing techniques. This paper presents how fault occurs
and which type of recovery technique is better for which type of fault.
2013 Elsevier Science. All rights reserved.
Keywords: Mobile Computing; Distributed Systems; Wireless Communication; Mobile Distributed Systems; Issues and challenges;
Checkpointing.
1. Introduction
Due to advancements in computation and communication technologies, computing can be seen everywhere.
The combination of distributed systems and mobile communication is the main reason behind these
advancements. Distributed applications are used now days almost in every application areas like business,
education, production etc. Due to mobile computing and wireless networks advancements in remote monitoring,
e-commerce, disaster and emergency operations in remote locations and personal communication through
wireless internet occurred. Advancements in technology provided facilities to a mobile user of using computers
with wireless interfaces that allow networked communication.
2. Related Terms
2.1. Mobile Computing
Mobile computing is basically the combination of communication, mobility and portability. It is a form of
interaction between humans and computers or any other computing devices; they may be handheld devices also
by which a computer is probably movable during normal usage. We can also say that user is capable of using
computer and access network even when he is moving from one place to another. Devices like palmtops,
notebook computers, cell phones are some examples of portable and hand held devices.
*
Corresponding author. avneet
425 Elsevier Publications, 2013
Ravneet Kaur, Tejinder Thind and Rachit Garg
2.2. Wireless Communication
Mobile computing requires wireless network access. Wireless networks communicate through radio waves or
pulsing infrared light. Wireless communication is linked to wired network infrastructure by stationary
transceiver. The area covered by a transceivers signal is known as a cell. A cell sizes vary widely; an infrared
wave covers a meeting room, cellular phone transceiver has a range of few miles and satellite transceiver covers
an area more than 400 miles in diameter.
Fig. 1- Wireless Communication
2.3. Distributed System
A distributed system is the collection of processes that communicate with each other by exchanging messages.
A distributed system consists of a collection of independent computers, connected through a network that enables
computers to manage their events and share the resources of the system. It is a collection of independent
computers that appears to its user as a single coherent system and the users of a distributed system think that the
system is a single integrated computing facility on which they are working upon. Main thing in distributed
systems is that the difference between the various computers and the ways in which they communicate with each
other are mostly hidden from the users. Distributed Systems includes World Wide Web, network of branch office
computers and the network of embedded system. Main goals of distributed system are:
Making Resources Accessible
Distribution Transparency
Openness
Scalability
3. Mobile Distributed System
Mobile Distributed Systems are the combination of mobile systems and distributed systems. When in a
distributed system some of the processes are running on the mobile hosts, it makes a Mobile Distributed System
(MDS). An MDS constitutes moving nodes called Mobile Hosts. Mobile hosts maintain their network connection
even during mobility or moving from one place to another. There is another node called Mobile Support Station
426 Elsevier Publications, 2013
Issues and chalenges of mobile distributed systems for designing an efficient recovery protocol (MDS-ERP)
(MSS) which helps a mobile host to communicate with the MSS or vice versa if and only if the mobile host is
within a service area around the MSS. A service area is a geographic area around the MSS where the signal
strength is available for mobile hosts to communicate with the MSS.
Fig. 2- Mobile Distributed System
4. Issues and Challenges of Mobile Distributed System
4.1. Designing distributed algorithms for mobile computing environments
A distributed algorithm constructed from interconnected processes and is specially designed to run on
computer hardware. Telecommunications, real time process control and distributed information processing are
various application areas of distributed computing which use distributed algorithms. There are number of
distributed algorithms available today, but these algorithms were designed assuming the hosts to be static in
nature. That means the hosts would be stationary with negligible mobility. But now mobile computing is
spreading all over the world very fast, everything is now mobile and in mobile computing environments where
Mobile Distributed Systems play a very important role, the geographical distribution of mobile hosts keeps on
changing due to the frequent mobility of hosts, so the use of distributed algorithms for static hosts is insufficient
for the mobile computing environment. So to enhance better communication, coordination and synchronization in
mobile computing environments modifications in the available distributed algorithms are required. So this is a
major challenge to make distributed algorithms for mobile computing environments.
4.2. Designing networking protocols for mobile distributed systems
Communication between hosts on a network is controlled by certain protocols. A protocol basically is a set of
rules. Mobility between hosts in mobile distributed environment is supported by Mobile IP which is an internet
protocol. Mobile hosts stay connected to internet without any dependency on their location and also without
changing their IP addresses only because of MIP. Mobility of the hosts becomes transparent to applications and
higher level mobile network protocols because of MIP. The networking protocols needs to control the issues
related to:
Type of communication
Access method used in network
Network topologies
Speed of data transfer
427 Elsevier Publications, 2013
Ravneet Kaur, Tejinder Thind and Rachit Garg
4.3. Middleware support for mobile distributed computing
A temporary loss of network connectivity occurs when mobile devices or hosts change their position Frequent
and unknown changes in the environment are the main reason for this loss. Variability in network bandwidth and
remote availability of resources are some unknown changes. There are some other factors also which are
responsible for this temporary loss like slow processor speed, less device memory and low battery power.
Keeping these problems in mind the mobile distributed system applications needs some research in the field of
middleware to provide better coordination and communication between distributed components.
4.4. Mobile codes and agents
Mobile agents are self-contained and recognizable computer programs. They learn, collect information and
deliver requests. But mobile agents also have some issues with them, the major issue of mobile agents is the
requirement of software infrastructure that provides them security and data protection. The purpose of mobile
agents is to implement distributed information oriented applications efficiently and not to improve the system
performance or to make new applications happen. Another issue of mobile agents is that their system is fault
tolerant, carrying their work independent of host failures [2].
4.5. Applications and services based on location
Speed up decision making process by providing only relevant information to the users.
Minimizing tiresome data entry process.
Sharing of location based information like reviews of people and photos clicked by the user.
Combination of those services which are provided by different service providers.
4.6. Mobile cluster computing
Mobile cluster computing (MCC) is a new standard of cluster computing in which two or more computers are
connected with each other in such a way that they work as well as behave like a single computer. In MCC,
mobile clusters work together with a set of mobile nodes to perform a particular task.
Flexibility in terms of mobility and extensibility.
Robust mobile network for handling high performance computing requests
Cluster security.
Parallel processing.
Load balancing.
Fault tolerance.
4.7. Fault tolerance and recovery in mobile distributed systems
Distributed computing or cluster computing are very cost-effective, scalable and are able to meet the demands
of high performance computing. That is why they are used widely now days. When the number of components
increased failure probability also raises. But to deal with these types of failures we have a fault tolerance
technique. Fault tolerance is the ability of a system to respond gracefully to an unexpected hardware or software
failure. When a component fails, a backup component or procedure immediately takes its place with no loss of
service or data [2]. In mobile distributed systems it can be provided with software, hardware or with the
combination of both. In software implementation, programmer is allowed to checkpoint critical data at
predetermined points within a transaction by the interface that is provided by the operating system. In hardware,
fault tolerance is achieved by duplicating each hardware component. It is necessary to understand the nature of
the faults occurred due to increasing the number of components, while providing them fault tolerance
capabilities. There are mainly two kinds of faults:
428 Elsevier Publications, 2013
Issues and chalenges of mobile distributed systems for designing an efficient recovery protocol (MDS-ERP)
Permanent faults: These faults are caused by permanent damage to one or more components and transient
faults are caused by changes in environmental conditions. These faults can be rectified by repair or
replacement of components.
Transient faults: These faults remain for a short duration of time and are difficult to detect and deal with. To
recover these types of faults we have many techniques, protocols and algorithms.
Fault tolerance can be attained by redundancy. Redundancy is of two types:
Temporal redundancy: It is also known as checkpoint-restart. In this when fault occurs, an application
restarted from an earlier checkpoint or we can say recovery point. Some loss in processing and applications
may not complete on their strict timing targets are the problems faced due to this process.
Spatial redundancy: In this, many copies of a single application are running concurrently over the different
processors. Main advantage of this type of redundancy over temporal redundancy is applications can be met
strict timing targets. But this technique is very costly and requires extra hardware.
5. Checkpointing
Checkpointing is the method of saving status information. When fault occurs, state is restored via a rollback.
Fault tolerance based on checkpointing, taking interrupted snapshot of the state of a process. These snapshots
then saved to another safe place. The snapshot of local state of process is known as checkpoint. When failure
occurs, process state recovered from the checkpoint. Checkpointing basically used to avoid the problem of losing
all the important processing done before the occurrence of fault in the system. It saves the state of a program in a
reliable storage medium. In the case of fault, checkpointing enable the execution of program to be resumed from
a previous consistent stage rather than resuming the execution from the beginning of the process. This reduces the
amount of useful processing lost due to fault.
Fig. 3- Consistent and Inconsistent Global States
Fig. 1. Represents the process of checkpointing, there is a set of messages sent but not received yet is known
as global state. The messages whose sending information is recorded but no receiving information is known as
lost or in-transit message. Orphan message is that whose receiving information is recorded but transmitting
information is lost. In Fig. 3. Set of {C10, C20, C30, C40} is initial global state and it is consistent because there
429 Elsevier Publications, 2013
Ravneet Kaur, Tejinder Thind and Rachit Garg
is not any orphan message. Initial global state is always consistent. Global state {C11, C21, C31, C41} is also
consistent as it also does not have any orphan message. But global state {C12, C22, C32, C42} is inconsistent
because it contains an orphan message m6.
5.1. Uncoordinated Checkpointing
Uncoordinated checkpointing is also known as independent checkpointing. In this, processes do not coordinate
their checkpointing activities with each other. Every process records its local checkpoint independently.
Independency of taking the decision for taking checkpoint that is when to take checkpoint is allowed to each
process. So each process may take a checkpoint when it is most convenient. It forms a consistent global state on
recovery after a fault by eliminating the coordination overhead. After a failure, a consistent global checkpoint is
established by tracking the dependencies. It may require cascaded rollbacks that may lead to the initial state due
to domino-effect. This process requires multiple checkpoints to be saved for each process. The checkpoints that
are no longer needed are reclaimed by using garbage collection algorithm. In this, useless checkpoint that will
never be a part of global consistent state might be taken by the process. Useless checkpoints invite overhead
without advancing the recovery line.
5.2. Coordinated Checkpointing
Coordinated checkpointing is also known as synchronous checkpointing. In this, processes take checkpoints in
such a manner that the resulting global state is consistent. Basically it follows two-phase commit structure.
Tentative checkpoints are taken in first phase and these are made permanent in the second phase. The main
advantage of this is that only one permanent checkpoint and at most one tentative checkpoint is required to be
stored. In case of a fault, processes rollback to last checkpointed state. . A tentative checkpoint can be undone or
changed to be a permanent checkpoint but a permanent checkpoint cannot be undone. Computation required to
reach the checkpointed state will not be repeated. In this a coordinator is there who takes a checkpoint and
broadcasts a request message to all other processes, then asking them to take a checkpoint. When a process
receives the message, it stops its executions, takes a tentative checkpoint and sends an acknowledgement message
back to the coordinator. Coordinator receives acknowledgements from all processes, then after receiving
acknowledgment it broadcasts a commit message and this message completes the two-phase checkpoint protocol
process. On receiving commit, a process converts its tentative checkpoint into permanent one and discards its old
permanent checkpoint. The process is then free to resume execution and exchange messages with other processes.
5.3. Quasi-Synchronous or Communication Induced Checkpointing
Communication-induced checkpointing avoids the domino-effect without requiring all checkpoints to be
coordinated. In this, processes take two kinds of checkpoints, local and forced [4]. Local checkpoints can be
taken independently. Forced checkpoints concluding the progress of the recovery line and minimize useless
checkpoints. These protocols do no exchange any special coordination messages to determine when forced
checkpoints should be taken [4]. The decision of taking forced checkpoint is taken by piggyback protocol specific
information on each application message generally called checkpoint sequence numbers.
5.4. Message Logging Based Checkpointing Protocols
Tolerating process crash failures is the main feature of message logging based checkpointing protocols. In this
all inter-process communication is done through messages to provide fault tolerance. There is a stable storage
where each message, which is received by a process, is saved in message log. In this, no coordination is needed
between the checkpointing of different processes. The execution of each process is assumed to be deterministic
between received messages, and all the processes are assumed to execute on fail stop processes. A new process is
created on the place of the process, which is crashed. The new process is given the proper recorded local state,
and then the logged messages are replayed those local states in the order the process originally received them. All
the message-logging protocols require that once a crashed process recovers, state of it needs to be consistent with
the states of the other processes. The consistency requirement is usually expressed in terms of orphan processes,
430 Elsevier Publications, 2013
Issues and chalenges of mobile distributed systems for designing an efficient recovery protocol (MDS-ERP)
those which are surviving processes whose states are inconsistent with the recovered states of crashed processes.
Thus, message- logging protocols guarantee that upon recovery, no process is an orphan [4]. This can be done
either by avoiding the creation of orphans during an execution or by taking appropriate actions during recovery to
eliminate all orphans. There is a receiver based message logging protocol in MIP for mobile hosts, mobile
support stations and home agents, which assure independent recovery.
6. Aspects of checkpointing
6.1. Frequency of Checkpointing
A checkpointing algorithm executes in parallel with the basic computation. Therefore, minimization of
overheads should be done which is introduced due to checkpointing. Checkpointing should enable a user to
recover quickly and not lose substantial computation in case of an error, which imposes frequent checkpointing
and consequently large overhead. The number of checkpoints initiated should be such that the cost of information
loss due to failure is small and the overhead due to checkpointing is not significant. Failure probability and the
importance of computation are those on which they depend. For example, in transaction processing system when
every transaction is important and information loss is not permitted, after every transaction a checkpoint may be
taken, which increasing the checkpoint overhead significantly
6.2. Contents of a Checkpoint
The state of a process has to be saved in stable storage so that the process can be restarted in case of an error.
The state/context includes code, stack and data segments along with the environment and the register contents.
Environment has the information about the various files currently in use and the file pointers. Environment
variables include those messages which are sent and not yet received, in case of message passing systems. The
information that is necessary to resume a computation after it is pre-empted is called the context of that
computation [4].
6.3. Overheads of Checkpointing Algorithm
During a failure free run, every global checkpoint incurs coordination overhead and context saving overhead
in a multiprocessor system. In parallel or distributed systems, co-ordination among processes is needed to obtain
a consistent global state. To obtain coordination between processes, special messages and piggybacked
information with regular messages are used. The accounting operations necessary to maintain coordination also
contribute to coordination overhead. In this the time taken to save the global context of a computation is defined
as the context saving overhead. If stable storage is not available with every node in a multiprocessor system, then
the context is transferred over the network. The network transmission delay is also included in the overhead [4].
6.4. Application of Checkpointing
Checkpointing is used to recover from failures as well as in debugging distributed programs and migrating
processes in multiprocessor system. In the debugging distributed programs, the state changes of a process during
execution are monitored at various time instances. To balance the load of processors in the distributed system, the
processes are moved from heavily loaded processors to lightly loaded processes. Checkpointing a process
sometimes provides the information necessary to move it from one processor to another. With checkpointing, an
arbitrary temporal section of a programs runtime can be extracted for exhaustive analysis without the need to
restart the program from beginning [4].
431 Elsevier Publications, 2013
Ravneet Kaur, Tejinder Thind and Rachit Garg
7. Conclusion
To provide fault tolerance to a system an efficient algorithm or protocol is required which maintains the
system and eliminates the faults occur. Checkpointing protocols are useful for fault tolerance and recovery of the
mobile distributed systems. In this paper we present some checkpointing protocols which are efficient for
recovering the faults.
References
[1] Tejinder Thind, Rachit Garg, Uminder Kaur and Dinesh Kumar. Paradigms in Fault Tolerant Checkpointing Protocols in Distributed
Mobile Systems, IEEE-2012.
[2] P.K Suri and Meenu Satiza. An Efficient Checkpointing Protocol for Mobile Distributed systems, IJ LRST-2012.
[3] Rachit Garg and Praveen Kumar. A Review of Fault Tolerance Checkpointing Protocols for Mobile Distributed Systems, IEEE-2010.
[4] Rachit Garg and Parveen Kumar. A Review of Fault Tolerant Checkpointing Protocol for Mobile Computing Systems, IJ CE-2010.
[5] Parveen Kumar, Richa Setiya and Poonam Gahlan. Checkpointing Algorithms for Distributed Systems, 2009.
[6] Adnan Agbaria, William H Sanders. Distributed Snapshots for Mobile Computing Systems, IEEE Intl. Conf. PERCOM04, pp1-10,
2004.
[7] Adnan Agbaria, William H Sanders. Distributed Snapshots for Mobile Computing Systems, Proceedings of the second IEEE Annual
Conference on Pervasive Computing and Communications (Percom04), pp. 1-10, 2004.
[8] Lalit Kumar, Manoj Mishra and Ramesh Chander J oshi. Low Overhead Optimal Checkpointing For Mobile Distributed Systems, IEEE-
2003.
[9] S. Kalaiselvi, V. Rajaraman. A Survey of Checkpointing Algorithms for Parallel and Distributed Systems, , vol. 25, Part 5, October
2000, pp.489-510.
[10] Cao G. and Singhal M. On Coordinated Checkpointing in Distributed System, IEEE Transactions on Parallel and Distributed Systems,
vol 9, no.12, pp.1213-1225, Dec 1998.
[11] L.Alvisi, B.Hoppe, K. Marzullo. Nonblocking and Orphan-Free Message Logging Protocol, Proc. Of 23
rd
Fault Tolerant Computing
Sysp., pp145-154, J une1993.
[12] R. Koo and S.Toueg. Checkpointing and Roll-Back Recovery for Distributed Systems, IEEE Trans. On Software Engineering, Vol. 13,
no. 1, pp. 23-31, J anuary 1987.
[13] TejinderThind and Rachit Garg. Mobile Distributed System: Concept, Issues And Challenges.
432 Elsevier Publications, 2013
Index

W
Wireless sensor network (WSN), 420
energy efficiency schemes, 422423
literature review, 421422
notification and assumptions, 422