Models

On Stochastic Dynamic Programming and its Application to Maintenance
FRANCOIS BESNARD
Masters Degree Project Stockholm, Sweden 2007
On Stochastic Dynamic Programming and its Application to Maintenance
MASTER THESIS BY FRANOIS BESNARD
Master Thesis written at the Royal Institute of Technology KTH School of Electrical Engineering, June 2007 Supervisor: Assistant Professor Lina Bertling (KTH), Professor Michael Patriksson (Chalmers Applied Mathematics), Dr. Erik Dotzauer (Fortum) Examiner: Assistant Professor Lina Bertling
XR-EE-ETK 2007:008
Abstract
The market and competition laws are introduced among power system companies due to the restructuration and deregulation of the power system. The generating companies, as well as transmission and distribution system operators aim to minimize their costs. Maintenance can be a signicant part of the total costs. The pressure to reduce the maintenance budget leads to a need for ecient maintenance. This work focus on an optimization methodology that could be useful for optimizing maintenance. The method, stochastic dynamic programming, is interesting because it can integrate explicitely the stochastic behavior of functional failures. Dierent models based on stochastic dynamic programming are reviewed with the possible optimization methods to solve them. The interests of the models in the context of maintenance optimization are discussed. An example on a multi-component replacement application is proposed to illustrate the theory. Keywords: Maintenance Optimization, Dynamic Programming, Markov Decision Process, Power Production
III
Acknowledgements
First of all, I would like to thank my supervisors who each in their way supported me in this work. Ass. Prof. Lina Bertling for her encouragements, constructive remarks and for giving me the opportunity of working on this project, Dr. Erik Dotzauer for many valuable inputs, discussions and comments and Prof. Michael Patriksson for his help on mathematical writing. Special greetings to all my friends and companions of study all over the world. Finally, my heart turns to my parents and my love for their endless encouragements and support in my studies and life. Stockholm June 2007
Abreviations
ADP: Approximate Dynamic Programming CBM: Condition Based Maintenance CM: Corrective Maintenance DP: Dynamic Programming IHSDP: Innite Horizon Stochastic Dynamic Programming LP: Linear Programming MDP: Markov Decision Process PI: Policy Iteration PM: Preventive Maintenance RCAM: Reliability Centered Asset Maintenance RCM: Reliability Centered Maintenance SDP: Stochastic Dynamic Programming SMDP: Semi-Markov Decision Process TBM: Time Based Maintenance VI: Value Iteration
VII
Notations
Numbers M Number of iteration for the evaluation step of modied policy iteration N Number of stages Constant Discount factor ll Variables i State at the current stage j State at the next stage k Stage m Number of iteration left for the evaluation step of modied policy iteration q Iteration number for the policy iteration algorithm u Decision variable State and Control Space k Function mapping the states with a decision (i) k Optimal decision at state k for state i Decision policy for stationary systems Optimal decision policy for stationary systems Policy Optimal policy Uk Decision action at stage k Uk (i) Optimal decision action at stage k for state i Xk State at stage k Dynamic and Cost functions Ck (i, u) Cost function Ck (i, u, j) Cost function Cij (u) = C(i, u, j) Cost function if the system is stationary CN (i) Terminal cost for state i fk (i, u) Dynamic function fk (i, u, ) Stochastic dynamic function Jk (i) Optimal cost-to-go from stage k to N starting from state i k (i, u) Probabilistic function of a disturbance s Pk (j, u, i) Transition probability function P (j, u, i) Transition probability function for stationary systems V (Xk ) Cost-to-go resulting of a trajectory starting from state X k Sets IX
U (i) k X k
Decision Space at stage k for state i State space at stage k
Contents
Contents 1 Introduction 1.1 Background 1.2 Objective . 1.3 Approach . 1.4 Outline . . XI 1 1 2 2 2 5 5 6 11 11 13 13 15 15 18 23 23 25 25 26 26 29 29 31 31 31 32 33
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Maintenance 2.1 Types of Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Maintenance Optimization Models . . . . . . . . . . . . . . . . . . . 3 Introduction to the Power System 3.1 Power System Presentation . . . . . . . . . . . . . . . . . . . . . . . 3.2 Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Main Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Introduction to Dynamic Programming 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Deterministic Dynamic Programming . . . . . . . . . . . . . . . . . . 5 Finite Horizon Models 5.1 Problem Formulation . . . . . . . . . . . . . . 5.2 Optimality Equation . . . . . . . . . . . . . . 5.3 Value Iteration Method . . . . . . . . . . . . 5.4 The Curse of Dimensionality . . . . . . . . . 5.5 Ideas for a Maintenance Optimization Model 6 Innite Horizon Models - Markov 6.1 Problem Formulation . . . . . . . 6.2 Optimality Equations . . . . . . 6.3 Value Iteration . . . . . . . . . . 6.4 The Policy Iteration Algorithm . 6.5 Modied Policy Iteration . . . . 6.6 Average Cost-to-go Problems . . XI Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
6.7 6.8 6.9
Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . Eciency of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . Semi-Markov Decision Process . . . . . . . . . . . . . . . . . . . . . Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34 35 35
7 Approximate Methods for 7.1 Introduction . . . . . . . 7.2 Direct Learning . . . . . 7.3 Indirect Learning . . . . 7.4 Supervised Learning . .
- Reinforcement Learning 37 . . . . . . . . 37 . . . . . . . . 38 . . . . . . . . 41 . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 45 45 47 47 55 59 61 63 65
8 Review of Models for Maintenance Optimization 8.1 Finite Horizon Dynamic Programming . . . . . . . 8.2 Innite Horizon Stochastic Models . . . . . . . . . 8.3 Reinforcement Learning . . . . . . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . .
9 A Proposed Finite Horizon Replacement Model 9.1 One-Component Model . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Multi-Component model . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Conclusions and Future Work A Solution of the Shortest Path Example Reference List
Chapter 1
Introduction
1.1 Background
The market and competition laws are introduced among power system companies due to the restructuration and deregulation of modern power system. The generating companies, as well as transmission and distribution system operators aim to minimize their costs. Maintenance costs can be a signicant part of the total costs. The pressure to reduce the maintenance budget leads to a need for ecient maintenance. Maintenance cost be divided into Corrective Maintenance (CM) and Preventive Maintenance (PM) (see Chapter 2.1). CM means that an asset is maintained once an unscheduled functionnal failure occurs. CM can imply high costs for unsupplied energy, interruption, possible deterioration of the system, human risks or environment consequences etc. PM is employed to reduce the risk of unexpected failure. Time Based Maintenance (TBM) is used for the most critical components and Condition Based Maintenance (CBM) for the components that are worth and not too expensive to monitore. These maintenance actions have a cost for unsupplied energy, inspection, repair, replacement etc. An ecient maintenance should balance the corrective and preventive maintenance to minimize the total costs of maintenance. The probability of a functionnal failure for a component is stochastic. The probability depends on the state of component resulting from the history of the component (age, intensity of use, external stress (such as weather), maintenance actions, human 1
errors and construction errors). Stochastic Dynamic Programming (SDP) models are optimization models that integrate explicitely stochastic behaviors. This feature makes the models interesting and was the starting idea of this work.
1.2
Objective
The main objective of this work is to investigate the use of stochastic dynamic programming models for maintenance optimization and identify possible future applications in power systems.
1.3
Approach
The rst task was to understand the dierent dynamic programming approaches. A rst distinction was made between nite horizon and innite horizon approaches. The dierent techniques that can be used for solving a model based on dynamic programming was investigated. For innite horizon models, approximate dynamic programming was studied. These types of methods are related to the eld of reinforcement learning. Some SDP models found in the literature was reviewed. Conclusions was made about the applicability of each approach for maintenance optimization problems. Moreover, future avenue for research was identied. A nite horizon replacement model was developed to illustrate the possible use of SDP for power system maintenance.
1.4
Outline
Chapter 2 solves an overview of the maintenance eld. The most important methods and some optimization models are reviewed. Chapter 3 discusses shortly power systems. Some costs and constraints for optimization models are proposed. Chapter 4-7 focus on dierent Dynamic Programming (DP) approaches and algorithms to solve them. The assumption of the models and practical limitations are discussed. The basic of DP models is investigated in deterministic models in Chapter 4. Chapter 5 and 6 focus on Stochastic Dynamic Programming methods, 2
respectively for nite and innite horizons. Chapter 7 is an introduction to Approximate Dynamic Programming (ADP), also known as Reinforcement Learning (RL), which is an approach to solving Dynamic Programming innite horizon problems using approximate methods. Chapter 8 gives a review of some maintenance optimization models based on dynamic programming. Conclusions are made about possible use of the dierent approaches in maintenance optimization. Chapter 9 is an example of how nite horizon dynamic programming can be used for maintenance optimization. Chapter 10 summarizes the conlusions of the work and discuss possible avenues for research.
Chapter 2
Maintenance
The context of maintenance optimization is shortly described in this chapter. Dierent types of maintenance are dened in Section 2.1. Some maintenance optimization models are reviewed in Section 2.2.
2.1
Types of Maintenance
Maintenance is a combination of all technical, administrative and managerial actions during the life cycle of an item intended to retain it, or restore it to a state in which it can perform the required functions [1]. Figure 2.1 shows a general picture of the dierent types of maintenance.
Corrective Maintenance (CM) is carried out after fault recognition and intended to put an item into a state in which it can perform a required function [1]. It is typically performed in case there is no way or it is not worth detecting or preventing a failure. Preventive maintenance aims at undertaking maintenance actions on a component before it fails to e.g. avoid high cost of replacement, power delivery unsupplied and possible damages of the surrounding of the component. One can distinguish between two kind of preventive maintenance: 1. Time Based Maintenance (TBM) is preventive maintenance carried out in accordance with established intervals of time or number of units of use but without previous condition investigation [1]. TBM is used for failures that are age-related and for which the probability of failure on time can be established. 5
Maintenance
Preventive Maintenance
Corrective Maintenance
Time-Based Maintenance (TBM)
Condition Based Maintenance (CBM)
Continuous
Schedulled
Inspection Based
Figure 2.1: Maintenance Tree based on [1]
2. Condition Based Maintenance is preventive maintenance based on performance and/or parameter monitoring and the subsequent actions [1]. PM corresponds to all the maintenance methods using diagnostic or inspections to decide of the maintenance actions. Diagnostic methods include the use of human senses (noise, visual, etc.), measurements or tests. They can be undertaken continuously or during schedulled or requested inspections. CBM is often used for non-age related failures.
2.2
Maintenance Optimization Models
Unexpected failures of a component in a system can lead to expensive Corrective Maintenance. Preventive Maintenance approaches can be used to avoid CM. If preventive maintenance is done too frequently, it can however also result in a very high cost. The aim of the maintenance optimization could be to balance corrective and preventive maintenance to minimize, for example, the total cost of maintenance. Numerous maintenance optimization models have been proposed in the litterature and interesting reviews have been published. Wang [43] gives an interesting picture of maintenance policy optimization and its inuence factors. Cho et. al. [15], Dekker et. al. [16] and Nicolai et. al. [31] focus mainly on multi-component problems. In this section, the most common classes of models are described and some references are given. This short review is based on Chapter 8 of [4]. 6
2.2.1
Age Replacement Policies
Under an age replacement policy, a component is replace at failure or at the end of a specied interval, whichever occurs rst [17]. This policy makes sens if preventive replacement is less expensive than a corrective replacement and the failure rate increase with time. Barlow et. al. [7] describes a basic age replacement model. A model including discount have been proposed in [17]. In this model, the loss value of a replaced component decreases with its age. A model with minimal repair is discussed in [6]. If the component fails, it can be repaired to the same condition as before the failure occured. An age/block replacement model with failures resulting from shocks is described in [38]. The shocks follows a non-homogeneous Poisson distribution (Poisson process with a rate that is not stationnary). Two types of failures can result from the shocks: minor failure removed by minor repair, and major failure removed by replacement.
2.2.2
Block Replacement Policies
In blocks replacement policies, the components of a system are replaced at failure or at xed times kT (k = 1, 2, ...), whichever occurs rst. Barlow et. al. [7] describes a basic block replacement model. To avoid that a component that has just been replaced is replaced again, a modied block replacement model is proposed in [10]. A component is not replaced at a schedulled replacement time if its age is less than T. This model has been modied in [11] to model that the operational cost of an unit is higher when it becomes older. Moreover, the model of [10] is extended in [5] to allow multi-component systems with any discrete lifetime distribution.
2.2.3
Condition Based Maintenance
CBM is being introduced in many systems to avoid unnecessary maintenance and prevent incipient failure. In wind turbines, condition monitoring is being introduced for components like the gear box, blades etc. [32] One problem prior to the optimization is to identify relevant variables and identify their relation with failures modes and probabilities. CBM optimization models focus on dierent questions related to inspected/monitored components. One question is the optimal limits for the monitored variables above which it is necessary to perform maintenance. The optimal wear-limit for preventive replacement 7
of a component is derived in [34]. The model is extended in [35] to include dierent monitoring variables. For components subject to inspection, at each decision epoch, one must decide if maintenance should be performed and when the next inspection should occur. In [2], the inspection occur at xed time and the decision of preventive replacement of the component depend on its condition at inspection. In [9], a Semi-Markov Decision Process (SMDP, see Chapter 4) is proposed to optimize at each inspection the maintenance decision and the time to next inspection. An age replacement policies model that takes into account the information from condition based monitoring devices is proposed in [25]. A proportional hazard model is used to model the eect of the monitored variables. The assumption of a hazard model is that the hazard function is the product of a two functions, one depending on the time and one on the parameters (monitored variables).
2.2.4
Opportunistic Maintenance Models
Opportunistics maintenance considers unexpected opportunities of performing preventive maintenance. With the failure of a component, it is possible to perform PM on other components. This could be interesting for oshore wind farms for example. The deplacement to the wind farm, by boat or helicopter is necessary and can be very expensive. By grouping maintenance actions, money could be saved. Haurie et. al. [19] focus on group preventive replacement policy of m identical components that are in the same condition. Both discrete and continuous time are considered and a dynamic programming equation is derived. The model is extended in [26] for m non-identical components. A rolling horizon dynamic programming algorithm is proposed in [45] to take into account the short term information. The model can be used for many maintenance optimization models.
2.2.5
Other Types of Models and Criteria of Classications
Other models integrate the possibility of a limited number of spare parts or a possible choice between dierent spare part. E.g cannibalization models allows the re-use of some components or subcomponents of a system. Other criterias can be used to classify maintenance optimization models. The number of components in consideration is important, e.g. multi-components models are more interesting in power system. The time horizon considered in the model 8
is important. Many articles consider innite time horizon. More focus should be done on nite horizon since they are more practical. Another characteristic of the model is the time representation, if discrete or continuous time is considered. One distinction can be done between models with deterministic and stochastic lifetime of components. Among stochastic approaches, it can be interesting to consider which kind of lifetime distribution can be used. The method used for solving the problem has an inuence on the solution. A model that can not be solved is of no interest. For some model, exact solution are possible. For complex models, it is either necessary to simplify the model or to use heuristic methods to nd approximate solutions.
Chapter 3
Introduction to the Power System

This chapter gives a brief description of electrical power systems. Some costs and constraints for a maintenance model are proposed.
3.1
Power System Presentation
Power systems are very complex. They are composed of thousands of components linked through a complex mesh of lines and cables that have limited capacities. With the deregulation of power systems, the generation, distribution and transmission systems are separated. Even considered independently, each part of the power system is complex with many components and subcomponents.
3.1.1
Power System Description
A simple description of the power system include the following main parts: 1. Generation: That are the generation units that produce the power. It can be e.g. hydro-power units, nuclear power plants, wind farms etc. The total power consumed is always equal to the power generated. 2. Transmission: The transmission system is composed of high voltage and high power lines. This part of the system is in general meshed. The transmission system connects distribution systems with generation units. 11
3. Distribution: The distibution system is a voltage level below transmission which is connected to customers. It connects distribution system with consumers. Distribution system are in general operated radial (One connection point to the transmission system). 4. Consumption: The consumer can be divided into dierent categories. Consumer can be industry, commercial, house, oce, agriculture etc. The costs for interruption are in general dierent for the dierent categories of consumer. These costs will also depend on the time of outage. The trade of electricity between producers and consumers is made through dierent specic markets in the world. The rules and organization are dierent for each market place. The bids of electricity trades are declared in advance to the system operator. This is necessary to check that the power system can withstand the operationnal condition. The power system is controlled in real-time both automatically (automatic control and protection devices) and manually (with the help of the system operator to coordinate the necessary action to avoid dangerous situations). Each component of the system inuence the other. If a component has a functional failure, it can induce failures of others component. Cascading failures can have drastic consequences such as black-outs.
3.1.2
Maintenance in Power System
The objective is to nd the right way to do maintenance. Corrective Maintenance and Preventive Maintenance should be balanced for each component of a system and the optimal PM approaches should be determined. Reliability Centered Maintenance (RCM) is being introduced in power companies. (See [47] for an example in hydropower) RCM is an structured approach to nd a balance between corrective and preventive maintenance. Research on Reliability Centered Asset Maintenance (RCAM), a quantitative approach to RCM, is being carried out in the RCAM group at KTH School of electrical engineering. Bertling et. al. [12] dened in details the approach and its dierent steps. An important step is the maintenance optimization. In Hilber et. al [20] a method based on a monetary importance index is proposed to dene the importance of individual components in a network. Ongoing research focus for example on wind power. (See [39], [32]) Research about power generation is typically focusing on predictive maintenance using condition based monitoring systems. (See for example [18] or [44]) The problem of maintenance for transmission and distribution systems has received more 12
attention since the deregulation of the electricity market. (See for example [12], [27] for distribution systems, [22], [30] for transmission systems) The emergence of new condition based monitoring systems is changing the approach to maintenance in power system. There is a need for new models and methods to optimize the use of condition based monitoring systems.
3.2
Costs
Possible costs/incomes related to maintenance in power systems have been identied (non-inclusively) as follows: Manpower cost: Cost for the maintenance team that performs maintenance actions. Spare part cost: The cost of a new component is an important part of the maintenance cost. Maintenance equipment cost: If special equipment is needed for undertaking the maintenance. An helicopter can sometime be necessary for the maintenance of some parts of an o-shore wind turbine. Energy production: The electricity produce is sold to consumers on the electricity market. The price of electricity can uctuate. At the same time the power produce by a generating power unit can uctuate depending on factors like the weather (for renewable energy). The condition of the unit can also inuence its eciency. Unserved energy/Interruption cost: If there is an agreement to produce/deliver energy to a consumer at some specic time, unserved energy must be paid. The cost depends on the contract and the cost per unit time depends on the duration of the failure. Inspection/Monitoring cost: Inspection or monitoring systems have a cost that must be considered. The cost can be an initial investment (for continuous monitoring systems) or discret costs (each time an inspection, measurement or test is done on an asset)
3.3
Main Constraints
Possibles constraints for the maintenance of power system have been identied as follows: 13
Manpower: The size and availability of the maintenance sta is limited. Maintenance Equipment: The equipment needed for undertaking the maintenance must be available. Weather: The weather can make certain maintenance actions postponed; e.g. in very windy conditions it is not possible to realize maintenance on oshore wind farms. Availability of the Spare Part: If the needed spare parts are not available, maintenance can not be done. It can also happen that a spare part is available but far away from the location where it is needed. The transportation has a price and time. Maintenance Contracts: Power companies can subscribe for maintenance services from the manufacturer of a system. This is a typical option for wind turbines [33]. The time span of a contract can be a constraint for an optimization model. Availability of Condition Monitoring Information: If condition monitoring systems are installed on a system, the information gathered by the monitoring devices are not always available to non-manufacturer companies. The availability of monitoring information has an important impact is on the possible input for an optimization model. Statistical Data: Available monitoring information have a value only if conclusions about the deterioration or failure state in a system can be drawn from them. Statistical data are necessary to create a probabilistic model.
14
Chapter 4
Introduction to Dynamic Programming

This chapter deals with general ideas about Dynamic Programming (DP) and some feature of possible DP models. Deterministic DP is used to introduce the basic of DP formulation and the value iteration method, a classical method for solving DP models.
4.1
Introduction
Dynamic Programming deals with multi-stage or sequential decisions problems. At each decision epoch, the decision maker (also called agent or controller in dierent contexts) observes the state of a system. (It is assumed in this thesis that the system is perfectly observable.) An action is decided based on this state. This action will result in an immediate cost (or reward) and inuence the evolution of the system. The aim of DP is to minimize (or maximize) the cumulative cost (respectively income) resulting of a sequence of decisions. In the following, important ideas concerning Dynamic Programming are discussed.
4.1.1
Principle of Optimality
Dynamic programming is a way of decomposing a large problem into subproblems. It can be applied to any problem that observes the principle of optimality: 15
An optimal policy has the property that, whatever the initial state and optimal rst decision may be, the remaining decisions constitute an optimal policy with regard to the state resulting from the rst decision. [8]
The solution of the subproblems are themselves solution of the general problem. The principle implies that at each stage the decision are based only on the current state of the system. The previous decisions should not have inuence on the actual evolution of the system and possible actions. Basically, in maintenance problems, it would mean that maintenance actions have only an eect on the state of the system directly after their accomplishment. They do not inuence the deterioration process after they have been completed.
4.1.2
Deterministic and Stochastic Models
A system is said to be deterministic if the state at the next epoch depends only on the actual state and action made. If a system is subject to probabilistic events, it will evolve according to a probabilistic distribution depending on the actual state and action choice. The system is then refered to as probabilistic or stochastic. Functional failures are in general represented as stochastic events. In consequence, stochastic maintenance optimization models are interesting.
4.1.3
Time Horizon
The time horizon of a model is the time "window" considered for the optimization. One distinguishs between nite and innite time horizons. Chapter 4 focus on nite horizon stochastic dynamic programming. In the context of maintenance, the objective would be, for example, to minimize the maintenance costs during the time horizon considered. Chapter 5 and 6 focus on models that assume an innite time horizon. This assumption implies that a system is stationary, that it evolves in the same manner all the time. Moreover, an innite horizon optimization assumes implicitely that the system is used for a innite time. It can be an good approximation if indeed the lifetime of a system is very long. 16
4.1.4
Decision Time
In this thesis, we focus mainly on Stochastic Dynamic Programming (SDP) with discrete sets of decision epochs (Chapter 3, 4 and 6). Decisions are made at each decision epoch. The time is devided into stages or periods between these epochs. It is clear that the interval time between 2 stages will have an inuence on the result. Short intervals are more realistitic and precise but the models can become heavy if the time horizon is large. In practice, long intervals can be used for long-term planning while short-term planning consider shorter intervals. Continum set of decision epochs implies that the decision can be made either continuously, at some points decided by the decision maker or when an event occur. The two last possibilities will be shortly investigated in Chapter 5. Continuous decision refers to optimal control theory and will not be discussed here.
4.1.5
Exact and Approximation Methods
Dynamic Programming suers of a complexity problem, the curse of dimensionality (discussed in Section 4.2). Methods for solving the dynamic programming models exactly exist and are presented in Chapters 5 and 6. However, large models are untractable with these methods. Chapter 6 provide an introduction to the eld of Reinforcement Learning (RL) that focus on approximations for DP solutions. Approximate algorithms are obtained by combining DP and supervised learning algorithms. RL is also known as neurodynamic programming when DP is combined with neural networks. [13]
17
4.2
Deterministic Dynamic Programming
This section introduces the basics of deterministic Dynamic Programming. The optimality equation is presented with the value iteration algorithm to solve it. The section is illustrated with a classical example of a simple shortest path problem.
4.2.1
Problem Formulation
The three main parts of a DP model are its state and decision spaces, dynamic and cost functions and objective function. The nite horizon model considers a system that evolves for N stages. State and Decision Spaces At each stage k, the system is in a state Xk = i that belongs to a state space X . k Depending on the state of the system, the decision maker decide of an action to do, u = Uk U (i). k Dynamic and Cost Functions As a result of this action, the system state at next stage will be Xk+1 = fk (i, u). Moreover, the action has a cost that the decision maker has to pay, Ck (i, u). A possible terminal cost is associated to the terminal state (state at stage N), (C N (XN ). Objective Function The objective is to determine the sequence of decision that will mimimize the cumulative cost (also called cost-to-go function) subject to the dynamic of the system:
J0 (X0 ) = min N 1
Uk k=0
Ck (Xk , Uk ) + CN (XN )
Subject to Xk+1 = fk (Xk , Uk ) k = 0, ..., N 1
N k i j Xk Uk Ck (i, u) CN (i) fk (i, u) J0 (i)
Number of stages Stage State at the current stage State at the next stage State at stage k Decision action at stage k Cost function Terminal cost for state i Dynamic function Optimal cost-to-go starting from state i 18
4.2.2
The Optimality Equation and Value Iteration Algorithm
The optimality equation (also known as Bellmans equation) derives directly from the principle of optimality. It states that the optimal cost-to-go function starting from stage k can be derived with the following formula:
Jk (i) = min {Ck (i, u) + Jk+1 (fk (i, u))} uU (i) k Jk (i)
(4.1)
Optimal cost-to-go from stage k to N starting from state i
The value iteration algorithm is a direct consequence of the optimality equation:

JN (i) = CN (i) i XN Jk (i) = uU (i) k min {Ck (i, u) + Jk+1 (fk (i, u))} i Xk
Uk (i) = argmin{Ck (i, u) + Jk+1 (fk (i, u))} i Xk uU (i) k
u Uk (i)
Decision variable lll Optimal decision action at stage k for state i
The algorithm goes backwards, starting from the last stage. It stops when k=0.
19
4.2.3
A Simple Shortest Path Problem Example
Deterministic dynamic programming can be used to solve simple shortest path problems with small state space. An example is used to illustrated the formulation and the value iteration algorithm. The following shortest path problem is considered:
B 2 4 3
D Stage 0 Stage 1
4 6 2 1 3 5 2
G Stage 2
2 5 7 3 2 1 2
H 4 I 2 7 J Stage 3 Stage 4 K
The aim of the problem is to determine the shortest way to reach the node K starting from the node A. A cost (corresponding to a distance) is associated to each arc. One rst way to solve the problem would be to calculate the cost of all the possible path. For example, the path A-B-F-J-K has a cost of 2+6+2+7=17. Then the shortest path would be the one with the lowest cost. Dynamic programming provides a more ecient way to solve the problem. Instead of calculating all the path cost, the problem will be divided in subproblems that will be solved recursively to determine the shortest path from each possible node to the terminal node K.
4.2.3.1
Problem Formulation
The problem is divided into ve stages,n=5, k={0,1,2,3,4}. State Space The state space is dened for each stage: X = {A} = {0}X = {B, C, D} = {0, 1, 2} X = {E, F, G} = {0, 1, 2} 0 1 2 X = {H, I, J} = {0, 1, 2}X = {K} = {0} 3 4 20
Each node of the problem is dened by a state Xk . For example, X2 = 1 corresponds to the node F. In this problem, the state space is dened by one variable. It is also possible to have multi-variable space for which Xk would be a vector. Decision Space The set of decisions possible must be dened for each state at each stage. In the example, the choice is "which way should I take from this node to go to the next stage?". The following notations are used: U (i) k for i = 0 {0, 1, 2} for i = 1 for k=1,2,3 = {1, 2} for i = 2
{0, 1}
U (0) = {0, 1, 2} for k=0 0
For example, U (0) = U (B) = {0, 1} with U1 (0) = 0 for the transition B E or 1 U1 (0) = 1 for the transition B F . Another example, U (2) = U (D) = {1, 2} with u1 (2) = 2 for the transition D F 1 or u1 (2) = 2 for the transition D G. A sequence = {0 , 1 , ..., N }, where k (i) is a function mapping the state i at stage k with an admissible control for this state, is called a policy. The value iteration algorithm determine the optimal policy of the problem, = { , , ..., }. 0 1 N Dynamic and Cost Functions The dynamic function of the example is simple thanks to the notations used: fk (i, u) = u. The transition costs are dened equal to the distance from one state to the resulting state of the decision. For example, C1 (0, 0) = C(B E) = 4. The cost function is dened in the same way for the others stages and states. Objective Function
J0 (0) =
Uk U (Xk ) k
min
4 k=0
Ck (Xk , Uk ) + CN (XN )
Subject to Xk+1 = fk (Xk , Uk ) k = 0, 1, . . . , N 1
4.2.3.2
Solution
The value iteration algorithm is used to solve the problem. The algorithm is initiated from the last stage and then iterated backwards until 21
the initial state is reached. The optimal decision sequence is then obtained forward by using the optimal solution determined by the DP algorithm for the sequence of states that will be visited. The solution of the algorithm are given in Appendix A.
The optimal cost-to-go is J0 (0) = 8. It corresponds to the following path A D G I K. The optimal policy of the problem is = {0 , 1 , 2 , 3 , 4 } with k (i) = u (i) (for example 1 (1) = 2, 1 (2) = 2). k
22
Chapter 5
Finite Horizon Models

In this chapter, a stochastic version of the dynamic programming model in Chapter 3 is presented. The section introduces the theory for the proposed model in Chapter 9. For more details and examples, the book Markov Decision Processes: Discrete Stochastic Dynamic Programming [36] is recommended.
5.1
Problem Formulation
Stochastic dynamic programming can be used to model systems whose dynamic is probabilistic (or subject to disturbances). The state of the system at the next stage is not deterministic as in Chapter 5. It depends on the current state and decision but also on a stochastic variable that describes the disturbance, the stochastic behavior of the system. A stochastic dynamic programming model can be formulated as below: State Space A variable k {0, ..., N } represents the dierent stages of the problem. In general it corresponds to a time variable. The state of the system is characterized by a variable i = Xk . The possible states are represented by a set of admissible states that can depends on k, Xk X . k Decision Space At each decision epoch, the decision maker must choose an action u = Uk among a set of admissible actions. This set can depend on the state of the system and on 23
the stage, u U (i). k Dynamic of the System and Transition Probability On the contrary with the deterministic case, the state transition does not depend only on the control used but also on a disturbance = k (i, u) Xk+1 = fk (Xk , Uk , ) k = 0, 1, . . . , N 1 The eect of the disturbance can be expressed with transition probabilities. The transition probabilities dene the probability that the state of the system at stage k+1 is j if the state and control are i and u at the stage k. These probabilities can depend also on the stage. Pk (j, u, i) = P (Xk+1 = j | Xk = i, Uk = u) If the system is stationary (time-invariant) the dynamic function f does not depends on time and the notation for the probability function can be simplied: P (j, u, i) = P (Xk+1 = j | Xk = i, Uk = u) In this case, one refers to a Markov decision process. If a control u is xed for each possible state of the model, then the probability transition can be represented by a Markov model. (See Chapter 9 for an example) Cost Function A cost is associated to each possible transition (i,j) and action u. The costs can also depend on the stage. Ck (j, u, i) = Ck (xk+1 = j, uk = u, xk = i) If the transition (i,j) occurs at stage k when the decision is u, then a cost C k (j, u, i) is given. If the cost function is stationary then the notation is simplied by C(i, u, j). A terminal cost CN (i) can be used to penalize deviation from a desired terminal state. Objective Function The objective is to determine the sequence of decision that optimize the expected cumulative cost (cost-to-go function) J (X0 ) where X0 is the initial state of the system: J (X0 ) = min E{CN (XN ) +
N 1 k=0
Uk U (Xk ) k
Ck (Xk+1 , Uk , Xk )}
Subject to Xk+1 = fk (Xk , Uk , k (Xk , Uk )) k = 0, 1, . . . , N 1 24
N k i j Xk Uk k (i, u) Ck (i, u, j) CN (i) fk (i, u, ) J0 (i)
Number of stages Stage State at the current stage State at the next stage State at stage k Decision action at stage k Probabilistic function of the disturbance Cost function Terminal cost for state i Dynamic function Optimal cost-to-go starting from state i
5.2
Optimality Equation
The optimality equation for stochastic nite horizon DP is:

Jk (i) = min E{Ck (i, u) + Jk+1 (fk (i, u, ))} uU (i) k
(5.1)
This equation dene a condition for a cost-to-go function of a state i in stage k to be optimal. The equation can be re-written using the probability transitions:
Jk (i) = min uU (i) k Pk (i, u, j) [Ck (i, u, j) + Jk+1 (j)] jX k+1
(5.2)
X k U (i) k Pk (j, u, i)
State space at stage k Decision Space at stage k for state i Transition probability function
5.3
Value Iteration Method
The Value Iteration (VI) algorithm for SDP problems is directly based on equation 5.2. The algorithm starts from the last stage. By backward-recursions, it determines at each stage the optimal decision for each state of the system.
JN (i) = CN (i) i X (Initialisation) N
While k 0 do: Jk (i) = min

Uk (i) = argmin
uUk (i) jX
Pk (i, u, j) [Ck (i, u, j) + Jk+1 (j)] Pk (i, u, j) [Ck (i, u, j) + Jk+1 (j)]
i X k i X N
k+1
uUk (i) jX k+1
k k1 25
Decision variable Uk (i)
Optimal decision action at stage k for state i
The recursion nishes when the rst stage is reached.
5.4
The Curse of Dimensionality
Consider a nite horizon stochastic dynamic problem with N stages NX states variables, the size of the set for each state variable is S NU control variables, the size of the set for each control variable is A The time complexity of the algorithm is O(N S 2NX ANU ). The complexity of the problem increases exponentionally with the size of the problem (number of state or decision variables). This characteristic of SDP is called the curse of dimensionality.
5.5
Ideas for a Maintenance Optimization Model
In this section, possible state variables for a maintenance models based on SDP are discussed.
5.5.1
Age and Deterioration States
The failure probability of components is often modelled as a function of time. A possible state variable for the component is its age. To be precise, the age of the component should be discretized according to the stage duration. If the lifetime of a component is very long it can lead to a very large state space. The time horizon can be considered to reduce the number of states. If a state variable can not reach certain states during the planned horizon, these states can be neglected. If a component, subcomponent or part of a system can be inspected or monitored, dierent levels of deterioration can be used as a state variable. In practice, both age and deterioration state variables could be used complementary. Of course maintenance states should be considered in both cases. It could be possible to have dierent types of failure states, as major failure and minor failures. Minor failures could be cleared by repair while for a major failure a component should be replace. 26
5.5.2
Forecasts
Measurements or forecasts can sometime estimate the disturbance a system is or can be subject to. The reliability of the forecasts should be carefully considered. Deterministic information could be used to adapt the nite horizon model on their horizon of validity. It would also be possible to generate dierent scenarios from forcasts, solve the problem for the dierent scenarios and get some conclusions from the dierent solutions. Another way of using forecasting models is to include them in the maintenance problem formulation by adding a specic variable. It will reduce the uncertainties but in return increase the complexity. The proposed model in Chapter 9 gives an example of how to integrate a forecasting model in an electricity scenario. Another factor that could be interesting to forecast is the load. Indeed the production must always be in balance with the generation. Also if there is no consumption, some generation units are stopped. This time can be used for the maintenance of the power plant. Weather forecasting could also be interesting in some cases. For example the power generated by wind farms depends on the wind strength, and maintenance action on oshore wind farms are possible only in case of good weather. For these two reasons, wind forecasting could be interesting for optimizing maintenance actions of oshore wind farms.
5.5.3
Time Lags
An important assumption of a DP model is that the dynamic of the system only depends on the actual state of the system (and possibly on the time if the system dynamic is not stationary). This condition of loss of memory is very strong and unrealistic in some cases. It is sometimes possible (if the system dynamic depends on few precedent states) to overcome this assumption. Variables are added in the DP model to keep in memory the precedent states that can be visited. The computational price is once again very high. For example, in the context of maintenance, it would be interesting to know the deterioration level of an asset at the precedent stage. It would give informations about the dynamic of the deterioration process.
27
Chapter 6
Innite Horizon Models Markov Decision Processes

Innite horizon models are models of systems that are considered stationary over time. The dynamic of the system as well as the cost function and the disturbances are stationary. Innite horizon stochastic dynamic programming (IHSDP) models can be represented by a Markov Decision Process. For more details and proof for the convergence of the algorithm [36] or the introduction chpater of [13] are recommended. In practice, one scarcely faces problems with innite number of stages. It can however be a reasonable approximation of problems with very large number of states for which the value algorithm would lead to untractable computation. The approximation methods presented in Chapter 7 are based on the methods presented in this chapter.
6.1
Problem Formulation
The state space, decision space, probability function and cost function of IHSDP are dened in a similar way that FHSDP for the stationary case. The aim of IHSDP is to minimize the cumulative costs of a system over an innite number of stages. This sum is called cost-to-go function. An interesting feature of IHSDP models is that the solution of the problem is a stationary policy. It means that the solution of the problem has the form = {, , ...}. is a function mapping the state space with the control space. For 29
i X , (i) is an admissible control for the state i, (i) U (i). The objective is to nd the optimal . It should minimize the cost-to-go function. To be able to compare dierent policies, it is necessary that the innite sum of costs converge. Dierent type of models can be considered; stochastic shortest path problems, discounted problems and average cost per stages problems. Stochastic shortest path models Stochastic shortest path dynamic programming models have a terminal state (or cost-free terminaison state) that is not evitable. When this state is reached, the system remains in this state and no costs are paid: J (X0 ) = min E{ lim
N 1
N k=0
C(Xk+1 , (Xk ), Xk )}
Subject to Xk+1 = f (Xk , (Xk ), (Xk , (Xk ))) k = 0, 1, . . . , N 1 J (i) Decision policy Optimal cost-to-go function for state i
Discounted problems Discounted IHSDP models have a cost function that is discounted by a factor is a discount factor (0 < < 1). The cost function for discounted IHSDP has the form Cij (u). As Cij (u) is bounded, the innite sum will converge (decreasing geometric progression). J (X0 ) = min E{ lim
N 1
N k=0
C(Xk+1 , (Xk ), Xk )}
Subject to Xk+1 = f (Xk , U k, (Xk , (Xk ))) k = 0, 1, . . . , N 1 Discount factor
Average cost per stage problems Innite horizon problems can sometimes not be represented with a no free-cost terminaison state or discounted. To make the cost-to-go nite, the problem can modelled as an average cost per stage problem where the aim is to minimize: J = min E{ lim
1 N k=0 N N 1
C(Xk+1 , (Xk ), Xk )}
Subject to Xk+1 = f (Xk , U k, (Xk , (Xk ))) k = 0, 1, . . . , N 1 30
6.2
Optimality Equations
The optimality equations are formulated using the probability function P (i, u, j). The stationary policy solution of a IHSDP shortest path problem is solution of the Bellmans equation (other name for the optimality equation - Bellman is the mathematician at the origin of the DP theory): J (i) = J (i) J (i)
(i)U (i)
min
Pij (u) [Cij (u) + J (j)]

jX
i X
Cost-to-go function of policy starting from state i Optimal cost-to-go function for state i
For a IHSDP discounted problem the optimality equation is: J (i) =

(i)U (i)
min
Pij (u) [Cij (u) + J (j)]

jX
i X
The optimality equation for average cost-to-go IHSDP problems is discussed in Section 6.7.
6.3
Value Iteration
To solve the optimality equations, a rst idea would be to use the value iteration algorithm presented in the Chapter 5. Intuitively, the algorithm should converge to the optimal policy. It can be shown that the algorithm will indeed converge to the optimal solution. If the model is discounted, then the method can be fast. The time complexity is in polynomial 1 time of the size of the state space, control space and 1 . For non-discounted models, the theoretical number of iteration needed is innite and a relative criteria must be determine to stop the algorithm. An alternative to the method is the Policy Iteration (PI) algorithm. This later terminates after a nite number of iteration.
6.4
The Policy Iteration Algorithm
Given a policy , the rst step of the algorithm evaluates the policy by calculating the expected cost-to-go function resulting from this policy. The next step of the 31
algorithm improve the expected cost-to-go function by enhancing the actual policy. This 2-steps algorithm is used iteratively. The process stops when a policy is a solution of its own improvement. The algorithm starts with an initial policy 0 . Then it can be described by the following steps: Step 1. Policy Evaluation q+1 = q stop the algorithm Else Jq (i) solution of the following linear system is calculated Jq (i) =
jX
P (j, u, i) [C(j, u, i) + Jq (j)]
Iteration number for the policy iteration algorithm
This is the expected cost-to-go function of the system using the policy q . Step 2. Policy Improvement A new policy is obtained using the value iteration algorithm: q+1 (i) = argmin
uU (i) jX
P (j, u, i) [C(j, u, i) + Jq (j)]
Go back to policy evaluation step. The process stops when q+1 = q . At each iteration, the algorithm always improve the policy. If the initial policy 0 is already good, then the algorithm will converge fast to the optimal solution.
6.5
Modied Policy Iteration
If the number of states is large, solving the linear problem of the policy evaluation can be computational intensive. An alternative is to use at each stage the value iteration algorithm on a nite number of iterations M to estimate the value function of the policy . The algorithm M is initialized with a value function Jk (i) that must be chosen higher than the real value Jk (i). 32
While m 0 do
m Jk (i) = jX m+1 P (j, k (i), i) [C(j, k (i), i) + Jk (j)]
i X
mm1 m Number of iteration left for the evaluation step of modied policy iteration
0 The algorithm stops when m=0 and Jk is approximated by Jk .
6.6
Average Cost-to-go Problems
The methods presented in Sections 5.1-5.4 can not be applied directly to average cost problems. Average cost-to-go problems are more complicated and implies conditions on the Markov decision process for the convergence of the algorithms. An average cost-to-go problem can be reformulated as equivalent to a shortest path problem if the model of the Markov decision process is proved to be unichain (That is, all stationary policies generate Markov chains that consist of a single ergodic class and possibly some transient states. See for details [36]). Given a stationary policy , a state X X , there is an unique and vector h such that h (X) = 0 + h (i) =
jX
P (j, (i), i) [C(j, u, i) + h (j)]
i X
This is the average cost-to-go for the stationary policy . The average cost-to-go is the same for all the starting state. The optimal average cost and optimal policy satisfy the Bellman equation: + h (i) = argmin
(i)U (i) jX
P (j, (i), i) [C(j, (i), i) + h ] i X P (j, u, i) [C(j, u, i) + h ] i X

jX
(i) = argmin
uU (i)
6.6.1
Relative Value Iteration
The value iteration method can be adapted to average cost-to-go problems. The method is called relative value iteration. X is an arbitrary state and h0 (i) is chosen 33
arbitrarly. Hk = hk+1 (i) = min

uU (X)
min
P (j, u, i) [C(j, u, i) + hk (X)]

jX
uU (i)
P (j, u, i) [C(j, u, i) + hk (j)] Hk

jX
i X
k+1 (i) = argmin

uU (i) jX
P (j, u, i) [C(j, u, i) + hk (j)]
i X
The sequence hk will converge if the Markov decision process is unichain. Moreover, the algorithm converge to the optimal policy. The number of iterations needed is innite in theory.
6.6.2
Policy Iteration
The problem can also be solved using the policy iteration algorithm. Initialisation X can be chosen arbitrarly. Step 1. Evaluation of the policy If q+1 = q and and hq+1 (i) = hq (i) Else solve the system of equation: hq (X) = 0 q + hq (i) =
jX
i X stop the algorithm.
P (j, (q)(i), i) [C(j, u, i) + hq (j)]
i X
Step 2. Policy improvement q+1 = argmin

uU (i) jX
P (j, u, i) [C(j, u, i) + hq ] i X
q := q + 1
6.7
Linear Programming
The three types of IHSDP models can be reformulated to be solved with linear programming (LP) methods. The motivation for this apporach is that a linear programming model can include constraints that are not possible to include in a classical MDP model. However, the model become less intuitive than with the other methods. Moreover LP, can only be used for smaller state spaces than the value iteration and policy iteration methods. 34
For example, in the discounted IHSDP, J (i) = argmin

(i)U (i) jX
P (j, u, i) [C(j, u, i) + J (j)]
i X
J (i) is solution of the following linear programming model: M inimize

iX
J (i)
jX
Subject to J (i) +
J (j) C(j, u, i)
jX
P (j, u, i) C(j, u, i)u, i
At present linear programming has not proven to be an ecient method for solving large discounted MDPs; however, innovations in LP algorithms in the past decade might change this [36].
6.8
Eciency of the Algorithms
For details about the complexity of the algorithms, [28] and [29] are recommended. If n and m denote the number of states and actions, this means that a DP method takes a number of computational operations that is less than some polynomial function of n and m. A DP method is guaranteed to nd an optimal policy in polynomial time even though the total number of (deterministic) policies is mn . [41] But linear programming methods become impractical at a much smaller number of states than do DP methods. [41] Since the policy iteration algorithm always improve the policy at each iteration, the algorithm will converge quite fast if the initial policy 0 is already good. There is strong empirical evidence in favor of PI over VI and LP in solving Markov decision processes [28].
6.9
Semi-Markov Decision Process
Until now, the decision epochs were predetermined at discrete time points (periodic in the case of innite horizon problems). However, for some applications, the decision time can be random. For example the next decision time can be decided by the decision maker depending on the actual state of the system. Or the decision epoch occurs each time the state of the system is changing. This kind of problems refers to Semi-Markov Decision Processes (SMDP). SMDP generalize MDP by "1) allowing, or requiring, the decision maker to choose actions whenever the system state changes; 2) modeling the system evolution in 35
continuous time; and 3) allowing the time spent in a particular state to follow an arbitrary probability distibution." [36] The time horizon is considered innite and the action are not made continuously (this kind of problems refer to optimal control theory). SMDP are more complicated than MDP and will not be part of this thesis. Puterman [36] explains how one can transform a SMDP model into a model solvable with the methods presented previously in this chapter. SMDP could be interesting in maintenance optimization since they allows a choice of inspection interval for each state of the system. However, due to the complexity of the models, only small state space are tractable.
36
Chapter 7
Approximate Methods for Markov Decision Process Reinforcement Learning

Reinforcement Learning (RL) or Approximate Dynamic Programming (ADP) is an approach of machine learning that combines innite horizon dynamic programming with supervised learning techniques. Supervised learning techniques give the possibility to approximate the cost-to-go function on a large state space. The aim of this chapter is to give an overview to RL. For further interest, see the books Handbook of Learning and Approximate Dynamic Programming [40], NeuroDynamic Programming [13] and article [23].
7.1
Introduction
The problem of the methods presented in the previous chapter is that the models are untractable for large state space. In this chapter, methods to overcome this problem by approximation are presented. They make use of supervised learning techniques. Supervised learning is a eld that investigates the creation of functions from training data (pairs input-output) to be able to predict future output for any kind of possible input data. Many approachs are possible, such as articial neural networks, decision tree learning, bayesian statistics. One of the rst reinforcement learning approaches was using articial neural net37
works methods as supervised learning technique. This approach was also called neuro-dynamic programming (see [13]). Reinforcement learning methods refer to systems that "learn how to make good decisions by observing their own behavior, and use built-in mechanisms for improving their actions trough a reinforcement mechanism" [13]. The root of the algorithm proposed in RL are based on the methods of Chapter 6. The system is assumed to be stationary and be a Markov decision process. However RL does not require that an explicite model of the system exist. The methods can even be applied in parallel of learning the environment (the MDP of the system). This can be a practical advantage since a fastidious model does not need to be built rst. The state and decision space are assumed known. The methods works on observed trajectory samples that have the form (Xk , Xk+1 , Uk , Ck ). The samples can be used to learn directly the cost-to-go function of a given policy or the Q-factor of a problem without estimating the probabilities transitions of the model. The rst section deals with this type of learning Direct learning methods. This approach is useful for large state space. If a model of the system exist, the method can be used with samples from Monte Carlo simulations. In case of a real-time application, it is possible to combine the learning of the transition and cost functions with direct learning methods to take advantage of all the experience obtained. This approach is called Indirect learning (or model based methods) and will be discussed shortly. The RL methods are extension of the methods presented in Section 7.2. RL methods make use of supervised learning techniques to approximate the cost-to-go function over the whole state space. They are presented in Section 7.4.
7.2
Direct Learning
The aim of reinforcement learning is to infer good decisions based on samples of performance of the system provided from simulation or real-life experience. A sample has the form (Xk , Xk+1 , Uk , Ck ). Xk+1 is the observed state after chosing the control Uk in state Xk and Ck = C(Xk , Xk+1 , Uk ) is the cost resulting from this transition. The samples can be generated by Monte Carlo simulation according to the probabilities transitions P (j, u, i) and C(j, u, i) if a model of the system exists. 38
7.2.1
Policy Evaluation using Temporal Dierences
Temporal dierences (TD) is a method for estimating the cost-to-go function of a policy using samples resulting from the use of this policy. The method is used in the rst step of the policy method discussed in Chapter 6. It can be seen in a similar way as the modied policy iteration. The cost-to-go function is estimated using the costs resulting of the simulation. Note that from each state visited, the remaining trajectory starting form this state can be used as a sample for the cost-to-go function. TD will be presented in the context of Stochastic shortest path problems, which means that there is a terminal state and every simulation terminate over a nite time. The method can also be adapted to discounted problems or average-cost-to-go problems. Policy evaluation by simulation Assume a trajectory (X0 , ..., XN ) has been generated according to the policy and the sequence of transition cost C(X k , Xk+1 ) = C(Xk , Xk+1 , (Xk )) have been observed. The cost-to-go resulting from the trajectory starting from the state X k is
N
V (Xk ) =
n=k
C(Xn , Xn+1 )
V (Xk )
Cost-to-go of a trajectory starting from state Xk
If a certain number of trajectories has been generated and the state i has been visited K times in these trajectories,J(i) can be estimated by J(i) = V (i, m) 1 K
K
V (i, m)
m=1
Cost-to-go of a trajectory starting from state i after the m th visit
A recursive form of the method can be formulated: J(i) := J(i) + [V (i, m) J(i)] with = 1/m with m the number of the trajectory. From a trajectory point of view J(Xk ) := J(Xk ) + Xk [V (Xk ) J(Xk )] Xk corresponding to 1/m where m is the number of time Xk has already been visited by trajectories. 39
With the precedent algorithm, it is necessary that V (Xk ) is calculated from the whole trajectory and then can be used when the trajectory is nished. However, the method can be reformulated exploiting the relation V (Xk ) = V (Xk+1 ) + C(Xn , Xn+1 ). At each transition of the trajectory the cost-to-go function of a state of the trajectory, J(Xk ) is updated. Assuming that the l th transition is being generated. Then J(Xk ) is updated for all the state that have been visited previously during the trajectory: J(Xk ) := J(Xk ) + Xk [C(Xl , Xl+1 ) + J(Xl+1 ) J(Xl )] k = 0, ..., l TD() A generalization of the precedent algorithm is the TD() where a constant < 1 is introduced. J(Xk ) := J(Xk ) + Xk kl [C(Xl , Xl+1 ) + J(Xl+1 ) J(Xl )] k = 0, ..., l
Note that TD(1) this is the same that the Policy evaluation by simulation. Another special case is when = 0. The TD(0) algorithm is J(Xk ) := J(Xk ) + Xk [C(Xl , Xl+1 ) + J(Xk+1 ) J(Xk )] Q-factors Once Jk (i) has been estimated using the TD algorithm, it is possible to make a policy improvement evaluating the Q-factors dened by Qk (i, u) = jX P (j, u, i) [C(j, u, i) + J (j)] Note that C(j, u, i) must be known. The improved policy: k+1 (i) = argmin Qk (i, u)
uU (i)
It is in fact an approximate version of the policy iteration algorithm since J and Qk have been estimated using the samples.
7.2.2
Q-learning
Q-learning is similar to a value iteration methods based on simulation. The method estimates directly the Q-factors without the need of the multiple policy evaluation of the TD method. The optimal Q-factor are dened by: Q (i, u) =
jX
P (j, u, i) [C(j, u, i) + J (j)] 40
(7.1)
The optimality equation can be rewritten in term of Q-factors: J (i) =

uU (Xk+1 )
min
Q (i, u)
(7.2)
By combining the 2 equations, we obtain: Q (i, u) =

jX
P (j, u, i) [C(j, u, i) + min Q (j, v)]

vU (j)
(7.3)
Q (i, u) is the unique solution of this equation. The Q-learning algorithm is base on (7.3): Q(i, u) can be initialized arbitrarly. For each sample (Xk , Xk+1 , Uk , Ck ) do Uk = argmin Q(Xk , u))
uU (Xk )
Q(Xk , Uk ) = (1 )Q(Xk , Uk ) + [C(Xk+1 , Uk , Xk ) + with dened as for TD.
uU (Xk+1 )
min
Q(Xk+1 , u)]
The trade-o exploration/exploitation The convergence of the algorithms to the optimal solution would imply that all the pair (x,u) are tried innitely often which is not realistic. In practice, a trade-o must be made between phases of exploitation, when a base policy (called also greedy policy) is evaluated (which is similar to the idea of TD(0)) and phases of exploration during which new control are tried and a new greedy policy is determined.
7.3
Indirect Learning
On-line application can take advantage of the experience gained from real time use by: -Using the direct learning approach presented in the precedent section for each "sample" of experience. -Built on-line the model of the probabilities transitions and cost function and then use this model for o-line training of the system through simulation using direct learning. 41
7.4
Supervised Learning
With the methods presented in the precedent section the cost-to-go or Q-functions was represented on a tabular form. These approaches are suitable for moderate size problems. However, for large state and control space, this would be too computationnal intensive. To overcome this problem, approximation methods can be used to approximate the cost-to-go or Q-functions and the whole state and control space. As an example consider a cost-to-go function J (i). It will be replaced by a suitable approximation J(i, r) where r is a vector that has to be optimized based on the samples available of J . In the table representation precedently investigated, J (i) was stored for all the value of i. With an approximation structure, only the vector r is stored. Functions approximators must be able to well generalize over the state space the information gained from the samples. In other words, it should minimize the error between the true function and the approximated one, J (i) J(i, r). There are a lot of possibles methods for function approximators. This eld is related to supervised learning methods. Possibles methods are articial neural networks, kernel-based methods or tree-based methods, bayesian statistics for example. A general approach to a supervised learning problem can be: Determine an adequate structure for the approximated function and corresponding supervised learning method. Determine the input features of the function, that is the important inputs that characterize the state of the system. The features are generally based on experience or insight about the problem. Decide of a training algorithm. Gathering a training set. Train the function with the training set. The function can then be validated using a subset of the training set. Evaluate the performance of the approximated function using a test set. An important dierence between classical supervised learning and the one performed in reinforcement learning is that a real training set is not existing. The training set are obtained either by simulation or from real-time samples. This is already an approximation of the real function.
42
Chapter 8
Review of Models for Maintenance Optimization

This chapter reviews several SDP maintenance models found in the litterature. In conclusion, the approaches/methods are compared and their applicability to maintenance problem in power system is discussed.
8.1
8.1.1
Finite Horizon Dynamic Programming

Deterministic Models
Dekker & al. [46] proposes a rolling horizon approach for short-term scheduling and grouping of maintenance activities. Each individual maintenance activity is rst based on an innite horizon optimization. The short-term planning use these maintenance activities as inputs. Penalties are dened for deviations from the original time of maintenance for each activity. The whole maintenance activities are optimized using nite horizon dynamic programming.
8.1.2
Stochastic Models
In [37], a SDP model is proposed to solve a nite horizon generating units maintenance scheduling. The system considered is composed of n generating units. The possible state for each unit is the number of remaining stages of maintenance and possible failure of an unit not in maintenance during the stage. The failure rates 43
are assumed constant but dierent before and after maintenance. Unserved energy and unserved reserve costs are considered for the cost function. One interesting feature of the model is that the time to achieve maintenance is considered stochastic. Another is that the maintenance crew is assumed limited so maintenance can be done only on one generating unit at the time. The model is illustrated with a 3 unit example with 4, 5 and 6 possible states for the dierent units. A 52 weeks horizon is considered with stages of one week length.
8.2
8.2.1
Innite Horizon Stochastic Models

Discrete Time innite Horizon Models
In [14], an innite horizon SDP model is considered for optimizing the maintenance of a single component system. The system can be in dierent deterioration states, maintenance states or in a failure state. Two kinds of failures are considered, random failure and deterioration failure. Each one modeled by a failure state with dierent time to repair. The time to deterioration failure is represented by an erlangian distribution. The preventive maintenance is considered imperfect. If the system fails, the component is replaced. An average cost-to-cost approach is used to evaluate the policy. First a Markov process of the system is investigated to determine the optimal mean time to preventive maintenance. A Markov decision process model is built using the states probabilities and the optimal mean time to preventive maintenance calculated. The MDP is solved using the policy iteration algorithm. The model is proved to be unichain before applying the algorithm. An illustrative example is given. It considers 3 deterioration states, one preventive maintenance state for each deterioration state and one failure state. Jayakumar et. al. [21] propose a similar MDP is proposed. Major and minor maintenance are possible are possible. For each possible maintenance action, the deterioration level after the maintenance is stochastic which is more realistic. The model is solved using the linear programming method. 44
8.2.2
Semi-Markov Decision Process
Many condition-based maintenance models based on SMDP have been proposed these last years. Amari et. al. [3] present a general framework for solving condition-based maintenance problems by using SMDP. The interest of the model is that for each possible deterioration state, possible maintenance decisions are minor maintenance, major maintenance (replacement) but also the choice for the next inspection time. An hypothetical example is given. The model consists of 5 deterioration states and 1 failure state. 20 possible values for the inspection time are considered. The model of [14] is extended to a SMDP in [42]. The inspection time is calculated prior to the optimization using a semi-Markov process. The SMDP model is said to superior because it includes the state sojourn time. The model is illustrated with an example based on a 230kV air blast circuit beaker.
8.3
Reinforcement Learning
Kalles et. al. [24] proposes the use of RL for preventive maintenance of power plants. The article aims at giving reason of using RL for monitoring and maintenance of power plants. The main advantages given are the automatic learning capabilities of RL. The problem of time-lag (time between an action and its eect) is revealed. Penalties are dened by deviations from normal operation of the system. The approach proposed should rst be used in parallel of the actual expert systems so that the RL algorithm learns the environment then it could be applied in practice. One important condition for a good learning of the environment is that the algorithm has been trained in all situation and all the more in critical situation.
8.4
Conclusions
An important assumption of all the models is the loss of memory (Markovian models). The assumption is related to the principle of optimality. It means that the transition probability of the models can depend only on the actual state of the system independantly of its history. The nite horizon approach is adapted to short-term optimization. From the litterature review, this approach can be applied to maintenance scheduling. I believe that the approach is interesting because it can integrate opportunistic maintenance. Chapter 8 gives an example of this type of models. A limitations is the consequence 45
of the curse of dimensionality. The complexity of the model increases exponentionnaly with the number of states. In consequence the number of components of a nite horizon SDP model can not be too high for being tractable. Several Markov Decision Process and Semi-Markov Decision Processes models have been proposed for solving condition based maintenance problems. The models considers an average cost-to-go which is realistic. SMDP have the advantages of being able to optimize the time to next inspection depending on the states. SMDP are also more complex. The models found in the litterature was considering only single components with only one state variable. SMDP could be very useful for schedulled CBM and SMDP for inspection based CBM. However, for continuous time monitoring, it would be recommanded to use approximate methods. Approximate dynamic programming (reinforcement learning) have many advantages. The methods does not need that a model of the system exist. They learn from samples and could be used to adapt to a system. Moreover, they can handle large state space in comparison with MDP. In my opinion, reinforcement learning could be used for continuous time monitoring of system with multi-states monitoring. The article [24] was also proposing this approach for condition monitoring of power plants. However, no implementation of the idea have been found in the litterature. A practical disadvantage of this approach is that the process of learning is time consuming. It can (and should) be done o-line or based on a model that already exist but is too large to be solvable with classical methods. A technical diculty is the choice for an adequate supervised learning structure. Table 8.1 shows a summary of the models and most important methods. Table 8.1: Summary of models and methods
Characteristics Possible Application in Maintenance Optimization Short-term maintenance Optimization / Scheduling Method Advantages/ Disadvantages Limitated state space (number of components)
Finite Horizon Dynamic Programming Markov Decision Processes
Model can be Non-Stationary -Stationary Model - Possible approaches: Average cost-to-go
Value Iteration Classical Methods for MDP Value Iteration (VI)
Discounted Shortest path
Continuous-time condition monitoring maintenance optimization Short-term maintenance optimization
Can converge fast for high discount factor Faster in general - Possible additional constraints - State space limited / VI & PI Can work without an explicit model
Policy Iteration (PI) Linear Programming
Approximate Dynamic Programming for MDP Semi-Markov Decision Processes
Can handle large state space / classical MDP methods -Can optimize interval inspection -Complex
Same as MDP for larger systems Optimization for inspection based maintenance (Average cost-to-go approach)
- TD-learning - Q-learning Same as MDP
46
Chapter 9
A Proposed Finite Horizon Replacement Model

A nite horizon SDP replacement model is proposed in this chapter. The model assumes a nite time horizon and discrete decision epochs. The system in consideration is a power generating unit. An interesting feature of the model is the integration of the electricity price as a state variable. Another is the possibility of opportunistic maintenance i.e if one component fails, it is possible to do preventive maintenance on another component that is still working. The proposed model is rst presented for one component and is then generalized to multi-components. Both these models can be solved using the value iteration algorithm.
9.1
9.1.1
One-Component Model
Idea of the Model
In this chapter, an age replacement model based on nite horizon dynamic programming is proposed. The model is rst described for one component for an easier understanding of its principle. The price of electricity was considered as an important factor that could inuence the maintenance decision. Indeed if the electricity price is high, it can be protable to operate the system and wait for lower prices. If a high electricity price is expected in a close future, it could be interesting to 47
do maintenance immediately to be operational later and avoid maintenance in a protable period. The idea was considered for the model. The electricity price was included as a state variable. The variable consider dierent electricity scenario, for example high, medium and low prices. For each scenario, the electricity price vary with a period of a year. There can be transitions from one scenario to another depending on the period of the year. In the scandinavian countries, a large part of the electricity is based on hydropower. The electricity price is in consequence highly inuenced by the weather. If the weather is warm and dry the hydro-storage will be low and the electricity price for the rest of the year may be high. On the opposite, a cold and rainy season may result in low electricity price for the rest of the year. This observation could be used to assume the electricity scenario to be transiant during the summer and stable during the rest of the year, typically interpreted as dry year or wet year. This assumption could be used as a base for modelling the transition for the electricity state.
9.1.2
Notations for the Proposed Model
Numbers NE NW NPM N CM Costs CE (s, k) CI CP M C CM C N (i) Variables i1 i2 j1 j2 Component state at the current stage Electricity state at the current stage Possible component state for the next stage Possible electricity state for the next stage Electricity cost at stage k for the electricity state s Cost per stage for interruption Cost per stage of Preventive maintenance Cost per stage of Corrective maintenance Terminal cost if the component is in state i Number Number Number Number of of of of electricity scenario working state for the component preventive maintenance state for one component corrective maintenance state for one component
State and Control Space 48
x1 k x2 k
Component state at stage k Electricity state at stage k
Probability function (t) (i) Sets x 2 U (i)

1
Failure rate of the component at age t Failure rate of the component in state Wi
Component state space Electricity state space Decision space for state i
States notations W. P M. CM. Working state Preventive maintenance state Corrective maintenance state
9.1.3
Assumptions
The time span of the problem is T. It is divided into N stages of length Ts such that T = N Ts . The maintenance decision are made sequentially at each stage k=0,1,...,N-1. The failure rate of the component over the time is assumed perfectly known. This function is denoted (t). If the component fails during stage k, corrective maintenance is undertaken for N CM stages with a cost of C CM per stage. It is possible at each stage to decide to replace the component to prevent corrective maintenance. The time of preventive replacement is N P M stages with a cost of C P M per stage. If the system is not working, a cost for interruption C I per stage is considered. The average production of the generating unit is G kW. It means that if the unit is not in preventive maintenance or failure, G Ts kWh are produced during the stage. (Ts in hours) NE possible electricity price scenarios are considered. The prices are supposed xed during a stage (equal to the price at the beginning of scenario). For scenario s, the electricity price per kWh is noted CE (s, k), k=0,1,...,N-1. It is possible that the electricity price "switch" from one scenario to another one during the time span. The probability of transition at each stage is assumed known. 49
A terminal cost (for stage N) can be used to penalize the terminal stage condition. The manpower is assumed unlimited. Spare parts are not considered.
9.1.4
9.1.4.1
Model Description
State Space
The state vector Xk is composed of two states variables, x1 for the state of the k component (its age) and x2 for the electricity scenario. NX = 2 k The state of the system is thus represented by a vector as in (9.1): Xk = x1 k x1 1 , x2 2 x x k k x2 k (9.1)
x1 is the set of possible states for the component and x2 the set of possible electricity scenarios. Component state The status of the component (its age) at each stage is represented by one state variable x1 . There are three types of possible states for the variable. Normal k state (W), when the component is working, corrective maintenance (CM) states if the component is in maintenance due to failure and preventive maintenance (PM) states. The meaning of a state is that the component has been in the corresponing condition during the last stage. For example, if the component is in a state PM, it means that during the last stage it has undertaken preventive maintenance. The number of CM and PM states for the component corresponds respectively to N CM and N P M . To limit the size of the state space, it is necessary to limit the number of states W. It can be assumed that when (t) reaches a xed limit max = (Tmax ), preventive maintenance is always made. Another possibility is to assume that i (t) stays constant when age Tmax is reached. In this case, Tmax can correspond for example at the time when (t) > 50% if t>Tmax . This approach was implemented. The corresponding number of W states is N W = Tmax /Ts or the closest integer in both cases.
50
CM2
CM1
1 Ts (0) Ts (1) Ts (2) Ts (3) Ts (4)
W0
(1 Ts (0))
W1
(1 Ts (1))
W2
(1 Ts (2))
W3
(1 Ts (3))
W4 (1 Ts (4))
1 1
P M1
Figure 9.1: Example of Markov Decision Process for one component with N CM = 3, N P M = 2, N W = 4. Solid line: u=0, Dashed Line: u=1
Figure 9.1 shows an example of graphical representation of the MDP model for one 1 component. In this example, x1 x = {W0 , ..., W4 , P M1 , CM1 , CM2 }. The State k W0 is used to represent a new component. P M2 and CM3 are both represented with this state. More generally, x = {W0 , ..., WN W , P M1 , ..., P MN P M 1 , CM1 , ..., CMN CM 1 }.
1
51
Electricity scenario state Electricity scenarios are associated with one state variable x2 . There are NE possible k states for this variable, each state corresponding to one possible electricity scenario. 2 x2 x = {S1 , ..., SNe } The electricity price of the scenario S at stage k is given k by the electricity price function CE (S, k). Figure 9.2 shows an example for three possibles scenarios. The example considers three electricity scenarios correspond to high, medium and low electricity prices (respectively dry, normal and wet year). The weather during the season inuence the water reserve in a country as Sweden. Hydropower is a large part of the electricity generation in Sweden. Moreover this is a cheap source of energy. In consequence, if there is a low water reserve, more expensive source of energy are needed and the electricity price is higher.
Electricity Prices SEK/MWh
500 450 400 350 300 250 200 k-1 k k+1 Stage

1/3 Scenario 2 1/3 1/3 Scenario 1
Figure 9.2: Example of electricity scenarios, NE = 3
52
Scenario 3
9.1.4.2
Decision Space
At each stage, the decision maker can decide, if the component is not in maintenance, to do preventive maintenance or not depending on the state X of the system. Uk = 0 no preventive maintenance Uk = 1 preventive maintenance The decision space depends only on the component state i1 . U (i) = {0, 1} if i1 {W1 , ..., WN W } else
9.1.4.3
Transition Probabilities
The two state variables are independant. Moreover only the electricity state transitions depend on the stage. Consequently, P (Xk+1 = j | Uk = u, Xk = i) = P (x1 = j 1 , x2 = j 2 | uk = u, x1 = i1 , x2 = i2 ) k+1 k+1 k = P (x1 = j 1 | uk = u, x1 = i1 ) P (x2 = j 2 | x2 = i2 ) k+1 k k+1 k = P (j 1 , u, i1 ) Pk (j 2 , i2 ) Component state transition probability At each stage k, if the state of the component is Wq the failure rate is assumed constant during the time of the stage and equal to (Wq ) = (q Ts ). The transition probability for the component state is stationary. It can be represented as a Markov decision process as in the example in Figure 9.1. Table 9.1 summarizes the transition porbabilities that not equal to zero. Note that if N P M = 1 or N CM = 1 then P M1 respectively CM1 correspond to W0 . Electricity State The transition probabilities of the electricity state, Pk (j 2 , i2 ) are not stationary. They can change from stage to stage. 9.1.4.3 with 9.3 give an example of transition probabilities for the electricity scenarios on a 12 stages horizon. In this example, 1 2 Pk (j 2 , i2 ) can take three dierent values dened by the transition matrices PE , PE 3 . i2 is represented by the rows of the matrices and j 2 by the column. or PE 53
Table 9.1: Transition probabilities i1 Wq q {0, ..., N W 1} Wq q {0, ..., N W 1} WN W WN W Wq q {0, ..., N W } P Mq q {1, ..., N P M 2} P MN P M 1 CMq q {1, ..., N CM 2} CMN CM 1

u 0 0 0 0 1
j1 Wq+1 CM1 WN W CM1 P M1 P Mq+1 W0 CMq+1 W0
P (j 1 , u, i1 ) 1 (Wq ) (Wq ) 1 (WN W ) (WN W ) 1 1 1 1 1

Table 9.2: Example of transition matrix for electricity scenarios 1/3 1/3 1/3 2 PE = 1/3 1/3 1/3 1/3 1/3 1/3 1 1 PE 2 1 PE 3 3 PE 4 3 PE 5 2 PE 6 2 PE 7 2 PE
1 0 0 1 P E = 0 1 0 0 0 1 Stage(k) Pk (j 2 , i2 ) 9.1.4.4 0 1 PE
0.6 0.2 0.2 3 PE = 0.2 0.6 0.2 0.2 0.2 0.6 8 3 PE 9 1 PE 10 1 PE 11 1 PE ... ...
Table 9.3: Example of transition probabilities on a 12 stages horizon
Cost Function
The costs associated to the possible transitions can be of dierent kinds: Reward for electricity generation= G Ts CE (i2 , k) (depends on the electricity scenario state i2 and the stage k). Cost for maintenance, CCM or CP M . Cost for interruption, CI . Moreover, a terminal cost noted CN could be used to penalized deviations from required state at the end of time horizon. This option and its consequences was not studied in this work. The transition cost are summarized in Table 9.4. Notice that i2 is a state variable. A possible terminal cost is dened by C N (i) for each possible terminal state C N (i) for the component.
54
Table 9.4: Transition costs i1 Wq q {0, ..., N W 1} Wq q {0, ..., N W 1} WN W WN W Wq P Mq q {1, ..., N P M 2} P MN P M 1 CMq q {1, ..., N CM 2} CMN CM 1 u 0 0 0 0 1 j1 Wq+1 CM1 WN W CM1 P M1 P Mq+1 W0 CMq+1 W0 Ck (j, u, i) G Ts Cel (i2 , k) C I + C CM G Ts CE (i2 , k) C I + C CM CI + CP M CI + CP M CI + CP M C I + C CM C I + C CM
9.2
Multi-Component model
In this section, the model presented in Section 9.1 is extended to multi-components systems.
9.2.1
Idea of the Model
The motivation for a multi-component model is to consider possible opportunistic maintenance. It is sometimes possible to do maintenance on dierent parts of the system at opportunistic times. For example if the system fails, it could be protable to do maintenance on some components of the system that are still working but should be maintained soon. This could be very interesting if the interruption cost is high or if the structure needed for the maintenance is very high. In wind power for example, for certain maintenance actions, an helicopter or a boat can be necessary. The price for their rent can be very high and it could be protable to group the maintenance of dierent wind turbines at the same time.
9.2.2
Notations for the Proposed Model
Numbers NC W Nc PM Nc CM Nc Number Number Number Number of of of of component working state for component c Preventive Maintenance state for component c Corrective Maintenance state for component c 55
Costs
P Cc M CM Cc N Cc (i)
Cost per stage of Preventive Maintenance for component c Cost per stage of Corrective Maintenance for component c Terminal cost if the component c is in state i
Variables ic , c {1, ..., NC } iNC +1 j c , c {1, ..., NC } j NC +1 uc , c {1, ..., NC } State of component c at the actual stage State for the electricity at the actual stage State of component c for the next stage State for the electricity for the next stage Decision variable for component c
State and Control Space xc , c {1, ..., NC } k xc xNC +1 k uc k State of the component c at stage k A component state Electricity state at stage k Maintenance for component c at stage k
Probability functions c (i) Sets x N +1 x C c u (ic )

c
Failure probability function for component c
State space for component c Electricity state space Decision space for component c in state ic
9.2.3
Assumptions
The system is composed of NC components in series. If one component fails, the whole system fails. The failure rate of each component over the time is assumed perfectly known. This function is noted c (t) for component c {1, ..., NC }. If component c fails during stage k, corrective maintenance is undertaken for CM stages with a cost of C CM per stage. Nc c It is possible at each stage to decide to replace a component to prevent corrective maintenance. The time of preventive replacement for component n is P P Nc M stages with a cost of Cc M per stage. 56
An interruption cost C I is consider whatever the maintenance is done on the system. The average production of the generating unit is G kW. If none of the component of the unit is in preventive maintenance or failure, GTs kWh is produced during the stage. (Ts in hours)
N A terminal cost Cc can be used to penalize the terminal stage condition for component c.
9.2.4
9.2.4.1
Model Description
State Space
The state of the system can be represented by a vector as in (9.2).
Xk =
x1 k . . .
x Nc k
N xk c +1
(9.2)
xc , c {1, ..., NC } represent the state of component c. k xNc +1 represents the electricity state. k Component Space The number of CM and PM states for component c corresponds respectively to CM and N P M . The number of W states for each component c, N W , is decided in Nc c c the same way that for one component. The state space related to the component c is noted x . xc x = {W0 , ..., WNc , P M 1, ..., P MNc M 1 , CM 1, ..., CMNc 1 } W P CM k Electricity Space Same as in Section 8.1.
c c
9.2.4.2
Decision Space
At each stage, the decision maker must decide for each component, that is not in maintenance, to do preventive maintenance or do nothing depending on the state of the system. 57
uc = 0 no preventive maintenance on component n k uc = 1 preventive maintenance on component n k The decision variables constitute a decision vector: u1 k u2 k Uk = . . .

(9.3)
u Nc k
The decision space for each decision variable can be dened by: c {1, ..., Nc }, u (ic ) =
c
{0, 1} if ic {W0 , ..., WNc } W else
9.2.4.3
Transition Probability
The state variables xc are independent of the electricity state xNc +1 . Consequently, P (Xk+1 = j | Uk = U, Xk = i) = P ((j 1 , ..., j NC ), (u1 , ..., uNC ), (i1 , ..., iNC )) P (j NC +1 , j NC +1 ) (9.4) (9.5)
The probabilities transition of the electricity states, P (j NC +1 , iNC +1 ) , are similar to the one-component model. They can be dened at each stage k by a transition matrices as in the example of Section 8.1. Component states transitions The state variables xc are not independent of each other. Indeed, if one component fails or is in maintenance, the components are not ageing since the system is not working. In consequence, dierent cases must be considered. Case 1 If all the component are working, no maintenance is done, the propability transition of the whole system is the product of the probability transition of each component considered independently.
c If c {1, ..., NC }, yk {W1 , ..., WNn }, W NC
P ((j , ..., j Case 2
NC
), 0, (i , ..., i
NC
)) =
c=1
P (ic , 0, j c )
58
If one of the component is in maintenance or the decision of preventive maintenance is:

NC
P ((j , ..., j
NC
), (u , ..., u
NC
), (i , ..., i
NC
)) =
n=1
Pc
with P c =
P (j c , 1, ic ) if uc = 1 or ic {W1 , ..., WN W } c 0 else
1 if ic {W0 , ..., WNc 1 } and ic = j c W
9.2.4.4
Cost Function
As for the transition probabilities, there are 2 cases: Case 1 If all the components are working, no maintenance is decided and no failure happens, a reward for the electricity produced is obtained.
c , c {1, ..., NC }, yk {W1 , ..., WNn } If W
C((j 1 , ..., j NC ), 0, (i1 , ..., iNC )) = G Ts CE (iNC +1 , k) Case 2 When the system is in maintenance or fails during the stage, an interruption cost C I is considered as well as the sum of all the maintenance actions.
NC
C((j 1 , ..., j NC ), (u1 , ..., uNC ), (i1 , ..., iNC )) = C(I) +

c=1
Cc
with C c =
CM if ic {CM , ..., CM c Cc CM 1 Nc } or j = CM1 c 0 else
C P M if ic {P M1 , ..., P MNc M } or j n = P M1 P
9.3
Possible Extensions
The model could be extended in several directions. The following list summarizes some ideas on issues that could impact on the model: Manpower. It would be interesting to limit the number of maintenance actions possible to do at the same time. A solution would be to consider a global decision space and not individual decision space for each component state variable. 59
Include other types of maintenance actions. In the model, replacement was the only maintenance action possible. In reality there are a lot of possible maintenance actions, such as minor repair, major repair etc. They could be modelled by adding possible maintenance decisions in the model. Time to repair is non deterministic. So that it is possible to model a stochastic reparation time by adding probabilities transition for the maintenance states. Use of deterioration states. If monitoring or inspection of some components are possible, deterioration state variables could be included in the model. Other forecasting states. It could be interesting to add other forecasting state information such as weather and/or load states.
60
Chapter 10
Conclusions and Future Work

This thesis has reviewed models and methods based on Stochastic Dynamic Programming (SDP) and their application to maintenance problems. The theory of Dynamic Programming was introduced with nite horizon and innite horizon stochastic approaches as well as Approximate Dynamic Programming (Reinforcement Learning) methods to solve innite horizon SDP models. A comparison of the methods available for innite horizon SDP was made. Problems with a limited state space can be solved exactly. The Policy Iteration algorithm is proved empirically to converge the faster. However for high discount rate, the Value Iteration algorithm can be better. Linear Programming can also be used if additional constraints need to be included in the model. Approximate Dynamic Programming methods are necessary for large state space. A maintenance model based on nite horizon Stochastic Dynamic Programming was proposed to illustrate the theory. An interesting idea of the model was to enable opportunistic maintenance. Dierent ideas of state variables and possible extensions was also proposed. A literature review of Dynamic Programming application to maintenance optimization was made. Finite horizon deterministic and stochastic dynamic programming have been mainly applied to short term maintenance scheduling. The idea of grouping maintenance activities on a nite horizon seems promising to avoid untractable models. Markov Decision Processes (MDP) and Semi-Markov Decision Processes (SMDP) is proposed in many articles to optimize maintenance decision based on condition monitoring systems. The advantage of SMDP is to be able to optimize the next time to maintenance depending on the actual state of the system. Only single state variable models have been found in the literature for both MDP and SMDP. No application of Approximate Dynamic Programming (ADP) has not been found in the literature but a proposition of application. 61
The main limitation of Dynamic Programming is related to the curse of dimensionnality. The time complexity increases exponentionnaly with the number of state variables in the model. With the new advances in ADP methods, this limitation could be overcome. No application of ADP was found in the litterature. The methods have been mainly applied to optimal control until now but their is new opportunities for applying them to new elds such as maintenance optimization. The condition based maintenance models proposed using MDP or SMDP may be e.g. generalized to multi-variables models where dierent parameters of a system are monitored. In the power industry, maintenance contracts for a nite time is common. In this perspective, maintenance optimization should focus on nite horizon models. However in the litterature, few nite horizon models are proposed. Two ways of using Dynamic Programming for nite horizon models are possible; Either directly a nite horizon model or with a discounted innite horizon model which is an approximate nite horizon model that must be stationnary over the time. An idea could be to extend the nite horizon model proposed in this thesis. Markov Decision Process and reinforcement learning could be applied to single-components monitoring (with possible monitoring of multi-parameters) while the nite approach could use the results from the single-components models to optimize the maintenance of a complete system. The component in the nite horizon model could be simplied to a few number of possible deterioration/age states to limit the complexity of the model.
62
Appendix A
Solution of the Shortest Path Example

Solution of the shortest path problem with the value iteration algorithm Stage 4 J (4, 0) = (0) = 0 Stage 3 J3 (0) = J (H) = C(3, 0, 0) = 4 u (0) = u (H) = 0 3 (1) = J (I) = C(3, 1, 0) = 2 u (1) = u (I) = 0 J3 3 J3 (2) = J (J) = C(3, 2, 0) = 7 u (2) = u (J) = 0 3 Stage 2 J2 (0) = J (E) = min {J3 (0) + C(2, 0, 0), J3 (1) + C(2, 0, 1)} = min {4 + 2, 2 + 5} = 6 (0) = J (E) = argmin (0) + C(0, 0), J (1) + C(1, 0)} = 0 u2 u{0,1} {J3 3 J2 (1) = J (F ) = min {J (3, 0) + C(2, 1, 0), J3 (1) + C(2, 1, 1), J3 (2) + C(2, 1, 2)} = min {4 + 7, 2 + 3, 7 + 2} = 5 u (1) = J (F ) = argminu{0,1,2} {J3 (0) + C(2, 1, 0), J3 (1) + C(2, 1, 1), J3 (2) + C(2, 1, 2)} = 2 2 (2) = J (G) = min {J (1) + C(2, 2, 1), J (2) + C(2, 2, 2)} = min {2 + 1, 7 + 2} = 3 J2 3 3 u (2) = J (G) = argminu{1,2} {J3 (1) + C(2, 2, 1), J3 (2) + C(2, 2, 2)} = 1 2 Stage 1 J1 (0) = J (B) = min {J2 (0) + C(1, 0, 0), J2 (1) + C(1, 0, 1)} = min {6 + 4, 5 + 6} = 10 u (0) = J (B) = argminu{0,1} {J 2( 0) + C(1, 0, 0), J2 (1) + C(1, 1, 0)} = 0 1 (1) = J (C) = min {J (0) + C(1, 1, 0), J (1) + C(1, 1, 1), J (2) + C(1, 1, 2)} = min {6 + 2, 5 + 1, 3 + 3} = 6 J1 2 2 2 u (1) = J (C) = argminu{0,1,2} {J2 (0) + C(1, 1, 1), J2 (1) + C(1, 1, 1), J2 (2) + C(1, 1, 2)} = 1 or 2 1 J1 (2) = J (D) = min {J2 (1) + C(1, 2, 1), J2 (2) + C(1, 2, 2)} = min {5 + 5, 3 + 2} = 5 (2) = J (D) = argmin (1) + C(1, 2, 1), J (2) + C(1, 2, 2)} = 2 u1 u{1,2} {J2 2 Stage 0 J0 (0) = J (A) = min {J1 (0) + C(0, 0, 0), J1 (1) + C(0, 0, 1), J1 (2) + C(0, 0, 2)} = min {10 + 2, 6 + 4, 5 + 3} = 8 (0) + C(0, 0, 0), J (1) + C(0, 0, 1), J (2) + C(0, 0, 2)} = 2 (0) = J (A) = argmin u0 u{0,1,2} {J1 1 1
63
Reference List
[1] Maintenance terminology. Svensk Standard SS-EN 13306 SIS, 2001. [2] Mohamed A-H. Inspection, maintenance and replacement models. Comput. Oper. Res., 22(4):435441, 1995. [3] S.V. Amari and L.H. Pham. Cost-eective condition-based maintenance using markov decision processes. Reliability and Maintainability Symposium, 2006. RAMS06. Annual, pages 464469, 2006. [4] N. Andrasson. Optimisation of opportunistic replacement activities in deterministic and stochastic multi-component systems. Technical report, Chalmers, Gteborg University, 2004. Licentiate Thesis. [5] Y.W. Archibald and R. Dekker. Modied block-replacement for multiplecomponent systems. IEEE Transactions on Reliability, 45(1):7583, 1996. [6] I. Bagai and K. Jain. Improvement, deterioration, and optimal replacement underage-replacement with minimal repair. IEEE Transactions on Reliability, 43(1):156162, 1994. [7] R. E. Barlow and F. Proschan. Mathematical Theory of Reliability. Wiley, 1965. [8] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957. [9] C. Berenguer, C. Chu, and A. Grall. Inspection and maintenance planning: an application of semi-Markov decision processes. Journal of Intelligent Manufacturing, 8(5):467476, 1997. [10] M. Berg and B. Epstein. A modied block replacement policy. Naval Research Logistics Quarterly, 23:1524, 1976. [11] M. Berg and B. Epstein. A note on a modied block replacement policy for units with increasing marginal running costs. Naval Research Logistics Quarterly, 26:157179, 1979. 65
[12] L. Bertling, R. Allan, and R. Eriksson. A reliability-centered asset maintenance method for assessing the impact of maintenance in power distribution systems. IEEE Transactions on Power Systems, 20(1):7582, 2005. [13] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientic, 1996. [14] GK Chan and S. Asgarpoor. Optimum maintenance policy with Markov processes. Electric Power Systems Research, 76(6-7):452456, 2006. [15] D.I. Cho and M. Parlar. A survey of maintenance models for multi-unit systems. European journal of operational research, 51(1):123, 1991. [16] R. Dekker, R.E. Wildeman, and F.A. van der Duyn Schouten. A review of multi-component maintenance models with economic dependence. Mathematical Methods of Operations Research (ZOR), 45(3):411435, 1997. [17] B. Fox. Age Replacement with Discounting. Operations Research, 14(3):533 537, 1966. [18] C. Fu, L. Ye, Y. Liu, R. Yu, B. Iung, Y. Cheng, and Y. Zeng. Predictive maintenance in intelligent-control-maintenance-management system for hydroelectric generating unit. IEEE Transactions on Energy Conversion, 19(1):179186, 2004. [19] A. Haurie and P. LEcuyer. A stochastic control approach to group preventive replacement in a multicomponent system. IEEE Transactions on Automatic Control, 27(2):387393, 1982. [20] P. Hilber and L. Bertling. Monetary importance of component reliability in electrical networks for maintenance optimization. In Probabilistic Methods Applied to Power Systems, 2004 International Conference on, pages 150155, September 2004. [21] A. Jayakumar and S. Asgarpoor. Maintenance optimization of equipment by linear programming. In Probabilistic Methods Applied to Power Systems, 2004 International Conference on, pages 145149, 2004. [22] Y. Jiang, Z. Zhong, J. McCalley, and TV Voorhis. Risk-based Maintenance Optimization for Transmission Equipment. Proc. of 12th Annual Substations Equipment Diagnostics Conference, 2004. [23] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4:237285, 1996. [24] D. Kalles, A. Stathaki, and R.E. Kingm. Intelligent monitoring and maintenance of power plants. In Workshop on Machine learning applications in the electric power industry, Chania, Greece., 1999. 66
[25] D. Kumar and U. Westberg. Maintenance scheduling under age replacement policy using proportional hazards model and TTT-plotting. European Journal of Operational Research, 99(3):507515, 1997. [26] P. LEcuyer and A. Haurie. Preventive replacement for multicomponent systems: An opportunistic discrete time dynamic programming model. IEEE Transactions on Automatic Control, 32:117118, 1983. [27] M. Lehtonen. On the optimal strategies of condition monitoring and maintenance allocation in distribution systems. In Probabilistic Methods Applied to Power Systems, 2006. PMAPS 2006. International Conference on, pages 15, 2006. [28] M.L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996. [29] Y. Mansour and S. Singh. On the complexity of policy iteration. Uncertainty in Articial Intelligence, 99, 1999. [30] M.K.C. Marwali and S.M. Shahidehpour. Short-term transmission line maintenance scheduling in a deregulated system. Power Industry Computer Applications, 1999. PICA99. Proceedings of the 21st 1999 IEEE International Conference, pages 3137, 1999. [31] R.P. Nicolai and R. Dekker. Optimal maintenance of multi-component systems: a review. 2006. [32] J. Nilsson and L. Bertling. Maintenance management of wind power systems using condition monitoring systems-life cycle cost analysis for two case studies. IEEE Transaction on Energy Conversion, 22(1):223229, 2007. [33] Julia Nilsson. Maintenance management of wind power systems - cost eect analysis of condition monitoring systems. Masters thesis, Royal Institute of Technology (KTH), April 2006. [34] K.S. Park. Optimal wear-limit replacement with wear-dependent failures. IEEE Transactions on Reliability, 37(3):293294, 1988. [35] K.S. Park. Condition-based predictive maintenance by multiple logisticfunction. IEEE Transactions on Reliability, 42(4):556560, 1993. [36] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. [37] A. Rajabi-Ghahnavie and M. Fotuhi-Firuzabad. Application of markov decision process in generating units maintenance scheduling. In Probabilistic Methods Applied to Power Systems, 2006. PMAPS 2006. International Conference on, pages 16, 2006. 67
[38] Rangan, Alagar, Ahyagarajan, Dimple, and Sarada. Optimal replacement of systems subject to shocks and random threshold failure. International Journal of Quality & Reliability Management, 23:11761191, 2006. [39] J. Ribrant and L. M. Bertling. Survey of failures in wind power systems with focus on swedish wind power plants during 1997-2005. IEEE Transaction on Energy Conversion, 22(1):167173, 2007. [40] J. Si. Handbook of Learning and Approximate Dynamic Programming. WileyIEEE, 2004. [41] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [42] C.L. Tomasevicz and S. Asgarpoor. Optimum maintenance policy using semimarkov decision processes. In Power Symposium, 2006. NAPS 2006. 38th North American, pages 2328, 2006. [43] H. Wang. A survey of maintenance policies of deteriorating systems. European Journal of Operational Research, 139(3):469489, 2002. [44] L. Wang, J. Chu, W. Mao, and Y. Fu. Advanced maintenance strategy for power plants - introducing intelligent maintenance system. In Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress on, volume 2, 2006. [45] R. Wildeman, R. Dekker, and A. Smit. A dynamic policy for grouping maintenance activities. European Journal of Operational Research. [46] R.E. Wildeman, R. Dekker, and A. Smit. A Dynamic Policy for Grouping Maintenance Activities. Econometric Institute, 1995. [47] Otto Wilhelmsson. Evaluation of the introduction of RCM for hydro power generators at vattenfall vattenkraft. Masters thesis, Royal Institute of Technology (KTH), May 2005.
68

Models

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Models

Diunggah oleh

Hak Cipta:

Format Tersedia

On Stochastic Dynamic Programming and its Application to Maintenance

Masters Degree Project Stockholm, Sweden 2007

On Stochastic Dynamic Programming and its Application to Maintenance

MASTER THESIS BY FRANOIS BESNARD

Decision Space at stage k for state i State space at stage k

6.7 6.8 6.9

Time-Based Maintenance (TBM)

Condition Based Maintenance (CBM)

Figure 2.1: Maintenance Tree based on [1]

Maintenance Optimization Models

Age Replacement Policies

Block Replacement Policies

Condition Based Maintenance

Opportunistic Maintenance Models

Other Types of Models and Criteria of Classications

Introduction to the Power System

Power System Presentation

Power System Description

Maintenance in Power System

Introduction to Dynamic Programming

Deterministic and Stochastic Models

Exact and Approximation Methods

Deterministic Dynamic Programming

Subject to Xk+1 = fk (Xk , Uk ) k = 0, ..., N 1

N k i j Xk Uk Ck (i, u) CN (i) fk (i, u) J0 (i)

The Optimality Equation and Value Iteration Algorithm

Optimal cost-to-go from stage k to N starting from state i

The value iteration algorithm is a direct consequence of the optimality equation:

Uk (i) = argmin{Ck (i, u) + Jk+1 (fk (i, u))} i Xk uU (i) k

Decision variable lll Optimal decision action at stage k for state i

A Simple Shortest Path Problem Example

U (0) = {0, 1, 2} for k=0 0

Subject to Xk+1 = fk (Xk , Uk ) k = 0, 1, . . . , N 1

Finite Horizon Models

Subject to Xk+1 = fk (Xk , Uk , k (Xk , Uk )) k = 0, 1, . . . , N 1 24

N k i j Xk Uk k (i, u) Ck (i, u, j) CN (i) fk (i, u, ) J0 (i)

The optimality equation for stochastic nite horizon DP is:

Value Iteration Method

While k 0 do: Jk (i) = min

uUk (i) jX k+1

Decision variable Uk (i)

Optimal decision action at stage k for state i

The recursion nishes when the rst stage is reached.

The Curse of Dimensionality

Ideas for a Maintenance Optimization Model

Age and Deterioration States

Innite Horizon Models Markov Decision Processes

Subject to Xk+1 = f (Xk , U k, (Xk , (Xk ))) k = 0, 1, . . . , N 1 Discount factor

Subject to Xk+1 = f (Xk , U k, (Xk , (Xk ))) k = 0, 1, . . . , N 1 30

Pij (u) [Cij (u) + J (j)]

For a IHSDP discounted problem the optimality equation is: J (i) =

Pij (u) [Cij (u) + J (j)]

The Policy Iteration Algorithm

P (j, u, i) [C(j, u, i) + Jq (j)]

Iteration number for the policy iteration algorithm

P (j, u, i) [C(j, u, i) + Jq (j)]

Modied Policy Iteration

0 The algorithm stops when m=0 and Jk is approximated by Jk .

Average Cost-to-go Problems

P (j, (i), i) [C(j, u, i) + h (j)]

P (j, (i), i) [C(j, (i), i) + h ] i X P (j, u, i) [C(j, u, i) + h ] i X

Relative Value Iteration

arbitrarly. Hk = hk+1 (i) = min