discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/265038765
Article
CITATIONS READS
22 509
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Pushkin Kachroo on 28 October 2014.
by
Cem nsal
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State
University in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
in
ELECTRICAL ENGINEERING
by
Cem nsal
(ABSTRACT)
One of todays most serious social, economical and environmental problems is traffic
congestion. In addition to the financial cost of the problem, the number of traffic related injuries
and casualties is very high. A recently considered approach to increase safety while reducing
congestion and improving driving conditions is Automated Highway Systems (AHS). The AHS
will evolve from the present highway system to an intelligent vehicle/highway system that will
incorporate communication, vehicle control and traffic management techniques to provide safe,
fast and more efficient surface transportation. A key factor in AHS deployment is intelligent
vehicle control. While the technology to safely maneuver the vehicles exists, the problem of
making intelligent decisions to improve a single vehicles travel time and safety while optimizing
the overall traffic flow is still a stumbling block.
The design of the adaptive vehicle path planner based on local information is then carried
onto the interaction of multiple intelligent vehicles. By analyzing the situations consisting of
conflicting desired vehicle paths, we extend our design by additional decision structures. The
analysis of the situations and the design of the additional structures are made possible by the study
of the interacting reward-penalty mechanisms in individual vehicles. The definition of the physical
environment of a vehicle as a series of discrete state transitions associated with a stationary
automata environment is the key to this analysis and to the design of the intelligent vehicle path
controller.
This work was supported in part by the Center for Transportation Research and Virginia
DOT under Smart Road project, by General Motors ITS Fellowship program, and by Naval
Research Laboratory under grant no. N000114-93-1-G022.
iii
To all the noble men and the knights fighting the demons
of a binary world in their quest for artificial intelligence.
iv
Acknowledgments
It has been a long ride since I received my M.S. degree. Many friends, local, cyber and dear, many
engineering students, EE, ME, or CE, many professors, and others have seen me working through
my dissertation. All of them, directly or indirectly, touched what is written in these pages.
First of all, I have to express my gratitude for my advisors, Dr. John S. Bay, most
probably the best advisor this campus had ever seen and a legend in thesis/dissertation editing, and
Dr. Pushkin Kachroo, percussionist extraordinaire (Isnt that what P.E. stands for?). They were
both genuinely understanding, and very supportive of me. I learned a great deal from them about
control engineering (among other fields) and about academic life. Without their guidance, their
advice, and their help, this dissertation would not have been completed.
Other members of my dissertation committee, Dr. William T. Baumann, Dr. Hugh F.
VanLandingham, and Dr. Joseph A. Ball, who were also my professors at some point during my
five and a half years at Tech, were kind enough to listen to me babbling about an intelligent
controller.
On multiple occasions, I had the chance of talking about my research (and my life) to the
other graduate students in the Machine Intelligence Laboratory. Doug, John, Mamun, and Ferat
were some of the faces I got used to seeing almost everyday. Also, Paul and Kevin forced me
every Wednesday afternoon to have my breakfast at Arnold's. I do have a feeling that I will miss
that very much.
During my assistantship at Center for Transportation Research, I had the privilege of
working with fellow students, researchers, and centers staff. Lisa, Terry and Christina were there
to answer my unnecessary questions. Kaan zbay was helpful in my starting to work at CTR, and
I am grateful for his confidence in me. My only regret about CTR is not having been able to know
Ashwin Amanna better; he is a true gentleman.
Local Turks, Simge, Erkan, BahadIr, Levent, Oya, Bahar, Kutsal, and others, suffered
indirectly from my research efforts. Sorry guys!
My parents, AydIn and Ayhan, watched me from a distance while I worked toward my
degree, missing most of the last three years of my life. The completion of this dissertation will
mean "seeing more of me" to them, I hope. I believe that my sister Gn will not make the
mistakes I have made in becoming an electrical engineer. Watching her catching up with me
makes me confused and very proud.
Probably, the people who heard the most about my dissertation and graduate life are the
members of KM (Kelek Muhabbet). This e-mail distribution list came to life sometime in 1992 and
grew to a much larger and diverse group. Some of the members are my close friends from the
early years of my academic venture, some of them are people whom I met on the Net, some of
them I have not seen yet. They also watched me getting tired of this lovely town of Blacksburg;
some even visited me here more than once. I could not, cannot and do not want to imagine going
a week without receiving e-mail from KM.
v
I am also grateful to Berrak PInar for helping me tremendously during the last two hectic
weeks of my dissertation work.
Finally, to the person who borrowed the Proceedings of the 1995 Intelligent Vehicles
Symposium, and refused to return it to the library since July 1996: You have been acknowledged!
This work is dedicated to all scientists and engineers who work on artificial intelligence,
and know exactly what I went through.
vi
Table of Contents
Abstract ii
Acknowledgments v
Table of Contents vii
List of Figures x
List of Tables xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 ITS and AHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Intelligent Vehicle Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Different Approaches to Intelligent Vehicle Control . . . . . . . . . 4
1.2.2 Learning Automata as an Intelligent Controller . . . . . . . . . . . . . 6
1.3 Scope and Structure of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
vii
3.4.1 Norms of Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Variable Structure Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Linear Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes 39
3.6 Extensions of the Basic Automata-Environment Model . . . . . . . . . . . . . . 41
3.6.1 S-Model Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.2 Nonstationary Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.3 Multi-Teacher Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Interconnected Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
viii
6.2.1 Function H and the Conditions for Absolute Expediency . . . . . . 111
6.2.2 Comparison of NLH with General Absolutely Expedient Scheme 114
8 Conclusion 139
8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.1 AHS as a Social Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.2 Technical Feasibility of AHS . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Results and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Bibliography 149
Vita 185
ix
List of Figures
x
Figure 5.6 Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 12m, d2 = 15m, and
fsr = 16m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 5.7 Positions of five automated vehicles: gray colored vehicle attempts to
shift to the right lane by slowing down. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 5.8 Positions of five automated vehicles: gray colored vehicle attempts to
shift to the right lane by slowing down. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 5.9 Headway distances in the platoon; lane change occurs at t 8.5sec. . . . . . 71
Figure 5.10 Speed and space-time trajectory of five automated vehicles. . . . . . . . . . . . 72
Figure 5.11 The lane module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.12 Pinch condition: two vehicles decides to shift to the same lane
simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 5.13 Memory vector/buffer: If an action fills half of the buffer, it is signaled. 74
Figure 5.14 Initial positions and trajectories of two vehicles. . . . . . . . . . . . . . . . . . . . . 75
Figure 5.15 Lane and speed of two vehicles from t = 0 to t = 4sec. . . . . . . . . . . . . . . . 75
Figure 5.16 The output of the pinch module and the actions in the memory buffer
versus time for vehicle 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 5.17 Initial and final situation, speed, and lane positions of three automated
vehicles traveling on a 3-lane 500-meter circular highway [mpeg]. . . . . . . 78
Figure 5.18 Snapshots, speed and lane positions of two automated vehicles traveling
on a 3-lane 500-meter circular highway. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 5.19 The new definition of the mapping F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 5.20 Automata in a multi-teacher environment with new definitions of mapping
functions and flag structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 5.21 Average platoon speeds for simulations 1-4. . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 5.22 The distance between vehicles for simulations 1-4. . . . . . . . . . . . . . . . . . . 86
Figure 5.23 Initial and final positions, and speeds of automated vehicles in four
ten-vehicle platoons [mpeg]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 5.24 Simulations 5 and 6: average speed for a platoon of ten automated
vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 5.25 Simulations 5 and 6: distances between vehicles for a platoon of ten
automated vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 5.26 Initial and final positions, and speeds of automated vehicles in four
ten-vehicle platoons [mpeg]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 5.27 Simulations 7 and 8: mpeg movies of two simulations of 10-vehicle
platoons with lead vehicle decelerating to 75mph [mpeg]. . . . . . . . . . . . . 90
Figure 6.1 Number of steps needed for popt to reach 0.995 for different values of
learning parameters a and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 6.2 For some parameter values, convergence is not obtained for all 500 runs. . 97
Figure 6.3 Number of steps needed for popt to reach 0.995 for different values of
learning parameters a and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Figure 6.4 Number of steps needed for popt to reach 0.995 for different values of
learning parameters a and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xi
Figure 6.5 The region for asymptotic convergence to pure optimal strategy. . . . . . . . 106
Figure 6.6 Probabilities and cumulative average of the probabilities of the non-
optimal actions for a 3-action automata with LR P (a = 0.2, b = 0.1). . . . . 106
Figure 6.7 Sketch of the function H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Figure 7.1 Longitudinal automaton determines the lateral automatons environment. . 118
Figure 7.2 Multiple vehicles interacting through the physical environment. . . . . . . . . 121
Figure 7.3 Situations for two interacting vehicles and combined environment
responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 7.4 Changes in the physical (left) and automata environments (right): vehicle
1 shifts to the middle lane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Figure 7.5 Possible physical environment states for 2 vehicles in 3-lane highway. . . . 126
Figure 7.6 State transition diagram for two vehicles on a 3-lane highway; reflexive
transitions are not shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 7.7 Scenario 1: Two vehicles with conflicting desired paths. . . . . . . . . . . . . . . 129
Figure 7.8 Possible chains for Scenario 1 (reflexive transitions are not shown; two
chains are distinguished with dashed and solid lines). . . . . . . . . . . . . . . . . 130
Figure 7.9 Possible penalty-reward structures to force physical environment to
switch to states B1 or C1 from current state A1. . . . . . . . . . . . . . . . . . . . 131
Figure 7.10 Scenario 2: Three vehicles with conflicting desired paths. . . . . . . . . . . . . . 132
Figure 7.11 A possible chain for Scenario 2: lane flag forces vehicles 1 and 2 to slow
down. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Figure 7.12 Three-vehicle transition diagram can be written as three separate two-
vehicle transition diagrams using the definitions in Figure 7.8. . . . . . . . . . 133
Figure 7.13 Three-vehicle transition diagram is equivalent to two separate two-vehicle
diagrams for the example in Section 5.3.1. . . . . . . . . . . . . . . . . . . . . . . . . 134
Figure 7.14 Scenario 3: Four vehicles with conflicting desired speeds. . . . . . . . . . . . . 135
Figure 7.15 Two possible solutions to a situation with 4 vehicles. . . . . . . . . . . . . . . . . 135
Figure 7.16 Speeds and positions of vehicles 1 and 2 for scenario 3 [mpeg]. . . . . . . . . 137
Figure B.1 Probabilities of five actions in the LI-P scheme; only c1 = 0. . . . . . . . . . . . . 169
Figure C.1 Structure of the simulation program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Figure C.2 The simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Figure C.3 Graphic User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Figure C.4 GUI for data visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Figure C.5 Trajectory command for relative position plots. . . . . . . . . . . . . . . . . . . 176
Figure C.6 Scenario window: Clickable buttons initialize several different scenarios. . 176
Figure C.7 Timed snapshots of a simulation run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure D.1 All possible immediate neighborhood situations for two vehicles. . . . . . . . 180
Figure D.2 Combined states for two vehicles: states not shown are identical to those
given here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Figure D.3 Further simplified states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Figure D.4 All possible immediate neighborhood situations for three vehicles in a
three-lane highway. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
xii
Figure D.5 Combined states for three vehicles: states not shown are identical to those
indicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Figure D.6 Possible states for four vehicles actually a two-vehicle situation. . . . . . 184
xiii
List of Tables
xiv
Chapter 1
Introduction
1.1 Motivation
One of todays most serious social, economical, and environmental problems in the world is
traffic congestion. In the United States alone, the problem is costing over 100 billion dollars each
year. In addition, more than 40,000 persons are killed and another three million are injured (an
additional cost of $137 billion per year). This is compounded by the fact that it is becoming
increasingly difficult, for both financial and environmental reasons, to continue to build additional
highway capacity [Hayes94]. Therefore, the solution to the problem must lie in other
approaches, one of which is to make better use of the existing highway infrastructure. The
current infrastructure has begun to plateau in its ability to meet the operational requirements of
highway transportation. Consequently, congestion and safety risks are increasing. Previous
solutions of building or widening roads are not feasible anymore; the average cost of a mile of
urban freeway is $38 million. Furthermore, the tens of thousands of lives per year is a steep
price to pay for mobility [Hayes94].
New approaches discussed in the next section have the potential of increasing the safety,
reducing the congestion, and improving the driving conditions. Studies conducted at the beginning
of the intelligent transportation era show that it is possible to cut accidents by 18%, gas
emissions by 15%, and fuel consumption by 12% by employing new approaches. It is claimed
that these improvements will result in $10 return to the economy for each dollar invested in the
new transportation systems [Card93].
Considering the potential of the application of intelligent systems to surface
transportation problem, US Department of Transportation deems the investigation of new
technologies as crucial. Most of the smart highway technologies were first suggested in the
'60's, but the idea did not gain momentum until Intermodal Surface Transportation Efficiency
Act (ISTEA) passed by Congress in 1991. The technology needed to create an intelligent
transportation system is already available, although still expensive for full implementation.
Todays vehicles have at least four or five microprocessors monitoring and controlling such
things as ignition spark, fuel and emission controls, automatic transmission, cruise control, air
bags, and anti-lock brakes. Further development is underway on computerized controls for active
suspensions, traction control, all-wheel drive, and four-wheel steering. Few people realize, in
fact, that todays car has four times the computing power of the first Apollo moon rocket
[Card93].
Cem nsal Chapter 1. Introduction 2
Significant reduction in the number of annual fatalities and injuries due to accidents,
Improvement in the safety of commercial vehicles and hazardous material movement,
Significant reduction in the cost associated with congestion,
Increase in the volume of people and goods that can be moved on existing highways,
Improvement of travel time predictability, and reduction in the level of stress associated
with travel,
Reduction in harmful vehicle emission, and in surface transport energy consumption per
vehicle-mile and per passenger-mile traveled.
While congestion is the primary problem in urban areas, rural areas suffer from a higher
traffic fatality rate due to road conditions and higher rates of speed. Increased margins of safety
provided by ITS technologies might reduce the number of accidents in both urban and rural areas.
Most importantly, communication capabilities could improve emergency response times, and
improve levels of rural public transportation services.
The established process of delivering the ITS program has four major components
covering the following issues:
The DOT has established a set of discrete milestones to be accomplished until the end of
the year 1997. Some of these milestones related to the work presented here are system
architecture, communication frequencies, collision avoidance, and automated highway systems.
A major long-term element of DOTs research and development is AHS. The deployment
of early phases of AHS will begin in the next ten years, but will continue beyond that period.
Current AHS program activities address the imminent user services involving safety advisory and
driver assistance products, e.g., in-vehicle warnings of impending danger, and emergency control
intervention. The ITS architecture must be structured to provide a sensible evolution to
accommodate these types of user services, as well as the eventual realization of total automation.
The 1997 demonstration requested by Congress is the AHS programs first major
milestone, and will assess proof of concept feasibility for the AHS. Preparation for
demonstration will include the identification and evaluation of feasible alternative conceptual
approaches. The demonstration will feature many feasible concepts. Subsystems and control,
including lateral and longitudinal control, transition between manual and automated control,
maintenance of position in the roadway traffic flow, lane changing, and various malfunction
management capabilities will be demonstrated all in a collision-free automated driving
environment.
The demonstration results, if successful, will lead to selection of the AHS conceptual
approach, documentation of the approach in performance specifications and standards, and
establishment of a partnership with industry to design, develop, and operationally test an AHS
with public participation. Underpinning the development of the first AHS demonstration is an
understanding of the human factor issues related to AHS driver operations. The DOT has begun a
research project to develop comprehensive driver performance guidelines for AHS. Beyond 1997,
deployment of a nationally compatible system will become the primary concern of DOT. It is
expected that the work will shift towards supporting deployment of ITS user services, and
continuing progress on long-term ITS goals, especially the AHS. Details of the AHS concept, and
major important factors are described in detail in Chapter 2.
there are multiple fast moving vehicles, making the right decision to avoid collisions and optimize
vehicle path is difficult. There are several different approaches currently considered in the current
AHS research. These and our position relative to current research are discussed next.
1.2 Context
1.2.1 Different Approaches to Intelligent Vehicle Control
There are two main approaches to the implementation of a fully automated highway system:
hierarchical structure and autonomous vehicle approaches. The first approach, defined in
[Varaiya93], centers around the notion of platooning (See Section 2.3). In this approach, there
are different layers of control hierarchy, each responsible for a different task required to
implement an automated highway system. The layers of the architecture, starting at the top, are:
network, link, planning, and regulation. The network layer assigns a route to each vehicle as it
enters the system. The link layer assigns each vehicle a path which balances traffic for all lanes
and assigns a target speed for each section of highway. The planning layer creates a plan which
attains the desired path. The regulation layer controls the vehicle trajectory so that it confirms to
this plan.
The tasks assigned to each layer differ from each other in three dimensions: time scale,
information span, and spatial impact of decisions. As we go up in the hierarchy, the frequency of
decisions decrease, more aggregated information is used, and the decisions affect vehicles over
larger distances. For example, the planning layer of the hierarchy makes decisions to create a plan
that evolves every minute by coordinating with neighboring vehicles. Path decisions affects only
the neighboring vehicles.
This approach assumes an information flow from the higher hierarchy and to/from other
vehicles for making decisions about the optimal path. The information available to an automated
vehicle has global characteristics since it is relayed from a higher layer of the hierarchy.
A simpler approach for reasons of implementation is the autonomous vehicle approach,
where each vehicle is treated as autonomous agents using local sensors (and maybe limited
communications with neighboring vehicles). It does not use the concept of closely separated
vehicles and aims to solve the problem in a shorter time interval, and to reach a fully automated
highway systems by an evolutionary implementation, i.e., by slowly increasing the capabilities
of autonomous vehicles over time.
In the first case, the control commands are relayed by higher layers in the hierarchy, and
the vehicles do not need to make intelligent decisions for path planning; the hierarchical structure
controls the vehicles. In the latter approach however, the vehicles need to make decisions based
on local information. Our learning planner can be visualized as part of the second approach, or a
backup system for the first one.
The first attempt to solve the problem of real-time decision making in an automated
highway environment dates back to 1992. Mourou and Fade described a planning method
applicable to agents with perception and decision-making capabilities and the ability to
communicate with other agents [Mourou92]. The agents, in this example, are assumed to
Cem nsal Chapter 1. Introduction 5
transmit their actions to their neighbors. This work emphasized the possibility of achieving a
more flexible, fast, and reactive system by multi-agent planning methods and execution
monitoring. However, this approach requires constant and complete data transfer between agents.
Recent research on intelligent vehicles includes an adaptive intelligent vehicle module used
in a simulation and design system for tactical driving algorithms [Sukthankar96a, 96b]. The
approach divides the driving task into three levels: strategic (route planning, goal determination),
tactical (maneuver selection for short term goals) and operational (translating maneuvers into
control operations). Intelligent modules are designed to answer the need for real-time tactical level
decisions. A reasoning system called SAPIENT combines the high-level goals with low-level
sensor constraints to control vehicles. The system uses a number of modules whose outputs are
combined using a voting scheme. The Population-Based Incremental Learning (PBIL) method is
used to adjust the large number of parameters defining the behavior of the modules. This
approach is basically a method for finding the suitable parameters for the module that are used to
fire lateral and longitudinal actions. The learning algorithm for the parameters is a combination of
evolutionary optimization and hill climbing, and it is very similar to learning automata
reinforcement schemes except for the use of a mutation factor for the probabilities of actions at
the end of each iteration.
The learning automata controller we describe here can be visualized as a combination of
intelligent modules at the tactical level. However, our approach differs from the above mentioned
research in the use of the learning algorithms. Instead of learning the parameters affecting the
firing of actions on repeated runs, learning automata learn which action to fire based on the local
sensor information. In other words, the learning is not at the design phase, but at the run phase.
The parameters defined in [Sukthankar96a] correspond to learning (Chapters 3, 4, and 6) and
sensor (Chapters 4 and 5) parameters of our controller, and can be used to adjust those. Learning
and sensor parameters defines the capabilities of an automated vehicle, and can also be used to
model different driver behaviors. For example, large learning parameters decrease the decision
time, and consequently result in a more agile vehicle path.
Another approach to intelligent vehicle controller for autonomous navigation uses a
decision-theoretic architecture with probabilistic networks [Forbes95]. The problem is modeled
as a partially observable Markov decision process, and the optimal action is a function of the
current belief state described as the joint distribution over all possible actual states of the world.
However, this work currently considers only combinations of dynamic decision networks with a
decision tree, i.e., if-then rules where each predicate is actually a complex set of probability
thresholds on specific variables, and it is similar to the previous work in the sense of firing
actions. Similarly, Niehaus and Stengel defined a rule-based navigation system that uses worst-
case decision making (WCDM) approach. Again, a stochastic model of the traffic situation based
on sensor measurements is assumed [Niehaus94].
Cem nsal Chapter 1. Introduction 6
This application of learning automata is one of the very few applications to real-time
learning problems [Narendra89, Najim91b]. The difficulty in modeling the environment (highway,
the vehicles, and the drivers) makes learning automata a likely candidate for such an application,
since a learning system is guaranteed to adapt to existing conditions provided that the teachers
are designed carefully.
The overall design structure, effects of the several parameters on the decision, and several other
issues related to intelligent path planner conclude Chapter 5.
In Chapter 6, we introduce reinforcement schemes (learning algorithms) used for our
application of learning automata. The schemes are compared according to their desired update
characteristics. A new nonlinear scheme and a linear scheme previously not considered by
previous research efforts are found to be more advantageous than others. For the linear scheme,
proof of optimality for a specific case in our application to intelligent vehicle control is given.
Instead of general linear differential equation approach [Narendra89], stability theorems for
discrete-time systems are used for this purpose. The behavior of the new nonlinear reinforcement
scheme is also discussed in this chapter. It is found that the algorithm is absolutely expedient.
These two new schemes are the results of our attempt to find fast converging reinforcement
schemes.
Chapter 7 introduces the concept of interacting automata and the treatment of multiple
automata as game playing agents. The concept of the automata games is then extended to
multiple automata in multiple vehicles. An approach similar to automata games [Narendra89] is
taken for interacting vehicles as well. By defining the physical environment of vehicle as a
switching environment, it is possible to treat a highway situation as state transitions in a Markov
chain. Associating automata environments to each stationary physical environment, it is possible
to analyze and design physical environment changes as state transitions based on the reward-
penalty structure of the automata environment1. This approach led to the design of the flag
structure explained in detail in Chapter 5.
A brief discussion of technical feasibility and user acceptance of AHS, the conclusions
drawn from the work presented here, and our recommendations for future research as well as the
discussion of the problems we encountered in this research are listed in Chapter 8.
Notations and definitions of the variables, list of abbreviations and acronyms, additional
proof of convergence for a simple linear reinforcement scheme, a brief description of the
simulation program and diagrams of the states for highway scenarios mentioned in Chapter 7
are all given in the Appendices.
1
The distinction between automata and physical environments is discussed in Chapter 4.
Chapter 2
The consensus in the AHS community is that AHS will evolve over a series of smaller
steps in technology. The final step of full automation will not be a leap, but a logical consequence
of previous development and deployment efforts. Each step in the technology will have its own
benefits and be self-sustaining. Vehicle and infrastructure evolutions will be synchronous
[James94]. We will briefly mention the steps of this evolution here before introducing the AHS
program and discussing automatic vehicle control technologies in detail.
When the cruise control was first developed, there was much concern over the safety and
user acceptance of the new system; however, it has become widely accepted and used. In the near
future, obstacle and headway warning and Automatic Vehicle Identification (AVI) will be added
to modern cruise control and existing communications infrastructure. The success of AHS
depends on linking the power of cellular communications and the emerging range of high-
performance computers to the ongoing vehicle based developments. Ideally, the highway system
can be divided into a number of cells which contain local radio receivers or beacons that will be
linked together through a fiber-optic network. Vehicles will also be equipped with a transceiver
unit carrying several user services. The first applications of this technology are the Automatic
Vehicle Identification (AVI) and Electronic Toll Collection (ETC). Obstacle and headway
warning is the next step in AHS development in vehicles. Vehicle on-board radar (VORAD)
systems in many commercial vehicles are already in use for the last two years. An important
issue in warning systems is the capabilities of the sensor modules. Differentiating between a large
Cem nsal Chapter 2. Automated Highway Systems 10
vehicle and a small animal may not be possible using a simple system. A consequent application
of the headway warning system is the automatic headway control. Adaptive cruise control
systems are currently designed by many automobile manufacturers.
The market introduction of the first vehicle with adaptive cruise control is expected in
1997. This will enable the drivers to hold their desired speed as well as the desired headway
distance. Although the drivers defined as creepers will be cut-of by more aggressive drivers
(hunters), the ability to set the desired headway may be desirable to many users. Also, the
issues such as sensor types, curve handling, merging vehicles, changing lanes, integration of
steering and braking all have to be addressed to obtain a complete system design.
Applications in advanced traffic management, traveler information and public
transportation systems (ATMS, ATIS, APTS) will require more sophisticated vehicle location
capabilities. In addition, the number of uses for vehicle-to-roadside communications will
eventually increase. MAYDAY services, fleet tracking and automatic vehicle location (AVL)
applications will use radio-location beacons as well as more sophisticated transceivers. As a
result of AVL and AVI, processing real-time information on vehicle locations will be possible.
Although the number of vehicles equipped with AVI/AVL technologies will initially be small,
traffic management centers can effectively use a small percentage of vehicles as probes.
Roadside-to-vehicle and vehicle-to-vehicle communications are also important for the
future of AHS. Automatic braking systems may be activated by decelerating vehicles in front, or
by the infrastructure sending a deceleration request to the headway control system. The vehicle
must be very sure about the imminent danger, and knowledge of following vehicles and their
speeds is an important factor to be considered. Inter-vehicle communications and rear sensing
both would help in automatic braking.
Evolution of the AHS system will continue with lane departure warning. It will be the
first system to control lateral movement of vehicles. The lane holding feature will consequently
be added to the adaptive cruise control, shortly after the lane departure warning feature.
Automatic lane holding will provide a "hands off/feet off" driving situation where the driver is
still responsible for all command decisions in the vehicle and must be aware at all times of his
surroundings. If the infrastructure knows the location of each vehicle, possesses the information
about its current path, and is communicating with the vehicle, then the lateral control can be
coordinated from the infrastructure.
Further advances in technology will force the driver to lose his control of the vehicle. In
order to gain any additional benefit of safety and efficiency1, the driver must be removed as the
primary source of command and control. Of course, this change requires that the automated
system perform better than a good driver, i.e., no more than 1 critical error in 108 seconds be
made. This step will be the natural consequence of the previous progress. Obviously, not all
vehicles will be equipped with this technology right away. Automated and manually driven
vehicles have to coexist for some time.
1
With the driver in control, the highway capacity can only be increased to 15% above the maximum value of 40
vehicle/lane/min. Furthermore, 90% of highway accidents are results of incorrect driver decisions.
Cem nsal Chapter 2. Automated Highway Systems 11
A vehicle that can predict the actions of neighboring vehicles is an important step for
safer highway transportation. Locating the position of all the vehicles in close proximity to the
automated vehicle with high accuracy is essential. This can be accomplished through multi-sensor
systems for adjacent vehicles and possibly inter-vehicle communications to give an idea of what
to expect beyond adjacent vehicles. Alternatively, the roadside control may have knowledge of
the positions of the vehicles relative to fixed reference points. This knowledge is obtained by
either vehicle based or roadside based detection, and/or by communicating with the vehicle.
This technology requires extreme accuracy in vehicle location at all times. If the system is
infrastructure-based, the infrastructure needs to know the locations of the non-automated
vehicles, for safe and efficient implementation. The minimum update rate of information must be
larger than 100 times per second with an accuracy less than 10 cm for the desired level of safety
[James 94]. Automated vehicle control (AVC) systems are expected to boost the capacity by
50% even for mixed vehicle traffic. Once the system has knowledge of the surrounding
environment to all extents, it can make decisions on merging and passing in addition to the
headway control and lane keeping performed under driver control. Full system optimization and
higher efficiencies can then be obtained as the percentage of automated vehicles on the road
increases.
Highways contain many characteristics that simplify the problem of automation, such as
uninterrupted traffic flow, controlled access. Therefore automation on arterials will lag
significantly behind automated highways. However, many safety measures can be taken on
arterials using the equipment already designed for the highway. For example, the problem of
intersection collision can be reduced by activating the onboard warning systems and automatic
braking systems with electronic signal lights in addition to the normal traffic signal. If the
intersection detects a potential for a collision it can notify equipped vehicles. Problems to be
encountered during AHS deployment on arterials include integration of cyclists and motorcyclists
to the AVC system, and the effects of pedestrian and animal traffic.
The final step in the AHS future is fully automatic control, wherein the driver will have
no control over the vehicle. All trip decisions will be automatically made using AVL and ATIS
information. The driver may be able to include additional criteria for route selection. Once the trip
decision is made, the infrastructure, utilizing AVC, will guide the vehicle while constantly
updating the routing strategy based on the current information obtained through Advanced
Traffic Management System (ATMS).
at [NASHC96]. The information about the past and future of the AHS program can also be found
at [AHS96].
Outside of the US, similar AHS programs are being conducted in Japan [ITSJ96], and in
Europe [Braess95]. Almost all automotive companies in Japan have automated vehicles in the
design and/or testing phase [Construction96]. Similar activities are reported in the
PROMETEUS program in Europe [Rault95].
steering command, and is designed by another NASHC participant. In order to locate the road
ahead, the rapidly adapting lateral position handler (RALPH) uses a template-based matching
technique to find parallel image features such as lane markings or tire and oil markings. During the
experiment called No Hands Across America, the test bed vehicle equipped with the RALPH
system drove 98% of the 2850 mile journey autonomously. An average speed of 63mph in
conditions that included bright sunlight, dusk, rain and nighttime, and a maximum stretch of 69-
miles autonomous driving are reported [Pomerlau96]. A third application for lateral control
consists of a vision-based system with a neural network learning from a driver. Performance
levels comparable to the human driver are reported in [Moon96]. zgner et al. reported
successful results for lane following using a vision sensor [zgner96].
the distance between vehicles to 0.20m will change this number to 130 vehicles/lane/min, although
it may not be feasible for a heterogeneous platoon of vehicles. For intra-platoon distances
discussed here, it is impossible for a driver, who has a reaction delay of 0.25-1.20 sec, to
guarantee adequate safety.
According to Varaiya, the tasks for an automated vehicle-highway system to accomplish
can be achieved by a four-layer hierarchical control architecture [Varaiya93]. The layers of the
architecture, from the top, are network, link, planning, and regulation. The network layer assigns
a route to each vehicle as it enters the system. The link layer assigns each vehicle a path which
balances traffic for all lanes, and assigns a target speed for each section of highway. This layer
may also assign platoon size. The planning layer creates a plan which approximates the desired
path. The regulation layer controls the vehicle trajectory so that it conforms to this plan. Below
the regulation layer, a physical layer that provides sensor data and responds to actuator signals,
is assumed.
Different choices of partition of control authority between the infrastructure and the
vehicle are compatible with this architecture. The regulation and planning layer controllers are on-
board vehicles, the link and network layers are on the roadside. It is important to find a design
that combines the advantages of a fully centralized control system and an architecture based on
autonomous vehicles and local sensor information.
In this thesis, we introduce an intelligent controller which can be seen as the planning
layer of an autonomous vehicle. The planning layer, as defined in [Varaiya93], has three tasks:
In our application, where the scenario includes non-automated vehicles and minimal
communications, the coordination of maneuvers between planning layers does not exist. The
communication between planning and regulation layers is fairly simple: the planning layer sends a
command, and the regulation layer returns a reply once it successfully completes the command.
A richer interface may be required: the planning layer could pass multiple parameters to the
regulation layer, which could then return parameters indicating the success or errors and
exceptions. The theory of control of such a system is not yet developed. There is a need for
research as to how the regulation layer should switch from one control law to another. Lasky and
Revani state that this represents an open research issue whose solution maybe vital to the
implementation of a full AHS [Lasky93].
The approach above is one of the many different approaches to AHS3. Some of these
concepts are based on cooperative architectures [McKendree96], maximum adaptability
3
We emphasized this architecture since it is used in DYNAVIMTS, the Virginia Tech Center for Transportation
Centers simulation program (See Section 2.6).
Cem nsal Chapter 2. Automated Highway Systems 17
which traffic and hazard information detected by one group of vehicles can be communicated to
other vehicles. There are a variety of methods for roadside-vehicle communications; most of them
are discussed in [Field92]. Recent research on communications includes interference studies for
vehicle transceivers [Gavan96], design of communication protocols [Godbole96], WaveLAN
radio [Chen96], and network architectures and protocols for vehicle-to-vehicle communications
[Bolla96, Fuji96].
fully automated) on driver behaviors while entering the automated lane are tested using a driving
simulator [Bloomfield96b]. Levitan presented a comprehensive report on human factors
considerations related to the design of AHS. The report is intended as a guideline for effective,
efficient, safe and user-acceptable AHS design [Levitan96].
Feedforward
Feedback Loops
Figure 2.1. Four-block evaluation and experimentation framework for AHS.
These four blocks can be considered as the building blocks of a comprehensive testing and
evaluation methodology for AHS. The input can be a hypothesis, a model, or technologies. The
evaluation and testing procedure defined by this methodology is not seen as a single feedthrough
four-step process, but as having some feedback and feedforward loops depending on the results
obtained at each block. These loops represent the changes made to the hypothesis, model or the
technological concept. Hardware tests are important since they provide the means to validate
computer results or to modify them in the case of discrepancies, due to unmodeled or
inadequately modeled dynamics. Without hardware testing, it would be foolhardy to jump into
actual implementation. For instance, the FLASH Laboratory could be used to improve the
computer simulation via scale model tests before starting the tests with full scale vehicles.
Cem nsal Chapter 2. Automated Highway Systems 20
Although testing with real-sized vehicles will provide the most accurate results, it has
significant cost, safety and liability considerations. In order to test highway traffic and similar
situations, there is a need for a large number of automated and non-automated vehicles. This is
very expensive, and due to a testing situation involving humans, the costs can become prohibitive
and the situations dangerous, which could lead to insurance and safety problems. Also, full-size
testing of infrastructure based systems which require complex communication scenarios may be
expensive due to (roadside and in-vehicle) installation costs. The four-block structure is designed
to overcome these problems. A comparison of the design/development stages for different
characteristics is given in Table 2.1.
laboratory and its vehicles may serve as a benchmark through which various configurations,
architectures, and technologies can be compared. Evaluation of alternative AHS system concepts,
selection of a system approach based on analysis, test and evaluation, and demonstration of
prototype configurations on a smaller and cheaper scale are the foci of the laboratory.
Furthermore, the laboratory can be used to obtain preliminary data for the operation evaluation
(third) phase of the AHS program, and be a test bed for several issues which cannot be tested on
selected locations because of considerations of insurance and safety.
Currently, the laboratory is equipped with multiple small scale vehicles with different
types of tires and shock mechanisms. The vehicles have infrared and magnetic sensors, and
wireless video camera for lateral displacement measurements, ultrasonic sensors for headway
sensing, encoders for speed measurements. Sensing and control is provided by GCB-11
microcontroller. Wireless radio modems provide full duplex serial communications (Figure 2.2.)
Remote control station includes IBM PC computers equipped with frame grabber DSP board, a
micro-kernel operating system, and a data acquisition board for lateral feedback control via image
processing, as well as a steering wheel, control pedal and monitor for manual control (Figure 2.3).
Single- and double-lane small scale modular tracks with inclines and multiple radius left and right
turns are designed using EPDM rubber membrane (Figure 2.4). The laboratory also has an
experimental testing platform designed for traction control and vehicle plant modeling tests
[Kachroo96, Schlegel96].
ultrasonic
sensor
Wireless
modem
Infra-red sensors
Battery
GCB11 microcontroller
board (under the modem)
Figure 2.4. Test tracks and the control station in the FLASH Laboratory.
Cem nsal Chapter 2. Automated Highway Systems 23
Simulator Animator
C++, CSim C, OpenGL, Matlab
Unix Windows NT, Unix
Communicator
C, X-Motif, Matlab,
ArcInfo
Unix, Windows NT
Depending upon the spatial dimensions of the problem at hand, macroscopic and
microscopic simulation models will be used. For example, DYNAVIMTS is currently capable of
studying the effects of vehicle dynamics on the stability of platoons. Incorporation of different
brake technologies are also underway. DYNAVIMTS has a range of vehicle dynamics and control
models that can be used for microscopic simulation studies. On the other hand, DYNAVIMTS is
also designed to study the effects of highway automation on the overall traffic flow in the traffic
network.
The simulation software is developed using C++ with CSIM libraries [CSIM96], and Matlab
[Matlab96]. As we move toward the macroscopic models, the design of the package will be
completely object-oriented, and multi-processor machines or parallel processing with multiple
machines will become necessary. A detailed description of the DYNAVIMTS package can be
found in [Nagarajan96].
the detailed information about the control parameters and sensor data. The animation submodule
is developed in C using OpenGL libraries on a NT workstation. The simulation data is currently
transferred via a pre-formatted ASCII data file. This module also contains the graphical data for
the planned Smart Road project. A detailed description of the package, its file structure and
graphic user interface is given in [Nagarajan96].
The control structure we introduce in Chapter 4 is incorporated in the DYNAVIMTS
simulation package as part of the planning layer simulator. Since the link layer development in
DYNAVIMTS is not complete, the planning layer simulator on Matlab assumes that the link
layer data is available (if at all necessary), and the resulting data is transferred to the animator
submodule in the form of an ASCII file. At the end of each simulation run, the user is also able to
use the GUI to plot different parameters and values using Matlab. The connection between the
planning and regulation layers does not yet exist as is the case in similar simulation packages. A
snapshot of the planning layer GUI is given in Appendix C.
Cem nsal Chapter 2. Automated Highway Systems 27
GUI
SIMULATION
Sim.
3D ANIMATION GRAPHICAL
PLOTS
Birds Eye View
Drivers View
Zooming
Panning V
Fast Forward
Rewind
Pause & resume
Matlab statistics
t
Figure 2.7. Components of DYNAVIMTS simulation package.
Cem nsal Chapter 2. Automated Highway Systems 28
In classical control theory, the control of a process is based on complete knowledge of the
process/system. The mathematical model is assumed to be known, and the inputs to the process
are deterministic functions of time. Later developments in control theory considered the
uncertainties present in the system. Stochastic control theory assumes that some of the
characteristics of the uncertainties are known. However, all those assumptions on uncertainties
and/or input functions may be insufficient to successfully control the system if changes. It is
then necessary to observe the process in operation and obtain further knowledge of the system,
i.e., additional information must be acquired on-line since a priori assumptions are not sufficient.
One approach is to view these as problems in learning.
Rule-based systems, although performing well on many control problems, have the
disadvantage of requiring modifications, even for a minor change in the problem space.
Furthermore, rule-based approach, especially expert systems, cannot handle unanticipated
situations. The idea behind designing a learning system is to guarantee robust behavior without
the complete knowledge, if any, of the system/environment to be controlled. A crucial advantage
of reinforcement learning compared to other learning approaches is that it requires no information
about the environment except for the reinforcement signal [Narendra89, Marsh93].
A reinforcement learning system is slower than other approaches for most applications
since every action needs to be tested a number of times for a satisfactory performance. Either the
learning process must be much faster than the environment changes (as is the case in our
Cem nsal Chapter 3. Stochastic Learning Automata 30
application), or the reinforcement learning must be combined with an adaptive forward model
that anticipates the changes in the environment [Peng93].
Learning1 is defined as any permanent change in behavior as a result of past experience,
and a learning system should therefore have the ability to improve its behavior with time, toward
a final goal. In a purely mathematical context, the goal of a learning system is the optimization of
a functional not known explicitly [Narendra74].
In the 1960s, Y. Z. Tsypkin [Tsypkin71] introduced a method to reduce the problem to
the determination of an optimal set of parameters and then apply stochastic hill climbing
techniques. M.L. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during
the same period. An alternative approach to applying stochastic hill-climbing techniques,
introduced by Narendra and Viswanathan [Narendra72], is to regard the problem as one of finding
an optimal action out of a set of allowable actions and to achieve this using stochastic automata.
The difference between the two approaches is that the former updates the parameter space at
each iteration while the later updates the probability space.
The stochastic automaton attempts a solution of the problem without any information on
the optimal action (initially, equal probabilities are attached to all the actions). One action is
selected at random, the response from the environment is observed, action probabilities are
updated based on that response, and the procedure is repeated. A stochastic automaton acting as
described to improve its performance is called a learning automaton.
1
Webster defines learning as to gain knowledge or understanding of a skill by study, instruction, or experience.
Cem nsal Chapter 3. Stochastic Learning Automata 31
The most recent (and second most comprehensive) book since the Narendra-Thathachar
collaboration on learning automata theory and applications is published by Najim and Pznyak in
1994 [Najim94]. This book also includes several applications and examples of learning automata.
Until recently, the applications of learning automata to control problems were rare.
Consequently, successful updating algorithms and theorems for advanced automata applications
were not readily available. Nario Babas work [Baba85] on learning behaviors of stochastic
automata under a nonstationary multi-teacher environment, and general linear reinforcement
schemes [Bush58] are taken as a starting point for the research presented in this dissertation.
Recent applications of learning automata to real life problems include control of
absorption columns [Najim91], bioreactors [Gilbert92], control of manufacturing plants
[Sequeria91], pattern recognition [Oommen94a], graph partitioning [Oommen94b], active vehicle
suspension [Marsh93, 95], path planning for manipulators [Naruse93], distributed fuzzy logic
processor training [Ikonen97], and path planning [Tsoularis93] and action selection [Aoki95] for
autonomous mobile robots.
Recent theoretical results on learning algorithms and techniques can be found in
[Najim91b], [Najim94], [Sastry93], [Papadimitriou94], [Rajaraman96], [Najim96], and
[Poznyak96].
take on one of the values 1 and 2. In the simplest case, the values i are chosen to be 0 and 1,
where 1 is associated with failure/penalty response. The elements of c are defined as:
{
Prob (n) = 1 (n) = i }= c i (i = 12
, ,...) (3.1)
Therefore ci is the probability that the action i will result in a penalty input from the
environment. When the penalty probabilities ci are constant, the environment is called a
stationary environment.
There are several models defined by the response set of the environment. Models in
which the input from the environment can take only one of two values, 0 or 1, are referred to as
P-models3. In this simplest case, the response value of 1 corresponds to an unfavorable
(failure, penalty) response, while output of 0 means the action is favorable. A further
generalization of the environment allows finite response sets with more than two elements that
may take finite number of values in an interval [a, b]. Such models are called Q-models. When the
input from the environment is a continuous random variable with possible values in an interval [a,
b], the model is named S-model.
State
Transition Function F
Input Output Function G Output
Automaton
Environment
Penalty probabilities c
3
We will follow the notation used in [Narendra89].
Cem nsal Chapter 3. Stochastic Learning Automata 33
w F(,): is a function that maps the current state and input into the next
state. F can be deterministic or stochastic:
(n + 1) = F[ (n), (n)] (3.3)
w H(,): is a function that maps the current state and input into the current
output. If the current output depends on only the current state, the automaton is referred
to as state-output automaton. In this case, the function H(,) is replaced by an output
function G(): , which can be either deterministic or stochastic:
(n) = G[ ( n)] (3.4)
For our applications, we choose the function G() as the identity function, i.e., the states
of the automata are also the actions.
f ij = Pr{ (n + 1) = j (n) = i , (n) = } i , j = 12
, ,..., s = 1 , 2 ,..., m (3.5)
Since fij are probabilities, they lie in the closed interval [a, b]5; and to conserve probability
measure we must have:
s
f ij = 1 for each and i. (3.7)
j =1
Example
States: 1, 2 Inputs: 1 , 2
050
. .
050
F( 1 ) =
025
. .
075
Transition matrices:
01. .
09
F( 2 ) =
075
. .
025
In Chapter 4, where we define our initial algorithm, the probabilities fij will be the same for all inputs and all
4
states i given a state j . Furthermore, the output mapping G is chosen as identity mapping.
5
[a, b]=[0, 1] in most cases.
Cem nsal Chapter 3. Stochastic Learning Automata 34
Transition Graphs:
0.50
= 1 0.50
0.75
1 2
0.25
0.90
0.10
= 2 0.25
1 2
0.75
In the previous example, the conditional probabilities fij were assumed to be constant,
i.e., independent of the time step n, and the input sequence. Such a stochastic automaton is
referred to as a fixed-structure automaton. As we will discuss later, it is useful to update the
transition probabilities at each step n on the basis of the environment response at that instant.
Such an automaton is called variable-structure automaton.
Furthermore, in the case of variable-structure automaton, the above definitions of the
transition functions F and G are not used explicitly. Instead of transition matrices, a vector of
action probabilities p(n) is defined to describe the reinforcement schemes as we introduce in
the next sections.
If the variable-structure automaton is a state-output automaton, and the probabilities of
transition from one state to another fij do not depend on current state and environment input,
then the relation between the action probability vector and the transition matrices is as follows:
Since the automaton is a state-output automaton, we omit the transition matrix G, and
only consider the transition matrix F.
The transition does not depend on the environment response, therefore there is only
one state transition matrix F where the elements are fij. For example, we must have:
f 11 f 12
F (1 ) = F ( 2 ) F = (3.8)
f 21 f 22
Furthermore, the probability of being in one state (or generating one action) does not
depend on the initial/previous state of the automaton. Therefore, the transition matrix reduces to:
[ ]
F = f 11 = f 21 f 1 f 12 = f 22 f 2 p (3.9)
th
where vector p consists of probabilities pi of the automaton being in the i state (or choosing the
ith output/action).
To judge the performance of the automaton, we need to set up quantitative norms of behavior.
The quantitative basis for assessing the learning behavior is quite complex, even in the simplest
P-model and stationary random environments. To introduce the definitions for norms of
behavior, we will consider this simplest case. Further definitions for other models and non-
stationary environments will be given whenever necessary.
= ci pi ( n)
i =1
Optimality implies that the action l is associated with the minimum penalty probability cl is
chosen asymptotically with probability one. In spite of efforts of many researchers, the general
algorithm which ensures optimality has not been found [Baba83, Kushner72, Narendra89]. It
may not be possible to achieve optimality in every given situation. In this case, a suboptimal
behavior is defined.
Some automata satisfy the conditions stated in the definitions for specified initial
conditions and for certain sets of penalty probabilities. However, we may be interested in
automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial
conditions. These requirements are partially met by an absolutely expedient automaton, which is
defined as:
i =1
6
We will discuss the feasibility of the case later.
Cem nsal Chapter 3. Stochastic Learning Automata 37
corresponds to one action, i.e., the automaton is a state-output automaton. The simulation
examples in later chapters also use the same identity state-output mapping.
In general terms, a reinforcement scheme can be represented as follows:
p( n + 1) = T1 [ p (n), (n), (n)] (3.19a)
or
f ij (n + 1) = T2 [ f ij (n), (n), (n + 1) ( n)] (3.19b)
where T1 and T2 are mappings. Again, is the action, is the input from environment, and is
the state of the automaton. From now on, we will use the first mathematical description (3.19a)
for reinforcement schemes. If p(n+1) is a linear function of p(n), the reinforcement scheme is said
to be linear; otherwise it is termed nonlinear.
If (n) = i
p j (n + 1) = p j (n) gi ( p( n)) for all j i
r
when = 0 pi ( n + 1) = pi (n) + g k ( p(n)) (3.20)
k =1
k i
where gk and hk (k=1,2,...,r) are continuous, nonnegative functions with the following
assumptions:
0 < gk ( p(n )) < pk (n)
r
0 < [ pk ( n) + hk ( p( n))] < 1 (3.21)
k =1
k i
for all i=1,2,...,r and all probabilities pk in the open interval (0,1). The two constraints on update
functions guarantees that the sum of all action probabilities is 1 at every time step.
examples in the next section). It is also possible to divide all possible algorithms for stochastic
learning automata into the following two classes [Najim94]:
nonprojectional algorithms, where individual probabilities are updated based on their previous
values:
pi (n + 1) = pi (n) + k i (pi (n ), (n), ( n)) (3.22)
where is a small real number and k i are the mapping functions for the update term.
projectional algorithms, where the probabilities are updated by a function which maps the
probability vector to a specific value for each element:
pi (n + 1) = k i (p( n), ( n), (n)) (3.23)
where the functions k i are the mapping functions.
The main difference between these two subclasses is that nonprojectional algorithms can
only be used with binary environment response (i.e., in a P-model environment). Projectional
algorithms may be used in all environment types and are more complex. Furthermore,
implementation of projectional algorithms are computationally more tasking.
Early studies of reinforcement schemes were centered around linear schemes for reasons
of analytical simplicity. The need for more complex and efficient reinforcement schemes
eventually lead researchers to nonlinear (and hybrid) schemes. We will first introduce several
well-known linear schemes, and then close this section with absolutely expedient nonlinear
schemes.
If (n) = i ,
p j (n + 1) = (1 a) p j (n) for all j i
when = 0
pi (n + 1) = pi (n) + a [1 pi (n)]
b
p j (n + 1) = +(1 b ) p j (n ) for all j i
when = 1 r 1 (3.25)
pi (n + 1) = (1 b ) pi (n)
As seen from the definition, the parameter a is associated with reward response, and the
parameter b with penalty response. If the learning parameters a and b are equal, the scheme is
Cem nsal Chapter 3. Stochastic Learning Automata 39
called the linear reward-penalty scheme LR-P [Bush58]. In this case, the update rate of the
probability vector is the same at every time step, regardless of the environment response. This
scheme is the earliest scheme considered in mathematical psychology.
For the linear reward-penalty scheme LR-P , E[p(n)], the expected value of the probability
vector at time step n, can be easily evaluated, and, by analyzing eigenvalues of the resulting
difference equation, it can be shown that asymptotic solution of the set of difference equations
enables us to conclude [Narendra89]:
r
r
ck
k =1
lim E[ M(n)] = r
< = Mo (3.26)
n r
1
ck
k=1
Therefore, from Definition 3.1, the multi-action automaton using the LR-P scheme is expedient for
all initial action probabilities and in all stationary random environments.
Expediency is a relatively weak condition on the learning behavior of a variable-structure
automaton. An expedient automaton will do better than a pure chance automaton, but it is not
guaranteed to reach the optimal solution. In order to obtain a better learning mechanism, the
parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b
is set to 0, then the scheme is named the linear reward-inaction scheme LR-I . This means that the
action probabilities are updated in the case of a reward response from the environment, but no
penalties are assessed.
For this scheme, it is possible to show that pl(n), the probability of the action l with
minimum penalty probability cl, monotonically approaches 1. By making the parameter a
{ }
arbitrarily small, we have Pr lim pl (n) = 1 as close to unity as desired. This makes the learning
n
penalty probabilities. For example, the following update functions are defined by previous
researchers for nonlinear reinforcement schemes7:
a
g j ( p( n)) = h j ( p(n)) = p ( n)(1 pi ( n))
r1 i
This scheme is absolutely expedient for restricted initial conditions and for restricted
environments [Shapiro69].
g j ( p (n)) = p j ( n) ( p(n ))
pi (n) ( pi ( n))
h ( p ( n )) =
r 1
j
1 1
probabilities must be satisfied (e.g., cl < and c j > for j l). [Chand68]
m m
g j ( p( n)) = a ( p1 ( n),1 p1 (n )) p j +1 (1 p j (n))
h j ( p(n)) = b ( p1 ( n),1 p1 (n )) p j +1 (1 p j (n))
where ( p1 ,1 p1 ) = (1 p1 , p1 ) is a nonlinear function which can be suitably selected, and
1. Parameters a and b must be chosen properly to satisfy the conditions in Equation
3.21. Sufficient conditions for ensuring optimality is given in [Vor65], but they are valid only
for two-action automata.
The general solution for absolutely expedient schemes is found in early 1970s by
Lakshmivarahan and Thathachar [Lakshmivarahan73]. Absolutely expedient learning schemes are
presently the only class of schemes for which necessary and sufficient conditions of design are
available. This class of schemes can be considered as the generalization of the LR-I scheme.
Consider the general reinforcement scheme in Equation 3.20. A learning automaton using this
scheme is absolutely expedient if and only if the functions gk (p) and hk (p) satisfy the following
conditions:
g1 ( p) g2 ( p) g ( p)
= == r = ( p)
p1 p2 pr
(3.27)
h1 ( p) h2 ( p) hr ( p)
= = = = ( p)
p1 p2 pr
where (p) and (p) are arbitrary functions satisfying 8:
Again, (n) = i is assumed. We give only the updating functions for action probabilities other than the current
7
action; the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.
8
The reason for the conditions on the update functions will be explained in detail in Chapter 6.
Cem nsal Chapter 3. Stochastic Learning Automata 41
0 < ( p) < 1
p j (3.28)
0 < ( p) < min
j=1,...,r 1 pj
Detailed proof of this theorem is given in both [Baba84] and [Narendra89]. A similar proof (for a
new nonlinear absolutely expedient scheme) is given in Chapter 6.
where:
sk = E[ (n) (n) = k ] (3.31)
9
Note that this normalization procedure requires the knowledge of a and b.
Cem nsal Chapter 3. Stochastic Learning Automata 42
In the S-model, sk plays the same role as the penalty probabilities in the P-model, and are
called penalty strengths. Again, for a pure-chance automaton with all the action probabilities
equal, the average penalty Mo is given by:
1 r
r i
Mo = si (3.32)
=1
The definitions of expediency, optimality, -optimality, and absolute expediency are the
same, except the fact that cis are replaced by s is. Now, consider the general reinforcement
scheme (Equation 3.20). The direct generalization of this scheme to S-model gives:
pi (n + 1) = pi ( n) (1 ( n)) gi ( p(n)) + ( n)hi ( p (n)) if ( n) i
(3.33)
pi (n + 1) = pi ( n) + (1 ( n)) g j ( p(n)) ( n) pi h j ( p( n)) if (n ) = i
j i j i
The previous algorithm (Equation 3.20) can easily be obtained from the above definition
by substituting ( n) = 0 and 1. Again, the functions gi and hi can be associated with reward and
penalty respectively. The value 1 (n ) is an indication of how far ( n) is away from the
maximum possible value of unity, and is maximum when ( n) is 0. The functions gi and hi have to
satisfy the conditions for Equation 3.21 for pi(n) to remain in the interval [0,1] at each step. The
non-negativity condition on gi and hi maintains the reward-penalty character of these functions,
though it is not necessary.
For example the LR-P reinforcement scheme (Equation 3.25) can be rewritten for S-model
(SLR-P) as:
a
pi (n + 1) = pi (n) [1 ( n)] a pi (n) + (n) [ a p( n)] if ( n) i
r 1 (3.35)
pi (n + 1) = pi (n) + [1 ( n)] a [1 pi (n )] ( n) a pi (n) if (n) = i
with time. The performance of a learning automaton should be judged in such a context. If a
learning automaton with a fixed strategy is used in a nonstationary environment, it may become
less expedient, and even non-expedient. The learning scheme must have sufficient flexibility to
track the better actions. The aim in these cases is not to evolve to a single action which is
optimal, but to choose the actions to minimize the expected penalty. As in adaptive control
theory, the practical justification for the study of stationary environments is based on the
assumption that if the convergence of a scheme is sufficiently fast, then acceptable performance
can be achieved in slowly changing environments.
An environment is nonstationary if the penalty probabilities ci (or si) vary with time. The
success of the learning procedure then depends on the environment changes as well as the prior
information we have about the environment. From an analytic standpoint, the simplest case is the
one in which penalty probability vector c(n) can have a finite number of values. We can visualize
this situation as an automaton operating in a finite set of stationary environments, and assign a
different automaton Ai to each environment Ei. If each automaton uses an absolutely expedient
scheme, then -optimality can be achieved asymptotically for the overall system. However, this
method requires that the system be aware of the environmental changes in order to assign the
correct automaton, and the number of environments must not be large. Furthermore, since the
automaton Ai is updated only when the environment is Ei, updating strategies may happen
infrequently so that the corresponding action probabilities do not converge [Narendra89]. In our
application, a similar approach is employed as explained in Chapter 4.
There are two classes of environments that have been analytically investigated:
Markovian switching environments, and state-dependent nonstationary environments. If we
assume that the environments E i are states of a Markov chain described by a state transition
matrix, then it is possible to view the variable-structure automaton in the Markovian switching
environment as a homogeneous Markov chain. Therefore, analytical investigations are possible.
Secondly, there are four different situations wherein the states of the nonstationary environment
vary with the step n, and analytical tools developed earlier are adequate to obtain a fairly
complete descriptions of the learning process. These are:
The fundamental concepts such as expediency, optimality, and absolute expediency have
to be re-examined for nonstationary environments, since, for example, the optimal action can
change with time. Here, we again consider the P-model for purposes of stating the definitions.
The average penalty M(n) becomes:
r
M(n) = ci (n) pi (n ) (3.37)
i =1
Cem nsal Chapter 3. Stochastic Learning Automata 44
{ }
c ij = Pr i ( n) = 1 ( n) = j . The action set of the automaton is of course the same for all
teachers/environments. Baba discussed the problem of a variable-structure automaton operating
in a multi-teacher environment [Baba85]. Conditions for absolute expediency are given in his
works (see also Chapter 6 for new nonlinear reinforcement schemes).
Some difficulty may arise while formulating a mathematical model of the learning
automaton in a multi-teacher environment. Since we have multiple responses, the task of
interpreting the output vector is important. Are the outputs from different teachers to be
summed after normalization? Can we introduce weight factors associated with specific teachers?
Cem nsal Chapter 3. Stochastic Learning Automata 45
If the environment is of a P-model, how should we combine the outputs? In the simplest case
that all teachers agree on the ordering of the actions, i.e., one action is optimal for all
environments, the updating schemes are easily defined. However, this will not be the case in our
application.
The elements of the environment output vector must be combined in some fashion to
form the input to the automaton. One possibility is to use to use an AND-gate as described in
[Thathachar77]. Another method is taking the average, or weighted average of the all responses
[Narendra89, Baba85]. This is valid for Q- or S-model environments. Our application uses logic
gates (OR) and additional if-then conditions as described in Chapters 4 and 5.
As stated in Section 2.2, vehicle control is one of the most important issues in ITS, and especially,
in AHS. Designing a system that can safely control a vehicles actions while contributing to the
optimal solution of the congestion problem is difficult. Besides the control problems at the
regulation layer, there is a need for extensive research on the planning layer of the control
structure described by Varaiya [Varaiya 93], or at the tactical level of the driving task
[Sukthankar96c]. When the design of a vehicle capable of carrying out tasks such as vehicle
following at high speeds, automatic lane tracking, and lane changing is complete, we must also
have a control/decision structure that can intelligently make decisions in order to operate the
vehicle in a safe way.
The task of creating intelligent systems that we can rely on consequently brings the idea of
artificial intelligence to mind. ITS community is well aware of the fact that the implementation
of such a complex (and maybe global) system requires investigation of several different methods
and the simultaneous applications of many. Several emerging methods, such as cellular automata,
self-organization, neural networks, fuzzy logic, and hybrid systems applications are being
mentioned [TRB95, Godbole95, Ho96, Deshpande96]. Initial research on intelligent vehicle
control indicates that a planning-regulation system that can guarantee optimal operation with a
sound theoretical background has not yet been developed, and it may be vital to the
implementation of an automated highway system [Varaiya93, Lasky93].
In this chapter, we introduce a decision/control method for intelligent vehicles.
Considering the complexity of an automated highway system or an intelligent vehicle-highway
system, classical control methods are found to be insufficient to provide a fully automated,
collision-free environment [Varaiya93]. Although we may not solve the whole problem using a
single method, we attempt to find a way to make intelligent decisions here. Our approach to the
problem of vehicle control makes the use of Learning Automata techniques described in Chapter
3. The learning algorithms used in this application are introduced and discussed separately in
Chapter 6 for reasons of clarity. We visualize the planning layer (See Section 2.3) of an intelligent
vehicle as an automaton (or automata group) in a nonstationary environment1. The aim here is to
design an automata system that can learn the best possible action (or action pairs: one for lateral,
1
The interpretation of the term environment is twofold, as we explain later.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 47
one for longitudinal) based on the data received from on-board sensors, and possibly some form
of vehicle-to-vehicle and roadside-to-vehicle communications.
The significance of this system is that the controller we define will be useful as a backup
system or the primary system in controlling the path of a vehicle in the case of communication loss
with the higher layer in the hierarchy of a full AHS as well as during transition from fully
automated to manual control. Since the implementation of a fully automated highway system is
not imminent, we will try to concentrate on autonomous vehicles throughout this work.
An autonomous vehicle must be able to sense the environment around itself. In the
simplest case, it is to be equipped with at least one sensor looking at the direction of possible
vehicle moves. Furthermore, an autonomous vehicle must also have the knowledge of the rate of
its own displacement. Therefore, we assume that there are four different sensors on board the
vehicle. These are the headway sensor, two side sensors, and a speed sensor. The headway
sensor is a distance measuring device which returns the headway distance to the object in front of
the vehicle. An implementation of such a device is a laser radar. Side sensors are assumed to be
able to detect the presence of a vehicle traveling in the immediately adjacent lane. Their outputs
are binary. Infrared or sonar detectors are currently used for this type of sensor. The speed sensor
is simply an encoder returning the current wheel speed of the vehicle.
Each sensor is connected to its associated module. Sensor modules evaluate the sensor
signals in the light of the current automata actions, and send a response to the automata (Figure
4.1). We visualize each sensor module as a teacher in a nonstationary automata environment (or a
multi-environment system). The detailed descriptions of these sensor modules are given in the
next section.
Our basic model for planning and coordination of lane changing and speed control is
shown in Figure 4.1. The response of the environment is a combination of the outputs of all four
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 48
teacher blocks. The details of the mapping F is given in Section 4.4. The mapping F from sensor
module outputs to the input of the automata can be a binary function (for a P-model
environment), a linear combination of four teacher outputs, or a more complex function as is
the case for this application. An alternative and possibly more ideal model would use a linear
combination of teacher outputs with adjustable weight factors (e.g., S-model environment).
Sensors
Action from
A1 A2 A2 A3 A1 A2
Planning Layer
IF ACTIONS ARE THE SAME
Action to
Execute
When an action is carried out, the action probabilities in the controlling automaton are
initialized to 1/r, where r is the number of actions. Although not necessary, this initialization
enables the learning automaton to adapt to a new situation faster. One of the reinforcement
schemes described in Chapter 6 is designed to speed up the adaptation process when the
probability of the best/desired action is close to zero. When this new nonlinear scheme and/or very
fast update rates are used, probability vector initialization is not necessary. When an action is
executed, the memory vectors are initialized also; all locations are filled with the idle actions SiL
or SM. A minimum processing speed of 25, and a maximum of 200 iterations per second are
assumed. This value is related only to the computations; the sensor data feeds may have a
different (and constant) rate. The upper limit of 200Hz is due to the communication requirements
considered in the current AHS research [Lasky93].
The discussion of nonstationary environments in Chapter 3 is based on the changing
penalty probabilities of actions. In this application, the action probabilities of the learning
automata are functions of the status of the physical environment (e.g., a decreasing headway
distance will result in a penalty response from the front sensor module if the chosen action is for
ACC). The realization of a deterministic mathematical model of this physical environment may be
impossible, if not extremely difficult. Therefore, simulation is the only way of demonstrating the
expediency and/or absolute optimality for a nonstationary automata environment resulting from a
changing physical environment.
As seen in Table 4.2, a penalty response from the left sensor module is received only when
the action is SL and there is a vehicle in the left sensors range or the vehicle is already traveling
at the left or rightmost lane. Similarly, the action SR is penalized if there is a vehicle on the right
lane. All other situations result in a reward response from this sensor module. The longitudinal
automaton do not use side sensors for reinforcement. Initial simulations define the range of the
side sensors using two parameters sr1 and sr2 as shown in Figure 4.3. It is assumed that the side
sensors range covers only the adjacent lane, and the sensor is mounted right at the middle of the
vehicles side.
The output of the side sensors are assumed to be binary, indicating the existence of a
vehicle on the adjacent lane. However, the side sensor module may also use the distance of the
detected vehicle from the sensor source for more intelligent decisions. This type of design may
result in a more expensive implementation, as it requires distance measurement and possibly
additional sensors.
V1 > V2
sr1 sr2
V3
V1 V2
fsr
2
The actions are shift-to-left-lane (SL), shift-to-right-lane (SR), stay-in-lane (SiL), accelerate (ACC), decelerate
(DEC) and keep-same-speed (SM).
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 51
Assuming that the front sensor is capable of detecting the headway distance, we define the
headway module as shown in Table 4.3. If the sensor sees a vehicle at a close distance (< fsr), a
penalty response is sent to the automaton for actions stay-in-lane, accelerate, and keep-same-
speed. All other actions (shifting lane or decelerating) may serve to avoid a collision, and
therefore, are encouraged. Then, an implemented headway modules task is simply to compare the
time-of-flight of the echo to the predefined time interval corresponding to distance fsr. Also note
that although we show the limiting value fsr as the sensor range in Figure 4.3, the actual sensor
range and this value do not have to be equal. The output of this module affects both automata as
seen in Table 4.3.
The speed sensor of the autonomous vehicle is assumed to be an encoder connected to the
wheel shaft. The output of the encoder can be used to detect the wheel speed, which is
approximately equal to vehicle speed (or equal when cruising). The speed modules task is simply
to compare the actual speed to the desired speed, and based on the action chosen, send a feedback
to the longitudinal automata. The responses depending on the speed and the longitudinal actions
are given in Table 4.4.
Sensor Status
Actions Vehicle in range No vehicle in range
(headway < fsr) (headway fsr)
SiL 1 0
SL 0 0
SR 0 0
SM 1 0
ACC 1 0
*
DEC 0 0
Table 4.3. Output of the Front Sensor block.
Sensor Status
Actions dev < -pdif -pdif < dev < pdif dev > pdif
SiL # # #
SL # # #
SR # # #
SM 1 0 1
ACC 0 0 1
DEC 1 0 0
dev actual speed - desired speed
pdif permitted difference between actual and desired speeds
Table 4.4. Output of the Speed Sensor Module.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 52
When the actual vehicle speed differs from the desired speed by a large amount, then only
the action decreasing the speed deviation receives a reward; others are penalized. This forces the
vehicle to slow down or speed up in order to match the desired speed. The value of the parameter
pdif (permitted difference between actual and desired speeds) is predefined.
The values of the sensor limits sr1, sr2, fsr, and pdif define the capabilities of the sensors
(in the case front and side sensors) as well as the behavior of the vehicle. Sensitivity to the
headway distance and/or to speed fluctuations are given by the predefined parameters and are key
to the behavior of the autonomous vehicle. For example, a vehicle with shorter side sensor range
will tend to jump in front of vehicle in the next lane more often than a vehicle equipped with
more capable sensors.
Now that we have defined the sensor module outputs, the problem is to intelligently
employ these signals for automata reinforcement. It is possible to treat all sensor modules as
separate teachers with conflicting feedback responses for the automata. For example, consider the
situation given in Figure 4.3. The longitudinal action will receive a penalty from the front sensor
module due to the existence of a vehicle in front. If the actual speed of this vehicle is less than the
desired speed, the speed module will try to force the vehicle to increase its speed. These two
outputs conflict, and it is obvious that one must have priority over the other: no matter what its
current speed is, a vehicle must slow down in order to avoid a collision if it senses another vehicle
occupying the immediate space in front of itself. The next section describes our approach to the
problem.
Modules
Actions Headway Left Right Speed
SiL ? 0 0 -
SL 0 ? 0 -
SR 0 0 ? -
SM ? - - ?
ACC ? - - ?
DEC 0 or 0* - - ?
? means reward (0) or penalty (1) response is possible.
- means that output is not used; can be treated as 0.
0* has a higher priority (See discussion below).
Table 4.5. Action - Sensor Module matrix.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 53
The probability of the lateral action SiL depends mainly on the output of the headway
sensor. When this action is chosen, the probability is decreased when the headway module
indicates a vehicle in the front sensor range. Side sensors always sends a reward response to this
action no matter what the situation is. Therefore, we can obtain a combined environment response
for this action by simply OR-ing the binary module outputs.
The same method is also valid for the two other lateral actions. Probabilities of actions SL
and SR depend on the output of the left and right sensor modules, respectively. A decision to shift
lane is penalized if the associated sensor indicates the presence of a vehicle in the adjacent lane.
Therefore, by using a simple OR function, we can combine all three sensor module outputs to
obtain a meaningful teacher response for the lateral automaton.
For the longitudinal actions, although there are only two sensors modules whose feedback
is considered for reinforcement, the decision is more complicated. Longitudinal actions SM and
ACC may receive a penalty or reward from the front sensor depending on the headway distance.
The output of the speed module depends on the actual speed of the vehicle, and is used to force
the vehicle to match the desired speed. Action DEC is considered as good all the time by the
headway sensor module. However, the reward response in the case where there is a vehicle in the
front sensor range (indicated by 0*) is different than the normal reward response (indicated by 0):
this reward response must override a possible penalty from other modules. For the safety of the
vehicle, the output of this sensor must have a higher priority than that of the speed module.
Therefore, a simple OR-gate is not sufficient; additional Boolean functions must be used. Possible
action-sensor output scenarios are shown in Table 4.6.
Again, an OR-ing of the headway and speed module outputs will give the combined
output in Table 4.6 except the case where the action is DEC and the headway module indicates
the presence of a vehicle in front. Therefore, the output of an OR-gate must be double-checked
for this condition. Since we are using OR gates, we can define dont care conditions (-) as
reward responses. Thus, it is possible to OR the outputs of all sensors. Then, the function F
mapping individual sensor outputs to a combined feedback signal is defined as in Figure 4.4.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 54
Action or
DEC =? LNG
Headway Sensor Long. Automaton
Left Sensor f
OR
Right Sensor
LAT
Speed Sensor Lateral Automaton
Therefore, by adding one conditional rule to the OR-ed feedback signal, we obtain a
combined environment response to be applied to the automata. This single response is used to
update the action probabilities in both automata using the reinforcement schemes described in the
next section.
The first algorithm is a well-known linear reinforcement scheme, one of the first LA
algorithms [Bush58]. The second algorithm is given by Baba [Baba83, Baba85], and also is valid
for S-model environments. The last two reinforcement schemes are extensions of the first two,
and have desirable characteristics for our application. They are the direct results of our attempts
to create reinforcement schemes with desirable characteristics suitable for this study of learning
automata applications to intelligent vehicle control. The detailed description of the algorithms can
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 55
be found in Chapter 6. Here, we will briefly mention their characteristics and relative properties.
Table 4.7 gives the general description of the above mentioned schemes.
The second algorithm is slightly different than the first one, and has not been studied in
detail previously. However, the convergence characteristics is better when the same learning
parameter a associated with reward is used. Furthermore, LR P is proven to guarantee
convergence to an equilibrium point3 in N-person games. This is a property we would like to use
for analyzing the behavior of multiple autonomous vehicles. The last algorithm is an extension of
the third one, again resulting in a faster convergence in probability updates. These two schemes
are also proven to be absolutely expedient, meaning that the decision tends to a better one at
every time step.
3
Or Nash equilibrium, where no player has a positive reason to change her strategy.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 56
t = 0, the autonomous vehicle, traveling in lane 3, approaches the group of vehicles cruising at
fixed speed of 80kmh (Figure 4.5a; the vehicles are traveling from left to right). It immediately
changes lane to avoid a collision. It detects another slow moving vehicle in lane 2 at
approximately t = 10sec and slows down to keep a 15-meter headway distance (see Figure 4.6a).
At approximately t = 14sec, it shifts to lane 1. Immediately after the lane change, the speed is
adjusted to desired value of 85kmh. At t = 34sec and t = 45sec, two shift-to-left-lane actions are
carried out, again due to the presence of slow moving vehicles in front (Figures 4.5d-e). During
these maneuvers, the vehicle speed is not decreased, because the probability of the lateral action
SL reached 1 much faster than that of the longitudinal action DEC. The reason for this behavior is
the fact that the lateral action SR receives penalty at all times due to the presence of another
vehicle in the right lane, or due to the fact that the vehicle is in the rightmost lane.
After two shifts to the right lane (Figures 4.5f-h), the vehicle finds itself again in lane 1,
traveling behind another vehicle. The speed is adjusted to match the speed of the vehicle in front
(Figure 4.6). At this point, the only lateral action receiving reward is SiL, and this action is fired
repeatedly. Longitudinal automata, on the other hand, will chose the action necessary to keep the
headway distance over 15m. Whenever the headway distance is larger than this predefined limit,
the vehicle will attempt to increase its speed to 85kmh. It cannot move out from the pocket
created by two slow moving vehicles.
Lane 3
Lane 2
t=0s t=3s
(a) Lane 1 (b)
t=14s t=33s
(c) (d)
t=45s t=61s
(e) (f)
t=74s t=90s
(g) (h)
86
84
3
82
80
2
78
76
74
1
72
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
time (sec) time (sec)
(a) (b)
Figure 4.6. Automated vehicles (a) lane position and (b) speed (See also Figure 4.5).
The second and third simulation results show the behavior of a vehicle following a slowly
moving vehicle. Both simulations have been started with an initial headway distance of 30m
(Figure 4.7). Front sensor range fsr is set to 15m. The memory vector size is the same as the
processing speed (25), and this results in speed changes every second at the least. The only
difference between two simulations is in the desired speeds; 81kmh in the second, and 85kmh in
the third simulation. The headway distance and speed of the automated vehicles are shown in
Figures 4.8 and 4.9.
30m
t=0s
88
30
86
84
25
82
20 80
78
15
76
74
10
72
5 70
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time (sec) time (sec)
(a) (b)
Figure 4.8. (a) Headway distance and (b) speed of an autonomous vehicle following another
slowly moving vehicle: the desired speed of the autonomous vehicle is 81kmh while the vehicle
in front cruises at 80kmh.
88
30
86
84
25
82
20 80
78
15
76
74
10
72
5 70
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time (sec) time (sec)
(a) (b)
Figure 4.9. (a) Headway distance and (b) speed of an autonomous vehicle following another
slowly moving vehicle: the desired speed of the autonomous vehicle is 85kmh while the vehicle
in front cruises at 80kmh.
As seen in the above figures, both vehicles are able to avoid a collision4; the headway
distance is never too close. However, the response of the automated vehicle is oscillatory. The
longitudinal automaton sends the action ACC whenever the headway distance is larger than 15m
4
This is, of course, possible provided that the iteration rate is fast enough or memory vector is relatively short.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 59
to reach the desired speed of 85kmh. The action DEC is the choice when the distance is less than
the predefined limit. Also, the oscillation amplitude is higher when the relative speed is higher.
Although the oscillations are smaller for the desired speed setting closer to the speed of the
vehicle in front, the behavior is unstable, as seen in Figure 4.7.
Although not shown here, the behavior of the vehicles change with different parameters.
The choice of sensor limits and learning coefficients as well as the iteration speed and the size of
the regulation buffer are important factors.
Another important issue is the fact that the automated vehicles in simulations 2 and 3
chose to adjust their speed (back and forth) to keep a safe distance instead of shifting to lane 1 or
lane 3 to evade the slow moving vehicle. Since the only lateral action receiving a penalty from the
front sensor module is SiL, the probability of the two other actions SL and SR reach 0.5. This
results in a memory buffer filled with (approximately) equal number of two different actions. The
vehicle cannot decide which action to take.
Figure 4.10 shows the initial and final positions, numbers and speeds of ten autonomous
vehicles. The vehicles are traveling from left to right on a 500m circular highway. Final speeds
shown in Figure 4.10(f) are equal to desired speeds set initially. Simulation parameters are again
given in Table 4.8.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 60
t=1s (a)
2 3 5 8
4 7 9
1 6 10
(b)
85 80 80 80
80 80 80
80 80 80
(c)
t=243s (d)
7 8
1 3 9 4 2
10 5 6
(e)
83 78
82 82 81 79 80
80 80 80
(f)
ch4m1.mpg
(1.230Mb, every frame represents one fifth of a second)
(g)
Figure 4.10. Initial and final positions (a, d), numbers (b, e), and speeds (c, f) of ten automated
vehicles traveling on a 3-lane 500-meter circular highway. The mpeg movie of the simulation is
accessible via icon (g).
As seen in figure above and the mpeg movie, vehicles with higher speeds are slowly
moving ahead of the others. All vehicles are able to avoid collision, either by slowing down or
shifting lanes. In the end, all vehicles have reached their desired speeds, since they were all able to
shift to an open lane, sooner or later during the simulation. However, the paths taken to overtake
slower vehicles are not the shortest and/or quickest paths.
technology in computing, the iteration speed is not a big concern. Learning parameters must be
chosen large enough to guarantee fast learning for the desired iteration speed while avoiding
values larger than 0.3 for linear schemes, and 0.1 for nonlinear schemes in order to avoid
unnecessarily large action probability update of a non-optimal action probability to 1. The length
of the memory vector is another factor in firing actions. This buffer is definitely needed for
lateral actions, in order to avoid continuous lane changes, while its size can be decreased down to
1 for longitudinal actions provided that the speed changes are continuous.
Although there are no collisions with carefully selected parameters to obtain the necessary
convergence speed, the path and/or behavior of the vehicles are far from optimal. The
autonomous vehicle in simulation number 1 did not possess the information to help it avoid a
pocket created by two slow moving vehicles. The decisions are based on local data, and therefore
cannot guarantee global solutions for the vehicle, or the traffic as a whole for that matter.
Conflicting decisions are expected.
The observed oscillations in the headway distance is due to the discrete time control of
the speed, and the defined front sensor limit. For a fast processing speed and a more realistic
speed controller design, the problem may be relatively insignificant. However, with the current
definition of the headway module, the vehicles will keep decelerating even though the headway
distance is improving.
Furthermore, unnecessary lane changes can be observed in simulation number 4. This is
again due to the fact that the vehicles do not have an evaluation method for the desired lane when
such a decision is needed. This fact is also the cause of the behavior seen in simulations 2 and 3.
When lateral actions SL and SR are both good, the vehicle is unable to change lanes using the
current method of firing actions when the memory vector/buffer is full (see Section 4.1).
Initial simulations show promising results, but there is a need for more elaborate sensor
definitions, and for more complex decision rules. The next chapter will address these upgrades,
and other related issues.
It is interesting to note that again a task performed easily by human beings proves to be
very complicated for a machine. Even a simple decision of shifting lanes in order to avoid being
stuck in a packet of slow moving vehicles (simulation 1, Figure 4.5h) requires complicated
sensory hardware and, possibly, a complex communication mechanism. However, a human driver
can simply look farther ahead toward the adjacent lanes, and decide to slow down in order to find
an opening. Research currently conducted on AHS aims to solve such problems with extensive
vehicle-to-vehicle and/or vehicle-to roadside communications [Construction96, Lygeros94],
multiple sensors and sensor fusion [Agogino96] and a hierarchical control structure [Varaiya91,
Varaiya93, Lygeros95, Lygeros96]. When, the hierarchical system and all vehicles are
connected, they will collectively reach a solution. However, such an architecture consequently
requires an expensive implementation. The problem of making decisions based on local
information would exist even in a full AHS environment unless the system is fully hierarchical.
Chapter 5
The intelligent controller defined in the previous chapter, although capable of controlling an
automated vehicles path using local information, is far from optimal in the sense that the
resulting safe path is not always the shortest or fastest path, nor a good choice for improving
the overall throughput of a segment of the automated highway. Also not seen in the simulation
examples, a pinch maneuver condition where two vehicles changes lanes to occupy the same
spot on the highway (See Section 5.2.2) is possible when only the immediate neighborhoods of
the vehicles are considered. In this chapter, we will extend our assumptions on the capabilities of
the intelligent vehicles. These changes can be grouped into three different classes: extending the
capabilities of the sensors, creating more complete (and complex) decision rules, i.e., extending
the capabilities of the teachers, and adding more sensor/teacher modules. All these changes are
described in the following sections.
Extensions on sensor capabilities require more complex and expensive hardware
implementations, and extending the capabilities of the teachers leads to more complex evaluation
rules for teacher modules as we describe in the following sections. These include new definitions
of front and side sensors, more rules for headway and side sensor modules, as well as interaction
of lateral and longitudinal automata. Two new sensor modules are defined in this chapter, namely
the pinch and lane sensor-module pairs. The assumptions and physical implementations of these
sensors as well as the flag structure used in the decision modules are described.
cases where the output of the sensor has an overriding characteristic are indicated by asterisks.
There are now three different range values used in the feedback decision. The value fsr defines the
sensor range, or this value can be visualized as the limit of the region of interest. Any
measurements equal or greater than this value results in a reward response for all longitudinal and
lateral actions. As seen in Figure 5.1, two additional parameters d1 and d2 with d1 < d2 < fsr are
used to define a new buffer region where the sensor modules response also depends on the rate
of change of the headway distance.
Sensor Status
Vehicle in Vehicle in region B Vehicle in region C No vehicle
region A (d1 < dist. < d2 ) (d2 < dist. < fsr) in range
Current (dist. < d1) Vehicle is Vehicle is Vehicle is Vehicle is (dist. fsr)
Action approaching NOT approaching NOT
approaching approaching
SiL 1 1* 0 1* 0 0
SL 0** 0 0 0 0 0
SR 0** 0 0 0 0 0
SM 1 1 0 0 0 0
ACC 1 1 1 1 0 0
DEC 0 0* 0 0 0 0
Asterisks are used to distinguish the cases where additional functions need to be used (See Section 5.3.3
for details).
Table 5.1. Output of the Front Sensor block (Also see Figure 5.1).
The lateral action SiL is discouraged when the vehicle in front is dangerously close. The
parameter d1 defines this safety region. The same action also receives penalty when the headway
distance is larger than d1, but less than the safe distance defined by parameter fsr. However, for
the region between these two values, the action is penalized only when the rate of change of the
headway distance is negative, i.e., if the headway distance decreases. All other lateral actions
receive reward from the front sensor module.
Longitudinal action ACC is discouraged when a vehicle enters the front sensor range, and
is approaching. If the headway distance is under d2, this action is penalized regardless of the rate
of change of the measured distance. A vehicle should not increase speed while sensing another
vehicle in proximity. The differentiation of feedback response for the range [d2 fsr] enables the
automated vehicle to catch up with the vehicle in front. If the headway distance is too close, or
if the vehicle in front is approaching, the action SM also receives penalty.
Note that if we choose the parameters as d1 = d2 = fsr, the reward-penalty structure of
Table 5.1 becomes the one previously given in Table 4.4. By defining these new regions where the
feedback response is different for different longitudinal actions, we hope to obtain a smoother
response, and thus remove the oscillations in the headway distance while following a slowly
moving vehicle (See examples in Figure 4.8 and 4.9). The reward response of the front sensor
module for longitudinal action ACC in region C facilitates speed recovery.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 64
E D E
A B C
d1
d2
fsr
Again, consider the situation given in Figure 4.7 where the automated vehicle follows a
slow-moving vehicle. The initial headway distance is still 30m, the sensor parameters are chosen
as shown in Table 5.2. The automated vehicle has a desired (and initial) speed of 85kmh while the
other cruises at 80kmh. All other parameters are set to the same values as in the second and third
examples in Chapter 4 (Table 4.8). The regulation layer changes velocity after receiving the same
action 25 times consecutively, and the processing speed is 25 iterations per second.
The vehicle speeds and the headway distances for the five simulations with the above
parameters are given in Figures 5.2-5.6. Only the sensor range definitions are changed to see the
effect of various settings. As seen in Figures 5.2 and 5.3, it is possible to prevent the oscillations
in the headway distance. Usually, the headway distance stabilizes at a value between d1 and d2
when the automated vehicle approaches a slow-moving vehicle. As seen in Figure 5.2, the
regulation layer changed the speed 5 times to match the speed of the vehicle in front, in
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 65
approximately 6 to 7 seconds. Again, note that a maximum of one change per second was
possible in these simulations.
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 0 10 20 30
time (sec) time (sec)
Figure 5.2. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 10m, d2 = 20m, and fsr = 30m.
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20
0 10 20
time (sec)
time (sec)
Figure 5.3. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 10m, d2 = 15m, and fsr = 20m.
Figure 5.4 shows another vehicle with much shorter sensor ranges than the previous ones.
Again starting at the same initial conditions, the vehicle is able to match the speed of the vehicle in
front, and keep a safe distance. Since the parameter fsr is relatively shorter than previous
simulation runs, the vehicle had less time to adjust its speed starting at only 18m. As a result of
very rapid changes (only action DEC is permitted after t = 10sec since the headway is less than d2
and the vehicle is approaching), the headway decreased under d1 = 10m and the vehicle speed
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 66
below 80kmh. For approximately 6 seconds, the automated vehicle cruised at 79kmh in order to
permit the headway to increase to a safe distance. At t 30sec, right after the headway distance
becomes equal to d2, the speed is matched. In this case, the region [d2 fsr] is where the steady-
state value is obtained. After t = 40sec, the vehicle ahead changes its speed to 83kmh, then back
to 80kmh. As seen in Figure 5.4, this speed change is quickly matched, and the headway distance
reaches its steady-state a few seconds after the vehicle in front reaches its cruising speed of
80kmh (at t 60sec).
88
30
86
84
25
82
80 20
78
15
76
74 10
72
70 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60
time (sec) time (sec)
Figure 5.4. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 12m, d2 = 15m, and fsr = 18m.
When we attempt to decrease the sensor regions further to [d1 d2] = [14 15] and
[d2 fsr] = [15 16], oscillations resulted as seen in Figure 5.5. The regions where different actions
are encouraged based on the distance and rate of change of distance are so small that before an
intelligent decision can be made, the value of the headway distance moves out of the buffer
region. Again, it is important to note that an ACC or DEC action is fired every second in the best
case. For a much shorter memory buffer as well as a faster iteration speed, the response will be
quicker, and therefore these sensor values may be feasible for non-oscillatory response.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 67
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
time (sec) time (sec)
Figure 5.5. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 14m, d2 = 15m, and fsr = 16m.
Keeping the same processing speed and memory vector, we try to solve the problem by
enlarging the lower sensor region, i.e., by decreasing the value of d1 to 12m. The results are
shown in Figure 5.6. The response is not as damped as in simulations 1 and 2, but a steady-state
value is reached. Again, a change in the speed of the vehicle in the front (at t 64sec) is quickly
matched with a minimal deviation in the headway distance.
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
time (sec) time (sec)
Figure 5.6. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 12m, d2 = 15m, and fsr = 16m.
From the simulation results above, we see that there is a trade off between the time to
reach a steady-state value for headway distance and the permitted region for the steady-state
values. For faster processing speeds (which decrease the time to reach a decision in the planning
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 68
layer) and shorter memory buffer (which results in a shorter decision time in the regulation layer),
the regions may be smaller, and therefore the problem less insignificant.
The definition of the front sensor module in Section 4.2 is the limiting case for the
extended version given here. As d1 and fsr approach d2, the definition given here becomes similar
to the one given in Chapter 4, and oscillatory responses are observed.
The idea of having multiple parameters defining the regions [d1 d2 ] and [d2 fsr] is similar
to the idea of defining a boundary layer around a switching surface for sliding mode controllers in
order to stop chattering [Slotine91]. In sliding mode control, additional integral terms are added
to the driving function for the predefined boundary layer. Here, we force the speed to change
according to value and rate of change of the headway distance. Since distance is the integral of
the velocity, the resulting effect is similar. Of course, the control methodology is much more
discrete, and also less structured, than in a sliding mode controller. In the design methodology of
the sliding mode controllers, either the width of the boundary layer or the function parameters are
predefined and the rest adjusted accordingly [Kachroo96c]. Similarly, in our case, the choice of
processing parameters and the sensor regions are sequential, in the sense that we either define the
sensor regions first, and then choose the learning parameters, or vice versa.
The region [d1 d2 ] is effective when the automated vehicle has to slow down; [d2 fsr] is
mainly important in the case of acceleration while closely following another vehicle. This second
region also improves the vehicles braking time in the previous case.
Also note that the length of the memory vector is 25, i.e., at the very best, only one
velocity adjustment per second is permitted. More continuous regulation layer modeling is of
course possible, but not considered here. Since the controlled variable is the distance, not the
speed, and a more realistic model would also include delays, we did not investigate continuous
models. The aim here is to show the importance of sensor range definitions, and to introduce an
intelligent vehicle control method based on learning.
Assuming that there is a sensing structure possibly an array of sensors that is able to
give information about the relative speed of the neighboring vehicles, the side sensors detection
area is divided into two sections as seen in Figure 5.1. An immediate neighborhood is now
defined. In this region (D) the output of the sensor module depends not only on the detection of
vehicles, but also on measurement of their relative longitudinal movements. The penalty-reward
structure for the side sensor modules is given in Table 5.3.
Longitudinal actions are not shown in the Table 5.3 since there are not affected by the side
sensor modules. The regions D and E are defined as shown in Figure 5.1; the parameters sr1, and
sr2 are used to define the detection range of the sensors (or sensor arrays). The parameter
differentiates the region D from the region E. If a vehicle is detected in the region E, and it is not
(longitudinally) approaching the automated vehicle, the penalty response to the lane changing
actions are suppressed. This distinction gives more flexibility to the automated vehicle as we
describe with the following simulation examples.
Figure 5.7 shows five automated vehicles with sensor ranges sr1 = sr2 = 12m, d1 = 8m,
d2 = 11m, fsr = 15m (the side sensor module of Chapter 4 is used in this simulation). The vehicle
in lane 3 wants to merge into the platoon, however the distance between the first two vehicles in
the platoon is only 20m. It slows down in order to change lanes (the behavior forcing this change
is given in Section 5.3.1). The right side sensor module always returns a penalty response for the
action SR, and consequently the vehicle is unable to change lanes. It slowly moves the to the back
of the platoon.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 70
t=0s t=3s
(a) (b)
t=6s t=9s
(c) (d)
t=15s t=19s
(e) (f)
However, if we employ the extended side sensor module for the same scenario, the vehicle
is able to change lanes. As seen in Figure 5.8, the vehicle in lane 3 slows down to change lanes,
and shifts lanes at t 8.5sec.
In this simulation, the side sensor ranges are sr1 = sr2 = 12m with the additional definition
of = 0.66 , i.e., if the detected range is between 8m and 12m and the neighboring vehicle is not
approaching, the action SR becomes acceptable. Using this decision structure, the gray colored
vehicle shifts to lane 2, and adjusts its speed accordingly. Figure 5.9 shows the headway distances
for the platoon. Before the lane change, the headway distance are steady with minor deviations
due to the random speed changes (See Figure 5.10a). At t 8.5sec, the lane shift occurs and the
vehicle entering the platoon is cruising at a lower speed (See Figure 5.10). Consequently, the
distance between the second and third vehicles decreases; the vehicle which is now the third in the
platoon detects a sudden change in headway distance. As a result, this and other trailing vehicles
slow down to adjust their headway distances. The effect can be seen in Figures 5.9 and 5.10b. The
ripple is moving backward through the chain of vehicles. All vehicles are able to keep their
headway distances within the acceptable range while returning to their desired speeds of 80kmh at
t = 30sec. The steady-state response is due to the new definition of the front sensor module given
in the previous section.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 71
t=0s t=6s
(a) (b)
t=8.5s t=8.8s
(c) (d)
t=12s t=16s
(e) (f)
t=24s t=32s
(g) (h)
20
15
10
5
0 4 8 12 16 20 24 28 32
time (sec)
Figure 5.9. Headway distances in the platoon; lane change occurs at t 8.5sec.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 72
88
400
86
84
position (m )
300
82
80
200
78
76 100
74
72 0
0 10 20 30 0 4 8 12 16 20 24 28 32
time (sec) time (sec)
(a) (b)
Figure 5.10. (a) Speed and (b) space-time trajectory of five automated vehicles
(Space-time trajectory for vehicle 2 is plotted after the lane change).
In the simulation example given here, the sensor ranges are chosen to illustrate the
advantage of the new definition and the new front sensors ability to stabilize the headway
distance. However, with such small (side sensor) range definitions, the lane shift creates ripples
in the otherwise stable platoon. The main reason for defining multiple regions for the side sensors
is actually to avoid sudden jumps of one vehicle in front of another. The sensor ranges must be
carefully chosen, possibly using adaptive techniques mentioned in Chapter 1.
to vehicles via broadcast communications, as envisioned by the PATH program (e.g., [Hedrick96]
and [Lygeros96]).
Once an automated vehicle knows its (lateral) position, and assuming that it can obtain the
information on the best choice of lane for its desired speed and path, the decision for lateral
movement is relatively straight-forward. Although current practical implementations do not yet
consider such computationally complicated1 maneuvers, research on control of lane changing
maneuvers is slowly emerging [OBrien96, Wonshik95, Hessburg95]. The difficult part in path
planning is to find the optimal lane position for a specific vehicle.
For our purposes we will assume that an automated vehicle can sense its present lane, and
that it has some idea about where it should be. Based on these two values, the action that leads to
necessary lane shifts is encouraged by the teacher module shown in Figure 5.11. The flag
structure described in Section 5.3.1 is used to force necessary longitudinal actions if the vehicle
has trouble shifting into the desired lane.
1
Computational difficulty here describes the complexity of a good decision-making process for lane maneuvers;
the low-level control of a lane maneuver is possible for most of the current test vehicles.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 74
In the simulations, we use the memory vector to check for other vehicles intention to shift
lanes. If the number of memory locations containing SR or SL is more that half the size of the
vector for vehicle, it is assumed that the vehicle is likely to shift lanes (Figure 5.13). Since there
are three lateral actions, a 50% distribution in the memory vector most likely indicates that an
action will soon be fired. If such an intention signal is received from a neighboring vehicle, the
pinch module sends a penalty response to the lateral automaton for the action that may cause a
problem. In a sense, pinch module in an automated vehicle is driven by the memory vector of
neighboring vehicles. In a real implementation, this corresponds to a signaling vehicle which may
be detected by a vision system, vehicle-to-vehicle communications indicating intended actions, or
a roadside-to-vehicle communication relaying the position of the vehicles. The last possibility is a
global solution to the problem.
m
Lateral
Planning Action
Layer A1 A2 A2 A3 A1 A2
Signal Execute
Figure 5.13. Memory vector/buffer: If an action fills half of the buffer, it is signaled.
Figure 5.14 shows two vehicles cruising at the same speed. Initially, vehicle 1 is in the
rightmost lane, vehicle two in the leftmost lane. However, vehicle 1 needs to shift to the leftmost
lane, while the other wants to change to the rightmost lane. Figure 5.15 shows the lane and speed
of the two vehicles versus time. Assuming that their initial positions in Figure 5.14 correspond to
time t = 0, both vehicles are unable to change lanes until t 3sec, then vehicle 1 shifts to the
middle lane.
The reason for the delay is the pinch module in both vehicles. Without a detection method
for this condition, the vehicles would change to the middle lane at approximately the same time,
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 75
and the result would be a collision. As seen in Figure 15.6, both vehicles lateral memory vector is
initially filled with the idle action SiL. Since both vehicles are forced to chose the necessary lateral
actions by the lane module, the percentages of the actions SL and SR in the memory buffers (of
vehicle 1 and 2, respectively) increase. The percentage of the action SL in the memory buffer of
vehicle 1 reaches 50% slightly faster than that of action SR in the vehicle 2 (Figure 5.16). As a
result, the lateral action SR of the vehicle 2 starts receiving penalties earlier than vehicle 1, and
the probability of this action decreases. The result is also a decrease in the percentage of the
action SR in the memory buffer. The percentage of vehicle 1s SL action also drops (until n = 40,
or t = 1.6sec), but never under 50%. Once the intention signal from vehicle 2 is stopped, action
SL fills the memory buffer, and is fired at n = 71 (t 2.84sec). The memory vector is then reset
to all-SiL.
86
3 84
82
80
2
78
76
74
1
72
0 1 2 3 4 0 1 2 3 4
time (sec) time (sec)
(a) (b)
Figure 5.15. Lane (a) and speed (b) of two vehicles from t = 0 to t = 4sec.
This new module enforces safe lane transitions; either one of the vehicles temporarily
suppress its desire to shift lanes (i.e., a penalty response from the pinch module overrides all other
reward responses from lane and side sensor modules by ORing), or both vehicles are hindered
from changing lanes to achieve their desired paths.
As seen in Figure 5.14, the two vehicles wont be able to exchange lanes as long as they
cruise at the same speed. Some bottleneck situation might also occur in a pinch situation where
both vehicles have similar action ratios in their memory buffers. To overcome this problem, more
complex decision rules need to be defined. These are explained in the next section.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 76
-1
0 10 20 30 40 50 60 70 80 90 100
time (1/25 sec)
Number of actions in the memory buffer
25
20 SiL
15 SL
10
5 SR
0
0 10 20 30 40 50 60 70 80 90 100
time (1/25sec)
-1
0 10 20 30 40 50 60 70 80 90 100
time (1/25 sec)
Number of actions in the memory buffer
25
20 SiL
15 SR
10
5
SL
0
0 10 20 30 40 50 60 70 80 90 100
time (1/25sec)
Figure 5.16. The output of the pinch module and the actions in the memory
buffer versus time for vehicle 1 (top) and vehicle 2 (bottom).
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 77
1 2 3
t=0s t=60s
Lane 1
(a) (b)
Vehicle Speeds (kmh) Vehicle Lanes
90
4
88
86
84
3
82
80
2
78
76
74
1
72
0 10 20 30 40 50 60 0 10 20 30 40 50 60
time (sec) time (sec)
(c) (d)
ch5m1.mpg
(0.273Mb; every frame represents one fifth of a second)
(e)
Figure 5.17. Initial (a) and final (b) situation, speed (c) and lane positions (d) of three
automated vehicles traveling on a 3-lane 500-meter circular highway. The mpeg movie
of the simulation is accessible via icon (e).
and match the speed of vehicle 2, but it is unable to travel at its desired speed of 80mph (Figure
5.18(e)). The predefined time interval to match the desired speed is 10sec in this case. The count
starts at t = 5.32sec (the + mark in Figure 5.18(f)), and the time expires at t = 15.32sec (S in
Figure 5.18(f)) while the automated vehicle is keeping a safe distance from the vehicle in front.
Once the flag is set, the action SiL, which is normally encouraged to avoid unnecessary lane
changes, starts receiving a penalty. The probability of one of the actions SR and SL, in this case
SR2, approaches one, and the action SR is fired at t = 18.96sec. The flag is reset when the vehicle
completes the lane change maneuver at t = 19.52sec (o in Figure 5.18(f)). Immediately after the
lane shift, the vehicle increases its speed to match the desired value.
t=0s t=19s
(a) (b)
t=19.6s t=35s
(c) (d)
86
84 3
82
80
2
78
76
74
1
72
0 10 20 30 0 10 20 30
time (sec) time (sec)
(e) (f)
Figure 5.18. Snapshots (a-d), speed (e) and lane positions (f) of two automated vehicles traveling
on a 3-lane 500-meter circular highway.
2
Using the additional rule defined in Section 5.3.3.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 80
the structure of the front sensor module is changed, the mapping that combines the teacher
outputs for longitudinal actions does not change; the rules are exactly the same as the previous
case given in Table 4.7 and Figure 4.4.
For the lateral actions, the changes in the front sensor and the addition of lane and pinch
modules affect the mapping. The new function generating the automaton input is more complex
than a single OR gate. For the lateral action SiL, the combined output is again an OR-ed
combination of sensors except for one case. When the lane module sends a reward signal to the
lateral automaton, but the headway module indicates an approaching vehicle, the action SiL must
be penalized, as is the case with a simple OR function. However, in order to avoid unnecessary
lane changes, the penalty response of the headway module is suppressed whenever the
longitudinal action is DEC for headway distances not dangerously close (Table 5.4).
Module Output
Actions Right/Left Lane Headway Pinch Combined
0 0 0
0 1 1
SiL always 0 1* always 1 or 0
0 1 0 0 1
1 1 1
1 1* 1
0 0 0 0 0
0 0 0 1 1
0 0 0** 0 0
0 0 0** 1 1
0 1 0 0 1
0 1 0 1 1
SR 0 1 0** 0 0
or 0 1 0** 1 1
SL 1 0 0 0 1
1 0 0 1 1
1 0 0** 0 1
1 0 0** 1 1
1 1 0 0 1
1 1 0 1 1
1 1 0** 0 1
1 1 0** 1 1
Combined output depends on the longitudinal action (See text and Figure 5.19).
All combined outputs for SR and SL are OR-ed side, lane and pinch module outputs except
this one, if 0 and 0** are assumed to be the same binary value, FALSE.
For actions SR and SL, again an additional if-then condition needs to be added for one
case. Consider a situation where the side sensors and pinch modules indicate a reward for lane
shifting actions SR and SL. The lane module, on the other hand, favors lateral action SiL because
the vehicle is currently cruising in its desired lane. However, there is another vehicle in very close
proximity of the vehicle and the front sensor module sends a reward signal to indicate the need for
a lane change (if the headway distance is under the predefined limit d1, we assume that the action
DEC has failed to avoid collision). This reward signal must override the lane sensors penalty
signal (indicated with 0 in Table 5.4). The resulting mappings from multiple teacher outputs to
a single automata input are given in Figure 5.19.
In addition to the rules described above, one more subroutine is defined to check the
memory buffer interface to the regulation layer. Currently, if an automated vehicle senses a
decreasing headway and nothing on the sides with no useful information from the lane module, it
cannot decide whether to shift left or right. The reason for that is the lack of global information;
the vehicle using only its local sensing devices cannot distinguish between the left or right lane,
and the probability of both actions reach 0.5. To overcome the problem, a symmetry breaking rule
which randomly chooses one of the actions is introduced. The overall structure of the intelligent
planner is given in Section 5.4.1.
distance
Headway F
Flag structure current
Lane Speed Detection
desired Longitudinal
If-then
Left Detection Automaton To Reg.
binary 1 & 2 Memory Layer
OR
binary Buffer
Right Detection Lateral
If-then
Automaton
current
Speed Lane Detection
desired
Pinch Detection
binary
Although not important for our design purposes, we chose to disable speed changes
during lane shifting. Once the lane change is completed, the probability update for longitudinal
actions continues. The longitudinal vehicle dynamics are modeled as simply as they can be;
however, we assert that realistic vehicle dynamics will not affect the overall system behavior since
the regulation and physical layers will react to fired actions fast enough. Although not
instantaneous, speed changes of 1m/s are not hard to obtain. For lateral movements, we assumed
that an automated vehicle completes a single lane change so that the maximum lateral acceleration
is less than or equal to 0.5m/sec2.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 83
The automata use the algorithms given in Chapter 6 as reinforcement schemes. The
learning parameters are adjusted to give fast enough learning while keeping them relatively low to
avoid instantaneous learning.
Figures 5.21 and 5.22 show the average platoon speed and the distances between vehicles
for four platoons of ten vehicles, each using a different set of parameters. Positions and speeds of
the vehicles for t = 0 and t = 29.6sec are given in Figure 5.23. All the simulations (#1-4) are run
separately, and then the data combined for illustrations. Each platoon corresponds to a simulation
run. Figure 5.23 also includes the mpeg movie of the four runs. The initial positions and speeds of
the vehicles are the same for all runs as is the processing speed of 25Hz. The desired speed for all
vehicles is 80kmh.
As seen in Figure 5.23 the positions of the vehicles at t = 29sec and their speed are
different for each simulation run. All values plotted in Figures 5.21, 5.22, and 5.23 are not
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 84
averaged over a large number of runs with the same parameters, but the result for each set of
parameters shows characteristics of a particular choice of parameters.
Simulations 1 and 2 (Figures 5.21a-b and 5.22a-b) illustrate the effect of the length of the
memory buffer on the vehicle following behavior of an automated vehicle. For the same initial
separation and velocity, the velocities of the vehicles reach a steady-state much faster when the
memory buffer is shorter. The separation between vehicles is more uniform with a larger memory
buffer. This is due to the fact that the system with the shorter buffer reaches the permitted region
[12m 18m] for the headway distance faster. Once the desired region is reached, the reaction to the
change in the feedback response is also faster. As a result, the separation between vehicles is more
uniform for a large memory buffer, however the response is not stable; the opposite is true for a
shorter memory vector length.
Changing the sensor range while keeping other parameters constant also affects the
behavior of the vehicles in the platoon, as expected. Simulations 1 and 3 differ only in the choice
of the values for d1 and fsr. As seen in Figures 5.21(a, c) and 5.22(a, c), by increasing the region
around the switching value d2, it is possible to obtain a steady-state speeds 15 sec after the start of
the run. The average speed of the platoon is still under the desired value of 80kmh since the
headway distance is still too small for the action ACC to fire for some of the vehicles. Eventually
all vehicles will reach their desired speed provided that the headway distance is above the safe
value. An example of this behavior is seen in Figures 5.21c and 5.22c: as soon as the headway
distance grows larger than d2 = 15m, at t 29sec, one of the vehicles increases- its speed, and the
effect on the average platoon speed is seen. Furthermore, it is important to note that the initial
vehicle separations are already in the defined sensor range. As seen in Figure 5.23, the overall
behavior of the platoon in simulation 3 is closer to the one in simulation 1 than in simulation 2.
Simulation 4 is given to illustrate the importance of the choice of learning parameters. In
terms of uniformity of the headway distances, the result of this simulation is worse. This is of
course due to relatively small learning parameters. Although the length of the memory vector is
the same as in simulation 2, the responses to the headway module are slow, and therefore the
headway and speed variations are large.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 85
81 81
80 80
79 79
78 78
77 77
0 10 20 0 10 20
time (sec) time (sec)
(a) (b)
Average platoon speed (kmh) Average platoon speed (kmh)
82 82
81 81
80 80
79 79
78 78
77 77
0 10 20 0 10 20
time (sec) time (sec)
(c) (d)
Figure 5.21. Average platoon speeds for simulations 1-4.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 86
Headway distances (d1=12, d2=15, fsr=16) Headway distances (d1=12, d2=15, fsr=16)
25 25
20 20
15 15
10 10
5 5
0 10 20 0 10 20
time (sec) time (sec)
(a) (b)
Headway distances (d1=10, d2=15, fsr=20) Headway distances (d1=12, d2=15, fsr=16)
25 25
20 20
15 15
10 10
5 5
0 10 20 0 10 20
time (sec) time (sec)
(c) (d)
Figure 5.22. The distance between vehicles for simulations 1-4.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 87
t=0s (a)
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
(b)
Simulation 4
79 80 80 79 79 7979 79 79 80
79 79 79 79 79 79 79 79 80 80
79 79 79 79 79 79 80 80 80 80
79 79 80 80 79 80 80 80 80 79
(d)
ch5m2.mpg
(0.237Mb; every frame represents one fifth of a second)
(e)
Figure 5.23. Initial and final positions (a,c) and speeds (b,d) of automated vehicles in four
ten-vehicle platoons. The mpeg movie of the simulation is accessible via icon (e).
Figures 5.24-5.26 show the results of a second batch of simulations (numbers 5 and 6). In
both simulations, the initial conditions are the same; the desired speed for all vehicles is still
80kmh except the leading vehicle which desires to slow down to 77mph. In this scenario, the
effect of the learning parameters is obvious: a fast learning vehicle will react faster to the speed
changes (and therefore headway changes). Also, a very short sensor region may create a problem
because it directly affects the reaction time. However, the effect of the length of the memory
buffer is not that obvious. In simulations 5 and 6 (Figures 5.24 and 5.25), the only difference
between two platoons is the size of the memory vector. There are no apparent differences in the
behavior of the platoon. Again, the separations are distributed in a larger region for a smaller
memory vector, but the difference is minor.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 88
80 80
79 79
78 78
77 77
76 76
0 10 20 30 40 50 60 0 10 20 30 40 50 60
time (sec) time (sec)
(a) (b)
Figure 5.24. Simulations (a) 5 and (b) 6: average speed for a platoon of ten automated vehicles.
Headway distances (d1=10, d2=15, fsr=18) Headway distances (d1=10, d2=15, fsr=18)
25 25
20 20
15 15
10 10
5 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60
time (sec) time (sec)
(a) (b)
Figure 5.25. Simulations (a) 5 and (b) 6: distances between vehicles for a platoon of ten
automated vehicles.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 89
t=0s (a)
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
(b)
Simulation 2
76 76 76 76 76 76 77 77 77 78
76 76 76 76 76 77 77 76 77 77
(d)
ch5m3.mpg
(0.345Mb; every frame represents one fifth of a second)
(e)
Figure 5.26. Initial and final positions (a,c) and speeds (b,d) of automated vehicles in four
ten-vehicle platoons. The mpeg movie of the simulation is accessible via icon (e).
When the speed change for leading vehicle is much larger, the effect of the memory vector
in platoon behavior can be seen much clearly. Two mpeg movies in Figure 5.27 show two
unstable platoons. When the lead vehicle slows down to 75mph, the following vehicles are not
able to keep their safe distances. Using the same parameters as before, a smaller length of the
memory vector, resulting in faster firing of actions, gives a better result. With a memory vector of
length 25, the third (or sometimes fourth) vehicle crashes into the vehicle in front, while with a
smaller length of 9, the platoon stability is kept longer, and the problem occurs at the end of the
platoon (seventh, eighth, or ninth vehicles). Since the speed (and headway) deviations increase as
we move back in the platoon, an increase in the capability of the automated vehicles avoids an
early problem.
The simulations given here are run using the general linear reward-penalty learning
scheme. Examples of the nonlinear learning scheme and its comparison to linear schemes are
given in Appendix C.
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 90
ch5m4.mpg
(0.065Mb; every frame represents one fifth of a second)
(a)
ch5m5.mpg
(0.100Mb; every frame represents one fifth of a second)
(b)
Figure 5.27. Simulations 7 and 8: mpeg movies of two simulations of 10-vehicle platoons with
lead vehicle decelerating to 75mph: Length of the memory vector is (a) 25 and (b) 9, all other
parameters are the same as in simulation 6.
every agent on the highway. The autonomous vehicle approach, on the other hand, suffers from
the lack of global information. While designing the intelligent controller, we assumed local sensing
and minimal or no communications.
Solving the path planning problem locally may be relatively inexpensive and simple due to
individual design goals and lack of communications, but the price to be paid may be the optimality
of such a distributed system. Consider the situation given in Section 5.3.1 (Figure 5.17). For an
autonomous vehicle to detect that the current solution is only local, and does not satisfy the
global goals (in the sense that the vehicle is capable of avoiding collision, but is not able to reach
its desired speed), the information about the relative position (and maybe speed) of other vehicles
needs to be known. This information can be provided by link layer in a hierarchical system
[Varaiya93], or may be extracted using visual clues in the autonomous vehicle approach
[zgner96, Pomerlau96, Weber96]. Again, the information content will be lower in the latter
case. The need for more global information, rather than using only local sensors, is crucial and
apparent, considering difficulties encountered during this research.
Chapter 6
nature of the mapping from the previous probability vector to the new probability
vector, e.g.:
(i) linear, nonlinear or hybrid or
(ii) projectional, nonprojectional1
the behavior of the learning automaton using the scheme, e.g., optimal, expedient.
Our efforts of designing an intelligent path controller for an autonomous vehicle using
stochastic learning automata led us to employ four different reinforcement schemes. These are:
The first two algorithms are widely known [Narendra89, Baba85]. Their convergence
properties were investigated, and several applications using these schemes were previously
presented [Chand69, Baba80]. The third algorithm is less popular, since the behavior of an
automaton using this scheme have not been analytically proven [Narendra89]. A special case of
this algorithm, LR P , is obtained by adding a small penalty term ( 0 < b << a < 1 ) to the LR I
1
See Section 3.5 for the definition of these terms.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 93
scheme. However, the proof of -optimality for the general r-action case is again not complete
[Narendra89]. The general linear reward-penalty scheme, LR P , is linear, nonprojectional. In
Section 6.1.1, we will prove that this scheme is optimal in the case where there is an pure
optimal action.
The last scheme is nonlinear, projectional and is an extension of the general absolutely
expedient schemes [Narendra89, Baba85]. It differs from previous schemes by the definition of
an additional reward function. We will also show in this chapter that this scheme is absolutely
expedient in stationary environments.
The last two schemes are used in this work for their improved behavior in specific cases
frequently encountered in our application to intelligent vehicle control. Both schemes are found
to be convergent to the optimal solution faster than the first two schemes. They are the direct
results of our attempts to create reinforcement schemes with desirable characteristics suitable for
this study of learning automata applications to intelligent vehicle control.
In the following section, we will compare the linear reinforcement scheme LR P to
previously existing linear learning algorithms. The nonlinear reinforcement scheme NLH is also
compared to the general nonlinear scheme in Section 6.2. Since these two new schemes are found
to be useful, we have investigated their convergence properties. Using the nonlinear stability
theorems, we give the proof of optimality of the linear scheme LR P for a specific case.
Furthermore, conditions of absolute expediency are checked with the new nonlinear scheme NLH,
and the scheme is proven to be absolutely expedient.
pi (n + 1) = (1 b) pi (n)
for (n) = 1 b
p ( n + 1) = + (1 b) p j ( n) j i
j
r1
where r is the number of actions. Parameters a and b are associated with the reward and penalty
updates respectively. For b = 0, the algorithm is called linear reward-inaction scheme, and is
known to be -optimal [Narendra89]. The specific case where a = b is, generally, known as the
linear reward-penalty scheme, and is expedient for stationary environments [Bush58,
Narendra89].
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 94
2
Except the case where the learning parameter b is much smaller than the parameter a, i.e., LR P scheme.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 95
1500
1000
500
0
1
b 0.5
1
0.8
0.6
0.4
0 0.2
0
a
100
80
60
40
20
0
0
b 0.5
0.2 0
0.6 0.4
1 1 0.8
a
Figure 6.1. Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: (top) maximum, average and minimum number of steps; (bottom) average and
minimum number of steps (over 500 runs; 3 actions; initial probabilities same; perfect case; note
the axis shift and scaling in the second plot).
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 96
Table 6.1. Behavior of the automata under general LR P scheme: 500 runs;
maximum step =1500; 3 actions with c = [0 1 1] (This data is plotted in Figure 6.1).
3
The mark * indicates linear reward-penalty scheme L=R P (a b).
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 97
In some simulation runs, the probability popt did not reach 0.995. In this case, the
statistical values are calculated by discarding the data for these runs, and therefore, do not present
the actual behavior of the algorithm. The fact that the algorithm did not converge in 1500 steps
does not exactly mean that it would not converge if the limit was increased. Similarly, the fact
that we have convergence in all 500 runs does not guarantee convergence at all times
Figure 6.2 shows the result of the simulations for the parameters range 0 b 0.9, 0.1
a 0.9. For each parameter pair, 500 simulations are completed. The number of simulations
which did not converge in 1500 steps are indicated with o. The marker x indicates the fact that
the probability of the pure optimal action had converged to 0.995 in all 500 simulation runs. We
know that for a = 0, convergence is not obtained no matter what the value of parameter b is
(Appendix B). As seen from the plot, if the value of the parameter b is much larger than
parameter a for small values of a, convergence again is not guaranteed. The exact limit of the
non-convergence region is not known since the number of simulations is 500, and the limit of
time steps is 1500. Furthermore, this region is different for different numbers of actions.
0.9 65
0.8 48
0.7 47
0.6 17
0.5 8
b
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
a
Figure 6.2. For some parameter values, convergence is not obtained for all 500 runs
(Number of non-converging runs are shown next to marker o).
Table 6.2 gives the maximum, minimum and average number of steps for the optimal
actions probability to converge to 0.995, for a smaller range of learning parameters. As seen from
the table (and Figure 6.1), a slight change in the value of the parameter affects the speed of
convergence to the pure optimal strategy. Especially for smaller values of parameter a, the
maximum and consequently the average number steps decreases significantly with a decreasing
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 98
value of parameter b. For example, the maximum number of steps decreases from 105 to 48 when
we switch from the L=R P scheme to the LR P scheme by decreasing the value of the parameter b
from 0.2 to 0 while a = 0.2.
Changing the initial values of the action probability does not drastically change the
behavior of the automata except to increase the number of time steps for convergence to the pure
optimal strategy. Figure 6.3 shows the maximum, average and minimum number of steps for
convergence with initial probability vector of [0.005 0.4975 0.4975]. Besides the increase in the
number of steps needed for convergence, the comparison of Figures 6.1 and 6.3 shows that the
effects of the learning parameters are comparable.
Similarly, increasing the number of actions also increases the number of time steps needed
for convergence; the effect of the parameters is generally the same. As seen from Tables 6.2 and
6.3, the minimum number of steps for convergence is not affected significantly, but the maximum
number of steps, and therefore the average, increases with an increasing number of actions.
Furthermore, if the probabilities of penalty for all non-optimal actions (cj) are less than
1, the convergence is again slower as expected (Figure 6.4). The change in the convergence rate is
due to the decrease in the number of correct updates on the action probability vector. The
effects of the parameters is again similar to previous cases. The effect of the parameter b on the
number of non-converging runs and the average number of steps for convergence is significant
when learning parameter a is small.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 99
1000
500
0
1
b 0.5
0.8 1
0.6
0 0.2 0.4
0 a
Figure 6.3 Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: maximum, average and minimum number of steps (over 500 runs;
3 actions; perfect case; p(0) = [0.05 0 .4975 0.4975]).
1000
500
0
1
b 0.5
0.8 1
0.6
0 0.2 0.4
0 a
Figure 6.4 Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: maximum, average and minimum number of steps (500 runs; 3 actions;
ideal case: c =[0 0.8 0.5]; initial probabilities are the same).
result in a complete proof; the establishment of -optimality of the LR P scheme in the r-action
case in general is incomplete [Narendra89].
Here, we will consider all values of the parameters a and b and use the general stability
theorem on nonlinear discrete-time systems [Kalman62] to prove the convergence for a specific
region, without limiting the parameter b to small values. However, there is a price we have to pay
for this; the use of the stability theorem limits this proof to a specific case where there is an
optimal action4.
To prove the convergence to pure optimal action, we start with the conditional
expectation of the action probabilities. From the definition of the LR P scheme (Equation 6.1), we
may write the conditional expectation for the probability of an action at the next step as:
r
[ ] [
E pi (n + 1) pi ( n) = E pi ( n + 1) pi (n) (n) = k p k (n )
k =1
]
r
[
= c k E pi (n + 1) pi ( n) ( n) = k (n) = 1 p k (n )
k =1
]
r
[
+ (1 c k ) E pi ( n + 1) pi (n) (n) = k (n) = 0 p k ( n)
k =1
]
r
b
= ci pi ( n)(1 b ) pi ( n) + c k pk (n) + (1 b) pi (n)
k i r 1
r
+ (1 ci ) pi (n)[pi (n) + a(1 pi (n)) ]+ (1 c k ) pk (n)(1 a) pi (n )
k i
(6.2)
We can write this equality as:
[ ]
r
b r r
E pi (n + 1) pi (n) = pi (n)(1 b) ck pk (n) + c p ( n ) + (1 a ) p ( n ) pk (n )
k =1 r 1 k i k k i
k =1
r
(6.3)
(1 a ) pi (n) ck pk (n ) + api (n)(1 ci )
k =1
r
Rearranging terms and using the fact that p ( n) = 1 , we get:
k
k =1
r r
[ ]
E pi (n + 1) pi ( n) = [(1 b ) (1 a )] pi (n ) pk c k +
k =1
b
pk (n) c k
r 1 k i
+ (1 a ) pi (n) 1 + a pi (n)(1 ci ) (6.4)
r r
b
= (a b) pi (n ) pk (n) c k +
r1
pk (n) ck + (1 aci ) pi ( n)
k =1 k i
4
The case where there is an optimal action is the most common case in stochastic automata applications. The
convergence of the probability vector to the pure optimal strategy in this case is proven for almost all linear and
nonlinear reinforcement schemes [Najim94, Narendra89].
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 102
For a = b, this corresponds to difference equation for the L=R P scheme [Narendra89]. Now,
taking the expected value of both sides, we obtain:
r r
b
E [ pi ( n + 1)] = ( a b) c k E [ pi (n ) p k (n) ] + c k E [pk (n )]+ (1 aci ) E[pi (n)] (6.5)
k =1 r 1 k i
At this point, we will state two theorems for the stability of nonlinear discrete-time
systems which will be used in our proof.
Then, the equilibrium state x = 0 is asymptotically stable in the large and V(x) is a Lyapunov
function. Condition (ii) may be replaced with:
(ii) V(x) 0 for all, and V(x) does not vanish identically for any solution sequence {x(n)}
satisfying Equation (6.6).
( )
r r
where:
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 103
[
pi (n ) E pi ( n) ] i = 1,..., r
(
ij2 (n ) Cov pi ( n), p j (n) )
ik ( a b) ck k = 1,..., r (6.10)5
ii 1 a ci
b
ij c j i
r 1 j
We first give the proof for a three-action automaton for the sake of clarity. For a three-
action automaton with an optimal action (c1 = 0), Equation 6.9 (or Equation 6.5) can be written
as:
p1 (n + 1) = 11 p1 ( n) + 12 p2 ( n) + 13 p3 ( n)
+ 11 p1 (n) 2 + 11112( n) + 12 p1 ( n) p2 (n ) + 12 122 (n) + 13 p1 ( n) p3 (n) + 13132 (n)
p2 (n + 1) = 21 p1 (n) + 22 p2 ( n) + 22 p3 (n )
+ 21 p2 ( n) p1 (n ) + 21 212( n) + 22 p2 (n) 2 + 22222 (n) + 13 p2 (n ) p3 ( n) + 23232 ( n)
p3 ( n + 1) = 31 p1 ( n) + 32 p2 (n) + 33 p3 ( n)
+ 31 p3 (n) p1 (n) + 31312 (n ) + 32 p3 (n) p2 (n) + 32322 ( n) + 33 p3 (n) 2 + 33 332 (n)
(6.11)
where:
j1 = ( a b) 0 = 0 jk = (a b) c k
b (6.12)
j1 = 0= 0 jk = 1 ack j = 2,3 k = 2,3
3 1
Again, ii2 (n) are variances, and ij2 (n) are covariances for the action probabilities. Since the first
action has zero penalty probability, the difference equations for the last two action probabilities
does not include the probability of the first action. Therefore, the difference equations for the
probabilities of the last two actions can be written as:
b
p2 (n + 1) = (1 ac2 ) p2 ( n) + c p (n )
2 3 3
+ (a b )c2 p2 (n )2 + ( a b)c2 222 (n) + (a b) c3 p2 (n) p3 ( n) + (a b)c3 322 ( n)
(6.13)
b
p3 ( n + 1) = c2 p2 (n) + (1 ac3 ) p3 (n)
2
+ (a b )c2 p3 (n) p2 (n) + (a b) c2 322 ( n) + (a b)c3 p3 (n) 3 + ( a b)c3 332 (n )
5
We omit the operator E[.] for simplicity.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 104
Let us define a function f of the expected values of the last two probabilities as 6:
b 2 2
p2 (1 ac2 ) p2 + 2 c3 p3 + (a b)c2 p2 p2 + ( a b) c2 22 + (a b)c3 p2 p3 + (a b )c323
f = b
p3 c p + (1 ac ) p + (a b)c p p + (a b)c 2 + (a b )c p p + (a b)c 2
2 2 3 3 2 3 2 2 32 3 3 3 3 33
2
(6.14)
If we can prove that this function is a contraction using the 1-norm7, then we can state that the
expected value of the probabilities of the last two actions asymptotically converge to 0, by using
Theorems 3.1 and 3.2. Thus, the probability p1 of the optimal action asymptotically approaches
1.
The contraction condition is:
p2 ( n) b
f = (1 ac2 ) p2 + c3 p3 + (a b)(c 2 p22 + c 2 22
2 2
+ c3 p2 p3 + c3 23 )
3
p (n ) 2
1
b
+ (1 ac3 ) p3 + c p + (a b) (c3 p23 + c3 33
2
+ c2 p2 p3 + c2 32
2
)
2 2 2
? p (n )
< p2 (n) + p3 ( n) = 2
p3 (n) 1
(6.15)
Since, a, b, ci, pi are all parameters/variables in [0, 1], and the last terms in the absolute values are
always positive8 for all n, we can write:
p 2 ( n ) b
f (1 ac 2 ) p2 + c 3 p3 + a b (c2 p22 + c2 222 + c3 p2 p3 + c3 23
2
)
p3 (n ) 2
1
b
+ (1 ac3 ) p3 + c2 p2 + a b (c 3 p32 + c3 33
2
+ c2 p2 p3 + c2 23
2
)
2
(6.16)
Rearranging the terms, we have:
p2 ( n) b
f (1 ac 2 ) p2 + c 2 p2 + a b (c2 p22 + c2 222 + c2 p2 p3 + c 2 232 )
p3 (n ) 2
1
b
+ (1 ac3 ) p3 + c3 p3 + a b (c3 p 23 + c3 332 + c3 p2 p3 + c 3 232 )
2
6
By omitting the terms (n) for simplification.
Denoted here as 1 .
7
8
All the terms of the sum are positive: the first term includes the square of the expected value, and the second term,
the variance which are always positive; the third and fourth terms can be combined as ci ( pi pj + 2ij ) = cE
i
[ pi pj ] .
Expected value of the multiplication of any two action probabilities is always positive since 0 < pi <1.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 105
(6.17)
b
Replacing by b without invalidating the inequality, and using the identity
2
E [x y ] = E [x ] E [y ]+ xy2 , we obtain:
p2 (n)
f ([ ] [
< (1 ac2 ) p2 + bc2 p2 + a b c2 E p2 + E p2 p3
p3 ( n) 1
2
])
( [ ] [ ])
+ (1 ac3 ) p3 + bc3 p3 + a b c 3 E p23 + E p2 p3 (6.18)
[ ] [ ] [ ] [ ] [ ]
We know that E pi2 + E pi p j = E pi2 + pi p j = E pi ( pi + p j ) < E pi since the sum of all
probabilities is equal to 1. Therefore:
p2 (n)
f < (1 ac2 ) p2 + bc2 p2 + a b c2 p2 + (1 ac 3 ) p3 + bc3 p3 + a b c3 p3
p3 ( n) (6.19)
1
(6.20)
Therefore, for the region a > b (Figure 6.5), the expected values of the probabilities of the sub-
optimal actions all converge to zero. Since the sum of all probabilities is equal to 1, the expected
value of the optimal action must converge to 1. Figure 6.6a shows the changes in the probabilities
of the suboptimal actions for a linear reward-penalty scheme with a = 0.2 and b = 0.1 with initial
probabilities of p(0)=[0.05 0.50 0.45]. As seen from the plot, probabilities p2 and p3 do not
asymptotically converge to 0. In 83 steps, the probability p1 of the optimal action reaches 0.999.
n
(
The cumulative averages pi (n ) = pi ( k ) (n = 0, 1, , 83, i = 1,.2) are plotted in Figure 6.6b.
k=0
b
1
stable
region
0 a
0.5 1
Figure 6.5. The region for asymptotic convergence to
pure optimal strategy.
Action probabilities 2 and 3 from n=1 to n=83 Cumulative average of action probabilities
0.5 0.5
n=0
0.4 0.4
0.3 0.3
p3 p3
0.2 0.2
n = 83
0.1 0.1
p(0)=[0.05 0.50 0.45]
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
p2 p2
(a) (b)
Figure 6.6. Probabilities (a) and cumulative average (b) of the probabilities of the non-optimal
actions for a 3-action automata with LR P (a = 0.2, b = 0.1).
[ ]
For r = N actions with p( n) = p1 ( n) p2 (n) ... pr (n ) , the contraction condition can be written
as:
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 107
p2 (n) p2 (n)
p3 ( n) p3 (n) N
f < pi (n) (6.21)
. .
i=2
N
p ( n )
1
p N ( n ) 1
( )
N N N
b
f (p(n )) = (1 aci ) pi + c j p j + (a b )c j pi p j + ij2
i=2 j=2 N 1 j =2
j i j i
(6.22)
( )
N N N N N
b
(1 aci ) pi + c j p j + (a b )c j pi p j + ij2
i =2 i=2 j=2 N1 i = 2 j =2
ji
Again, (1 aci), p i(n), c i , b, (a b) are all positive values, and the sum of all ( pi p j + ij2 ) has
positive values for all time steps. Therefore:
( )
N
b N N N N
f (p(n )) (1 aci ) pi + c j p j + (a b) c j pi p j + ij
2
(6.23)
i =2 N 1 i =2 j =2 i =2 j =2
j i
More explicitly:
(1 ac 2 ) p2 ( + c3 p3 + c4 p4 + ...........+ c N p N ) +
(1 ac ) p (c p + + c4 p 4 + ...........+ c N p N ) +
3 b 2 2
( )
f p(n )
3
+
... N 1 ...
(1 ac N ) p N ( c2 p2 + c3 p3 + .........+ c N 1 p N 1 + )
c2 ( p22 + 222 + p 2 p 3 + 232 + ...........+ p2 p N + 22 N ) +
c 3 ( p2 p3 + 322 + p32 + 332 + ...........+ p3 p N + 22N ) +
+ (a b)
...
c N ( p 2 p N + N2 2 + p3 p N + N2 3 + ...........+ p 2N + 2NN )
(6.24)
Rearranging, and using E [x y ]= E [x] E[x ] + xy , we have:
2
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 108
(1 ac2 ) p2 + ( N 2)c2 p2 +
(1 ac ) p +
b ( N 2)c3 p3 +
f (p(n ))
3 3
+
... N 1 ...
(1 ac N ) p N ( N 2)c N pN
9
([ ] [ ])
c2 E p22 (n ) + E p2 ( n) p3 (n) + ........... + E p2 (n) pN (n ) + (6.25)
] [
+ (a b)
([ ] [ ] [ ])
c3 E p2 ( n) p3 (n) + E p32 (n) + ........... + E p3 ( n) pN (n) +
...
N ([2 n ] [ 2 N ]
c E p (n ) p (n) + E p ( n) p (n) + ........... + E p 2 ( n)
N [ ])
[ ]
N N
[ ]
Since E pi ( n) p j (n ) = E pi (n) p j ( n) < E pi (n) 1 , we may write:
j= 2 j= 2
(1 ac2 ) p2 + c2 p2 + c2 p2 +
(1 ac ) p + c p +
( N 2)b c3 p3 + 3 3
f (p(n )) <
3 3
+ + ( a b ) (6.26)
... N 1 ... ...
(1 acN ) pN c N pN cN pN
N 2
Combining the terms above and omitting the coefficient N 1 < 1, we have:
N N
f (p(n )) < [(1 aci ) + bci + (a b)ci ]pi = pi p(n ) (6.27)
i= 2 i= 2
We have therefore proven that for the region a > b (Figure 6.5), the expected values of the
probabilities of the sub-optimal actions all converge to zero. This, in turn, implies that the
probability of the optimal action goes to 1, i.e., the pure optimal strategy is obtained. The
conditions for convergence are that learning parameter associated with reward be greater than the
learning parameter associated with penalty, and that there is one optimal action with c = 0. In
short, we have the following:
Theorem 3.3: An automaton using the general linear reward-penalty scheme with unequal
learning parameters in a stationary environment reaches the pure optimal strategy if:
And, since the probability vector converges to the unit vector corresponding to the optimal
strategy, the expected value of the average penalty converges to zero. Therefore, such an
automaton is expedient and optimal (see the definitions in Chapter 3).
Although LR P is not absolutely expedient like nonlinear schemes given in Chapter 3 and
Section 6.2, it is optimal for a specific but widely encountered environment condition. It still
remains to extend the proof to the region {(a , b) [ 01, ] [01, ] a < b }(Figure 6.5) if at all
possible. Also, the proof can again be extended to all possible environments by considering
values copt > 0. However, in this case, previous linearization approaches [Narendra89] and the
contraction theorems 3.1 and 3.2 cannot be used.
10
This is because the probabilities of penalty ci are independent from action probabilities pi .
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 110
synthesis approach toward reinforcement schemes, and the first result was the absolutely
expedient nonlinear scheme of Lakshmivarahan and Thathachar [Lakshamivarahan73].
Definitions of several nonlinear reinforcement schemes were given in Section 3.5.2. In the
next section, we will introduce a new nonlinear scheme that works better than the previously
defined schemes [Baba83, Baba85] for several applications where fast convergence under certain
conditions is necessary.
if (n) = i ,
i1 + + i N r i1 + + i N r
pi (n + 1) = pi (n) +
N
( )
j p(n ) 1
N
(
j p(n) )
jj=1i jj =1i
i1 + + i N i1 ++ i N
p j (n + 1) = p j (n )
N
( )
j p (n) + 1
N
(
j p(n) ) for all j i
(6.28)
where the functions i and i satisfy the following conditions:
1 ( p( n)) ( p( n))
= = r = ( p (n))
p1 ( n) pr ( n)
1 ( p (n)) ( p( n))
= = r = ( p(n ))
p1 (n ) pr (n)
p j (n) + j ( p(n)) > 0 (6.29)
r
pi (n ) + j ( p(n)) > 0
j= 1
j i
Parameter is an arbitrarily small positive real number. Also note that the function H includes p i
which is the action probability corresponding to the current action.
H(p)
1/k
0
0.5 1 p
We now show that the defined algorithm (Equations 6.31-6.34) satisfies all the conditions
in Equations 6.29. From Equation 6.32, we have:
k (P ( n)) k H( n ) p k ( n)
= = k H ( n) ( P( n)) (6.35)
p k ( n) pk (n )
and:
k (P( n )) pk ( n )
= = ( P (n )) (6.36)
pk (n) pk (n)
That is, our definition is consistent with the first two conditions of Equation 6.29. There are
three remaining necessary and sufficient conditions for absolute expediency. Using Equations
6.31 and 6.32, the rest of the conditions on i and i translates to the following:
Conditions (a) and (b) are associated with the reward updates while condition (c) is associated
with the penalty updates. These conditions guarantee that the probabilities stay in the range (0,
1) at all times (with the assumption that none of the probabilities is initially 0 or 1). Conditions
(a) and (b) can be shown to be satisfied using the fact that the sum of all probabilities is 1:
r r
(a) p i + (1 p i ) < pi + p j <p i + p j = 1 since 0 < < 1 (6.38)
j =1 j =1
j i j i
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 113
H = 0 pi ( n ) > 0 . This is true for all values of the action probabilities pi.
(p(n) )= < 0
(p(n )) = k H 0 (6.42)
(p(n) )+ (p( n)) = k H < 0
because 0 < < 1 , 0 < k < 1 , and 0 < H < 1 . Since the conditions above are sufficient for
absolute expediency, and we know that the nonlinear reinforcement scheme NLH satisfies all the
conditions listed in the definition of a candidate nonlinear reinforcement scheme, we state the
algorithm given in Equations 6.31-6.34 is absolutely expedient in a stationary environment.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 114
The data shown in Table 6.4 are the results of two different initial conditions: (a) all
probabilities are initially the same, only one action is receiving reward, and (b) the only action
receiving a reward12 from the environment has a very small probability value. The difference in
convergence rate is more distinct in the situation where the probability of the optimal action is
initially very close to 013. As seen in Table 6.4, when popt = 0.005, the number of iteration steps
12
I.e., the optimal action.
13
This situation occurs frequently in our application to automated highway systems; for example, while the
probability of the lateral action shift left is converging to 1, a vehicle may enter the left sensor range. In this case,
we need a strong penalty update to decrease the probability of this action, while encouraging the action stay in
lane.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 115
to reach p opt = 0.995 is reduced drastically for relatively large values of the learning parameter .
Especially when both learning parameters are large (see the shaded area in Table 6.4), the
difference between the average number of steps for two schemes is threefold or more. In order to
have a fast update on the probability vector, the function H is set to the highest possible value
(see Equation 6.34) satisfying the conditions for i in Equation 6.29.
Chapter 7
In previous chapters, we introduced an intelligent control method for autonomous navigation and
path planning. The decision system mainly uses local information, and as a result, the actions
learned by the intelligent controller are not globally optimal; the vehicles can survive, but may not
be able to reach some of their goals. To overcome this problem, we visualize the interaction
between vehicles as sequences of games played between pairs of automata. Every game
corresponds to a state of the physical environment (the highway and the vehicles) as described
in Section 7.3. By evaluating these games, it is possible to design new decision rules, and analyze
the interactions by predicting the behavior and the position of each vehicle. Some of the
additional modules in Chapter 5 are actually the result of the approach described in detail here.
probability vector of actions defines the mixed strategy of a player/automaton; each element of
the probability vector corresponds to a specific action, or pure strategy.
Standard game definitions such as equilibrium point(s), dominant strategy, and pareto
optimality, are defined similarly. In our application, we are interested in equilibrium points where
no player has a positive reason to change its strategy, assuming that none of the other players is
going to change strategies. Such equilibrium points are also called Nash equilibria. All N-player
games with finite mixed strategy sets have at least one equilibrium. In a two-person zero-sum
game, there may be more than one equilibrium point with the same payoff. This is not
necessarily true for two-person non-zero sum games.
All descriptions above are defined on a game matrix. A game matrix is used to show the
payoff structure of a game. For example, for a two-player game where each player has two
actions (pure strategies), the game matrix D can be shown as:
(d 11
1
, d112 ) (d12
1
, d122 )
D= 1 2
(d 21 , d 21 ) (d 22 , d 22 )
1 2
where dijk is the payoff1 to player k when the players play pure strategies i and j. For a two-
player zero-sum game, the equality dij1 = d ij2 is true. There were several learning schemes
designed for zero-sum games; the solution to the problem (of obtaining the pareto optimal2 or
equlibrium point outcome) is given in [Lakshmivarahan82]. It is known that linear reward-
inaction LR I and -optimal linear reward-penalty LR P (with b << a) schemes guarantee that the
automata reach the best possible solution, i.e., the expected penalty reaches its minimum value.
The results are also valid for identical payoff games [Lakshmivarahan81], and may be extended to
N-player games [Narendra89].
For non-zero sum games, however, unique rational solutions are difficult to define. While
considering non-zero sum automata games, it is not possible to evaluate the performance of the
automata using game theoretic arguments. The role of the learning schemes must be reevaluated
in such situations [Narendra89]. Unfortunately, relatively little work exists in this area. Here,
we will attempt to provide some insight on automata interaction of several vehicles by defining
games of automata and vehicles. The results obtained for non-zero sum games can be extended to
N-player non-zero sum games to show that the mixed strategies will converge to the equilibrium
point(s). In the case of multiple equilibrium points, the convergence depends on the initial
conditions of the probability vectors.
It is also important to note that, in an automata game, the players are not aware of the
mixed strategy used by the other player(s) nor its previous actions. In fact, the players do not
even have the knowledge on the distribution of random payoff structure as a function of the
strategy combinations. The interaction medium between automata is the payoff function.
1
This is still the probability of receiving penalty, and therefore a player wants to minimize its payoff.
2
In an N-person game, an outcome is said to be pareto optimal if there is no other outcome in which all players
simultaneously do better, i.e., receive less penalty from the environment.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 118
LNG LAT
E 1LAT , E 2LAT
A LAT LAT
Lateral
Lateral Environment
LNG Automaton
ALNG E LNG
Longitudinal Longitudinal
Automaton Environment
We know that lateral automaton ALAT can operate in two stationary environments E iLAT
(i = 1, 2). The difference between these two automaton environments is the response of the
headway module (see Section 5.3.3). In some situations, the choice of longitudinal action LONG
affects the response of the lateral environment. All other environment changes are due to changes
in the physical environment, and we visualize those changes as switching environments described
in Section 7.3. Due to interconnections, one would expect the longitudinal automaton ALNG to
3
The word environment here describes the automata environment, not the physical one.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 119
converge to its best action using absolutely expedient schemes. Lateral automaton ALAT in turn
would converge to the best action in the environment determined by ALNG.
Let us assume that an automated vehicle finds itself in the rightmost lane of a two-lane
highway, say by merging from an exit. Consider the situation in the first few seconds where the
automata environment stays stationary. With a relatively fast update rate, this assumption is
possible. The lateral action SR will receive a penalty until the vehicle shifts lane. Assuming the
vehicle is in its desired lane and speed range, the environment response for the longitudinal and
lateral actions depends on the output of the headway and left side sensor modules as given in
Table 7.1. Let us also assume that the probability that a sensed vehicle is moving away from the
autonomous vehicle is 1/3, and we know the probability of a vehicle being in a particular sensor
range4 as given in Table 7.2.
Range A B C D E none
Action
ACC 1 1 1/3 0 0 0
DEC 0 0 0 0 0 0
SM 1 1/3 0 0 0 0
SL 0 0 0 1 1/3 0
SR 1 1 1 1 1 1
SiL 5 1 1/3 1/3 0 0 0
0 0 1/3 0 0 0
Table 7.1. Assumed probabilities of penalty for each action based on the front
and side sensors (see Figure 5.1 for range definitions).
.
Probability A B C D E none
(a) 1/10 1/10 1/10 1/10 1/10 _
Table 7.2. Probability of sensing a vehicle in the sensor range.
Based on the numbers given in Tables 7.1 and 7.2, the probabilities of receiving penalty
for both automata can be calculated. For example, for longitudinal action ACC, the payoff is
1 1 1 1 1 1 1 1 7
1 + + + 0 + 0 + 0 = . Therefore, the game matrix for longitudinal and
10 10 3 10 3 10 10 2 10
lateral automata is:
4
Note that this is not required for the actual application, it is just an assumption to illustrate the effect of changing
environment for interconnected automata.
5
The second row indicates the probability of penalty when the longitudinal action is DEC.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 120
SL SR SiL
ACC 7 4 7 7 5
30 , 30 , 1
30
,
30 30
DEC
4 1
0,
30
(0 ,1 ) 0,
30
SM 4 4 4 4 5
, , 1 ,
30 30 30 30 30
Entries in the first and third row correspond to the first environment for the lateral
automaton while the entries of the second row to the second environment. The difference
between the two environment is the longitudinal action DEC: when the longitudinal automaton
chooses DEC, lateral environment ALAT is switched to second environment which changes the
response to the lateral action SiL. If the vehicle is slowing down, lane shifts are discouraged
(Figure 5.19). If the automata were not connected, absolutely expedient algorithms would be
expected to converge to actions DEC and SL (without the connection, the lateral action SL is
optimal since the penalty response from the front sensor in not suppressed.) When the automata
are connected as described, the optimal actions are DEC and SiL. Based on the probabilities given
in Tables 7.1 and 7.2, this solution is pareto optimal and an equilibrium point for this game. In a
situation where the probability of sensing a vehicle in region C is much larger than others, the
payoff of the lateral action SL is always less than SiL, and the action pair (DEC, SL) becomes the
pareto optimal solution. The two-automata game will converge to this new optimal solution in
this case.
Note that the situation above is very specific when we consider all sensors: the vehicles
must be cruising at their desired speed range and lane, the pinch sensor must not send a penalty
response, etc. For all other situations, the two automata may be considered unconnected. Using
the algorithms given in Chapter 6, the automata will converge to their best (optimal) actions
separately. The interaction between automata is via the physical environment, and for the
duration of a specific game we consider it to be stationary, resulting in a stationary automata
environment. Of course, the solution of such a disjoint game will be an equilibrium point (and a
pareto optimal solution) due to the convergence characteristics of the reinforcement schemes.
ALNG ELNG
Regulation
Layer
ALAT ELAT
Vehicle
Physical
Environment
Vehicle
ALAT ELAT
Regulation
Layer
ALNG ELNG
It is important to realize that lateral or longitudinal actions need not be fired for the
automata environment to switch from one state to another. For example, if two vehicles cruising
at their desired speed are in adjacent lanes, and if their idle actions SiL and SM are optimal, the
physical environment may change due to the speed difference between the vehicles. The moment
that one of the vehicles clears the others side sensor detection area, the lateral automata
environment for both vehicles change. Similarly, when ideal actions are fired, the physical and
automata environments may not change. The interactions between the actions and the physical
environment, and the physical and automata environment, are fairly complicated. In the next
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 122
section, we will introduce a representation scheme that will facilitate analysis of the changes in
the physical environment in relation to the automata environment.
(a) (b)
v2 v2
v1 ACC DEC SM SiL SL SR v1 ACC DEC SM SiL SL SR
ACC ACC
DEC DEC
SM SM
SiL SiL
SL SL
SR SR
(c) (d)
Figure 7.3. Situations for two interacting vehicles (a, b), and combined environment responses
(c, d) for situations (a) and (b) respectively (shaded triangles indicate a penalty; upper triangles
are associated with vehicle 2; optimal action pairs are indicated with black borders).
Consider a situation with three vehicles as shown in Figure 7.4a. Vehicle 1 and 2 are
autonomous; vehicle 3 is not automated and cannot sense nor communicate. It is just an obstacle
as far as the intelligent vehicles are considered. Vehicle velocities are given as V1 = V 3 > V 2.
Vehicle 2 has no lane preference while vehicle 1 wants to shift to the middle lane. However,
vehicle 1 cannot shift immediately to the middle lane since vehicle 2 is in the side sensor range
(Figure 7.4a); the automata environment for this situation is given in Figure 7.4d (The actions SM
and SiL are combined as a single action IDLE. If a lateral action other than SiL is chosen, the
row/column for combined action IDLE refers to the lateral idle action, and vice versa. If both SiL
and SM are chosen the table shows the OR-ed response).
Due to velocity differences, vehicle 2 drifts away from vehicle 1s sensor range (Figure
7.4b), and the automata environment switches (Figure 7.4e). In the mean time, the idle actions are
fired repeatedly. With the new environment, the number of possible actions for vehicles 1 and 2
increases, and lateral action SL becomes the optimal solution for vehicle 1. As a result, vehicle 1
changes lane (Figure 7.4c) which in turn causes another automata environment change (Figure
7.4f).
Using the same reasoning, we can establish which automata environment corresponds to
what physical situation-vehicle condition pairs. Since the automata in each vehicle uses optimal
or absolutely expedient algorithms, convergence to the optimal solution is guaranteed for all these
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 124
situations, provided that the automata have enough time to learn. Using this method, it is then
possible to predict how the vehicle will react to a specific physical situation. This will enable us
to define highway scenarios as described in the next section, and find solutions for intelligent
path planning.
v2
v1 ACC DEC Idle SL SR
ACC
3 V3
DEC
SL
Des. Spd. = Cur. Spd. 1 V1
Des. Lane = middle SR
(a) (d)
v2
v1 ACC DEC Idle SL SR
ACC
3 V3
DEC
2 V2 Idle
SL
1 V1
SR
(b) (e)
v2
v1 ACC DEC Idle SL SR
ACC
3 V3
DEC
2 V2 1 V1 Idle
SL
SR
(c) (f)
Figure 7.4. Changes in the physical (left) and automata environments (right): vehicle 1 shifts to
the middle lane.
for both hierarchical control and autonomous vehicle approaches. However, the second approach
is a decentralized control method in nature, and finding a solution may be much more difficult
than finding a globally optimal path strategy with a hierarchical architecture that has all the
information about the highway situation.
This problem with the autonomous vehicle approach has not yet been answered by the
previous research efforts, while the hierarchical control structure inherently possesses the
methodology to solve the problem. We visualize a possible situation with multiple interacting
vehicles as a sequence of environment states. For all states of the physical environment which
includes the positions of the vehicles and current parameters defining their behavior a
corresponding automata environment can be defined. The automata environment is analyzed to
predict possible physical environment changes. These changes will be illustrated as state
transitions. State diagrams formed using possible environment state transitions can then be used
for analysis as well as design purposes.
Consider two vehicles sharing a 3-lane highway. The possible physical situations are
given in Figure 7.5. Besides the relative lateral positions, we assumed that only three possibilities
exist for relative longitudinal positions. The distinguishing factor between these positions is
whether a vehicle is in the side sensor range or not. Also note that for each state given in Figure
7.5, there is a reciprocal state with switched vehicle positions, denoted by an asterisk (e.g., B1*;
therefore, total number of states is 24). Figure 7.5 is basically a list of situations that are
interesting for the analysis of vehicle interactions; there may be more situations. Appendix D.1
shows the relative positions of the two vehicles in a larger physical environment. Situations
where the vehicles fall outside of the sensor range are combined into a single state, and further
simplification gives the twelve situations given in this section.
Note that for any physical environment state in Figure 7.5, there may be multiple
corresponding automata environments due to several factors such as desired speed, desired lane
as indicated in Figure 7.3 and 7.4.
To analyze the behavior of autonomous vehicles and the conflicts resulting from their
interactions, we define highway scenarios. A scenario is a specific situation with physical
locations of vehicles, their sensor outputs, and internal parameters such as desired lanes and
desired speeds. Once we know the automata environment at the beginning of a scenario, we can
predict the (state) changes in the physical environment. Then, all possible changes are combined
to form a state diagram showing the progress of the physical environment. The transitions from
one state are the direct results of the automata environment given by the matrices such as those in
Table 7.4.
To illustrate the idea, let us consider the situation A1 in Figure 7.5. Assume the
probabilities of possible vehicle actions are equal. For example, vehicle 1 may shift right, move
ahead of the vehicle 2, slow down or do nothing while vehicle 2 keeps its speed and lane the
same. These vehicle actions will cause transitions from A1 to A3, B1, C1 and A1 respectively.
All possible transitions between defined states are shown in Figure 7.6. The links indicated with
* show a transition to a reciprocal state, e.g., D1 B1*, B2 D2* or B3* D1*.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 126
1 1 1
A1 B1 C1 D1
1 1 1 1
A2 B2 C2 D2
1 1 1
1
A3 B3 C3 D3
Figure 7.5. Possible physical environment states for 2 vehicles in 3-lane highway.
A3
A1
A2
C3 C1 C2 B2 B1 B3
*
*
D2
*
*
D1
D3
Assuming equal probability for all vehicle actions in state A1, we can define a transition
matrix as follows. At state A1, vehicle 1 can shift right, move ahead, stay back or keep its speed
and lane. Similarly, vehicle 2 can shift left, move ahead, stay back or keep its speed and lane. We
assume that vehicles do not take actions simultaneously, i.e., only one vehicle can fire an action at
a given time. Thus, there are eight possible transitions with equal probability (Table 7.3).
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 127
Actions Transition to
Vehicle 2 idle, Vehicle 1 move ahead B1
Vehicle 2 idle, Vehicle 1 slips back C1
Vehicle 2 idle, Vehicle 1 idle A1
Vehicle 2 idle, Vehicle 1 shifts right A3
Vehicle 1 idle, Vehicle 2 move ahead C1
Vehicle 1 idle, Vehicle 2 slips back B1
Vehicle 1 idle, Vehicle 2 idle A1
Vehicle 1 idle, Vehicle 2 shifts left A2
Table 7.3. Action-transition pairs for state A1.
Therefore, the probabilities of transition from state A1 to states B1, C1, A2 and A3 are
0.250, 0.250, 0.125, and 0.125 respectively. The environment stays at state A1 with probability
of 0.250. All other transitions for state A1 are not possible. Similarly, all probabilities of
transition for other states can be written, and the transition matrix T becomes:
A B C D A* B* C* D* K
A1 14 1 1 1
0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2
8 8 4 4
A2 2 2 2
9 9
0 0 9
0 0 9
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9
A3 1 0 2
0 0 2
0 0 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
91 9
1 1
9
1
9 9
B1 4 0 0 2 8 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B2 0 2
0 1 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0
9 9 9 9 9
B3 0 0 2 1
0 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 1
0
9 9 9 9 9
C1 14 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 8 8
C2 0 2
0 0 0 0 1 4
0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0
9
2
9
1
9
4
9
1
9
1
C3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 9 9 9
D1 0 0 0 0 0 0 0 1 1 2
0 0 0 0 0 1 1
0 0 0 0 0 0 0 1
10
1
10 5
1
10
1
10 5
1
D2 0 4
0 0 0 0 0 0 8
0 0 2
0 0 0 0 8
0 0 0 0 0 0 0 0
D3 0 0 0 0 0 0 0 0 1
0 0 1
0 0 0 0 0 1
0 0 0 0 0 0 1
8 2 8 4
A1* 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
0 0 1
0 0 0 0 0 0
4 8 8 4 4
A2 * 0 0 0 0 0 0 0 0 0 0 0 0 1 2
0 0 2
0 0 2
0 0 0 0 2
9 9 9 9 9
A3* 0 0 0 0 0 0 0 0 0 0 0 0 1
0 2
0 0 2
0 0 2
0 0 0 2
9
1
9
1 1
9
1
9 9
B1 * 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2 8 8
B2 * 0 0 0 0 0 0 0 0 0 1 1
0 0 2
0 1 4
0 0 0 0 0 0 0 0
9 9 9 9 9
B 3* 0 0 0 0 0 0 0 0 0 1
0 1
0 0 2 1
0 4
0 0 0 0 0 0 0
9 9 9 9 9
C1 * 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 1 1 1
0 0 0 0
4 2 8 8
C2 * 0 0 0 0 0 0 0 0 0 0 0 0 0 2
0 0 0 0 1 4
0 1 1
0 0
9 9 9 9 9
C 3* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
0 0 0 1
0 4 1
0 1
0
9 9 9 9 9
D1* 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2
0 0 1
10
1
10 10
1
10 5
1 1
5
D2* 0 4
0 0 0 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
0 0 2
0
D3 * 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 1 1
8 8 2 4
K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 128
Each row of the matrix T shows the probabilities of transitions from a state, and sums up
to 1. Such a matrix is called a stochastic matrix. As seen above, it also includes the reciprocal
states obtained by changing the locations of the vehicles in states shown in Figure 7.5. The last
row and column of the matrix corresponds to a collision state, denoted by K.
It is also possible to consider simultaneous vehicle actions. For example, if vehicle 1 shifts
to the middle lane while vehicle 2 move ahead in state A1, a transition to state C3 will occur. The
probability of simultaneous actions is relatively small, and also, even if this is the case, the
transition may be thought of as a sequence of to separate transitions from state A1 to C1 or A3,
and then to state C3. Thus, we assumed that only one vehicle may take an action at one time.
For the scenario above, we may assume that the probability of transitions are constant for
all steps. If the probabilities of finding the environment at a given state are known, the row vector
which consists of these probabilities can then be multiplied with the transition matrix to evaluate
the probability distribution between states at the next time step. Note that the use of the
transition matrix suggests that probability of the environment being in one state at time n
depends only on the probability distribution at the previous step n-1. Therefore, the state
transitions of the physical environment given above constitute a Markov Chain.
Let us assume that the environment is at state A1 initially (n = 0). The row vector
defining the probability distribution for the states is then:
[
s(0) = 1 0 0 1 x25 ]
The probabilities of being at a specific state is then calculated by:
A1 A2 A3 B1 B2 B3 C1 C 2 C 3 K
[
s(0) T = s(`1) = 0250
. 0125
. 0125
. 0250
. 0 0 0250
. 0 0 0 ]
For consecutive steps, the probability of being at state K increases. At n = 20, we have:
s = [23 15 15 23 18 18 23 18 18 11 7 7 7 6 6 7 8 8 7 8 8 10 7 7 716].10-3
The state K for this Markov chain is an absorbing state 6. Then, we conclude that two vehicles
randomly choosing one of the possible actions in a three-lane highway will collide sooner or later,
if, of course, our probabilistic model is correct.
A subset C of the state space for a Markov process is said to be closed if for every state si C and sj C, and the
6
transition probability tij is zero. If the subset C consists of a single state, that state is called an absorbing state. Only
state K has that probability in our example.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 129
basic sensors and modules, there are no decisions/actions that would lead to a goal state (or
states). Starting at state A1, possible transitions are to states A1, A2, and A3. For transitions to
states A2 or A3, one of the vehicles must fill its memory vector with a lane shifting action much
faster than the other. If one of the pinch modules detects the other vehicle signaling for lane shift,
the state A1 will be permanent, and idle action SM will fire repeatedly.
Lane 2
If the vehicles were to change their speed, the transitions shown in Table 7.4 are possible
with the current internal parameters. Figure 7.8 shows the corresponding state diagram where
arrows indicate the direction of transition. Two possible chains with the solution states as
absorbing states B1* and C1* are distinguished with solid and dashed lines.
From To
Chain 1
A1 A1,A2, A3
A2 A2
A3 A3
Chain 2
B1 B1,B2, B3
B2 B2,D1*, D2*
B3 B3,D1*, D3*
D1* D1*,C2*, C3*
D2* D2*,C2*
D3* D3*,C3*
C2* C1*, C2*
C3* C1*,C3*
C1* C1*
Chain 3
C1 C1, C2, C3
C2 C2, D1, D2
C3 C3,D1,D3
D1 D1,B2*, B3*
D2 D2,B2
D3 D3,B3
B2* B1*, B2*
B3* B1*,B3*
B1* B1*
Table 7.4. Possible transitions for Scenario 1.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 130
A3
A1
A2
C3 C1 C2 B2 B1 B3
*
*
D2
*
*
D1
D3
Figure 7.8. Possible chains for Scenario 1 (reflexive transitions are not shown; two
chains are distinguished with dashed and solid lines).
From Table 7.4 and Figure 7.8, it is obvious that if the initial state is A1, the vehicles will
not be able to reach their desired lanes under the current circumstances. The final situation is then
A1, A2 or A3. The states C1* and B1* represent the goal state for this situation and are
reachable from all other states in their respective chains. Furthermore, these two states are
absorbing states for chains 2 and 3 respectively. Given the vehicles desire to shift lanes, the
physical environment is guaranteed to switch to states C1* or B1* starting at any state in chain 2
or 3. Therefore, all we need to do is to force the physical environment to switch from A1, A2 or
A3 to any of the states in chain 2 or 3. Possible transitions are from Ai to Bi or Ci where i =
1,2,3. All these transitions require a speed change for at least one of the vehicles.
Consider the situation A1 with the automata environment given in Figure 7.9a. If the
penalty-reward structure can be changed to one of the matrices shown in Figures 7.9b-7.9e, or
any suitable combination of longitudinal actions, the physical environment will then switch to
state B1 or C1 depending on the chosen action(s). If at least one of the vehicles can be forced to
change its speed by changing its automata environment, the physical environment will switch
after some time, leading to a goal state. The same reasoning is true for states A2 and A3.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 131
v2 v2 v2
v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR
ACC ACC ACC
SL SL SL
SR SR SR
(a) (b) (c)
v2 v2
v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR
ACC ACC
DEC DEC
Idle Idle
SL SL
SR SR
(d) (e)
Figure 7.9. Possible penalty-reward structures (b-e) to force physical environment to switch
to states B1 or C1 from current state A1.
Therefore, in order to introduce a change to the automata environment, the flag structure
given in Section 5.3.1 is designed. If a vehicle cannot shift to its desired lane in a predefined time
interval, a flag is set. This flag changes the vehicles desired speed, usually to a value smaller than
the current speed. The learning automata environment then changes as shown in Figure 7.9b or
7.9c; the transition in physical environment state comes later, attaching the states Ai to chain 2 or
3 (Figure 7.8). Consecutive state transition are automatic for this situation.
All possible environment states for three vehicles on a three-lane highway are given in
Appendix D.2. Again, assuming that only one vehicle take can take an action at a time, the state
transitions leading to a solution would be similar to the one given in Figure 7.11. This transition
diagram only shows a solution when vehicles 1 and 2 slow down. A few other solutions are also
possible if different speed adjustment are considered.
C2 C3 D5 K2 J3 F2
C6 A5
Figure 7.11. A possible chain for Scenario 2: lane flag forces vehicles 1 and 2 to slow down.
All the transitions except the first one are automatic under current circumstances. For the
first transition, on the other hand, the lane flag needs to be set in at least one vehicle (if it is
vehicle 3), breaking the symmetry. The problem and the solution for this case is similar to the
two-vehicle situation given in the previous section. This is not a coincidence; it is due to the
superposition of two two-vehicle situations.
The term superposition indicates that a three-vehicle situation given here can be treated
as three separate two-vehicle interactions. In terms of the two-vehicle states, the state transition
diagram above can be written as separate transition diagrams using the two-vehicle situations
previously described, as shown in Figure 7.12.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 133
Scenario 2
Vehicles: C2 C3 D5 K2 J3 F2
1, 2 and 3
C6 A5
Scenario 1a
Vehicles: A2 A2 A1 A3 A3 A3
1 and 2
A1 A1
Scenario 1b
Vehicles: A1
1 and 3
C1 C1 C3 D1 B2*
C2 D2
Scenario 1c
Vehicles:
2 and 3
A3 C3 D3 D3 B3* B1*
B3* B1*
As seen from Figure 7.12, the three-vehicle situation is nothing more than three
asynchronous Markov chains representing two-vehicle situations. In Figure 7.12, the
corresponding two-vehicle states are aligned with the three-vehicle environment states.
Transitions that need to be forced by the lane flag are shown in gray, and they are (and must be)
between corresponding states in both three-vehicle and two-vehicle transition diagrams. The two-
vehicle scenario including vehicles 1 and 2 is automatic, i.e., there are no conflicts. The other two
scenarios both have a synchronous forced transition.
Similarly, for the three-vehicle scenario of Section 5.3.1, it is possible to view the
situation as three two-vehicle transition diagrams. Figure 7.13 shows the corresponding two- and
three-vehicle chains. Again, the transition breaking the symmetry occurs at the between the
corresponding states of three- and two-vehicle scenarios. When three-vehicle situation is forced
from state K2 to state L2, two-vehicle scenario of vehicles 1 and 2 is forced from state A3
(corresponding to state K2) to state B3* (corresponding to state L2). Two-vehicle scenario
including vehicle 1 and 3 is automatic, the one between 2 and 3 is not significant, i.e., there are no
conflicts, nor transitions. The situation between vehicle 1 and 2 is the defining scenario.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 134
Scenario 2
Vehicles: K2 L2 J5 D1
1, 2 and 3
Scenario 1a
Vehicles: A3 B3* D1 C2
1 and 2
Scenario 1b
Vehicles: D3
1 and 3
D3 C3 C1
Scenario 1c
Vehicles:
2 and 3
C3 C3 C3 C3
Analysis of the example situations given in this section indicates that it is possible to
define complex situations of multiple interacting vehicles as a group of many (conflicting and
non-conflicting) two-vehicle situations. A complex scenario is nothing more than a superposition
of multiple two-vehicle scenarios. When the state transitions of significant two-vehicle scenarios
corresponding to the transition diagram of the N-vehicle (N > 2) scenario are studied, it is seen
that the key transition that breaks the symmetry in N-vehicle situation is synchronous with at
least one of the two-vehicle symmetry-breaking transitions 7.
to slow down to avoid collisions, the fact that they do not have a lane preference will prevent
them from shifting to the middle lane.
Lane 2
For our analysis of the situation, we only consider vehicles 1 and 2 while defining the
environment states since interactions of vehicles 3 and 4 with others are unaffected by any
possible actions vehicles 1 and 2 may take. Therefore, by analyzing the interactions between the
first two vehicles, we must be able to find a solution to this conflict. The states in which we are
interested are listed in Appendix D.3. Initially the environment is in state B2. Since vehicles 1
and 2 are traveling at the same speed and faster than vehicles 3 and 4, the transition to state D1 is
automatic (Figure 7.15).
B2 D1 C1 D1
F2 G2
At state D1, two transitions that will solve the conflict, but need to be forced, are
transitions to form state D1 to states C1 and F2. From these states, the chain will move to goal
states C3, E4, or G3 (see Appendix D.3). Then, vehicles 1 and 2 will increase their speed and
pass other vehicles. Since the desired lane for vehicles 1 and 2 are not set, another method needs
to be found to force the environment state to switch to one of the chains leading to a goal state.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 136
The solution to the problem is what we defined as speed flag in Section 5.3.2. Under current
circumstances, vehicles 1 and 2 both fire their idle actions repetitively at state D1. With the
addition of the speed flag, both vehicles decide to shift to another lane after a predefined time
interval. Since shifting to the middle lane becomes the only possible lateral action, one of the
states C1 or F2 is reached.
If both vehicles start filling their memory vectors with conflicting actions, then the pinch
modules will start returning a penalty for these actions. At this point one of the vehicles will
decide against shifting to the middle lane, and the other will be the first to shift as described in
Section 5.2.2, therefore changing environment state to either C1 or F2. Figure 7.16 includes the
mpeg movies of the two separate runs for this highway scenario where vehicle 1 and 2 mutually
find a solution through their pinch modules. The time interval for the speed flag is chosen as 6
sec. In the first simulation (Figures 7.16a, c, e), vehicle 1 shifts to the middle lane immediately
after the speed flag is set, returning a penalty response to lateral action SiL. In the second
example (Figures 7.16b, d, f), although the flag is again set after 6 seconds, vehicles 1 and 2 need
some time to negatiate the shift via their pinch modules. Since the last deviation from the
desired speed of 85kmh is approximately the same for both vehicles (Figure 7.16b), they attempt
to shift to the middle lane at approximately the same time, leading to a pinch condition.
Therefore, reaching the solution takes slightly more time in the second simulation.
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 137
88 88
86 86
84 84
82 82
80 80
78 78
76 76
74 74
72 72
0 10 20 30 0 10 20 30 40
time (sec) time (sec)
(a) (b)
Vehicle Lanes Vehicle Lanes
4 4
3 3
2 2
1 1
0 10 20 30 0 10 20 30 40
time (sec) time (sec)
(c) (d)
ch7m1.mpg ch7m2.mpg
(0.065 Mb; every frame represents (0.177 Mb; every frame represents
one fifth of a second) one fifth of a second)
(e) (f)
Figure 7.16. Speeds (a, b) and positions (c, d) of vehicles 1 and 2 for scenario 3:
(left) vehicle 1 shifts first; (right) vehicle 2 shifts first. Mpeg movies of the
simulations are accessible via icons (e, f).
Figure 7.15 shows only two of the possible transitions from state D1; there are two other
possibilities switching to state B3 or to state C4. These two possibilities become a solution to
the problem at hand when both vehicles cannot shift lanes because of their pinch modules
respective penalty outputs. The probability of occurrence for this situation is very small;
however, it is not zero. If this is the case, the vehicles will not be able to change to the middle
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 138
lane defined now as the desired lane by the speed flag interference. Then, the lane flag will be set
(see Section 5.3.1) after a time period, forcing the vehicles to slow down. Since the desired
speeds set by the lane flag are different, the symmetry will be broken, and one of the goal states
will be reached via state B3 or C4.
7.5 Discussion
This chapter introduced the treatment of path decisions for autonomous vehicles as interacting
automata, and then at a higher level, as interacting vehicles. In each intelligent vehicle, lateral and
longitudinal automata create an interacting system which can be visualized as an automata game.
In our application, this game is usually disjoint i.e., the mixed strategies of automata do not
depend on others strategy. Therefore, the optimal action pair is reached. For the case where the
longitudinal automatons actions affect the lateral automatons environment, the interacting
system again reaches the equilibrium point which happens to be the optimal solution for a given
environment.
The visualization of multiple automata interactions as games is based on the assumption
that the physical (and therefore, the automata) environment is stationary. Assuming that the
physical environment does not change for a relatively long period of time, and provided that the
learning rates for all automata are fast enough, it is possible to assign a stationary automata
environment to any physical environment consisting of locations and internal parameters of
vehicles in which we are interested for a given situation. Changes of the physical environment can
then be treated as a Markov chain whose transitions are direct results of the automata
environment evaluated for each state of the physical environment. Situations involving multiple
vehicles can be analyzed by investigating the transition diagrams resulting from a specific
situation, as described in Section 7.4.
Despite the fact that we used automata environments to evaluate a physical
environments state transitions, the method can also be used for other decision mechanisms. As
long as we can define a formal way of describing the decision/control procedure resulting from a
specific environment condition, the transition diagrams for physical environment states can be
created.
The treatment of the highway scenarios is based on 3x3 or 3x4 location matrices. Larger
matrices can also be used, however those may be redundant for our analysis purposes of the
situations given in Section 7.4. For a larger number of vehicles or different sensor definitions,
larger location matrices may be needed, but again, it is possible to represent multi-vehicle
scenarios as a simultaneous two-vehicle interactions as described in Section 7.4.2.
Chapter 8
Conclusions
This chapter opens with a brief discussion of the automated highway system as a possible
solution to current traffic problems and a short evaluation of the reinforcement schemes as an
automata learning process. The results and contributions of this dissertation are listed in the
second section. The chapter concludes with the recommendations for future research.
8.1 Discussion
The automated highway system envisioned by researchers and institutions around the world
suggests a complex structure that requires extensive research efforts. Among many issues, the
realization of AHS depends mainly upon the realization of intelligent vehicle control methods.
No matter what design approach is taken for vehicle control, there are two important issues that
will haunt the researchers. The first problem in front of the AHS is a social issue: user
acceptance. Without the user acceptance, the aim of automating the highway system is bound to
fail. We will briefly discuss the possible future of the AHS deployment in the light of the current
influence of the technology on society. The second problem is the technical feasibility of the
designed systems. While designing the intelligent path planner described in this work, the
transition from abstraction to implementation was one of our concerns. We will therefore briefly
discuss the technical issues that hinder the realization of an automated highway system.
We designed our intelligent vehicle controller using an AI technique called the learning
automata. The adaptation of the automata to existing conditions is provided by the reinforcement
schemes mentioned in previous chapters. We will thus conclude our discussion with a few ideas
on reinforcement schemes.
1
The author of this dissertation has traveled more than 40,000 miles on US interstate highways while working on
design of intelligent vehicle and highway systems, and on the evaluation tools for intelligent transportation systems.
Cem nsal Chapter 8. Conclusions 140
communication aspects of AHS (and ITS) is the information content of such a system. Most of
the projects for full automation of the highway system envision total knowledge about the
highway and the vehicles.
Present AHS and ITS applications such as traffic monitoring, electronic toll collection and
license plate recognition are already employing identification/classification techniques that can be
used to generate information about the drivers/owners of the vehicles. Human factors research on
AHS systems shows that drivers would not want the infrastructure to know everything about
their actions, locations, and daily routines. This problem is called the big brother syndrome.
The objections to usage of this type of data include fear that the security of the data will be
compromised, and overall discomfort with the modern trend toward routine collection of personal
data. Initial system analyses indicated that check-in procedures envisioned for AHS
implementation may include processing of financial, medical, or driving records to verify
potential users' qualifications [Delco94].
The information content of the data collected for AHS purposes is very important for the
user acceptance of the automated highway system. While the collection of driver data may be
perceived as an invasion of privacy, some of the problems may be solved by choosing the least
invasive AHS implementation with minimal data storage. Such issues are important because they
may influence the design of the communication and processing functions of the (longitudinal)
control system for a hierarchical infrastructure [Delco94]. For the autonomous vehicle approach
that will most probably include some form of visual control system, similar problems exist.
With the current technological progress and its applications to daily life, the invasion of
privacy is becoming an important issue. A recent example is the implementation of a global
computer network, a.k.a. Internet or World Wide Web. More and more individuals are starting to
use the network in their daily routine, and in return, they are providing the WWW server, the
Internet service provider or the routing system with some form of information. This could be
login data, an e-mail message, or a server log of the visited pages with the address of the client
machine. In most cases, it is possible for a computer user to monitor the activities of another
from his console. A similar and more common example for today is the use of credit cards. Every
time a charge card is used, the corresponding computer system records the time, date, location
and type of the transaction which can later be sold to third parties. Yet, many are comfortable
with the accessibility that a charge card or an online computer system provides (to all interested
parties).
The use of AHS or ITS systems such as automatic toll collection or automatic vehicle
identification (or ATIS, ATMS, AVL systems in the near future) is no different than
technologies mentioned above. A trade-off between comfort and privacy exists, and will exist in
the future. The evolution of the AHS will stop at a level where the majority of the users are no
longer willing to trade their privacy with the additional comfort the system may bring. As AHS
becomes more visible, its complexity may lead to misunderstandings and misrepresentations that
The reflections in this section are the results of his AHS experience as a researcher and his driving experience as a
user of the present highway system.
Cem nsal Chapter 8. Conclusions 141
may jeopardize public acceptance [Batelle94]. Considering the recent mistakes made with the
latest technological revolution, the Internet, it is obvious that a balanced and accurate picture of
AHS must be presented to the public.
Recent investigations addressing the user acceptance of AHS, and legal and technical
privacy issues in AHS technologies can be found in [Stevenson95], [Wright95], [Garfinkel96],
[McGurrin96], and [Wigan96].
headway control, and proximity sensors for lane departure warning are currently used in research
vehicles [zgner96, Pomerlau96, Tan96, Weber96, Yanagisawa92, Construction96].
The research on sensing technologies is gaining momentum due to the ongoing research on
AHS, and due to the fact that most of these technologies find quick commercial applications. The
research on sensing technologies for AHS was initially driven by the robotics field; during the last
few years more research has been carried out specifically for intelligent vehicle applications. The
sensor capabilities for autonomous navigation will increase with the investigation of new
alternatives and the progress of the previous technologies.
Communication with the roadside and other vehicles is another important factor for AHS
implementation. Again, a reliable communication system is vital for a hierarchical control
approach. A hierarchical control structure requires three types of communications:
communications for control, maneuvers, and navigation information. For the first type, high
bandwidth is required. Furthermore, the loss of data must be minimized to maintain stability of
platoons, and to assure collision-free lane changes [Foreman95]. Communicated data must be
trustworthy. A real-life analogy to the problem of reliability of the received information is to
wait at the intersection until another car approaching from left makes a right turn although its
right turn signal flashes while approaching the intersection. This type of behavior is a clever one
on the part of a human driver and it is useful in avoiding collisions, but it is against everything
AHS tries to achieve. An automated system repeatedly checking every bit of information against
errors cannot be successful in enhancing the traffic flow. On a greater scale, a hierarchical
automated highway system must optimize traffic flow by coordinating vehicle actions in an
intelligent way. The need to increase traffic flow theoretically increases the risk of accidents. The
reason for this is a decreasing possibility of recovering from a wrong decision due to a smaller
reaction window.
Commercial communication systems for AHS and ITS include new radar detectors with
road hazard and emergency vehicle signals, traffic information systems, fleet-management
systems, in-vehicle mobile offices and on-line services, MAYDAY devices and emergency
services, and electronic toll collection [ITSW96, TrafT96]. On the other hand, vehicles currently
used for research purposes are equipped with 2.5GHz band radio transceivers, vehicle-to-vehicle
radio modem communications [Construction96, PATH96]. Ultra wide band communication is
also considered as an alternative [James96]. The network architecture for vehicle-to-vehicle
communications is currently being defined [Bolla96, Fuji96], and related problems such as
interference and scattering are being investigated [Gavan96, Zoratti95].
Progress in the electronics filed, especially in wireless communications, will positively
affect the deployment of AHS2. AHS is actually a combination of many engineering fields. For
example, the design of the ITS/AHS technologies are carried out by the government organizations
in conjunction with the Institute of Electrical an Electronics Engineers (IEEE), the Society of
2
AHS will use emerging technologies as soon as they become available during its evolution to fully automated
highways. For example, drivers with cellular phones are already acting as probes in traffic via emergency phone
lines. Such an information source was not available ten years ago.
Cem nsal Chapter 8. Conclusions 143
Automotive Engineers (SAE), the Institute of Transportation Engineers (ITE), and the American
Society for Testing and Materials (ASTM) [ITSst96]. National AHS Consortium consists of
many aerospace and automotive companies as well as civil, mechanical, electrical engineering, and
computer science departments of several educational institutions. AHS is a multi-disciplinary
research filed. In addition to technologies mentioned above, its realization depends on many
engineering applications such as systems and control, fiber optics, automotive design, data
mining, data scheduling, human factors. It slowly creates its own niche in many related fields of
engineering boosting the research efforts and resources.
Automotive companies are hoping to make AHS technologies such as cruise control and
lane departure warnings commercially available in the next five years. Lane tracking and vision
enhancement applications are expected to be available in ten years [Cogan96]. It is interesting to
note that the current AHS research has already found answers to the problems related to these
applications. The transition from research to commercial availability is only a matter of cost
reduction. Considering the rate of technological progress in our century, the future of the AHS
technologies looks bright.
Returning to our assumptions on the vehicle capabilities, it is fair to say that the
assumptions made on sensing and communications for autonomous vehicles are feasible, but not
widely available.
structured. It will not be easily predicted, nor analyzed. In the case of mixed traffic (human
drivers and automated vehicles coexisting on the same highway), the problem grows enormously.
As a result, an automated vehicle must be able to deal effectively with a large number of possible
traffic situations.
The last statement above emphasizes the importance of learning and adaptation. What is
needed is a system that can handle situations unforeseen by the designers. For this type of
capability, the choice of artificial intelligence technologies is obvious. While expert systems, and
knowledge-based planning [Forbes95] are effective, learning methods are capable of discovering
new situations and optimal responses using simulated environments [Sukthankar96].
One of the most important characteristics of the reinforcement learning is the choice
between the actions that are known to return positive responses and the actions with unknown
performance. For example, a learning automata starting with a priori action probability values
needs time to evaluate the actions: it learns from its mistakes.
This process can be examined from two different points of view [Ashby60]. It can be
thought of simply as an attempt to find the best solution. In case of failure, the attempt is not
rewarded (or is penalized). This view of the learning process is not very attractive. On the other
hand, the trial and error method is invaluable in gathering information necessary to achieve
optimal action. Choosing non-optimal actions to gain information improves the long-term
performance while resulting in temporary penalty responses from the environment. Learning
automata is one of the successful examples of such methods.
Another important aspect of the reinforcement learning is that the actions performed by
an automaton influence the environment response that the automaton will receive in the future.
Rather than receiving an independently chosen set of action-response pairs, the agent has some
form of control over what response it will receive and complete control over what actions will be
generated in response. In addition to making it difficult to make distributional statements about
the inputs to the agent, this degree of control makes it possible for what seem like small
experiments to cause the agent (automaton) to discover an entirely new part of its environment
[Kaelbling94].
The reinforcement scheme is the heart of the learning process, and its characteristics define
the behavior of the automata. However, besides the reinforcement scheme, the definition of
Cem nsal Chapter 8. Conclusions 145
the reward-penalty structure (i.e., the teacher) is vital in the success of the learning automata
application. The automata cannot be expected to learn the best actions with a defective
teacher structure. The significance of the reward-penalty structure is emphasized in
environments with multiple teachers. We have spent a significant amount of time to design a
function that intelligently combines multiple environment responses.
The lack of applications using continuous environment response is well understood. The
problem of combining such responses as well as the difficulty in finding a meaningful
representation of the several factors with the same measure hinders the use of S-model
environments. There are no known applications that use this model.
The use of learning algorithms in our application is based on the assumption that the physical
environment around an autonomous vehicle can be associated with a stationary automata
environment (or multiple switching automata environments). We assume that the automata
adaptation to the environment is faster than the changes in the physical environment.
Although the current computational technology is sufficient for such an assumption, we
defined two new reinforcement schemes for our purposes. These new algorithms have
convergence rates up to four times faster than previously designed counterparts, especially in
situations frequently occurring in AHS applications.
One of these new reinforcement schemes has been previously known. However, its
convergence characteristics have not been proven due to the difficulty in analytical modeling
the learning process. By using nonlinear methods and a special condition resulting from our
design, we were able to prove optimality of the algorithm.
The second algorithm is also the result of our attempts to improve the convergence
characteristics of the learning automaton. It is an extension of a previously designed nonlinear
reinforcement scheme, but performs better. This algorithm is proven to be absolutely
expedient.
Our work on the design and analysis of an intelligent vehicle control using learning
automata enables us to conclude the following:
The intelligent vehicle path controller consisting of two stochastic learning automata is
capable of adapting to its dynamic environment by choosing the best (lateral and longitudinal)
actions to be sent to the lower level of control.
This non-model based approach would be especially useful in situations wherein the
complete knowledge about the overall traffic flow is not provided by higher level of the
control hierarchy (if such levels exists at all). Instead of trying to foresee all possible traffic
situations, we take the approach of defining a mechanism that can make intelligent decisions
based on local sensor information, keeping in mind the fact that the initial phases of the AHS
will include non-automated vehicles as well as intelligent vehicles capable of communicating
with others.
Cem nsal Chapter 8. Conclusions 146
The method is capable of capturing the overall dynamics of the system that include the
vehicle, the driver and the roadway. Definitions of the learning and sensor parameters
determine the behavior of each vehicle.
Initial results indicated that capabilities of the vehicle sensors must be extended. As presented
in Chapter 5, these extensions are not too tasking, although some may be relatively expensive
to realize. A simple rate-feedback method for the headway distance measurements can
overcome the oscillatory behavior in longitudinal control.
Simulations of intelligent vehicles also indicated the need for additional information sources
besides the local sensors. No matter what control structure is used for AHS deployment,
some form of communications must exists between vehicles. Although visual clues can be
used to coordinate lateral actions, the lane changing capabilities of the intelligent vehicles as
well as the safety of the actions increase with local vehicle-to-vehicle communications. We
have found that, in order to avoid pinch situations, vehicles may coordinate their lane
changing actions by simply sending an intention signal to neighboring vehicles.
Our attempt to design an intelligent path controller inadvertently extended, to some degree, to
other levels of vehicle control. For example, we have found that if a higher level of
control/decision mechanism provides desired lane information, many local solutions (that are
not globally optimal) may be solved to optimize overall traffic flow. Furthermore, the
capabilities of the lower layers of control are significantly important when designing the
intelligent path controller. Although we did not consider feedback from the control layer that
actually carries out the vehicle actions, the design parameters of the lower layer affects the
design parameters of the path planning layer. All levels of control are interconnected and
cannot be treated as a single free-running entity.
There is a trade-off between what the automated vehicle can accomplish and how simple the
sensing/information system is. The more global the information content of the decision
mechanism is, the more the vehicle can accomplish autonomously. It is our belief that neither
a fully decentralized control method, nor a completely hierarchical system can solve the
problem completely. While a decentralized system will require some form of local
communications, a hierarchical system will suffer from being unable to foresee all conflicts
between vehicle paths.
Our approach to the vehicle path control problem can be viewed as an autonomous vehicle
approach. Automated vehicles are making their own decisions on which action to take by
using information from their own sensors and limited communication systems. Although each
vehicle is capable of avoiding collisions, the combined actions of multiple vehicles are not
always optimal. Previous research efforts in autonomous vehicles discussed in Section 1.2.1
have not considered multiple vehicle interactions. In this work, we also attempted to find a
structured methodology for the interactions of multiple autonomous vehicles.
The method of evaluating possible environment state transitions based on associated
automata environments enabled us to define additional decision mechanisms we called flags.
Cem nsal Chapter 8. Conclusions 147
Speed and lane flags are used to solve the conflict situations arising from the multiple teacher
responses and vehicle interactions
Although our method of evaluating the physical environments state changes are based on the
learning automata environment, similar methods can also be used with other decision
mechanisms. By formal descriptions of the decision/control procedure, transition diagrams
similar to those given in Chapter 7 can be created to analyze the highway situations.
8.3 Recommendations
AHS research, especially high level vehicle path control, is a relatively new research area, and
there are an extensive number of questions to be answered. There are multiple possibilities for
extending the work discussed in this dissertation. Here, we will emphasize some of these
possibilities that seem relatively important for realization of intelligent vehicle control. Our
recommendations for future research efforts on the subject are the following:
Simulation:
Simulation is an indispensable way of testing the effectiveness of vehicle control methods
since the system becomes highly complex due to the presence of large number of vehicles.
Furthermore, the ideas investigated in this work cannot be tested with real vehicles, partly
due to the reasons listed in Section 2.5, but mainly because of the futuristic character of the
problem. The overall system is neither continuous nor discrete-time; the combination of
vehicle dynamics and the computational modules forms a hybrid system whose design and
analysis is inherently difficult. For the work described here, we simplified the vehicle
dynamics drastically since the main purpose of this research was the study of learning
automata techniques for vehicle control. More realistic vehicle dynamics must be
incorporated with the computational system, and the study of the interaction between the
discrete and continuous parts of the overall system needs to be carried out for a precise
control system simulation.
Sensor modeling:
For the purpose of studying the learning automata as an intelligent vehicle controller, we
assumed almost perfect sensor models. Our sensor module subroutines in the simulation
incorporate small percentage of measurement error. Beyond the addition of the measurement
noise, we assumed that the sensors defined in this work can be implemented using the current
technology. More realistic models of the sensors based on existing radar, laser radar, infrared
and sonar techniques may be created to further the simulations ability to imitate real-life
situations. The issues of sensor degradation and back-up methods also have to be answered
for a fault-tolerant autonomous navigation applications.
High level vehicle path control:
Our study of vehicle path planner indicated that this level of vehicle control cannot be
designed nor analyzed without considering lower and higher level of the control methodology.
Especially for the hierarchical control structure, feedback from the regulation layer and the
Cem nsal Chapter 8. Conclusions 148
information sent from the link layer of the hierarchy must be taken into account. This study
of the planning layer did not consider a possible unable to comply signal from the low level
control modules. Furthermore, we assumed that global information such as the desired lane
and speed are available. Previous and current research on AHS has not yet answered such
issues. Lane and speed assignments by a higher layer in the control hierarchy may prove to be
crucial for obtaining optimal traffic flow.
Vehicle interactions:
The problem of coordination between autonomous vehicles is inherently addressed by a
centralized control approach. However, decentralized AHS methods such as autonomous
vehicle approach [Bayouth96] or cooperative approach [McKendree96] must still find an
answer to this problem. In case of mixed traffic which may be a possibility in the future of
AHS, the problem is far more serious. Analysis and design methods described in this work
address the vehicle interaction by simplifying multiple vehicle scenarios (situations) into
multiple two-vehicle interactions by using additional simplifications on the relative position
of the vehicles (the matrices in Section 7.4 and Appendix D). This is possible because of our
definition of sensors and the reward-penalty structure used for the learning automata. The
idea of analyzing multiple vehicle interactions can be applied to other vehicle control
methodologies provided that the states of the physical environment, and the decision
environment resulting from these are carefully defined. More detailed and complex
definitions for highway scenarios may be necessary.
Learning automata:
The idea behind the adaptive vehicle controller presented here is very simple: two automata
synchronously adapt to their (teacher) environment. During our study, we have observed
that the automata interactions and multiple teacher learning processes in multi-teacher
environments are relatively untouched areas of research. The intelligent controller may be
extended to multiple automata associated with a single teacher (i.e., sensor module). In this
case, the analysis of the interactions between automata will be more difficult. However,
translating the problem of combining the environment response to the problem of combining
the multiple actions may have its advantages. The application of the S-model environment
then may prove to be relatively easier, bringing the advantage of continuously mapping the
sensed physical environment to automata environment. Yet, the idea of combining automata
actions have not been investigated previously.
Cem nsal Bibliography 149
Bibliography
[Agagino96] Agogino, A., K. Goebel, and S. Alag, Intelligent Sensor Validation and Fusion,
Intellimotion, vol. 5, no. 2, pp. 6-7, 1996.
[Aoki] Aoki, T., T. Suzuki, and S. Okuma, Acquisition of Optimal Action Selection in
Autonomous Mobile Robot Using Learning Automata (Experimental Evaluation), Proceedings
of the IEEE Conference on Fuzzy Logic and Neural Networks/Evolutionary Computation, pp.56-
63, Nagoya, Japan, Nov. 1995.
[Ashby60] Ashby, W. R., Design for a Brain: The Origin of Adaptive Behaviour, New York,
NY: John Wiley and Sons, 1960.
[Asher96] Asher, H. J., and B. A. Galler, Collision Warning Using Neighboring Vehicle
Information, Proceedings of the 6th Annual Meeting of ITS America, vol. 2, pp. 674-684, 1996.
[Atkinson65] Atkinson, R.C., G.H. Bower and E.J. Crothers, An Introduction to Mathematical
Learning Theory, New York: Wiley, 1965.
[Baba83] N. Baba, The Absolutely Expedient nonlinear Reinforcement Schemes under the
Unknown Multi-Teacher Environment, IEEE Transactions on Systems, Man and Cybernetics,
SMC-13, pp. 100-108, 1983.
[Baba85] N. Baba, New Topics in Learning Automata Theory and Applications, Number 71 in
Lecture Notes in Control and Information Sciences, Berlin: Springer-Verlag, 1985.
[Bloomfield96b] Bloomfield, J. R., Human Factors Aspects of Transferring Control from the
Driver to the Automated Highway System with Varying Degrees of Automation, Technical
Report FHWA-RD-95-108, Federal Highway Administration, Turner-Fairbank Highway
Research Center, McLean VA, 1996.
[Bolla96] Bolla, R., F. Davoli, and C. Nobile, A Mobile Network Architecture for Vehicle-to-
Vehicle Communications, Proceedings of the 3rd World Congress in ITS, Orlando, FL, October
1996.
[Bush58] Bush, R. R., and F. Mosteller, Stochastic Models for Learning, New York: Wiley,
1958.
[Card93] Statement of Andrew H. Card, President and Chief Executive Officer of American
Automobile Manufacturers Association, November 10th, 1993 Hearing before the Subcommittee
on Investigations and Oversight of the Committee on Science, Space and Technology, US. House
of Representatives, 103 Congress, First Session, pp. 108-109. US Gov. Printing Office,
Washington 1994. (Doc. Y4.sci2 no.103/75)
[Chand68] Chandrasekharan, B., and D.W.C. Shen, On Expediency and Convergence in Variable
Structure Stochastic Automata, IEEE Transactions on System Science and Cybernetics, vol.
SSC-5, pp. 145-149, 1968.
[Chand69] Chandrasekharan, B., and D.W.C. Shen, Stochastic Automata Games, IEEE
Transactions on System Science and Cybernetics, vol. SSC-5, pp. 145-149, 1969.
[Chee95] Chee, W. Lane Change Maneuvers for Automated Highway Systems, Research
Report, University of California, Berkeley, 1995.
Cem nsal Bibliography 151
[Chen96] Chen, C., B. Foreman, A Discussion of the WaveLAN Radio as Relevant to AVCS,
Technical Note 96-1, Institute of Transportation Studies, University of California, Berkeley,
1996.
[Cogan96] Cogan, R., Wheels in Motion, ITS World, Technology and Applications for
Intelligent Transportation Systems, pp. 18-22, July/August 1996, ISBN 1086-2145.
[Cremer94] Cremer, J., J. Kearney, Y. Papelis, and R. Romano, The Software Architecture for
Scenario Control in the Iowa Driving Simulator, Proceedings of the 4th Computer Generated
Forces and Behavioral Representation, May 1994.
[CSIM96] Mesquite CSIM 18- A Development Toolkit for Simulation and Modeling,
Mesquite Software, Inc., http://www.mesquite.com/csim18.htm (Jan 2, 1997).
[Deshpande96] Deshpande, A., Design and Evaluation Tools for Automated Highway
Systems, in Hybrid Systems III : Verification and Control. New Brunswick: Springer-Verlag, pp.
138-148, 1996.
[Field92] Intelligent Vehicle Highway systems: Review of Field Trials, Organization for Economic
Cooperation and Development, 1992.
[Forbes95] Forbes, J., T. Huang, K. Kanazawa, and S. Russell, The BATmobile: Towards a
Bayesian Automated Taxi, Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Montreal, Canada, 1995.
[Frank89] Frank A. A., S .J. Liu, and S. C. Liang, Longitudinal Control Concepts for Automated
Automobiles and Trucks Operating on a Cooperative Highway, in Vehicle/Highway Automation:
Technology and Policy Issues, Society of Automotive Engineers, pp. 61-68, 1989.
[Fu65a] Fu, K.S., and G.J. McMurtry, A Study of Stochastic Automata as Models of
Adaptive and Learning Controllers, Technical Report TR-EE 65-8, Purdue University,
Lafayette, IN, 1965.
[Fu65b] Fu, K.S., and G.J. McMurtry, An Application of Stochastic Automata to the
Synthesis Learning Systems, Technical Report TR-EE 65-17, Purdue University, Lafayette,
IN., 1965.
[Fu67] Fu, K.S., Stochastic Automata as Models of Learning Systems, in Computer and
Information Sciences II, J.T. Lou, Editor, New York: Academic, 1967.
[Fu69a] Fu, K.S., and T.J. Li, Formulation of Learning Automata and Automata Games,
Information Science, vol. 1, no. 3, pp. 237-256, 1969.
[Fu69b] Fu, K.S., and T.J. Li, On Stochastic Automata and Languages, Information Science,
vol. 1, no. 4, pp. 403-419, 1969.
[Fu71] Fu, K.S., Stochastic Automata, Stochastic Languages and Pattern Recognition, Journal
of Cybernetics, vol. 1, no. 3, pp. 31-48, 1971.
[Fuji96] Fuji, H., O. Hayashi, and Y. Hirao, Experimental Research on Protocol of Inter-Vehicle
Communications for Multiple Vehicles, Proceedings of the 3rd World Congress in ITS, Orlando,
FL, October 1996.
Cem nsal Bibliography 153
[Garfinkel96] Garfinkel, S. L., Why Driver Privacy Must Be a Part of ITS, Converging
Infrastructures: Intelligent Transportation and the National Information Infrastructure,
Cambridge, MA: MIT Press, pp. 324-340, 1996.
[Gavan96] Gavan, J., and F. Handler, Analysis, Computation and Mitigation of the Interference
to a Remote Receiver from Two Collocated Vehicular Transceivers, IEEE Transactions on
Vehicular Technology, vol. 45, no. 3, pp. 431-442, 1996.
[Gilbert92] Gilbert, V., J. Thibault, and K. Najim, Learning Automata for the Control and
Optimization of a Continuous Stirred Tank Fermenter, IFAC Symposium on Adaptive systems in
Control and Signal Processing, Grenoble, France, July 1992.
[Godbole95] Godbole, D., J. Lygeros, and S. Sastry, Hierarchical Hybrid Control : a Case
Study, in Hybrid systems II, Berlin: Springer-Verlag, pp. 166-190, 1995.
[Godbole96] Godbole, D., Towards a Fault Tolerant AHS Design. Part II, Design and
Verification of Communication Protocols, Technical Report UCB-ITS-PRR-96-15, PATH
Program, Institute of Transportation Studies, University of California, Berkeley, 1996.
[Hedrick96] Hedrick, J. K., J.C. Gerdes, D. B. Maciuca, D. Swaroop, and V. Garg, Longitudinal
Control Development for IVHS Fully Automated and Semi-Automated Systems: Phase II,
Technical Report UCB-ITS-PRR-96-01, PATH Program, Institute of Transportation Studies,
University of California, Berkeley, 1996.
Cem nsal Bibliography 154
[Hessburg95] Hessburg, T., and M. Tomizuka, Fuzzy Logic Control for Lane Change
Maneuvers in Lateral Vehicle Guidance, Technical Report UCB-ITS-PWP-95-13, PATH
Program, Institute of Transportation Studies, University of California, Berkeley, 1995.
[Ho96] Ho, F.-S., and P. Ioannou, Traffic Flow Modeling and Control using Artificial Neural
Networks, IEEE Control Systems, vol. 16, no. 5, pp. 1626, October 1996.
[Ikonen97] Ikonen, E., and K. Najim, Use of Learning Automata in Distributed Fuzzy Logic
Processor Training, IEE Proceedings of Control Theory and Applications, in press, 1997.
[ITSJ96] ITS Handbook in Japan, Highway Industry Development Organization, Tokyo, Japan.
[ITSW96] ITS World, Technology and Applications for Intelligent Transportation Systems, issues
July-August, September-October 1996, Advanstar Communications, Eugene, OR, ISBN 1086-
2145.
[James96] James, R.D., Widening the Net, COMTrans (Communication Networks for ITS),
supplement to Traffic Technology International, pp. 60-64 l, August 1996.
[Kachroo95b] Kachroo, P., Sliding Mode Full-state Feedback Longitudinal Control of Vehicles
in an Automated Highway System (AHS), Mobile Robots X, SPIE, Bellingham WA, pp. 260-
268, 1995.
Cem nsal Bibliography 155
[Kachroo95c] Kachroo, P., and M. Tomizuka, Vehicle Control for Automated Highway
Systems for Improved Lateral Maneuverability, Proceedings of the IEEE International
Conference on Systems, Man and Cybernetics, Vancouver, B.C., vol. 1, pp. 777-782, 1995.
[Kachroo96] Kachroo, P., A. Walimbe, and B. Benham, Longitudinal and Lateral Vehicle
Control Setup in the Flexible Low-Cost Automated Scaled Highway (FLASH) Laboratory, in
the Proceedings of ITS America, Sixth Annual Meeting, 1996.
[Kalman62] Kalman, R.E, and J. E. Bertram, Control System Analysis and Design via the
Second Method of Lyapunov: II. Discrete-Time Systems, ASME Journal of Basic Engineering,
ser. D, 82, pp. 394-400, 1962.
[Lasky93] Lasky, T. L., and B. Ravani, A review of Research Related to Automated Highway
System (AHS), Interim Report for Federal Highway Administration, Contract no. DTFH61-93-
C-00189, University of California, Davis, October 25, 1993.
[Lee95] Lee, H., D.W. Love, and M. Tomizuka, Longitudinal Maneuvering Control for
Automated Highway Systems, Proceedings of the American Control Conference, Seattle, Wash.,
vol. 1, pp. 150-154, 1995.
[Levitan95] Levitan, L., Preliminary Human Factors Guidelines for Automated Highway
System Designers Volume I, Guidelines for AHS Designers, Technical Report FHWA-RD-94-
116, Federal Highway Administration, McLean VA, 1995.
[Lin96] Lin, C.-F., and A.G. Ulsoy Time to Lane Crossing Calculation and Characterization of
its Associated Uncertainty, ITS journal, vol. 3, no. 2, pp. 85-98, 1996.
[Liu89] Liu, S. M., and A. A. Frank, On Lateral Control of Highway Vehicles Guided by a
Forward Looking Sensor, Proceedings of the 1st International Conference on Applications of
Advanced Technologies in Transportation Engineering, San Diego, pp. 119-124, 1989.
[Lubin92] Lubin, J.M., E. C. Huber, S.A. Gilbert, and A. L. Kornhauser, Analysis of a Neural
Network Lateral Controller for an Autonomous Road Vehicle, in IVHS Issues and Technology,
Society of Automotive Engineers, pp. 23-44, 1992.
[Lygeros94] Lygeros, J., and D. N. Godbole, An Interface Between Continuous and Discrete-
event Controllers for Vehicle Automation, Proceedings of American Control Conference,
Baltimore, MD, vol. 1, pp. 801-805, 1994.
[Lygeros96] Lygeros, J., D. N. Godbole, and M. Broucke, Towards a Fault Tolerant AHS
Design. Part I, Extended Architecture, Technical Report UCB-ITS-PRR-96-14, PATH Program,
Institute of Transportation Studies, University of California, Berkeley, 1996.
[McGurrin96] McGurrin, M. F., The Privacy Enhancement Protocol (PEP), 3rd World
Congress in ITS, Orlando, FL, October 1996.
[Malik95] Malik, J., D. Koller, and T. Luong, A Machine Vision Based System for Guiding
Lane-change Maneuvers, Technical Report UCB-ITS-PRR-95-34, PATH Program, Institute of
Transportation Studies, University of California, Berkeley, 1995.
[Marsh93] Marsh, C., Gordon, T. J., and Q. H. Wu, Stochastic Optimal Control of Active
Vehicle Suspensions using Learning Automata, Proceedings I. Mech. Eng. Part I, Journal of
Systems and Control Engineering, vol. 207, pp.143-152, 1993.
[Marsh95] Marsh, C., and Gordon T. J., The Application of Learning Automata to Controller
Design in Slow-Active Automobile Suspensions, International Journal for Vehicle Mechanics
and Mobility, vol. 24, no. 8, pp. 597-616, 1995.
[Moon96] Moon, S.-H., C.-W. Kim, and M.-H. Han, Navigation Control for an Autonomous
Road Vehicle Using Neural Network, Proceedings of the 3rd World Congress in ITS, Orlando,
FL, October 1996.
Cem nsal Bibliography 158
[Mourou92] Mourou, P., and B. Fade, Multi-agent Planning and Execution Monitoring:
Application to Highway Traffic, Proceedings of the AAAI Spring 92 Symposium, pp. 107-
112, 1992.
[Najim77] Najim, K., and Y.M. El Fattah, Use of Learning Automaton in Static Control of a
Phosphate Drying Furnace, Proceedings of the 5th IFAC/IFIP International Conference on
Digital Computer Applications to process Control, The Hague, Netherlands, June 1977.
[Najim 91b] Najim, K., and G. Oppenheim, Learning Systems: Theory and Applications, IEE
Proceedings of Computer and Digital Techniques, vol. 138, no. 4, pp. 183-192, 1991.
[Najim94] K. Najim and A. S. Poznyak, editors, Learning Automata: Theory and Application,
Tarrytown, NY: Elsevier Science Ltd., 1994.
[Najim96] Najim, K., and A. S. Poznyak, Multimodal Searching Technique Based on Learning
Automata with Continuous Input and Changing Number of Actions, IEEE Transactions on
Systems, Man and Cybernetics, Part B, vol. 26, no. 4, pp.666-673, 1996.
[Narendra74] Narendra, K.S., and M.A.L. Thathachar, Learning AutomataA Survey, IEEE
Transactions in Systems, Man and Cybernetics, vol. SMC-4, no. 4, July 1974.
[Naruse93] Naruse, K., and Y. Kakazu, Strategy Acquisition of Path Planning of Redundant
Manipulator using Learning Automata, IEEE International Workshop on Neuro-Fuzzy Control,
pp.154-159, 1993.
[OBrien96] OBrien, R. T., P. A. Iglesias, and T. J. Urban, Vehicle Lateral Control for
Automated Highway Systems, IEEE Transactions on Control Systems Technology, vol. 4, no. 3,
pp. 266-273, 1996.
[Oommen94a] Oommen, B. J., and E. V. de St. Croix, String Taxonomy using Learning
Automata, Technical Report TR-234, School of Computer Science, Carleton University, Ottawa,
Canada, March 1994.
[Oommen94b] Oommen, B. J., and E. V. de St. Croix, Graph Partitioning using Learning
Automata, Technical Report TR-250, School of Computer Science, Carleton University, Ottawa,
Canada, July 1994.
[zgner95] zgner, ., Combined Longitudinal and Lateral Controller Design for a Vehicle
with Radar Sensors, IVHS and Advanced Transportation Systems, Society of Automotive
Engineers, Warrendale PA, pp. 51-57, 1995.
[PATH] Joh, J., et al. (PATH Design Team), California Partners for Advanced Transit and
Highways, http://www-path.eecs.berkeley.edu/ (Dec. 6, 1996).
[PathDb96] Morris, S., and S. Petrites, PATH Database, California Partners for Advanced
Transit and Highways, http://sunsite.berkeley.edu/~path/search.html (Dec. 6, 1996).
[Peng93] Peng, J., and R. J. Williams, Efficient Learning and Planning within the Dyna
Framework, Adaptive Behavior, vol. 1, no. 4, pp. 437-454, 1993.
Cem nsal Bibliography 160
[Pomerlau96] Pomerlau, D., and T. Jochem, Rapidly Adapting Machine Vision for Automated
Vehicle Steering, IEEE Expert, vol. 11, no. 2, pp. 19-27, 1996.
[Poznyak96] Poznyak, A., K. Najim, and E. Ikonen, Adaptive Selection of the Optimal Model
Order of Linear Regression Models using Learning Automata, International Journal of Systems
Science, vol. 27, no. 2, pp. 151-159, 1996.
[Puri95] Puri, A., and P. Varaiya, Driving Safely in Smart Cars, Proceedings of the American
Control Conference, Seattle, WA, vol. 5, pp. 3597-3599, 1995.
[Rajamaran96] Rajaraman, K., and P. S. Sastry, Finite Time Analysis of the Pursuit Algorithm
for Learning Automata, IEEE Transactions on Systems, Man and Cybernetics, Part B, vol. 26,
no. 4, pp.590-598, 1996.
[Rault95] Rault, A., and M. Muffat, Short-term and Mid-term Applications in the Automotive
Sector : Prometheus Activities of Peugeot SA and Renault, Presentations of the Joint California
PATH and France Workshop, Pichmond, CA, 1995.
[Rayteon94] Knowledge Based Systems and Learning Methods for AHS, Technical Report
FHWA-RD-95-045 by Raytheon, AHS Precursor Systems Analyses, Compendium of Research
Summaries, compiled by Information Dynamics, Inc., and Viggen Corporation, McLean, VA,
February 1994.
[Schlegel96] Schlegel, N., and P. Kachroo, Telerobotic Operation Combined with Automated
Vehicle Control, Proceedings of the SPIE-Photonics East 96, Boston, MA, November 1996.
Cem nsal Bibliography 161
[Schuster96] Schuster, S., and C. Jacoby, An AHS Concept Based on Maximum Adaptability,
Proceedings of the 3rd World Congress in ITS, Orlando, FL, October 1996.
[Shapiro69] Shapiro I.J., and K.S. Narendra, Use of Stochastic Automata for Parameter Self-
Optimization with Multi-Modal Performance Criteria, IEEE Transactions on System Science
and Cybernetics, vol. SSC-5, pp. 352-360, 1969.
[Shiller95] Shiller, Z., and S. Sundar, Emergency Maneuvers for AHS Vehicles, Systems and
Issues in ITS, Society of Automotive Engineers, Warrendale PA, pp. 1-11, 1995.
[Singer95] Singer, R. M., K. C. Gross, and S Wegerich, A Fault Tolerant Sensory Diagnostic
System for Intelligent Vehicle Application, Intelligent Vehicles '95 Symposium, Detroit, MI, pp.
176-182, 1995.
[Slotine91] Slotine, J.-J. E., and Li, W., Applied Nonlinear Control, Englewood Cliffs, NJ:
Prentice-Hall, 1991.
[Smart94] Smart Road Development Effort Report, Center for Transportation Research, Virginia
Tech, 1994.
[Spooner95] Spooner, J.T., and K. M. Passino, Fault Tolerant Longitudinal and Lateral Control
for Automated Highway Systems, Proceedings of the American Control Conference, Seattle,
Wash., vol. 1, pp. 663-667, 1995.
[Sukthankar96b] Sukthankar, R., D. Pomerlau, and C. Thorpe, SHIVA: Simulated Highways for
Intelligent Vehicle Algorithms, Proceedings of the IEEE Intelligent Vehicles Symposium, Detroit,
MI, pp. 332-337, 1995.
[Tan96] Tan, H.-S., and S. Patwardhan, Magnetic sensors for automatic steering control,
Intellimotion, vol. 5, no. 2, pp. 10-11, 1996.
[TrafT96] Traffic Technology International, issues April, May 1996, UK & International Press,
Surrey, UK, ISSN 1356-9252.
[TRB95] January 25, 1995 meeting of the Committee on AI, during the 74th Annual Meeting of
Transportation Research Board, Washington, D. C., 1995.
[Tsetlin73] Tsetlin, M.L., Automaton Theory and Modeling of Biological Systems, vol. 102 in
Mathematics in Science and Engineering, New York: Academic Press, 1973.
[Tsypkin71] Tsypkin, Ya. Z., Adaptation and Learning in Automatic Systems, New York:
Academic, 1971.
Cem nsal Bibliography 163
[nsal95] nsal, C., J. S. Bay and P. Kachroo, Intelligent Control Of Vehicles: Preliminary
Results on the Application of Learning Automata Techniques to Automated Highway System,
Proceedings of Tools with Artificial Intelligence Symposium, Washington DC, November 1995.
[nsal96] nsal, C., P. Kachroo, and J. S. Bay, Multiple Stochastic Learning Automata for
Vehicle Path Control in an Automated Highway Systems, in revision for IEEE Transactions on
Systems, Man, and Cybernetics, first submission May 1996.
[Varaiya91] Varaiya, P., and S.E. Shladover, Sketch of an IVHS System Architecture, Vehicle
Navigation and Information Systems Conference Proceedings, Society of Automotive Engineers,
pp. 909-922, 1991.
[Varaiya93] Varaiya, P., Smart Cars on Smart Roads: Problems of Control, IEEE Transactions
on Automatic Control, vol. 38., no. 2, pp. 195-207, Feb. 1993.
[Varshavski63] Varshavski, V.I., and I.P. Vorontsova, On the Behavior of Stochastic Automata
with Variable Structure, Automat. Telemekh., vol. 24, pp. 253-360, 1963.
[Viswanathan72] Viswanathan R., and K.S. Narendra, Comparison of Expedient and Optimal
Reinforcement Schemes for Learning Systems, Journal of Cybernetics, vol. 2, pp. 21-37, 1972.
[Weber96] Weber J, and M. Atkin, Further Results on the Use of Binocular Vision for Highway
Driving, Proceedings of the 1996 SPIE Conference on Intelligent Systems and Controls, vol.
2902, Nov. 1996.
[Wigan96] Wigan, M., Problem of Success: Privacy, Property, and Transactions, Converging
Infrastructures: Intelligent Transportation and the National Information Infrastructure,
Cambridge, MA: MIT Press, pp. 341-354, 1996.
[Wright95] Wright, T., Eyes on the Road : Privacy and ITS, Traffic Technology International,
pp. 88-93, 1995.
Cem nsal Bibliography 164
[Xu96] Xu, T., DYNAVIS, a Dynamic Visualization Environment for Automatic Vehicle
Control Systems, Intellimotion, vol. 5, no. 1, pp. 2, 10, 1996.
[Yanakiev96] Yanakiev, D., and I. Kanellakopoulos, Speed Tracking and Vehicle Follower
Control Design for Heavy-duty Vehicles, Vehicle system dynamics, vol. 25, no. 4, pp. 251-276,
1996.
[Yang96] Yang, Y.-T., and B.H. Tongue, A New Control Approach for Platoon Operations
during Vehicle Exit/Entry, Vehicle System Dynamics, vol. 25, no. 4, pp. 305-319, 1996.
[Yu95] Yu, J., and A. Sideris, Control of Combined Vehicle Motion : a Gain-scheduling
Approach,. Proceedings of the 5th Annual Meeting of ITS America, vol. 1, pp. 109-115, 1995.
[Zoratti95] Zoratti, P., Millimeter Wave Scattering Characteristics and Radar Cross Section
Measurements of Common Roadway Objects, Collision Avoidance and Automated Traffic
Management Systems, SPIE, Bellingham, WA, pp. 169-179, 1995.
Cem nsal Appendices 165
Appendix A. Glossary
The linear inaction-penalty scheme LI-P can be derived by a modification of the general linear
reward-penalty scheme LR-P (See Section 6.1). This reinforcement scheme, as the name suggests,
does not update the action probabilities when the environment response is affirmative:
if ( n) = i
pi (n + 1) = pi (n)
for (n) = 0
p j (n + 1) = p j (n) j i
(B.1)
pi (n + 1) = (1 b) pi (n)
for (n) = 1 p (n + 1) = b + (1 b) p ( n)
j j i
r1 j
From the definition above, the expected value of the probability of an action at the next
step can be written as:
[ ] [ ]
r
E pi (n + 1) pi ( n) = E pi ( n + 1) pi (n) (n) = k pk ( n)
k =1
[ ]
r
= ck E pi (n + 1) pi (n ) (n) = k ( n) = 1 pk (n)
k =1
[ ]
r
+ (1 ck ) E pi (n + 1) pi (n) (n) = k (n ) = 0 pk ( n)
k =1
b
r
= ci pi (n)(1 b) pi ( n) + ck pk (n) + (1 b) pi ( n)
ki r 1
r
+ (1 ci ) pi ( n) pi (n) + (1 ck ) pk (n )pi ( n)
ki
(B.2)
We will again consider the ideal case where the probability of penalty for the optimal action is
equal to zero, i.e., c= 0 and 0 < cj < 1. Then, for i = , we have:
[ ] b
r r
E p ( n + 1) p (n) = 0 + ck pk (n ) + (1 b) p (n) + p2 (n ) + (1 ck ) pk (n )p (n )
k r 1 k
(B.3)
Cem nsal Appendices 168
More explicitly:
[ ] b r r r
E p ( n + 1) p (n) = c p
r 1 k k k
( n ) + p ( n ) c p
k k ( n ) bp i ( n ) ck pk (n)
k k
r r
+ p2 (n ) + pk (n) p (n) ck pk ( n) p ( n)
k k
r r r
b
= c p (n) bp ( n)
r 1 k k k
ck pk (n) + p2 (n ) + pk (n) p (n)
k k
(B.4)
r
Now, taking the expectation of both sides, and using the fact that p ( n) = 1, we obtain:
i
i =1
[ ] b r
[ ] [ ]
r
E p ( n + 1) =
r 1 k
ck E pk (n) b E p (n ) ck p k ( n) + E p2 ( n)
k (B.5)
[
+ E (1 p ( n))p (n ) ]
The simulations of the linear inaction-penalty scheme showed that the probability of the
optimal action converges to a steady-state value, as seen in Figure B.1 For such a steady-state
result, we can assume that:
E [p (n) ]= cons tan t p
2
As n p ( n ) = 0 (B.6)
[ ]
E p (n) p j (n) = E[p (n)] E p j (n) [ ]
Conditions above state that, as we approach the steady-state value for the probability of
the optimal action, the expected value approaches the same value; the variance of the probability
is negligibly small and the probability of all other actions are independent of the optimal actions
probability, although the sum of probabilities must be equal to 1. (Since the probability of the
optimal action is a constant value, this value and other probabilities can be treated as independent
r
variables.) Using these facts, the identity E [x 2 ]= E [x ]2 x2 , and renaming E ck pk (n) A ,
k
we can simplify the equality in Equation B.5 as follows:
r
E [p (n + 1) ]= c k E [pk (n)] b E [p ( n)] E c k p k (n)
b r
r 1 k k (B.7)
+E 2
[p (n)]+ E [p (n) ] E [p (n )]
2
Or:
Cem nsal Appendices 169
b b
p= A bpA + p + p p 0 = A bpA
2 2
r 1 r 1
(B.8)
bA 1
p= =
bA (r 1) (r 1)
where r is the number of actions. This proves that a learning automaton using the linear inaction-
penalty scheme in a stationary environment where there is one (and only one) optimal action is
not -optimal (except where r = 2). Furthermore, the probability of the optimal action converges
to 1/(r-1). Figure B.1 shows the probabilities of five actions where the optimal actions
probability converges to 1/(5-1) = 0.25.
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0 50 100 150
iterations
Figure B.1. Probabilities of five actions in the LI-P scheme; only c1 = 0.
Cem nsal Appendices 170
Appendix C. Simulation
Unix: mfiles.tar.Z
Windows: mfiles.zip
Cem nsal Appendices 171
Table C1. Description of the subroutines and functions for multiple lane, multiple automata
intelligent vehicle simulation.
Cem nsal Appendices 172
mlma.m
m_setgui.m m_init.m
m_reinit.m
plot_h2.m m_sandl.m
move2.m m_plap.m
reg2.m m_traj.m
plan2.m m_mvec.m
m_movie.m
Learning
Extended side Learning
algorithm menu
sensor switch parameters
Processing
Front sensor speed (1/sec)
ranges and length of
memory
vectors
Index of the
current vehicle Speed and lane parameters
for current vehicle
On/off switches
for the current
Speed
vehicle
increment
Permitted
Data file I/O speed
variation
Display
update rate
Switch for
Opens scenario saving flow
GUI (Fig. C.6) data
Opens plot
GUI (Fig C.4) Highway
window size
adjustment
Final time (sec)
Memory vector
and pinch module Indices of the
outputs vehicles to be
plot
Speed and lane
positions
Indices of the first
Timed snapshots and last vehicle to
(Fig. C.5) be plot
Relative positions
of vehicles during Indices of the first
run and last vehicle to
be plot
Reference
vehicle
Time interval:
t = 1sec to t = 16sec. Reference vehicle
at t = [0 10 20 28 30 40 45]
The tables given in this section show the possible positions of vehicles in a highway for the
scenarios described in Chapter 7. They indicate relative positions of multiple vehicles with
respect to each others sensor ranges. Each row in a matrix corresponds to a lane; each square
illustrates a road section which falls into the side sensor range of an automated vehicle. A dark
square indicates the presence of a vehicle. Not all possibilities are considered; instead, only the
situations that are of interest for a specific scenario are listed. Similar situations are then
combined into a single state and simplified if necessary. Two situations are said to be similar if
the sensor module outputs and/or possible actions are the same for both. Three scenarios of the
Chapter 7 are:
Figure D.1. All possible immediate neighborhood situations for two vehicles.
Figure D.2. Combined states for two vehicles: states not shown
are identical to those given here.
Cem nsal Appendices 181
1 2 3 4 5 6 7
E A2 A3 A5 A6
F B7 C2
G C3 C6 C7 D5
gg
H E7
I F2 F3 F6 C2
J G5
K J3 J4
L K2 L1
Figure D.5. Combined states for three vehicles: states not shown are
identical to those indicated.
Cem nsal Appendices 184
1 2 3 4
Vita
Cem nsal was born on November 4, 1967 in Ankara, Turkey. He is an alumnus (118) of
Galatasaray Lisesi (Lyce de Galatasaray), Istanbul. He received a B.Sc. degree with honors in
electrical and electronics engineering from Bogazii University, Istanbul, in 1991, and an M.S.
degree in electrical engineering from Virginia Polytechnic Institute and State University (Virginia
Tech), in 1993. His M.S. thesis "Self-Organization in Large Populations of Mobile Robots"
received the Paul E. Torgersen Graduate Student Research Excellence Award (3rd place) in 1994.
From 1993 to 1996, he was a Ph.D. candidate in the Bradley Department of Electrical
Engineering, and research assistant at the Center for Transportation Research at Virginia Tech.
His research interests include automated highway systems, automated vehicle control systems,
artificial intelligence, learning automata, sliding mode control and nonlinear observers.
He will continue his research career at the Robotics Institute at Carnegie Mellon
University, Pittsburgh, PA. Mr. nsal has been a student member of AMS, IEEE, ITS America
and TRB during his graduate studies. He is currently a member of IEEE and ITSA.
29 January 1997