Intelligent Navigation of Autonomous Vehicles in A

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/265038765
Intelligent Navigation of Autonomous Vehicles in

an Automated Highway System: Learning
Methods and Interacting Vehicles...
Article
CITATIONS READS
22 509
4 authors, including:
Cem nsal Pushkin Kachroo

Intuit University of Nevada, Las Vegas
25 PUBLICATIONS 959 CITATIONS 215 PUBLICATIONS 1,786 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Travel Time Theory by Pushkin Kachroo View project
Automated vehicles View project
All content following this page was uploaded by Pushkin Kachroo on 28 October 2014.
The user has requested enhancement of the downloaded file.

Intelligent Navigation of Autonomous Vehicles in an Automated
Highway System: Learning Methods and Interacting Vehicles Approach
by
Cem nsal
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State
University in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
in
ELECTRICAL ENGINEERING
John S. Bay, Chair

Joseph A. Ball
William T. Baumann
Pushkin Kachroo
Hugh F. VanLandingham
January 29, 1997

Blacksburg, Virginia
Keywords: AHS, Intelligent Vehicle Control, Stochastic Learning Automata,

Reinforcement Schemes
Copyright 1997. Cem nsal

Intelligent Navigation of Autonomous Vehicles in an Automated Highway System:
Learning Methods and Interacting Vehicles Approach
by
Cem nsal
Committee Chairman: John S. Bay

The Bradley Department of Electrical Engineering
(ABSTRACT)
One of todays most serious social, economical and environmental problems is traffic
congestion. In addition to the financial cost of the problem, the number of traffic related injuries
and casualties is very high. A recently considered approach to increase safety while reducing
congestion and improving driving conditions is Automated Highway Systems (AHS). The AHS
will evolve from the present highway system to an intelligent vehicle/highway system that will
incorporate communication, vehicle control and traffic management techniques to provide safe,
fast and more efficient surface transportation. A key factor in AHS deployment is intelligent
vehicle control. While the technology to safely maneuver the vehicles exists, the problem of
making intelligent decisions to improve a single vehicles travel time and safety while optimizing
the overall traffic flow is still a stumbling block.
We propose an artificial intelligence technique called stochastic learning automata to

design an intelligent vehicle path controller. Using the information obtained by on-board sensors
and local communication modules, two automata are capable of learning the best possible (lateral
and longitudinal) actions to avoid collisions. This learning method is capable of adapting to the
automata environment resulting from unmodeled physical environment. Simulations for
simultaneous lateral and longitudinal control of an autonomous vehicle provide encouraging
results. Although the learning approach taken is capable of providing a safe decision, optimization
of the overall traffic flow is also possible by studying the interaction of the vehicles.
The design of the adaptive vehicle path planner based on local information is then carried
onto the interaction of multiple intelligent vehicles. By analyzing the situations consisting of
conflicting desired vehicle paths, we extend our design by additional decision structures. The
analysis of the situations and the design of the additional structures are made possible by the study
of the interacting reward-penalty mechanisms in individual vehicles. The definition of the physical
environment of a vehicle as a series of discrete state transitions associated with a stationary
automata environment is the key to this analysis and to the design of the intelligent vehicle path
controller.
This work was supported in part by the Center for Transportation Research and Virginia
DOT under Smart Road project, by General Motors ITS Fellowship program, and by Naval
Research Laboratory under grant no. N000114-93-1-G022.
iii
To all the noble men and the knights fighting the demons
of a binary world in their quest for artificial intelligence.
iv
Acknowledgments
It has been a long ride since I received my M.S. degree. Many friends, local, cyber and dear, many
engineering students, EE, ME, or CE, many professors, and others have seen me working through
my dissertation. All of them, directly or indirectly, touched what is written in these pages.
First of all, I have to express my gratitude for my advisors, Dr. John S. Bay, most
probably the best advisor this campus had ever seen and a legend in thesis/dissertation editing, and
Dr. Pushkin Kachroo, percussionist extraordinaire (Isnt that what P.E. stands for?). They were
both genuinely understanding, and very supportive of me. I learned a great deal from them about
control engineering (among other fields) and about academic life. Without their guidance, their
advice, and their help, this dissertation would not have been completed.
Other members of my dissertation committee, Dr. William T. Baumann, Dr. Hugh F.
VanLandingham, and Dr. Joseph A. Ball, who were also my professors at some point during my
five and a half years at Tech, were kind enough to listen to me babbling about an intelligent
controller.
On multiple occasions, I had the chance of talking about my research (and my life) to the
other graduate students in the Machine Intelligence Laboratory. Doug, John, Mamun, and Ferat
were some of the faces I got used to seeing almost everyday. Also, Paul and Kevin forced me
every Wednesday afternoon to have my breakfast at Arnold's. I do have a feeling that I will miss
that very much.
During my assistantship at Center for Transportation Research, I had the privilege of
working with fellow students, researchers, and centers staff. Lisa, Terry and Christina were there
to answer my unnecessary questions. Kaan zbay was helpful in my starting to work at CTR, and
I am grateful for his confidence in me. My only regret about CTR is not having been able to know
Ashwin Amanna better; he is a true gentleman.
Local Turks, Simge, Erkan, BahadIr, Levent, Oya, Bahar, Kutsal, and others, suffered
indirectly from my research efforts. Sorry guys!
My parents, AydIn and Ayhan, watched me from a distance while I worked toward my
degree, missing most of the last three years of my life. The completion of this dissertation will
mean "seeing more of me" to them, I hope. I believe that my sister Gn will not make the
mistakes I have made in becoming an electrical engineer. Watching her catching up with me
makes me confused and very proud.
Probably, the people who heard the most about my dissertation and graduate life are the
members of KM (Kelek Muhabbet). This e-mail distribution list came to life sometime in 1992 and
grew to a much larger and diverse group. Some of the members are my close friends from the
early years of my academic venture, some of them are people whom I met on the Net, some of
them I have not seen yet. They also watched me getting tired of this lovely town of Blacksburg;
some even visited me here more than once. I could not, cannot and do not want to imagine going
a week without receiving e-mail from KM.
v
I am also grateful to Berrak PInar for helping me tremendously during the last two hectic
weeks of my dissertation work.
Finally, to the person who borrowed the Proceedings of the 1995 Intelligent Vehicles
Symposium, and refused to return it to the library since July 1996: You have been acknowledged!
This work is dedicated to all scientists and engineers who work on artificial intelligence,
and know exactly what I went through.
vi
Table of Contents
Abstract ii
Acknowledgments v
Table of Contents vii
List of Figures x
List of Tables xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 ITS and AHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Intelligent Vehicle Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Different Approaches to Intelligent Vehicle Control . . . . . . . . . 4
1.2.2 Learning Automata as an Intelligent Controller . . . . . . . . . . . . . 6
1.3 Scope and Structure of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Automated Highway Systems 9

2.1 AHS Program Phases and the National Automated Highway Systems
Consortium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Vehicle Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Lateral Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Longitudinal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Combined Lateral and Longitudinal Control . . . . . . . . . . . . . . . 15
2.3 Hierarchical Control Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Other AHS Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Sensors and Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Safety and Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Human Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 An Experimentation and Evaluation Framework for AHS . . . . . . . . . . . . . 19
2.5.1 FLASH Laboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Smart Road . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 A Simulation Tool for AHS Systems: DYNAVIMTS . . . . . . . . . . . . . . . . . 24
2.6.1 Software System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Components of DYNAVIMTS . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Stochastic Learning Automata 29

3.1 Earlier Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 The Environment and the Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 The Stochastic Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Variable Structure Automata and Its Performance Evaluation . . . . . . . . . . 34
vii
3.4.1 Norms of Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Variable Structure Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Linear Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes 39
3.6 Extensions of the Basic Automata-Environment Model . . . . . . . . . . . . . . 41
3.6.1 S-Model Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.2 Nonstationary Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.3 Multi-Teacher Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Interconnected Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Application of Learning Automata to Intelligent Vehicle Control 46

4.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Sensor/Teacher Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Nonlinear Combination of Multiple Teacher Signals . . . . . . . . . . . . . . . . 52
4.4 The algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Discussion of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 More Intelligent Vehicles: Extending the Capabilities 62

5.1 More Complex Sensors and Sensor Modules . . . . . . . . . . . . . . . . . . . . . . 62
5.1.1 Front Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.2 Side Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Additional Sensors and Inter-vehicle Communications . . . . . . . . . . . . . . . 72
5.2.1 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 Pinch Condition Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Higher Level Additions: More Intelligent Teachers . . . . . . . . . . . . . . . . . 77
5.3.1 Lane flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Speed flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.3 Additional rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 Overall Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Effect of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.3 Encountered Problems and . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3.1 Environment Model . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3.2 Information Content and Other Issues . . . . . . . . . . . . . 90
6 New Reinforcement Schemes for Stochastic Learning Automata 92

6.1 A Linear Reinforcement Scheme: Linear Reward-Penalty with Unequal
Parameters, LR-P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Proof of convergence for linear reward-penalty scheme LR-P . . . 100
6.1.2 Notes on the ideal environment and penalty probabilities . . . . 109
6.2 A Nonlinear Reinforcement Scheme: NLH . . . . . . . . . . . . . . . . . . . . . . . . 110
viii
6.2.1 Function H and the Conditions for Absolute Expediency . . . . . . 111
6.2.2 Comparison of NLH with General Absolutely Expedient Scheme 114
7 Interacting Vehicles: Rules of the Game 116

7.1 Interconnected Automata and Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Interacting Automata and Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.1 Interacting Automata in an Autonomous Vehicle . . . . . . . . . . . 118
7.2.2 Interacting Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 States of the Switching Automata Environment . . . . . . . . . . . . . . . . . . . . 122
7.4 Highway Scenarios as State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4.1 Scenario 1: Two Vehicles on a Three-Lane Highway . . . . . . . . 128
7.4.2 Scenario 2: Three Vehicles on a Three-Lane Highway . . . . . . . . 131
7.4.3 Scenario 3: Four Vehicles on a Three-Lane Highway . . . . . . . . . 134
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 Conclusion 139
8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.1 AHS as a Social Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.2 Technical Feasibility of AHS . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Results and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Bibliography 149
Appendix A. Glossary 165

A.1 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.2 Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Appendix B. Proof of Convergence of the Optimal Action with Linear

Inaction-Penalty Scheme LI-P 167
Appendix C. Simulation 170

C.1 The Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.2 Additional Simulation Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Appendix D. States for Multiple Vehicle Interactions 179
Vita 185
ix
List of Figures
Figure 2.1 Four-block evaluation and experimentation framework for AHS. . . . . . . . 19

Figure 2.2 Automated FLASH Car: Test Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 2.3 Driving Station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.4 Test tracks and the control station in the FLASH Laboratory. . . . . . . . . . 22
Figure 2.5 Location of the Smart Road. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 2.6 Software System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.7 Components of DYNAVIMTS simulation package . . . . . . . . . . . . . . . . . 27
Figure 2.8 Opening screen of DYNAVIMTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 3.1 The automaton and the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 4.1 Automata in a multi-teacher environment connected to the physical layers 48
Figure 4.2 Memory vector/buffer in regulation layer. . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 4.3 Sensor ranges for an autonomous vehicle. . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 4.4 The definition of the mapping F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 4.5 Positions of nine vehicles: gray colored vehicle is autonomous, black
colored vehicles are cruising at constant speed. . . . . . . . . . . . . . . . . . . . . 56
Figure 4.6 Automated vehicles lane position and speed. . . . . . . . . . . . . . . . . . . . . . . 57
Figure 4.7 An automated vehicle following a slow moving vehicle in lane 2. . . . . . . . 57
Figure 4.8 Headway distance and speed of an autonomous vehicle following another
slowly moving vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
slowly moving vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 4.10 Initial and final positions, numbers, and speeds of ten automated vehicles
traveling on a 3-lane 500-meter circular highway [mpeg]. . . . . . . . . . . . . . 60
Figure 5.1 Extended definitions of front and side sensor ranges. . . . . . . . . . . . . . . . . 64
slow-moving vehicle: sensor parameters are d1 = 10m, d2 = 20m, and
fsr = 30m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
fsr = 20m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
fsr = 18m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
fsr = 16m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
x
fsr = 16m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 5.7 Positions of five automated vehicles: gray colored vehicle attempts to
shift to the right lane by slowing down. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 5.8 Positions of five automated vehicles: gray colored vehicle attempts to
shift to the right lane by slowing down. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 5.9 Headway distances in the platoon; lane change occurs at t 8.5sec. . . . . . 71
Figure 5.10 Speed and space-time trajectory of five automated vehicles. . . . . . . . . . . . 72
Figure 5.11 The lane module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.12 Pinch condition: two vehicles decides to shift to the same lane
simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 5.13 Memory vector/buffer: If an action fills half of the buffer, it is signaled. 74
Figure 5.14 Initial positions and trajectories of two vehicles. . . . . . . . . . . . . . . . . . . . . 75
Figure 5.15 Lane and speed of two vehicles from t = 0 to t = 4sec. . . . . . . . . . . . . . . . 75
Figure 5.16 The output of the pinch module and the actions in the memory buffer
versus time for vehicle 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 5.17 Initial and final situation, speed, and lane positions of three automated
vehicles traveling on a 3-lane 500-meter circular highway [mpeg]. . . . . . . 78
Figure 5.18 Snapshots, speed and lane positions of two automated vehicles traveling
on a 3-lane 500-meter circular highway. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 5.19 The new definition of the mapping F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 5.20 Automata in a multi-teacher environment with new definitions of mapping
functions and flag structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 5.21 Average platoon speeds for simulations 1-4. . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 5.22 The distance between vehicles for simulations 1-4. . . . . . . . . . . . . . . . . . . 86
Figure 5.23 Initial and final positions, and speeds of automated vehicles in four
ten-vehicle platoons [mpeg]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 5.24 Simulations 5 and 6: average speed for a platoon of ten automated
vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 5.25 Simulations 5 and 6: distances between vehicles for a platoon of ten
automated vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 5.26 Initial and final positions, and speeds of automated vehicles in four
ten-vehicle platoons [mpeg]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 5.27 Simulations 7 and 8: mpeg movies of two simulations of 10-vehicle
platoons with lead vehicle decelerating to 75mph [mpeg]. . . . . . . . . . . . . 90
Figure 6.1 Number of steps needed for popt to reach 0.995 for different values of
learning parameters a and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 6.2 For some parameter values, convergence is not obtained for all 500 runs. . 97
xi
Figure 6.5 The region for asymptotic convergence to pure optimal strategy. . . . . . . . 106
Figure 6.6 Probabilities and cumulative average of the probabilities of the non-
optimal actions for a 3-action automata with LR P (a = 0.2, b = 0.1). . . . . 106
Figure 6.7 Sketch of the function H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Figure 7.1 Longitudinal automaton determines the lateral automatons environment. . 118
Figure 7.2 Multiple vehicles interacting through the physical environment. . . . . . . . . 121
Figure 7.3 Situations for two interacting vehicles and combined environment
responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 7.4 Changes in the physical (left) and automata environments (right): vehicle
1 shifts to the middle lane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Figure 7.5 Possible physical environment states for 2 vehicles in 3-lane highway. . . . 126
Figure 7.6 State transition diagram for two vehicles on a 3-lane highway; reflexive
transitions are not shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 7.7 Scenario 1: Two vehicles with conflicting desired paths. . . . . . . . . . . . . . . 129
Figure 7.8 Possible chains for Scenario 1 (reflexive transitions are not shown; two
chains are distinguished with dashed and solid lines). . . . . . . . . . . . . . . . . 130
Figure 7.9 Possible penalty-reward structures to force physical environment to
switch to states B1 or C1 from current state A1. . . . . . . . . . . . . . . . . . . . 131
Figure 7.10 Scenario 2: Three vehicles with conflicting desired paths. . . . . . . . . . . . . . 132
Figure 7.11 A possible chain for Scenario 2: lane flag forces vehicles 1 and 2 to slow
down. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Figure 7.12 Three-vehicle transition diagram can be written as three separate two-
vehicle transition diagrams using the definitions in Figure 7.8. . . . . . . . . . 133
Figure 7.13 Three-vehicle transition diagram is equivalent to two separate two-vehicle
diagrams for the example in Section 5.3.1. . . . . . . . . . . . . . . . . . . . . . . . . 134
Figure 7.14 Scenario 3: Four vehicles with conflicting desired speeds. . . . . . . . . . . . . 135
Figure 7.15 Two possible solutions to a situation with 4 vehicles. . . . . . . . . . . . . . . . . 135
Figure 7.16 Speeds and positions of vehicles 1 and 2 for scenario 3 [mpeg]. . . . . . . . . 137
Figure B.1 Probabilities of five actions in the LI-P scheme; only c1 = 0. . . . . . . . . . . . . 169
Figure C.1 Structure of the simulation program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Figure C.2 The simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Figure C.3 Graphic User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Figure C.4 GUI for data visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Figure C.5 Trajectory command for relative position plots. . . . . . . . . . . . . . . . . . . 176
Figure C.6 Scenario window: Clickable buttons initialize several different scenarios. . 176
Figure C.7 Timed snapshots of a simulation run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure D.1 All possible immediate neighborhood situations for two vehicles. . . . . . . . 180
Figure D.2 Combined states for two vehicles: states not shown are identical to those
given here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Figure D.3 Further simplified states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Figure D.4 All possible immediate neighborhood situations for three vehicles in a
three-lane highway. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
xii
Figure D.5 Combined states for three vehicles: states not shown are identical to those
indicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Figure D.6 Possible states for four vehicles actually a two-vehicle situation. . . . . . 184
xiii
List of Tables
Table 2.1 Comparison of different stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Table 4.1 Possible action pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Table 4.2 Output of the left/right sensor modules. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Table 4.3 Output of the Front Sensor block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 4.4 Output of the Speed Sensor Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 4.5 Action - Sensor Module matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 4.6 Possible longitudinal action-sensor output scenarios. . . . . . . . . . . . . . . . . 53
Table 4.7 Properties of the reinforcement schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 55
Table 4.8 Parameter settings for simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 5.1 Output of the Front Sensor block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 5.2 Front sensor parameters for simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 64
Table 5.3 Output of the left and right sensor modules. . . . . . . . . . . . . . . . . . . . . . . . 69
Table 5.4 Possible lateral action-sensor output combinations. . . . . . . . . . . . . . . . . . 80
Table 5.5 Parameter settings for simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 6.1 Behavior of the automata under general LR P scheme. . . . . . . . . . . . . . . . . 96
Table 6.2 Behavior of the automata under general LR-P scheme. . . . . . . . . . . . . . . . . 98
Table 6.3 Effect of number of actions on behavior of the automata using general
LR-P scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 6.4 Convergence rates for a single optimal action of a 3-action automaton
in a stationary environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Table 7.1 Assumed probabilities of penalty for each action based on the front and
side sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Table 7.2 Probability of sensing a vehicle in the sensor range. . . . . . . . . . . . . . . . . . 119
Table 7.3 Action-transition pairs for state A1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Table 7.4 Possible transitions for Scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Table C.1 Description of the subroutines and functions for multiple lane, multiple 171
automata intelligent vehicle simulation. . . . . . . . . . . . . . . . . . . . . . . . . . .
Table C.2 Description of the situations and simulation files [mpeg] . . . . . . . . . . . . . . 178
xiv
Chapter 1
Introduction
1.1 Motivation
One of todays most serious social, economical, and environmental problems in the world is
traffic congestion. In the United States alone, the problem is costing over 100 billion dollars each
year. In addition, more than 40,000 persons are killed and another three million are injured (an
additional cost of $137 billion per year). This is compounded by the fact that it is becoming
increasingly difficult, for both financial and environmental reasons, to continue to build additional
highway capacity [Hayes94]. Therefore, the solution to the problem must lie in other
approaches, one of which is to make better use of the existing highway infrastructure. The
current infrastructure has begun to plateau in its ability to meet the operational requirements of
highway transportation. Consequently, congestion and safety risks are increasing. Previous
solutions of building or widening roads are not feasible anymore; the average cost of a mile of
urban freeway is $38 million. Furthermore, the tens of thousands of lives per year is a steep
price to pay for mobility [Hayes94].
New approaches discussed in the next section have the potential of increasing the safety,
reducing the congestion, and improving the driving conditions. Studies conducted at the beginning
of the intelligent transportation era show that it is possible to cut accidents by 18%, gas
emissions by 15%, and fuel consumption by 12% by employing new approaches. It is claimed
that these improvements will result in $10 return to the economy for each dollar invested in the
new transportation systems [Card93].
Considering the potential of the application of intelligent systems to surface
transportation problem, US Department of Transportation deems the investigation of new
technologies as crucial. Most of the smart highway technologies were first suggested in the
'60's, but the idea did not gain momentum until Intermodal Surface Transportation Efficiency
Act (ISTEA) passed by Congress in 1991. The technology needed to create an intelligent
transportation system is already available, although still expensive for full implementation.
Todays vehicles have at least four or five microprocessors monitoring and controlling such
things as ignition spark, fuel and emission controls, automatic transmission, cruise control, air
bags, and anti-lock brakes. Further development is underway on computerized controls for active
suspensions, traction control, all-wheel drive, and four-wheel steering. Few people realize, in
fact, that todays car has four times the computing power of the first Apollo moon rocket
[Card93].
Cem nsal Chapter 1. Introduction 2
1.1.1 ITS and AHS

The approach taken by the US Department of Transportation (DOT) is called Intelligent
Transportation Systems (ITS; previously, Intelligent Vehicle Highway System, IVHS). ITS is
actually a broad range of diverse technologies, including information processing, communications,
control, and electronics. DOTs IVHS Strategic Plan [Plan93] summarizes the approach, mainly
defining the use of intelligent vehicles and extensive communications between the vehicle and the
roadside infrastructure. This initial plan set the goals, milestones, and objectives of the ITS
program through 1997. The federal government is one of the many players in the implementation
of a nationally compatible system. Others are found among the private sector, professional
societies, consumer and industry groups, academia, and State and local governments.
Some of the direct goals and objectives of the ITS program are:
Significant reduction in the number of annual fatalities and injuries due to accidents,
Improvement in the safety of commercial vehicles and hazardous material movement,
Significant reduction in the cost associated with congestion,
Increase in the volume of people and goods that can be moved on existing highways,
Improvement of travel time predictability, and reduction in the level of stress associated
with travel,
Reduction in harmful vehicle emission, and in surface transport energy consumption per
vehicle-mile and per passenger-mile traveled.
While congestion is the primary problem in urban areas, rural areas suffer from a higher
traffic fatality rate due to road conditions and higher rates of speed. Increased margins of safety
provided by ITS technologies might reduce the number of accidents in both urban and rural areas.
Most importantly, communication capabilities could improve emergency response times, and
improve levels of rural public transportation services.
The established process of delivering the ITS program has four major components
covering the following issues:
Research and Development: The categories of research applicable to the development

of the program include (a) providing basic research tools and a knowledge base, (b)
creating a favorable environment and (c) defining potential opportunities for
applications, and (e) developing ITS applications.
Operational Tests are designed to evaluate applications of new technologies and
system concepts, facilitating the transition into operational use.
The Automated Highway System (AHS) is established to pursue the more technically
challenging, longer term goals of having a fully operational vehicle-highway system that
automates the driving process.
Deployment Support: Unlike major national systems in aviation, space, defense, and
other sectors, the DOT will not own and operate the national ITS.
The DOT has established a set of discrete milestones to be accomplished until the end of
the year 1997. Some of these milestones related to the work presented here are system
architecture, communication frequencies, collision avoidance, and automated highway systems.
A major long-term element of DOTs research and development is AHS. The deployment
of early phases of AHS will begin in the next ten years, but will continue beyond that period.
Current AHS program activities address the imminent user services involving safety advisory and
driver assistance products, e.g., in-vehicle warnings of impending danger, and emergency control
intervention. The ITS architecture must be structured to provide a sensible evolution to
accommodate these types of user services, as well as the eventual realization of total automation.
The 1997 demonstration requested by Congress is the AHS programs first major
milestone, and will assess proof of concept feasibility for the AHS. Preparation for
demonstration will include the identification and evaluation of feasible alternative conceptual
approaches. The demonstration will feature many feasible concepts. Subsystems and control,
including lateral and longitudinal control, transition between manual and automated control,
maintenance of position in the roadway traffic flow, lane changing, and various malfunction
management capabilities will be demonstrated all in a collision-free automated driving
environment.
The demonstration results, if successful, will lead to selection of the AHS conceptual
approach, documentation of the approach in performance specifications and standards, and
establishment of a partnership with industry to design, develop, and operationally test an AHS
with public participation. Underpinning the development of the first AHS demonstration is an
understanding of the human factor issues related to AHS driver operations. The DOT has begun a
research project to develop comprehensive driver performance guidelines for AHS. Beyond 1997,
deployment of a nationally compatible system will become the primary concern of DOT. It is
expected that the work will shift towards supporting deployment of ITS user services, and
continuing progress on long-term ITS goals, especially the AHS. Details of the AHS concept, and
major important factors are described in detail in Chapter 2.
1.1.2 Intelligent Vehicle Control

Vehicle control is probably the most important part of the advanced AHS applications.
Implementation of AHS requires automatically controlled vehicles for the reasons listed in
Chapter 2. The aim is to increase safety by removing the driver from control of the vehicle, and
the throughput by automation of vehicle following, lane changing, and other maneuvers.
The past and present research on vehicle control emphasizes the importance of new
methodologies in order to obtain stable longitudinal and lateral control. Combined lateral and
longitudinal control needs to be implemented for full AHS applications. As indicated in Section
2.2, there is a large group of investigators working on vehicle control issues. However, being able
to control the dynamics of a vehicle does not necessarily mean that we have an AHS. What we
mean by intelligent vehicle control is the ability of an automated vehicle to make decisions on
its steering, and velocity commands, i.e., its ability to plan its path. In an environment where
there are multiple fast moving vehicles, making the right decision to avoid collisions and optimize
vehicle path is difficult. There are several different approaches currently considered in the current
AHS research. These and our position relative to current research are discussed next.
1.2 Context
1.2.1 Different Approaches to Intelligent Vehicle Control
There are two main approaches to the implementation of a fully automated highway system:
hierarchical structure and autonomous vehicle approaches. The first approach, defined in
[Varaiya93], centers around the notion of platooning (See Section 2.3). In this approach, there
are different layers of control hierarchy, each responsible for a different task required to
implement an automated highway system. The layers of the architecture, starting at the top, are:
network, link, planning, and regulation. The network layer assigns a route to each vehicle as it
enters the system. The link layer assigns each vehicle a path which balances traffic for all lanes
and assigns a target speed for each section of highway. The planning layer creates a plan which
attains the desired path. The regulation layer controls the vehicle trajectory so that it confirms to
this plan.
The tasks assigned to each layer differ from each other in three dimensions: time scale,
information span, and spatial impact of decisions. As we go up in the hierarchy, the frequency of
decisions decrease, more aggregated information is used, and the decisions affect vehicles over
larger distances. For example, the planning layer of the hierarchy makes decisions to create a plan
that evolves every minute by coordinating with neighboring vehicles. Path decisions affects only
the neighboring vehicles.
This approach assumes an information flow from the higher hierarchy and to/from other
vehicles for making decisions about the optimal path. The information available to an automated
vehicle has global characteristics since it is relayed from a higher layer of the hierarchy.
A simpler approach for reasons of implementation is the autonomous vehicle approach,
where each vehicle is treated as autonomous agents using local sensors (and maybe limited
communications with neighboring vehicles). It does not use the concept of closely separated
vehicles and aims to solve the problem in a shorter time interval, and to reach a fully automated
highway systems by an evolutionary implementation, i.e., by slowly increasing the capabilities
of autonomous vehicles over time.
In the first case, the control commands are relayed by higher layers in the hierarchy, and
the vehicles do not need to make intelligent decisions for path planning; the hierarchical structure
controls the vehicles. In the latter approach however, the vehicles need to make decisions based
on local information. Our learning planner can be visualized as part of the second approach, or a
backup system for the first one.
The first attempt to solve the problem of real-time decision making in an automated
highway environment dates back to 1992. Mourou and Fade described a planning method
applicable to agents with perception and decision-making capabilities and the ability to
communicate with other agents [Mourou92]. The agents, in this example, are assumed to
transmit their actions to their neighbors. This work emphasized the possibility of achieving a
more flexible, fast, and reactive system by multi-agent planning methods and execution
monitoring. However, this approach requires constant and complete data transfer between agents.
Recent research on intelligent vehicles includes an adaptive intelligent vehicle module used
in a simulation and design system for tactical driving algorithms [Sukthankar96a, 96b]. The
approach divides the driving task into three levels: strategic (route planning, goal determination),
tactical (maneuver selection for short term goals) and operational (translating maneuvers into
control operations). Intelligent modules are designed to answer the need for real-time tactical level
decisions. A reasoning system called SAPIENT combines the high-level goals with low-level
sensor constraints to control vehicles. The system uses a number of modules whose outputs are
combined using a voting scheme. The Population-Based Incremental Learning (PBIL) method is
used to adjust the large number of parameters defining the behavior of the modules. This
approach is basically a method for finding the suitable parameters for the module that are used to
fire lateral and longitudinal actions. The learning algorithm for the parameters is a combination of
evolutionary optimization and hill climbing, and it is very similar to learning automata
reinforcement schemes except for the use of a mutation factor for the probabilities of actions at
the end of each iteration.
The learning automata controller we describe here can be visualized as a combination of
intelligent modules at the tactical level. However, our approach differs from the above mentioned
research in the use of the learning algorithms. Instead of learning the parameters affecting the
firing of actions on repeated runs, learning automata learn which action to fire based on the local
sensor information. In other words, the learning is not at the design phase, but at the run phase.
The parameters defined in [Sukthankar96a] correspond to learning (Chapters 3, 4, and 6) and
sensor (Chapters 4 and 5) parameters of our controller, and can be used to adjust those. Learning
and sensor parameters defines the capabilities of an automated vehicle, and can also be used to
model different driver behaviors. For example, large learning parameters decrease the decision
time, and consequently result in a more agile vehicle path.
Another approach to intelligent vehicle controller for autonomous navigation uses a
decision-theoretic architecture with probabilistic networks [Forbes95]. The problem is modeled
as a partially observable Markov decision process, and the optimal action is a function of the
current belief state described as the joint distribution over all possible actual states of the world.
However, this work currently considers only combinations of dynamic decision networks with a
decision tree, i.e., if-then rules where each predicate is actually a complex set of probability
thresholds on specific variables, and it is similar to the previous work in the sense of firing
actions. Similarly, Niehaus and Stengel defined a rule-based navigation system that uses worst-
case decision making (WCDM) approach. Again, a stochastic model of the traffic situation based
on sensor measurements is assumed [Niehaus94].
1.2.2 Learning Automata as an Intelligent Controller

The concept of the learning automaton is defined as a result of the work on modeling observed
behavior, on the choice of actions based on past observations, on implementation of optimal
strategies in the context of the two-armed bandit problem, and on the need for rational decisions
in random environments.
In classical control theory, the control of a process is based on complete knowledge of the
system. Later developments considered the uncertainties present in the system. However, all
those assumptions on uncertainties and/or input functions may be insufficient to successfully
control the system. It is then necessary to acquire additional information online since a priori
assumptions are not sufficient. One approach is to view these as problems in learning.
Learning can be defined as a change in behavior as a result of past experience. In a purely
mathematical context, the goal of a learning system is the optimization of a functional not known
explicitly [Narendra74]. A stochastic automaton attempts to find a solution of the problem
without any information on the optimal action (the control input to the environment/system).
One action is selected at random, the response from the environment is observed, action
probabilities are updated based on that response, and the procedure is repeated. A stochastic
automaton acting as described to improve its performance is called a learning automaton.
As seen clearly from the general description of a learning automaton, our approach differs
from previous and current efforts to intelligently control autonomous vehicles in the use of the
learning. Instead of learning the parameter/behaviors/firing rules for best actions to take for
achieving a safe and optimal path, our controller learns the action in real-time. In that sense, the
decision to fire an action is never taken at exactly the same time for similar conditions.
Furthermore, there are no prescribed conditions for actions. The parameters that define the
learning process of a stochastic automaton, as well as the sensor parameters defining decision
ranges, can be learned too using methods similar to ones described in Section 1.2.1.
The idea of defining a fixed structure to be utilized to find the optimal action has its
own appeal, since the performance of the system is deterministic in the sense that the best action
for a specific situation is known. However, even a good driver does not follow rules
deterministically. In this sense, the learning automata approach is able to capture the dynamics of
the driver behavior. On the other hand, a rule-based system, although known to perform well on
many situations, has the disadvantage of requiring extensive modifications, even for a minor
adjustment in the rules. Furthermore, such systems cannot handle unanticipated situations
[Cremer94].
We visualize two learning automata employing a reinforcement algorithm as the heart of
our path planner. Using local sensor data (and maybe limited communications with neighboring
agents), longitudinal and lateral automata learn the optimal action to be taken for a given situation.
Sensor information is processed in virtual teacher modules that evaluate each action in the light of
the current sensor status. Given a large enough time and sufficiently large learning parameters, the
automata indicate the best action to take and sends them to a lower layer which in turns fires
the action. The details of the concept are given in Chapters 4 and 5.
This application of learning automata is one of the very few applications to real-time
learning problems [Narendra89, Najim91b]. The difficulty in modeling the environment (highway,
the vehicles, and the drivers) makes learning automata a likely candidate for such an application,
since a learning system is guaranteed to adapt to existing conditions provided that the teachers
are designed carefully.
1.3 Scope and Structure of Dissertation

Chapter 1 described the overall problem of traffic congestion and safety, as well as ITS, currently
considered as a solution by DOT as well as other agencies throughout the world. The direct goals
and objectives of the ITS program, its advantages and the role of Automated Highway Systems
(AHS) to reach these goals as well as the importance of the intelligent vehicle control for AHS are
also emphasized in this chapter.
Chapter 2 introduces the AHS program in detail. The phases of the program and National
AHS Consortium are introduced to give a perspective on the current developments. Background
on the previous and current research on vehicle control, proposed control structures, and related
issues such as sensors and communication difficulties, and safety analysis are also given in
Chapter 2. The chapter concludes with the description of the evaluation and experimentation
framework designed at Virginia Tech Center for Transportation Research. Current small scale and
simulation studies for AHS are briefly described.
Chapter 3 is an introduction to learning automata and reinforcement learning. All the
necessary descriptions and ideas to understand our application of learning automata to intelligent
vehicle control are given. This chapter also lists past and present research on learning automata
and previous applications of learning automata.
Chapter 4 and 5 describe the use of learning automata as an intelligent decision maker for
vehicle path planning. The application of learning automata techniques to intelligent vehicle
control is introduced in Chapter 4. The basic automata-teacher model and abstraction of the
sensing modules are given; the idea of learning necessary actions for a vehicle to survive in a
dynamically changing environment is introduced. The idea differs from the previous and current
research by its on-line learning approach. The method described in Chapter 4 is found to be
capable of safely directing the autonomous vehicles. Simulations of learning model also indicated
that there is a need for additional sensor information as well as faster learning algorithms.
Result of Chapter 4 leads to the extension of the decision capabilities of an automated
vehicle by extending its capabilities sensing and communication capabilities and by defining a
more complex decision structure, given in Chapter 5. Our investigation of the control of
autonomous vehicles using local information determines the need for new and improved sensor
structure. Chapter 5 details the necessary additions for more intelligent vehicles in order to
optimize the overall behavior of the traffic, and provides solutions to problems generated by
intelligent vehicle interactions. The flag structure described in this work enables autonomous
vehicles to reach their immediate goals while optimizing the traffic flow in close neighborhood.
The overall design structure, effects of the several parameters on the decision, and several other
issues related to intelligent path planner conclude Chapter 5.
In Chapter 6, we introduce reinforcement schemes (learning algorithms) used for our
application of learning automata. The schemes are compared according to their desired update
characteristics. A new nonlinear scheme and a linear scheme previously not considered by
previous research efforts are found to be more advantageous than others. For the linear scheme,
proof of optimality for a specific case in our application to intelligent vehicle control is given.
Instead of general linear differential equation approach [Narendra89], stability theorems for
discrete-time systems are used for this purpose. The behavior of the new nonlinear reinforcement
scheme is also discussed in this chapter. It is found that the algorithm is absolutely expedient.
These two new schemes are the results of our attempt to find fast converging reinforcement
schemes.
Chapter 7 introduces the concept of interacting automata and the treatment of multiple
automata as game playing agents. The concept of the automata games is then extended to
multiple automata in multiple vehicles. An approach similar to automata games [Narendra89] is
taken for interacting vehicles as well. By defining the physical environment of vehicle as a
switching environment, it is possible to treat a highway situation as state transitions in a Markov
chain. Associating automata environments to each stationary physical environment, it is possible
to analyze and design physical environment changes as state transitions based on the reward-
penalty structure of the automata environment1. This approach led to the design of the flag
structure explained in detail in Chapter 5.
A brief discussion of technical feasibility and user acceptance of AHS, the conclusions
drawn from the work presented here, and our recommendations for future research as well as the
discussion of the problems we encountered in this research are listed in Chapter 8.
Notations and definitions of the variables, list of abbreviations and acronyms, additional
proof of convergence for a simple linear reinforcement scheme, a brief description of the
simulation program and diagrams of the states for highway scenarios mentioned in Chapter 7
are all given in the Appendices.
1
The distinction between automata and physical environments is discussed in Chapter 4.
Chapter 2
Automated Highway Systems
A major long-term element of Intelligent Transportation Systems research and development is

Automated highway Systems (AHS). The AHS program is a broad international effort to
provide the basis for, and transition to, the next major performance upgrade of the
vehicle/highway system through the use of automated vehicle control technology [NAHSC96].
The detailed definition of the Automated Highway System is as follows [Plan93]:
The term fully automated intelligent vehicle-highway system is interpreted to
mean a system that:
w Evolves from todays roads (beginning in selected corridors);
w Provides fully automated hands-off operation at better levels of performance
than todays roadways in terms of safety, efficiency, and operator comfort; and,
w Allows equipped vehicles to operate in both urban and rural areas on highways
that are both instrumented, and not instrumented.
The consensus in the AHS community is that AHS will evolve over a series of smaller
steps in technology. The final step of full automation will not be a leap, but a logical consequence
of previous development and deployment efforts. Each step in the technology will have its own
benefits and be self-sustaining. Vehicle and infrastructure evolutions will be synchronous
[James94]. We will briefly mention the steps of this evolution here before introducing the AHS
program and discussing automatic vehicle control technologies in detail.
When the cruise control was first developed, there was much concern over the safety and
user acceptance of the new system; however, it has become widely accepted and used. In the near
future, obstacle and headway warning and Automatic Vehicle Identification (AVI) will be added
to modern cruise control and existing communications infrastructure. The success of AHS
depends on linking the power of cellular communications and the emerging range of high-
performance computers to the ongoing vehicle based developments. Ideally, the highway system
can be divided into a number of cells which contain local radio receivers or beacons that will be
linked together through a fiber-optic network. Vehicles will also be equipped with a transceiver
unit carrying several user services. The first applications of this technology are the Automatic
Vehicle Identification (AVI) and Electronic Toll Collection (ETC). Obstacle and headway
warning is the next step in AHS development in vehicles. Vehicle on-board radar (VORAD)
systems in many commercial vehicles are already in use for the last two years. An important
issue in warning systems is the capabilities of the sensor modules. Differentiating between a large
Cem nsal Chapter 2. Automated Highway Systems 10
vehicle and a small animal may not be possible using a simple system. A consequent application
of the headway warning system is the automatic headway control. Adaptive cruise control
systems are currently designed by many automobile manufacturers.
The market introduction of the first vehicle with adaptive cruise control is expected in
1997. This will enable the drivers to hold their desired speed as well as the desired headway
distance. Although the drivers defined as creepers will be cut-of by more aggressive drivers
(hunters), the ability to set the desired headway may be desirable to many users. Also, the
issues such as sensor types, curve handling, merging vehicles, changing lanes, integration of
steering and braking all have to be addressed to obtain a complete system design.
Applications in advanced traffic management, traveler information and public
transportation systems (ATMS, ATIS, APTS) will require more sophisticated vehicle location
capabilities. In addition, the number of uses for vehicle-to-roadside communications will
eventually increase. MAYDAY services, fleet tracking and automatic vehicle location (AVL)
applications will use radio-location beacons as well as more sophisticated transceivers. As a
result of AVL and AVI, processing real-time information on vehicle locations will be possible.
Although the number of vehicles equipped with AVI/AVL technologies will initially be small,
traffic management centers can effectively use a small percentage of vehicles as probes.
Roadside-to-vehicle and vehicle-to-vehicle communications are also important for the
future of AHS. Automatic braking systems may be activated by decelerating vehicles in front, or
by the infrastructure sending a deceleration request to the headway control system. The vehicle
must be very sure about the imminent danger, and knowledge of following vehicles and their
speeds is an important factor to be considered. Inter-vehicle communications and rear sensing
both would help in automatic braking.
Evolution of the AHS system will continue with lane departure warning. It will be the
first system to control lateral movement of vehicles. The lane holding feature will consequently
be added to the adaptive cruise control, shortly after the lane departure warning feature.
Automatic lane holding will provide a "hands off/feet off" driving situation where the driver is
still responsible for all command decisions in the vehicle and must be aware at all times of his
surroundings. If the infrastructure knows the location of each vehicle, possesses the information
about its current path, and is communicating with the vehicle, then the lateral control can be
coordinated from the infrastructure.
Further advances in technology will force the driver to lose his control of the vehicle. In
order to gain any additional benefit of safety and efficiency1, the driver must be removed as the
primary source of command and control. Of course, this change requires that the automated
system perform better than a good driver, i.e., no more than 1 critical error in 108 seconds be
made. This step will be the natural consequence of the previous progress. Obviously, not all
vehicles will be equipped with this technology right away. Automated and manually driven
vehicles have to coexist for some time.
1
With the driver in control, the highway capacity can only be increased to 15% above the maximum value of 40
vehicle/lane/min. Furthermore, 90% of highway accidents are results of incorrect driver decisions.
A vehicle that can predict the actions of neighboring vehicles is an important step for
safer highway transportation. Locating the position of all the vehicles in close proximity to the
automated vehicle with high accuracy is essential. This can be accomplished through multi-sensor
systems for adjacent vehicles and possibly inter-vehicle communications to give an idea of what
to expect beyond adjacent vehicles. Alternatively, the roadside control may have knowledge of
the positions of the vehicles relative to fixed reference points. This knowledge is obtained by
either vehicle based or roadside based detection, and/or by communicating with the vehicle.
This technology requires extreme accuracy in vehicle location at all times. If the system is
infrastructure-based, the infrastructure needs to know the locations of the non-automated
vehicles, for safe and efficient implementation. The minimum update rate of information must be
larger than 100 times per second with an accuracy less than 10 cm for the desired level of safety
[James 94]. Automated vehicle control (AVC) systems are expected to boost the capacity by
50% even for mixed vehicle traffic. Once the system has knowledge of the surrounding
environment to all extents, it can make decisions on merging and passing in addition to the
headway control and lane keeping performed under driver control. Full system optimization and
higher efficiencies can then be obtained as the percentage of automated vehicles on the road
increases.
Highways contain many characteristics that simplify the problem of automation, such as
uninterrupted traffic flow, controlled access. Therefore automation on arterials will lag
significantly behind automated highways. However, many safety measures can be taken on
arterials using the equipment already designed for the highway. For example, the problem of
intersection collision can be reduced by activating the onboard warning systems and automatic
braking systems with electronic signal lights in addition to the normal traffic signal. If the
intersection detects a potential for a collision it can notify equipped vehicles. Problems to be
encountered during AHS deployment on arterials include integration of cyclists and motorcyclists
to the AVC system, and the effects of pedestrian and animal traffic.
The final step in the AHS future is fully automatic control, wherein the driver will have
no control over the vehicle. All trip decisions will be automatically made using AVL and ATIS
information. The driver may be able to include additional criteria for route selection. Once the trip
decision is made, the infrastructure, utilizing AVC, will guide the vehicle while constantly
updating the routing strategy based on the current information obtained through Advanced
Traffic Management System (ATMS).
2.1 AHS Program Phases and the National Automated Highway

Systems Consortium
The AHS program in United States is planned around three broad phases: Analysis
(1993-96), System Definition (1994-2001), and Operational Tests and Evaluation (starting in
2001). The National Automated Highway System consortium (NAHSC) is responsible for
conducting the second phase.
The Analysis Phase established an analytic program foundation. It consisted of Precursor

Systems Analyses (PSA) [PSA94] by 15 contractor teams that addressed automated vehicle
control requirements and issues in 16 topic areas, a human factors study effort to develop an
AHS human factors design handbook, and National Highway Traffic Safety Administration
(NHTSA) analyses to investigate other ITS automated vehicle control-based services that avoid
collisions through warning and control. The PSA identified issues and risks associated with
various AHS concepts and design areas. All contract teams submitted final reports in November
1994. The NAHSC is actively using these findings in their research.
The Systems Definition Phase is currently underway. The NAHSC is working in
partnership with the federal government. The consortium includes representatives from the
vehicle industry, highway industry, State and local governments, regional and metropolitan
transportation agencies, and electronics/communications industries associated with the vehicle
and communications market.
The milestones of the consortium program are as follows: (a) establishment of
performance and design objectives, (b) a 1997 proof-of-technical-feasibility demonstration, (c)
identification and description of multiple feasible AHS system concepts, (d) selection of the
preferred AHS system configuration, (e) completion of prototype testing, and (f) completion of
system and supporting documentation.
The Operational Test and Evaluation Phase is currently not funded. It would logically
follow a successful completion of the Systems Definition Phase.
The National Automated Highway System Consortium (NAHSC) was formed to
specify, develop, and demonstrate a prototype Automated Highway System (AHS)
[NAHSC96]. The specification provides for an evolutionary deployment that can be tailored to
meet todays transportation needs. The Consortium will seek opportunities for early
introduction of vehicles and highway automation technologies to achieve benefits for all surface
transportation users, while incorporating public and private stakeholder views to ensure that an
AHS is economically, technically, and socially viable.
The consortiums proof-of-technical-feasibility demonstration is one of the major
milestones of the AHS program, and is scheduled for October 1997. This demonstration will be a
full-scale exhibition with multiple vehicles on a segment of the I-15 interstate highway near San
Diego, integrating the technological achievements of the participants of the consortium and the
transportation industry. The demonstration will focus on existing technologies and concepts that
can be integrated quickly to provide a solid proof of feasibility.
The selection of the final AHS system configuration will be made during 1999, and the
final decision on the preferred AHS configurations will be made in February 2000. Immediately
after that date, the design, development and testing of the prototype system will begin. The
design process will be finished before the end of year 2000; the development will conclude in
2001, and the prototype systems testing will be completed in August 2002.
The detailed description of the national automated highway program, information on
NAHSC and its participants, and the detailed schedule of AHS tasks and subtasks can be found
at [NASHC96]. The information about the past and future of the AHS program can also be found
at [AHS96].
Outside of the US, similar AHS programs are being conducted in Japan [ITSJ96], and in
Europe [Braess95]. Almost all automotive companies in Japan have automated vehicles in the
design and/or testing phase [Construction96]. Similar activities are reported in the
PROMETEUS program in Europe [Rault95].
2.2 Vehicle Control

Vehicle control is probably the most important part of the advanced AHS applications.
Implementation of AHS necessitates automatically controlled vehicles as mentioned previously.
Achieving the optimal solution to congestion and safety problems requires extensive research in
system modeling, lateral (steering) controls and longitudinal (speed and headway) controls. In a
fully automated highway system, these control systems will rely on vehicle-to-vehicle
communication, as information on velocity and acceleration of other vehicles will be utilized in
individual vehicle controllers [Varaiya93]. The same information and much more (e.g., desired
speed and lane) may also be received via vehicle-to-roadside communications. Here, we will
briefly discuss the previous research on lateral, longitudinal and combined lateral and longitudinal
control of vehicles.
2.2.1 Lateral Control

Hessburg and Tomizuka [Hessburg91] designed a fuzzy rule-based controller for lateral guidance
of a vehicle. This system is based on human-type reasoning. Advantages of such a controller
include flexibility in the choices of input/outputs, and on-line/off-line training capability. Their
focus was achieving good tracking for a variety of roadway curves over a range of longitudinal
vehicle speeds. Simulations to demonstrate its performance under parameter variations and
external disturbances gave satisfactory results.
An alternative approach is presented in [Lubin92]. It concentrates the intelligence in the
vehicle, using the visual sensing approach described in [Kornhauser91]. In this model, no
infrastructure modification is needed, but considerable cost and complexity is added to each
individual vehicle. With the current rate of technology improvement, this system may become
feasible for production purposes. During the last five years, the research on lateral vehicle control
and lane changing maneuvers was extensive. For a (non comprehensive) list of publications on the
subject, see [PathDb96].
Besides the theoretical modeling and simulations for lateral control of vehicles, there are a
few important experimental accomplishments: the use of magnetic markers, and the use of visual
information for lateral position handling. The first method was designed by the PATH Program
[PATH96] and employs magnetic markers imbedded into the road to detect the lateral
displacement from the center of the lane [Tan96]. Current tests with a vehicle equipped with
magnetic sensors on its front bumper are reported to be successful [Lee95]. The second
application for lateral control uses visual data and on-board computing resources to obtain the
steering command, and is designed by another NASHC participant. In order to locate the road
ahead, the rapidly adapting lateral position handler (RALPH) uses a template-based matching
technique to find parallel image features such as lane markings or tire and oil markings. During the
experiment called No Hands Across America, the test bed vehicle equipped with the RALPH
system drove 98% of the 2850 mile journey autonomously. An average speed of 63mph in
conditions that included bright sunlight, dusk, rain and nighttime, and a maximum stretch of 69-
miles autonomous driving are reported [Pomerlau96]. A third application for lateral control
consists of a vision-based system with a neural network learning from a driver. Performance
levels comparable to the human driver are reported in [Moon96]. zgner et al. reported
successful results for lane following using a vision sensor [zgner96].
2.2.2 Longitudinal Control

Longitudinal control is an important aspect of the future AHS. One of the major concepts in this
area is platooning, which is a formation of traveling vehicles that maintain close spacing at
highway speeds. The concept requires inter-vehicle communication links to provide velocity and
possibly acceleration information from the lead vehicle to each of the following vehicles, as well
as the velocity and acceleration of the preceding vehicle in the platoon. Sheikholeslam and Desoer
[Sheikholeslam89] showed that inter-vehicle communications increases the stability of the
platoon formation in the case of identical vehicle platoons.
In the case of a platoon of non-identical vehicles, the situation is more complex. Frank,
Liu, and Liang [Frank89] explicitly considered the case of non-identical vehicles. The control
scheme presented combines three nested control loops for speed regulation, spacing control, and
speed synchronization. They also concluded that (a) the platoon size must be limited to
approximately 15 vehicles, (b) nonlinearities significantly affect the response characteristic of the
platoon, and (c) emergency situations need further investigation before proper sensor
specifications can be set.
It has also been shown that communicating the lead vehicles information to other vehicles
is not a requirement if we can tolerate degradation in the performance. This degradation is said to
be not catastrophic [Sheikholeslam91].
Recent research on longitudinal control includes vehicle follower control design for heavy-
duty vehicles [Yanakiev96], adaptive control of a nonlinear platoon model [Spooner96],
automatic braking systems and their effects on capacity [Hedrick96], advanced control
techniques [Kachroo95b], and adaptive traction control [Lee96]. Again, A relatively
comprehensive list of publications on longitudinal vehicle control can be found at [PathDb96].
Experimental results of longitudinal vehicle control include a platoon of four vehicles
traveling at 55mph with a headway distances under 50 cm [Hedrick96]. Again, lead vehicles
information is transmitted to following vehicles in order to achieve string stability.
2.2.3 Combined Lateral and Longitudinal Control

Although much of the research to date has focused primarily on either lateral or longitudinal
control, an overall automated driving system combining both lateral and longitudinal control is
vital for future automated highway systems.
System models which incorporate longitudinal and lateral dynamics are very rare.
Kachroo and Tomizuka [Kachroo95c] studied combined longitudinal and lateral control to
investigate the resulting behavior of the coupled system. It is shown that longitudinal controllers
that directly control the wheel slip are inherently more stable, especially during lateral maneuvers
on very slippery road conditions. Spooner and Passino [Spooner95] also developed sliding mode
controllers for longitudinal and lateral control. Their fault tolerant algorithms were found to be
stable for a variety of faults such as braking, powertrain, and steering systems. Yu and Sideris
[Yu95] considered combined control using partial state-measurements of longitudinal and lateral
deviations, longitudinal velocity and yaw rate. The research on combined control of vehicles is
moving toward more realistic systems. New control approaches for more platoon operations in
more complex situations such as entry and exit maneuvers are being studied [Yang96].
The PATH program investigates the use of machine vision for guiding lane change
maneuvers [Malik95]. The vision system is modularly interfaced with the existing magnetic
sensor system for lateral position measurements, and with active range sensors. zgner
[zgner95] also described a vehicle-roadway system in which the control of vehicle movement
is based on instrumentation located both in the vehicle and the roadway. A radar based system is
used for both cruise control, and for providing position information in lateral maneuvers.
Combined lateral and longitudinal control experiments are yet to be designed and
implemented; the 1997 AHS Demonstration will be a good occasion for combined control tests.
2.3 Hierarchical Control Structure

Varaiya introduced a structure for designing ITS functions and their relation to driver decisions
[Varaiya93]. The focus of AHS applications is mainly on the in-trip phase of the ITS activities.
An automated vehicle has to (a) choose its route to reduce travel time, (b) plan its path to ensure
a smooth traffic flow, (c) maneuver in coordination with other vehicles, and (d) regulate the
proper spacing and steering to increase traffic flow in a safe manner.
An automated highway that leaves the driver in the control of the vehicle can only achieve
a capacity of 15% above the maximum value of 40 vehicle/lane/min2. Driver behavior is the
capacity bottleneck in such a system. Furthermore, 90% of highway accidents are results of
incorrect driver decisions [Varaiya93]. Therefore, an automated system might increase both the
capacity and the safety.
In order to increase the capacity of the existing highways, the California PATH program
suggests organizing the traffic in platoons. A platoon size of 15 with intra-platoon spacing of 2m,
inter-platoon spacing of 60m, and a steady-state speed of 72km/h will increase the capacity to
105 vehicle/lane/min, which is much larger than maximum empirically observed values. Decreasing
2
This value is suggested by empirical studies.
the distance between vehicles to 0.20m will change this number to 130 vehicles/lane/min, although
it may not be feasible for a heterogeneous platoon of vehicles. For intra-platoon distances
discussed here, it is impossible for a driver, who has a reaction delay of 0.25-1.20 sec, to
guarantee adequate safety.
According to Varaiya, the tasks for an automated vehicle-highway system to accomplish
can be achieved by a four-layer hierarchical control architecture [Varaiya93]. The layers of the
architecture, from the top, are network, link, planning, and regulation. The network layer assigns
a route to each vehicle as it enters the system. The link layer assigns each vehicle a path which
balances traffic for all lanes, and assigns a target speed for each section of highway. This layer
may also assign platoon size. The planning layer creates a plan which approximates the desired
path. The regulation layer controls the vehicle trajectory so that it conforms to this plan. Below
the regulation layer, a physical layer that provides sensor data and responds to actuator signals,
is assumed.
Different choices of partition of control authority between the infrastructure and the
vehicle are compatible with this architecture. The regulation and planning layer controllers are on-
board vehicles, the link and network layers are on the roadside. It is important to find a design
that combines the advantages of a fully centralized control system and an architecture based on
autonomous vehicles and local sensor information.
In this thesis, we introduce an intelligent controller which can be seen as the planning
layer of an autonomous vehicle. The planning layer, as defined in [Varaiya93], has three tasks:
Decision on maneuvers to attempt in order to realize the assigned path,

Coordination of maneuvers with the planning layers of neighboring vehicles,
Supervising the regulation layer in the execution of a trajectory corresponding to the
chosen maneuver.
In our application, where the scenario includes non-automated vehicles and minimal
communications, the coordination of maneuvers between planning layers does not exist. The
communication between planning and regulation layers is fairly simple: the planning layer sends a
command, and the regulation layer returns a reply once it successfully completes the command.
A richer interface may be required: the planning layer could pass multiple parameters to the
regulation layer, which could then return parameters indicating the success or errors and
exceptions. The theory of control of such a system is not yet developed. There is a need for
research as to how the regulation layer should switch from one control law to another. Lasky and
Revani state that this represents an open research issue whose solution maybe vital to the
implementation of a full AHS [Lasky93].
The approach above is one of the many different approaches to AHS3. Some of these
concepts are based on cooperative architectures [McKendree96], maximum adaptability
3
We emphasized this architecture since it is used in DYNAVIMTS, the Virginia Tech Center for Transportation
Centers simulation program (See Section 2.6).
[Schuster96], autonomous vehicle architectures [Bayouth96], and infrastructure assistance

[Godbole96b].
2.4 Other AHS Issues

Besides the automatic vehicle control, there are several important issues that need to be carefully
considered for a successful implementation of an automated highway system. During the first
few years of the AHS research efforts, the problems related to the issues given here were not
investigated as much as vehicle control problems. However, as the AHS related research
progressed, it expanded to the areas of sensing and communications, fault tolerance, and human
factors. In this section, we will emphasize related research efforts on these areas.
2.4.1 Sensors and Communication

The realization of full AHS needs hardware both in infrastructure and the vehicle. Roadside
monitors will measure traffic flow and speed, and vehicle paths will be calculated based on this
information. Such measurements are currently made with loop detectors, ultrasonic sensors, AVI
tags or vision systems. Information may be communicated by infrared beacons, broadcast and
cellular radio, or using emerging ultra wideband technologies [James96]. The vehicles need a
longitudinal sensor to measure distance and relative speed of the preceding vehicle. Such sensors
may be based on radar, ultrasound, or vision [zgner95, Hedrick96, Pomerlau96]. Microwave
radar sensors perform very well in fog and heavy rain, but they are very expensive. Laser radar
systems are low-cost, but cannot handle low visibility conditions [Yanagisawa92].
To facilitate lane changes at a range of relative speeds, the vehicle must be equipped with
sensors that locate vehicles on the side with a longitudinal range of about 30m. Infrared and laser
range finding techniques may prove to be useful in this area.
Besides headway and side sensor information, longitudinal and lateral velocity and
acceleration, yaw rate, front steering angle, and lateral deviation data is needed to obtain a robust
combined lateral and longitudinal control. All of these except the last one can be obtained using
on-board accelerometers and encoders. For vehicle position sensing, there are two alternatives:
magnetic markers [Lee95], and vision systems [Pomerlau96]. Recent research done on vision
systems showed significant promise, however these systems are more expensive than magnetic
markers which, in turn, require infrastructure deployment as well as on-board sensors.
A sequence of single magnetic markers can also form a word that transfers information
such as curvature, and number of lanes. However, this magnetic marker data contains only the
static information (roadway characteristics), not dynamic information (such as information on
other vehicles incidents), unlike the vision system.
Roadside-vehicle communications is also a critical aspect of AHS. Vehicles need to be
identified, speed must be communicated to vehicles, and actions need to be coordinated for a fully
automated system. In addition, there is a need for vehicle-to-vehicle communications because of
the designed longitudinal control methods and coordination issues. Precise control can be obtained
using full duplex communication. Networking represents a higher level of communications, in
which traffic and hazard information detected by one group of vehicles can be communicated to
other vehicles. There are a variety of methods for roadside-vehicle communications; most of them
are discussed in [Field92]. Recent research on communications includes interference studies for
vehicle transceivers [Gavan96], design of communication protocols [Godbole96], WaveLAN
radio [Chen96], and network architectures and protocols for vehicle-to-vehicle communications
[Bolla96, Fuji96].
2.4.2 Safety and Fault Tolerance

Although the solutions to most of the technical problems in vehicle control, traffic management,
information systems, and communications have been found, the envisioned AHS will never be
deployed unless the safety of the overall system can be verified. One issue which is often
overlooked by researchers is the possibility of undesirable interaction of the systems. An
example in [Safety92] mentions two devices which try to maintain vehicles at constant lateral
spacing using side range sensing. If the devices (and the vehicles) have different dynamics, it is
possible that one or both vehicles may become unstable, possibly resulting in a collision.
Furakawa [Furakawa92] approaches the automobile safety question from an interesting
perspective: the instinctive ability of humans (and animals) to sense and avoid dangers, an ability
which is impaired when driving a car. Furakawa notes that building this instinctive function into
automobiles as a form of intelligent control technology is a good approach to improve safety.
Current research on safety and fault tolerance includes lane crossing studies [Lin96],
sensor validation [Agogino96,Singer95], fault tolerant design for AHS [Lygeros96], emergency
maneuvers for automated vehicles [Shiller95], and design constraints on intelligent vehicle
controllers [Puri95].
2.4.3 Human Factors

In an advanced system such as AHS, the driver will be confronted with significantly more
information, and possibly more controls, than are currently used in vehicles. In a system that
uses complete automation during some segments of a trip, the safe transition from automated
driving to manual driving is a difficult issue. Also, there exist several advanced vehicle control
systems (AVCS) related problems such as the acceptance of a system which takes the control
away from drivers, privacy issues related to AVI and AVL systems, and platooning
claustrophobia. For users to accept the AHS, a successful AVC system must address important
issues such as dealing with false alarms and system failures to gain acceptance and public
confidence, displays and warnings with the right amount and content of information, and driver
skills and attentiveness for smooth automated-manual switching [Chira92].
Human factors assessment of the drivers interfaces of existing collision avoidance systems
is currently under investigation [Mazzae96]. On the other hand, although AHS deployment has
not started yet, simulator tests are conducted to see how drivers behave when they are in the lead
vehicle in a string of vehicles and another vehicle enters the automated lane ahead of them
[Bloomfield96]. Also, the effect of different AHS automation degrees (manual, semi-automated,
fully automated) on driver behaviors while entering the automated lane are tested using a driving
simulator [Bloomfield96b]. Levitan presented a comprehensive report on human factors
considerations related to the design of AHS. The report is intended as a guideline for effective,
efficient, safe and user-acceptable AHS design [Levitan96].
2.5. An Experimentation and Evaluation Framework for AHS

According to Center for Transportation Researchs approach, the progression in AHS
development from conceptualization to implementation has four steps, [Kachroo95]. The
intelligent controller described in this work is part of the four-block structure visualized at the
center. As seen in Figure 2.1, the first block is computer software simulation (see Section 2.6 for
a detailed description) that is preceded by mathematical modeling. The second block consists of
conducting small scale experiments in the Flexible Low-cost Automated Scaled Highway
(FLASH) Laboratory. Then, hardware tests comprising the third block are performed on a test
site with actual vehicles. The Smart Road being built near Blacksburg, Virginia will be a suitable
test bed for conducting such experiments using actual vehicles and controlled traffic conditions.
The fourth block is the deployment of AHS on conventional highways.
Feedforward
Software FLASH Smart Highway Deployment

Input Conclusion
Feedback Loops
Figure 2.1. Four-block evaluation and experimentation framework for AHS.
These four blocks can be considered as the building blocks of a comprehensive testing and
evaluation methodology for AHS. The input can be a hypothesis, a model, or technologies. The
evaluation and testing procedure defined by this methodology is not seen as a single feedthrough
four-step process, but as having some feedback and feedforward loops depending on the results
obtained at each block. These loops represent the changes made to the hypothesis, model or the
technological concept. Hardware tests are important since they provide the means to validate
computer results or to modify them in the case of discrepancies, due to unmodeled or
inadequately modeled dynamics. Without hardware testing, it would be foolhardy to jump into
actual implementation. For instance, the FLASH Laboratory could be used to improve the
computer simulation via scale model tests before starting the tests with full scale vehicles.
Although testing with real-sized vehicles will provide the most accurate results, it has
significant cost, safety and liability considerations. In order to test highway traffic and similar
situations, there is a need for a large number of automated and non-automated vehicles. This is
very expensive, and due to a testing situation involving humans, the costs can become prohibitive
and the situations dangerous, which could lead to insurance and safety problems. Also, full-size
testing of infrastructure based systems which require complex communication scenarios may be
expensive due to (roadside and in-vehicle) installation costs. The four-block structure is designed
to overcome these problems. A comparison of the design/development stages for different
characteristics is given in Table 2.1.
Stage Simulation FLASH Lab. Test Site Deployment

Agents Vehicle models Scaled vehicles Real vehicles Real vehicles
Tasks Experimentation Experimentation Experimentation
& & & Implementation
Demonstration Demonstration Demonstration
Cost Low Moderate High Very High
Development Short Short to Long Very Long
Time Moderate
Table 2.1. Comparison of different stages.
2.5.1. FLASH Laboratory

The Flexible Low-cost Automated Scaled Highway (FLASH) Laboratory is visualized a
precursor to a full-scale highway [Kachroo95]. It will provide a platform for experimental work
as well as an arena for demonstration of AHS systems. In order to test highway situations with
complex realistic scenarios of merging, splitting, exit, and entrance, small scale vehicles (1:10 to
1:15 scale) are designed in a modular fashion. Modularity will guarantee inexpensive and fast
incorporation of different system configurations. The FLASH laboratory will be composed of
tens of small vehicles (of different sizes) with a flexible highway system and communication
network. The FLASH laboratory concept makes the construction of specifically designed scaled
highway configurations possible in a short period of time, and at much less cost than it would be
required for the construction of a similar full scale test bed.
The use of small low-cost vehicles with known mathematical models would be beneficial
for researchers seeking to verify their computer simulations quickly and economically. The use of
standard vehicles eliminates some of the uncertainty due to particular vehicle models that make
comparisons of different AHS approaches difficult. The laboratory will provide researchers with
the vehicle as well as its verified mathematical model. This capability will enhance the results
from computer simulations.
The FLASH laboratory has significance not only to all the phases of AHS development
programs, but also to the improvement and maintenance of the actual automated highways. The
laboratory will be extremely useful in the analysis phase of the AHS development. The
laboratory and its vehicles may serve as a benchmark through which various configurations,
architectures, and technologies can be compared. Evaluation of alternative AHS system concepts,
selection of a system approach based on analysis, test and evaluation, and demonstration of
prototype configurations on a smaller and cheaper scale are the foci of the laboratory.
Furthermore, the laboratory can be used to obtain preliminary data for the operation evaluation
(third) phase of the AHS program, and be a test bed for several issues which cannot be tested on
selected locations because of considerations of insurance and safety.
Currently, the laboratory is equipped with multiple small scale vehicles with different
types of tires and shock mechanisms. The vehicles have infrared and magnetic sensors, and
wireless video camera for lateral displacement measurements, ultrasonic sensors for headway
sensing, encoders for speed measurements. Sensing and control is provided by GCB-11
microcontroller. Wireless radio modems provide full duplex serial communications (Figure 2.2.)
Remote control station includes IBM PC computers equipped with frame grabber DSP board, a
micro-kernel operating system, and a data acquisition board for lateral feedback control via image
processing, as well as a steering wheel, control pedal and monitor for manual control (Figure 2.3).
Single- and double-lane small scale modular tracks with inclines and multiple radius left and right
turns are designed using EPDM rubber membrane (Figure 2.4). The laboratory also has an
experimental testing platform designed for traction control and vehicle plant modeling tests
[Kachroo96, Schlegel96].
Encoder Sensor Board
ultrasonic
sensor
Wireless
modem
Infra-red sensors
Battery
GCB11 microcontroller
board (under the modem)
Figure 2.2. Automated FLASH Car: Test Model.

Figure 2.3. Driving station.
Figure 2.4. Test tracks and the control station in the FLASH Laboratory.
2.5.2. Smart Road

When completed, the Smart Road [Smart94] will be a 6-mile roadway between Blacksburg and
I-81 interstate highway in southwest Virginia, linking the Roanoke and New River Valleys. The
first 2 miles is designated as a controlled test facility (Figure 2.5). It will be the first of its kind to
be built from the ground up with an ITS infrastructure incorporated into the roadway. This full-
scale test bed will provide controlled testing of a variety of ITS technologies and concepts with
varied terrain and environmental conditions. Research involving crash avoidance, driver behavior,
vehicle dynamics, sensors, and automated vehicle control will take place under a broad range of
carefully controlled testing conditions.
For the roadway, an advanced communication system consisting of a local area, wireless
network interfaced with a complete fiber optic backbone is planned. Thus, the road is designed to
have the capability to service a multitude of sensors and messaging equipment concurrently. The
test bed design also incorporates the ability to simulate a diverse range of weather conditions,
from mild to severe (fog, rain, and snow). In addition, the all-weather testing area will overlap the
variable lighting area to provide simulation capabilities for night driving conditions. To facilitate
continuous testing, the road will be equipped with turnarounds at both ends. Overhead structures
are designed to accommodate variable signing as well as overhead sensors. Two safety zones with
50-foot clearance areas and additional crash protection features for prototype testing of advanced
in-vehicle systems are also planned.
Figure 2.5. Location of the Smart Road.

2.6. A Simulation Tool for AHS Systems: DYNAVIMTS

During the last few years, many computer programs for microscopic and macroscopic level
simulations of vehicles in an automated system have surfaced. Paramics (Parallel Microscopic
Simulation) is currently used to find the reasons for large-scale congestions in the Controlled
Motorway pilot project on the M25 highway in England, by dividing the the network into a
number of regions [Duncan96, Cameron96]. DYNAVIS is another dynamic visualization package
specifically designed for evaluating automatic control of vehicles [Xu96]. This package is part of
a bigger effort by the California PATH program, SmartPath [Eskafi95]. SmartPath is a simulation
package for an automated highway system, designed to provide a framework for simulation and
evaluation of AHS alternatives. It is again a micro-simulator, i.e., the functional elements and the
behavior of each vehicle are individually modeled. SmartPath has two separate simulation and
animation modules.
An example of a macroscopic simulation is KRONOS developed at the University of
Minnesota [Kwon95]. It is a PC based simulation using a simple continuum modeling approach
to evaluate the flow behaviors such as merging and diverging. Argonne National Laboratorys
prototype simulation includes the modeling of automated vehicles with in-vehicle navigation
units and a Traffic Management Center (TMC) using distributed computer systems or massively
parallel computer systems [Ewing95].
The Center for Transportation Research of Virginia Tech is developing a simulation
package that models micro and macroscopic behavior of vehicles over a network. The software is
envisioned to integrate the different research efforts at CTR into one comprehensive ITS
software package. In the development of this tool, particular emphasis is given to modeling the
AHS system based on the control architecture developed under the PATH program [Varaiya93].
The package is called DYNAVIMTS, Dynamic Visual Micro/Macroscopic Traffic Simulation
Tool.
2.6.1. Software System Architecture

DYNAVIMTS is a multi-purpose AHS simulation software currently developed using different
software tools and platforms [Kachroo96b]. It has a distributed system architecture enabling the
user to distribute several time consuming functions of DYNAVIMTS to separate computers. The
main simulation currently runs on a Sun Sparc 2000 Workstation. The output of the simulation
is transferred to a Pentium PC which has animation software developed using OpenGL .
DYNAVIMTS consists of three main modules: the Communicator, the Simulator, and the
Animator (Figure 2.6).
Simulator Animator
C++, CSim C, OpenGL, Matlab
Unix Windows NT, Unix
Communicator
C, X-Motif, Matlab,
ArcInfo
Unix, Windows NT
Figure 2.6. Software System Architecture
Depending upon the spatial dimensions of the problem at hand, macroscopic and
microscopic simulation models will be used. For example, DYNAVIMTS is currently capable of
studying the effects of vehicle dynamics on the stability of platoons. Incorporation of different
brake technologies are also underway. DYNAVIMTS has a range of vehicle dynamics and control
models that can be used for microscopic simulation studies. On the other hand, DYNAVIMTS is
also designed to study the effects of highway automation on the overall traffic flow in the traffic
network.
The simulation software is developed using C++ with CSIM libraries [CSIM96], and Matlab
[Matlab96]. As we move toward the macroscopic models, the design of the package will be
completely object-oriented, and multi-processor machines or parallel processing with multiple
machines will become necessary. A detailed description of the DYNAVIMTS package can be
found in [Nagarajan96].
2.6.2. Components of DYNAVIMTS

As mentioned above, there are three main modules in the DYNAVIMTS software package
(Figure 2.7). The front-end graphical user interface (Communicator, Figure 2.8) is developed in C
using OSF/Motif and Xt Intrinsics libraries on a Sun Sparc. The structure of the interface and
the underlying modules again have been modeled based on the four-layer structure of the
California PATH Program. The submodule for the network layer design calls ARC/INFOs
ArcEdit [Arc96] utilities to construct the network. The GUI also consists of routines to invoke
incident management software, traffic assessment algorithms, and other related programs to form
the central part of an integrated ITS simulation package. The microscopic level simulator is a C++
program using CSIM libraries. There are three basic classes - vehicle, link and sensor. These
objects interact among themselves and with the CSIM objects, through their member functions.
The animation module is started after a run is executed by the simulator. The Animator consists
of two components: graphical plots and animation. The graphical plots generated by Matlab give
the detailed information about the control parameters and sensor data. The animation submodule
is developed in C using OpenGL libraries on a NT workstation. The simulation data is currently
transferred via a pre-formatted ASCII data file. This module also contains the graphical data for
the planned Smart Road project. A detailed description of the package, its file structure and
graphic user interface is given in [Nagarajan96].
The control structure we introduce in Chapter 4 is incorporated in the DYNAVIMTS
simulation package as part of the planning layer simulator. Since the link layer development in
DYNAVIMTS is not complete, the planning layer simulator on Matlab assumes that the link
layer data is available (if at all necessary), and the resulting data is transferred to the animator
submodule in the form of an ASCII file. At the end of each simulation run, the user is also able to
use the GUI to plot different parameters and values using Matlab. The connection between the
planning and regulation layers does not yet exist as is the case in similar simulation packages. A
snapshot of the planning layer GUI is given in Appendix C.
GUI
Macroscopic Microscopic Simulation

Parameters
Network Layer Planning Layer
Link Layer Regulation Layer
Physical Layer
SIMULATION
Sim.
Vehicle Process Sensor Process
Vehicle Link Sensor

Car Straight Roadside
3D ANIMATION GRAPHICAL
PLOTS
Birds Eye View
Drivers View
Zooming
Panning V
Fast Forward
Rewind
Pause & resume
Matlab statistics
t
Figure 2.7. Components of DYNAVIMTS simulation package.
Figure 2.8. Opening screen of DYNAVIMTS [Nagarajan96].

Chapter 3
Stochastic Learning Automata
An automaton is a machine or control mechanism designed to automatically follow a

predetermined sequence of operations or respond to encoded instructions. The term stochastic
emphasizes the adaptive nature of the automaton we describe here. The automaton described here
do not follow predetermined rules, but adapts to changes in its environment. This adaptation is
the result of the learning process described in this chapter.
The concept of learning automaton grew out of a fusion of the work of

psychologists in modeling observed behavior, the efforts of statisticians to model the
choice of experiments based on past observations, the attempts of operation
researchers to implement optimal strategies in the context of the two-armed bandit
problem, and the endeavors of system theorists to make rational decisions in random
environments [Narendra89].
In classical control theory, the control of a process is based on complete knowledge of the
process/system. The mathematical model is assumed to be known, and the inputs to the process
are deterministic functions of time. Later developments in control theory considered the
uncertainties present in the system. Stochastic control theory assumes that some of the
characteristics of the uncertainties are known. However, all those assumptions on uncertainties
and/or input functions may be insufficient to successfully control the system if changes. It is
then necessary to observe the process in operation and obtain further knowledge of the system,
i.e., additional information must be acquired on-line since a priori assumptions are not sufficient.
One approach is to view these as problems in learning.
Rule-based systems, although performing well on many control problems, have the
disadvantage of requiring modifications, even for a minor change in the problem space.
Furthermore, rule-based approach, especially expert systems, cannot handle unanticipated
situations. The idea behind designing a learning system is to guarantee robust behavior without
the complete knowledge, if any, of the system/environment to be controlled. A crucial advantage
of reinforcement learning compared to other learning approaches is that it requires no information
about the environment except for the reinforcement signal [Narendra89, Marsh93].
A reinforcement learning system is slower than other approaches for most applications
since every action needs to be tested a number of times for a satisfactory performance. Either the
learning process must be much faster than the environment changes (as is the case in our
Cem nsal Chapter 3. Stochastic Learning Automata 30
application), or the reinforcement learning must be combined with an adaptive forward model
that anticipates the changes in the environment [Peng93].
Learning1 is defined as any permanent change in behavior as a result of past experience,
and a learning system should therefore have the ability to improve its behavior with time, toward
a final goal. In a purely mathematical context, the goal of a learning system is the optimization of
a functional not known explicitly [Narendra74].
In the 1960s, Y. Z. Tsypkin [Tsypkin71] introduced a method to reduce the problem to
the determination of an optimal set of parameters and then apply stochastic hill climbing
techniques. M.L. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during
the same period. An alternative approach to applying stochastic hill-climbing techniques,
introduced by Narendra and Viswanathan [Narendra72], is to regard the problem as one of finding
an optimal action out of a set of allowable actions and to achieve this using stochastic automata.
The difference between the two approaches is that the former updates the parameter space at
each iteration while the later updates the probability space.
The stochastic automaton attempts a solution of the problem without any information on
the optimal action (initially, equal probabilities are attached to all the actions). One action is
selected at random, the response from the environment is observed, action probabilities are
updated based on that response, and the procedure is repeated. A stochastic automaton acting as
described to improve its performance is called a learning automaton.
3.1 Earlier Works

The first learning automata models were developed in mathematical psychology. Early research in
this area is surveyed by Bush and Mosteller [Bush58] and Atkinson et al. [Atkinson65]. Tsetlin
[Tsetlin73] introduced deterministic automaton operating in random environments as a model of
learning. Most of the works done on learning automata theory has followed the trend set by
Tsetlin. Varshavski and Vorontsova [Varshavski63] described the use of stochastic automata
with updating of action probabilities which results in reduction in the number of states in
comparison with deterministic automata.
Fu and colleagues [Fu65a, Fu65b, Fu67, Fu69a, Fu69b, Fu71] were the first researchers
to introduce stochastic automata into the control literature. Applications to parameter estimation,
pattern recognition and game theory were initially considered by this school. Properties of linear
updating schemes and the concept of a growing automaton are defined by McLaren
[McLaren66]. Chandrasekaran and Shen [Chand68,Chand69] studied nonlinear updating schemes,
nonstationary environments and games of automata. Narendra and Thathachar have studied the
theory and applications of learning autoamta and carried out simulation studies in the area. Their
book Learning Automata [Narendra89] is an introduction to learning automata theory which
surveys all the research done on the subject until the end of the 1980s.
1
Webster defines learning as to gain knowledge or understanding of a skill by study, instruction, or experience.
The most recent (and second most comprehensive) book since the Narendra-Thathachar
collaboration on learning automata theory and applications is published by Najim and Pznyak in
1994 [Najim94]. This book also includes several applications and examples of learning automata.
Until recently, the applications of learning automata to control problems were rare.
Consequently, successful updating algorithms and theorems for advanced automata applications
were not readily available. Nario Babas work [Baba85] on learning behaviors of stochastic
automata under a nonstationary multi-teacher environment, and general linear reinforcement
schemes [Bush58] are taken as a starting point for the research presented in this dissertation.
Recent applications of learning automata to real life problems include control of
absorption columns [Najim91], bioreactors [Gilbert92], control of manufacturing plants
[Sequeria91], pattern recognition [Oommen94a], graph partitioning [Oommen94b], active vehicle
suspension [Marsh93, 95], path planning for manipulators [Naruse93], distributed fuzzy logic
processor training [Ikonen97], and path planning [Tsoularis93] and action selection [Aoki95] for
autonomous mobile robots.
Recent theoretical results on learning algorithms and techniques can be found in
[Najim91b], [Najim94], [Sastry93], [Papadimitriou94], [Rajaraman96], [Najim96], and
[Poznyak96].
3.2 The Environment and the Automaton

The learning paradigm the learning automaton presents may be stated as follows: a finite number
of actions can be performed in a random environment. When a specific action is performed the
environment provides a random response which is either favorable or unfavorable. The objective
in the design of the automaton is to determine how the choice of the action at any stage should be
guided by past actions and responses. The important point to note is that the decisions must be
made with very little knowledge concerning the nature of the environment. It may have time-
varying characteristics, or the decision maker may be a part of a hierarchical decision structure
but unaware of its role in the hierarchy. Furthermore, the uncertainty may be due to the fact that
the output of the environment is influenced by the actions of other agents unknown to the
decision maker.
The environment in which the automaton lives responds to the action of the automaton
by producing a response, belonging to a set of allowable responses, which is probabilistically
related to the automaton action. The term environment2 is not easy to define in the context of
learning automata. The definition encompasses a large class of unknown random media in which
an automaton can operate. Mathematically, an environment is represented by a triple {, c, }
where represents a finite action/output set, represents a (binary) input/response set, and c is
a set of penalty probabilities, where each element ci corresponds to one action i of the set .
The output (action) (n) of the automaton belongs to the set , and is applied to the
environment at time t = n. The input (n) from the environment is an element of the set and can
2
Commonly refers to the aggregate of all the external conditions and influences affecting the life and development of
an organism.
take on one of the values 1 and 2. In the simplest case, the values i are chosen to be 0 and 1,
where 1 is associated with failure/penalty response. The elements of c are defined as:
{
Prob (n) = 1 (n) = i }= c i (i = 12
, ,...) (3.1)
Therefore ci is the probability that the action i will result in a penalty input from the
environment. When the penalty probabilities ci are constant, the environment is called a
stationary environment.
There are several models defined by the response set of the environment. Models in
which the input from the environment can take only one of two values, 0 or 1, are referred to as
P-models3. In this simplest case, the response value of 1 corresponds to an unfavorable
(failure, penalty) response, while output of 0 means the action is favorable. A further
generalization of the environment allows finite response sets with more than two elements that
may take finite number of values in an interval [a, b]. Such models are called Q-models. When the
input from the environment is a continuous random variable with possible values in an interval [a,
b], the model is named S-model.
State
Transition Function F
Input Output Function G Output
Automaton
Environment
Penalty probabilities c
Figure 3.1. The automaton and the environment.
The automaton can be represented by a quintuple {, , , F(,), H(,)} where:

w is a set of internal states. At any instant n, the state (n) is an element of the finite
set = {1, 2,..., s}
w is a set of actions (or outputs of the automaton). The output or action of an
automaton an the instant n, denoted by (n), is an element of the finite set = { 1, 2,...,
r}
w is a set of responses (or inputs from the environment). The input from the
environment (n) is an element of the set which could be either a finite set or an infinite
set, such as an interval on the real line:
= { 1 , 2 ,..., m } or = {(a, b)} (3.2)
3
We will follow the notation used in [Narendra89].
w F(,): is a function that maps the current state and input into the next
state. F can be deterministic or stochastic:
(n + 1) = F[ (n), (n)] (3.3)
w H(,): is a function that maps the current state and input into the current
output. If the current output depends on only the current state, the automaton is referred
to as state-output automaton. In this case, the function H(,) is replaced by an output
function G(): , which can be either deterministic or stochastic:
(n) = G[ ( n)] (3.4)
For our applications, we choose the function G() as the identity function, i.e., the states
of the automata are also the actions.
3.3 The Stochastic Automaton

We now introduce the stochastic automaton in which at least one of the two mappings F and G is
stochastic. If the transition function F is stochastic, the elements fij of F represent the
probability that the automaton moves from state i to state j following an input 4:

f ij = Pr{ (n + 1) = j (n) = i , (n) = } i , j = 12
, ,..., s = 1 , 2 ,..., m (3.5)
For the mapping G, the definition is similar:
g ij = Pr{ (n) = j (n) = i } i , j = 12

, ,..., r (3.6)
Since fij are probabilities, they lie in the closed interval [a, b]5; and to conserve probability
measure we must have:
s
f ij = 1 for each and i. (3.7)
j =1
Example
States: 1, 2 Inputs: 1 , 2
050
. .
050
F( 1 ) =
025
. .
075
Transition matrices:
01. .
09
F( 2 ) =
075
. .
025
In Chapter 4, where we define our initial algorithm, the probabilities fij will be the same for all inputs and all
4
states i given a state j . Furthermore, the output mapping G is chosen as identity mapping.
5
[a, b]=[0, 1] in most cases.
Transition Graphs:
0.50
= 1 0.50
0.75
1 2
0.25
0.90
0.10
= 2 0.25
1 2
0.75
In the previous example, the conditional probabilities fij were assumed to be constant,
i.e., independent of the time step n, and the input sequence. Such a stochastic automaton is
referred to as a fixed-structure automaton. As we will discuss later, it is useful to update the
transition probabilities at each step n on the basis of the environment response at that instant.
Such an automaton is called variable-structure automaton.
Furthermore, in the case of variable-structure automaton, the above definitions of the
transition functions F and G are not used explicitly. Instead of transition matrices, a vector of
action probabilities p(n) is defined to describe the reinforcement schemes as we introduce in
the next sections.
If the variable-structure automaton is a state-output automaton, and the probabilities of
transition from one state to another fij do not depend on current state and environment input,
then the relation between the action probability vector and the transition matrices is as follows:
Since the automaton is a state-output automaton, we omit the transition matrix G, and
only consider the transition matrix F.
The transition does not depend on the environment response, therefore there is only
one state transition matrix F where the elements are fij. For example, we must have:
f 11 f 12
F (1 ) = F ( 2 ) F = (3.8)
f 21 f 22
Furthermore, the probability of being in one state (or generating one action) does not
depend on the initial/previous state of the automaton. Therefore, the transition matrix reduces to:
[ ]
F = f 11 = f 21 f 1 f 12 = f 22 f 2 p (3.9)
th
where vector p consists of probabilities pi of the automaton being in the i state (or choosing the
ith output/action).
3.4 Variable Structure Automaton and Its Performance Evaluation

Before introducing the variable structure automata, we will introduce the definitions necessary to
evaluate the performance of a learning variable-structure automaton. A learning automaton
generates a sequence of actions on the basis of its interaction with the environment. If the
automaton is learning in the process, its performance must be superior to intuitive methods.
To judge the performance of the automaton, we need to set up quantitative norms of behavior.
The quantitative basis for assessing the learning behavior is quite complex, even in the simplest
P-model and stationary random environments. To introduce the definitions for norms of
behavior, we will consider this simplest case. Further definitions for other models and non-
stationary environments will be given whenever necessary.
3.4.1 Norms of Behavior

If no prior information is available, there is no basis in which the different actions i can be
distinguished. In such a case, all action probabilities would be equal a pure chance situation.
For an r-action automaton, the action probability vector p(n) = Pr {(n) = i } is given by:
1
pi (n ) = i = 12
, ,..., r (3.10)
r
Such an automaton is called pure chance automaton, and will be used as the standard for
comparison. Any learning automaton must at least do better than a pure-chance automaton.
Consider a stationary random environment with penalty probabilities
{c , c ,..., c } where
1 2 r { }
ci = Pr ( n) = 1 (n) = i . We define a quantity M(n) as the average
penalty for a given action probability vector:
M ( n) = E[ (n ) p(n)] = Pr{ (n) = 1 p( n)}
r
= Pr{ (n) = 1 (n ) = i } Pr{ (n) = i } (3.11)

i =1
= ci pi ( n)
i =1
For the pure-chance automaton, M(n) is a constant denoted by Mo:

1 r
Mo = ci
r i =1
(3.12)
Also note that:

E [ M (n)] = E {E[ (n ) p(n)]}
(3.13)
= E [ (n)]
i.e., E[M(n)] is the average input to the automaton. Using above definitions, we have the
following:
Definition 3.1: A learning automaton is expedient if lim

n
E [ M (n)] < M o .
r r
Since pi ( n) = 1, we can write inf M (n) = inf p (n ) { ci pi (n )} = mini {ci } c l .

i =1 i =1
Definition 3.2: A learning automaton is said to be optimal if:

lim
n
E [ M (n)] = cl (3.14)
Optimality implies that the action l is associated with the minimum penalty probability cl is
chosen asymptotically with probability one. In spite of efforts of many researchers, the general
algorithm which ensures optimality has not been found [Baba83, Kushner72, Narendra89]. It
may not be possible to achieve optimality in every given situation. In this case, a suboptimal
behavior is defined.
Definition 3.3: A learning automaton is e-optimal if:

lim E [ M ( n)] = cl + (3.15)
n
where e is arbitrarily small positive number.
Some automata satisfy the conditions stated in the definitions for specified initial
conditions and for certain sets of penalty probabilities. However, we may be interested in
automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial
conditions. These requirements are partially met by an absolutely expedient automaton, which is
defined as:
Definition 3.4: A learning automaton is absolutely expedient if:

E[M(n+1) | p(n)] < M(n) (3.16)
for all n, all pi(n) (0, 1) and for all possible sets {c1 , c 2 ,..., cr } . Taking the expectations of both
sides, we can further see that:
E [ M (n + 1)] < E[ M (n )] (3.17)
In the application of learning automata techniques to intelligent control, we are mainly

interested in the case where there is one optimal action j with cj = 06. In such an environment,
a learning automaton is expected to reach a pure optimal strategy [Najim94]:
p( n) e rj (3.18)
where erj is the unit vector with jth component equal to 1. In other words, the automaton will be
optimal since E [M (n) ]= ci pi (n) cl .

r
i =1
3.4.2 Variable Structure Automata

A more flexible learning automaton model can be created by considering more general stochastic
systems in which the action probabilities (or the state transitions) are updated at every stage
using a reinforcement scheme. In this section, we will introduce the general description of a
reinforcement scheme, linear and nonlinear. For simplicity, we assume that each state
6
We will discuss the feasibility of the case later.
corresponds to one action, i.e., the automaton is a state-output automaton. The simulation
examples in later chapters also use the same identity state-output mapping.
In general terms, a reinforcement scheme can be represented as follows:
p( n + 1) = T1 [ p (n), (n), (n)] (3.19a)
or

f ij (n + 1) = T2 [ f ij (n), (n), (n + 1) ( n)] (3.19b)
where T1 and T2 are mappings. Again, is the action, is the input from environment, and is
the state of the automaton. From now on, we will use the first mathematical description (3.19a)
for reinforcement schemes. If p(n+1) is a linear function of p(n), the reinforcement scheme is said
to be linear; otherwise it is termed nonlinear.
The simplest case

Consider a variable-structure automaton with r actions in a stationary environment with
= {0, 1} . The general scheme for updating action probabilities is as follows:
If (n) = i
p j (n + 1) = p j (n) gi ( p( n)) for all j i
r
when = 0 pi ( n + 1) = pi (n) + g k ( p(n)) (3.20)
k =1
k i
p j (n + 1) = p j (n) + hi ( p (n)) for all j i

r
when = 1 pi (n + 1) = pi (n) hk ( p(n))
k =1
k i
where gk and hk (k=1,2,...,r) are continuous, nonnegative functions with the following
assumptions:
0 < gk ( p(n )) < pk (n)
r
0 < [ pk ( n) + hk ( p( n))] < 1 (3.21)
k =1
k i
for all i=1,2,...,r and all probabilities pk in the open interval (0,1). The two constraints on update
functions guarantees that the sum of all action probabilities is 1 at every time step.
3.5 Reinforcement Schemes

The reinforcement scheme is the basis of the learning process for learning automata. These
schemes are categorized based on their linearity. A variety of linear, nonlinear and hybrid
schemes exists for variable structure learning automata [Narendra89]. The linearity characteristic
of a reinforcement scheme is defined by the linearity of the update functions gk and hk (see the
examples in the next section). It is also possible to divide all possible algorithms for stochastic
learning automata into the following two classes [Najim94]:
nonprojectional algorithms, where individual probabilities are updated based on their previous
values:
pi (n + 1) = pi (n) + k i (pi (n ), (n), ( n)) (3.22)
where is a small real number and k i are the mapping functions for the update term.
projectional algorithms, where the probabilities are updated by a function which maps the
probability vector to a specific value for each element:
pi (n + 1) = k i (p( n), ( n), (n)) (3.23)
where the functions k i are the mapping functions.
The main difference between these two subclasses is that nonprojectional algorithms can
only be used with binary environment response (i.e., in a P-model environment). Projectional
algorithms may be used in all environment types and are more complex. Furthermore,
implementation of projectional algorithms are computationally more tasking.
Early studies of reinforcement schemes were centered around linear schemes for reasons
of analytical simplicity. The need for more complex and efficient reinforcement schemes
eventually lead researchers to nonlinear (and hybrid) schemes. We will first introduce several
well-known linear schemes, and then close this section with absolutely expedient nonlinear
schemes.
3.5.1 Linear Reinforcement Schemes

For an r-action learning automaton, the general definition of linear reinforcement schemes can be
obtained by the substitution of the functions in Equation 3.20 as follows:
g k ( p( n)) = a p k (n)
b
hk ( p(n)) = b p k (n ) (3.24)
r 1
0 < a ,b < 1
Therefore, the scheme corresponding to general linear schemes is given as:
If (n) = i ,
p j (n + 1) = (1 a) p j (n) for all j i
when = 0
pi (n + 1) = pi (n) + a [1 pi (n)]
b
p j (n + 1) = +(1 b ) p j (n ) for all j i
when = 1 r 1 (3.25)
pi (n + 1) = (1 b ) pi (n)
As seen from the definition, the parameter a is associated with reward response, and the
parameter b with penalty response. If the learning parameters a and b are equal, the scheme is
called the linear reward-penalty scheme LR-P [Bush58]. In this case, the update rate of the
probability vector is the same at every time step, regardless of the environment response. This
scheme is the earliest scheme considered in mathematical psychology.
For the linear reward-penalty scheme LR-P , E[p(n)], the expected value of the probability
vector at time step n, can be easily evaluated, and, by analyzing eigenvalues of the resulting
difference equation, it can be shown that asymptotic solution of the set of difference equations
enables us to conclude [Narendra89]:
r
r
ck
k =1
lim E[ M(n)] = r
< = Mo (3.26)
n r
1
ck
k=1
Therefore, from Definition 3.1, the multi-action automaton using the LR-P scheme is expedient for
all initial action probabilities and in all stationary random environments.
Expediency is a relatively weak condition on the learning behavior of a variable-structure
automaton. An expedient automaton will do better than a pure chance automaton, but it is not
guaranteed to reach the optimal solution. In order to obtain a better learning mechanism, the
parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b
is set to 0, then the scheme is named the linear reward-inaction scheme LR-I . This means that the
action probabilities are updated in the case of a reward response from the environment, but no
penalties are assessed.
For this scheme, it is possible to show that pl(n), the probability of the action l with
minimum penalty probability cl, monotonically approaches 1. By making the parameter a
{ }
arbitrarily small, we have Pr lim pl (n) = 1 as close to unity as desired. This makes the learning
n
automata -optimal [Narendra89].
3.5.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes

Although early studies of automata learning were done mostly on linear schemes, a few attempts
were made to study nonlinear schemes [Vor65, Chand68, Shapiro69, Vis72]. Evolution of early
reinforcement schemes occurred mostly in a heuristic manner. For two-action automata, design
and evaluation of schemes was almost straight forward, since actually only one action was being
updated. Generalization of such schemes to multi-action case was not straightforward
[Narendra89]. This problem led to a synthesis approach toward reinforcement schemes.
Researchers started asking the question: What are the conditions on the updating functions that
guarantee a desired behavior? The problem viewed in the light of this approach resulted in a new
concept of absolute expediency. While searching for a reinforcement scheme that guarantees
optimality and/or absolute expediency, researchers eventually started considering schemes that
can be characterized as nonlinear. The update functions gk and h k are nonlinear functions of
penalty probabilities. For example, the following update functions are defined by previous
researchers for nonlinear reinforcement schemes7:
a
g j ( p( n)) = h j ( p(n)) = p ( n)(1 pi ( n))
r1 i
This scheme is absolutely expedient for restricted initial conditions and for restricted
environments [Shapiro69].
g j ( p (n)) = p j ( n) ( p(n ))

pi (n) ( pi ( n))
h ( p ( n )) =
r 1
j
where( x ) = ax (m = 2,3,). Again, for expediency several conditions on the penalty

m
1 1
probabilities must be satisfied (e.g., cl < and c j > for j l). [Chand68]
m m
g j ( p( n)) = a ( p1 ( n),1 p1 (n )) p j +1 (1 p j (n))

h j ( p(n)) = b ( p1 ( n),1 p1 (n )) p j +1 (1 p j (n))
where ( p1 ,1 p1 ) = (1 p1 , p1 ) is a nonlinear function which can be suitably selected, and
1. Parameters a and b must be chosen properly to satisfy the conditions in Equation
3.21. Sufficient conditions for ensuring optimality is given in [Vor65], but they are valid only
for two-action automata.
The general solution for absolutely expedient schemes is found in early 1970s by
Lakshmivarahan and Thathachar [Lakshmivarahan73]. Absolutely expedient learning schemes are
presently the only class of schemes for which necessary and sufficient conditions of design are
available. This class of schemes can be considered as the generalization of the LR-I scheme.
Consider the general reinforcement scheme in Equation 3.20. A learning automaton using this
scheme is absolutely expedient if and only if the functions gk (p) and hk (p) satisfy the following
conditions:
g1 ( p) g2 ( p) g ( p)
= == r = ( p)
p1 p2 pr
(3.27)
h1 ( p) h2 ( p) hr ( p)
= = = = ( p)
p1 p2 pr
where (p) and (p) are arbitrary functions satisfying 8:
Again, (n) = i is assumed. We give only the updating functions for action probabilities other than the current
7
action; the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.
8
The reason for the conditions on the update functions will be explained in detail in Chapter 6.
0 < ( p) < 1
p j (3.28)
0 < ( p) < min
j=1,...,r 1 pj

Detailed proof of this theorem is given in both [Baba84] and [Narendra89]. A similar proof (for a
new nonlinear absolutely expedient scheme) is given in Chapter 6.
3.6 Extensions of the Basic Automata-Environment Model

3.6.1 S-Model Environments
Up to this point we based our discussion of learning automata only on the P-model environment,
where the input to the automaton is either 0 or 1. We now consider the possibility of input
values over the unit interval [a, b]. Narendra and Thathachar state that such a model is very
relevant in control systems with a continuous valued performance index. In previous sections, the
algorithms (for P-model) were stated in terms of the functions gk and hk which are the reward and
penalty functions respectively. Since in S- (and Q-) models the outputs lie in the interval [0,1],
and therefore are neither totally favorable or totally unfavorable, the main problem is to
determine how the probabilities of all actions are to be updated. Narendra and Thathachar
[Narendra89] show that all the principal results derived for the P-model carry over the more
realistic S- and Q-models also. Here, we will not discuss the Q-model; all definitions for the Q-
model are similar to the P-model.
In the case where the environment outputs k are not in the interval [0,1], but in [a, b] for
some a ,b IR, it is always possible to map the output into the unit interval using:
a
k = k (3.29)
ba
where a = mini{ i} and b = maxi{ i}9. Therefore, only the normalized S-model will be
considered.
The average penalty M(n) is still defined as earlier, but instead of ci {0,1}, si[0,1] are
used in the definition:
M ( n) = E[ (n) p (n)]
r
= E [ (n) p(n ), (n) = k ] Pr[ (n) = k ] (3.30)

k =1
r
= sk pk
k =1
where:
sk = E[ (n) (n) = k ] (3.31)
9
Note that this normalization procedure requires the knowledge of a and b.
In the S-model, sk plays the same role as the penalty probabilities in the P-model, and are
called penalty strengths. Again, for a pure-chance automaton with all the action probabilities
equal, the average penalty Mo is given by:
1 r
r i
Mo = si (3.32)
=1
The definitions of expediency, optimality, -optimality, and absolute expediency are the
same, except the fact that cis are replaced by s is. Now, consider the general reinforcement
scheme (Equation 3.20). The direct generalization of this scheme to S-model gives:
pi (n + 1) = pi ( n) (1 ( n)) gi ( p(n)) + ( n)hi ( p (n)) if ( n) i
(3.33)
pi (n + 1) = pi ( n) + (1 ( n)) g j ( p(n)) ( n) pi h j ( p( n)) if (n ) = i
j i j i
The previous algorithm (Equation 3.20) can easily be obtained from the above definition
by substituting ( n) = 0 and 1. Again, the functions gi and hi can be associated with reward and
penalty respectively. The value 1 (n ) is an indication of how far ( n) is away from the
maximum possible value of unity, and is maximum when ( n) is 0. The functions gi and hi have to
satisfy the conditions for Equation 3.21 for pi(n) to remain in the interval [0,1] at each step. The
non-negativity condition on gi and hi maintains the reward-penalty character of these functions,
though it is not necessary.
For example the LR-P reinforcement scheme (Equation 3.25) can be rewritten for S-model
(SLR-P) as:
a
pi (n + 1) = pi (n) [1 ( n)] a pi (n) + (n) [ a p( n)] if ( n) i
r 1 (3.35)
pi (n + 1) = pi (n) + [1 ( n)] a [1 pi (n )] ( n) a pi (n) if (n) = i
Similarly, the asymptotic value of the average penalty is:

r
lim E[ M( n)] = r (3.36)
n
1si
k= 1
Thus, the SLR-P scheme has analogous properties to the LR-P scheme. It is expedient in all
stationary random environments and the limiting expectations of the action probabilities are
inversely proportional to the corresponding si. The conditions of Equations 3.27 and 3.28 for
absolute expediency also applies to updating algorithms for S-model environment.
3.6.2 Nonstationary Environments

Thus far we have considered only the automaton in a stationary environment. However, the need
for learning and adaptation in systems is mainly due to the fact that the environment changes
with time. The performance of a learning automaton should be judged in such a context. If a
learning automaton with a fixed strategy is used in a nonstationary environment, it may become
less expedient, and even non-expedient. The learning scheme must have sufficient flexibility to
track the better actions. The aim in these cases is not to evolve to a single action which is
optimal, but to choose the actions to minimize the expected penalty. As in adaptive control
theory, the practical justification for the study of stationary environments is based on the
assumption that if the convergence of a scheme is sufficiently fast, then acceptable performance
can be achieved in slowly changing environments.
An environment is nonstationary if the penalty probabilities ci (or si) vary with time. The
success of the learning procedure then depends on the environment changes as well as the prior
information we have about the environment. From an analytic standpoint, the simplest case is the
one in which penalty probability vector c(n) can have a finite number of values. We can visualize
this situation as an automaton operating in a finite set of stationary environments, and assign a
different automaton Ai to each environment Ei. If each automaton uses an absolutely expedient
scheme, then -optimality can be achieved asymptotically for the overall system. However, this
method requires that the system be aware of the environmental changes in order to assign the
correct automaton, and the number of environments must not be large. Furthermore, since the
automaton Ai is updated only when the environment is Ei, updating strategies may happen
infrequently so that the corresponding action probabilities do not converge [Narendra89]. In our
application, a similar approach is employed as explained in Chapter 4.
There are two classes of environments that have been analytically investigated:
Markovian switching environments, and state-dependent nonstationary environments. If we
assume that the environments E i are states of a Markov chain described by a state transition
matrix, then it is possible to view the variable-structure automaton in the Markovian switching
environment as a homogeneous Markov chain. Therefore, analytical investigations are possible.
Secondly, there are four different situations wherein the states of the nonstationary environment
vary with the step n, and analytical tools developed earlier are adequate to obtain a fairly
complete descriptions of the learning process. These are:
w Fixed-structure automata in Markovian environments

w Variable-structure automata in state-dependent nonstationary environments
w Hierarchy of automata
w Nonstationary environments with fixed optimal actions
The fundamental concepts such as expediency, optimality, and absolute expediency have
to be re-examined for nonstationary environments, since, for example, the optimal action can
change with time. Here, we again consider the P-model for purposes of stating the definitions.
The average penalty M(n) becomes:
r
M(n) = ci (n) pi (n ) (3.37)
i =1
The pure-chance automaton has an average penalty Mo as a function of time:

1 r
Mo (n ) = ci (n) (3.38)
r i =1
Using these descriptions, we now have the following:
Definition 3.5: A learning automaton in nonstationary environment is expedient if there exists a

no such that for all n > no, we have:
E[ M(n)] Mo (n) < 0 (3.39)
The previous concept of optimality can be applied to some specific nonstationary
environments. For example, if there exists a fixed l such that cl(n) < ci(n) for all i = 1,2,...,r, i l,
and for all n (or at least for all n > n 1 for some n1), then l is the optimal action; the original
definitions of optimality and -optimality still hold. In general, we have:
Definition 3.6: In a nonstationary environment where there is no fixed optimal action, an

automaton can be defined as optimal if it minimizes:
1 T
lim E[ ( n)] (3.40)
T T
n =1
The definition of absolute expediency can be retained as before, with time-varying
penalty probabilities and the new description in Equation 3.37.
3.6.3 Multi-Teacher Environments

All the learning schemes and behavior norms discussed up to this point have dealt with a single
automaton in a single environment. However, it is possible to have an environment in which the
actions of an automaton evoke a vector of responses due to multiple performance criteria. We
will describe such environments as multi-teacher environments. Then, the automata have to
find an action that satisfies all the teachers.
In a multi-teacher environment, the automaton is connected to N teachers (or N single-
teacher environments). Then, a single-teacher environment is described by the action set , the
output set i ={0,1} (for a P-model), and a set of penalty probabilities {c , c ,..., c }where
i
1
i
2
i
r
{ }
c ij = Pr i ( n) = 1 ( n) = j . The action set of the automaton is of course the same for all
teachers/environments. Baba discussed the problem of a variable-structure automaton operating
in a multi-teacher environment [Baba85]. Conditions for absolute expediency are given in his
works (see also Chapter 6 for new nonlinear reinforcement schemes).
Some difficulty may arise while formulating a mathematical model of the learning
automaton in a multi-teacher environment. Since we have multiple responses, the task of
interpreting the output vector is important. Are the outputs from different teachers to be
summed after normalization? Can we introduce weight factors associated with specific teachers?
If the environment is of a P-model, how should we combine the outputs? In the simplest case
that all teachers agree on the ordering of the actions, i.e., one action is optimal for all
environments, the updating schemes are easily defined. However, this will not be the case in our
application.
The elements of the environment output vector must be combined in some fashion to
form the input to the automaton. One possibility is to use to use an AND-gate as described in
[Thathachar77]. Another method is taking the average, or weighted average of the all responses
[Narendra89, Baba85]. This is valid for Q- or S-model environments. Our application uses logic
gates (OR) and additional if-then conditions as described in Chapters 4 and 5.
3.6.4 Interconnected Automata

Although we based our discussion to a single automaton, it is possible that there are more than
one automata in an environment. If the interaction between different automata is provided by the
environment, the case of multi-automata is not different than a single automaton case. The
environment reacts to the actions of multiple automata, and the environment output is a result of
the combined effect of actions chosen by all automata. If there is direct interaction between the
automata, such as the hierarchical (or sequential) automata models, the actions of some automata
directly depend on the actions of others.
It is generally recognized that the potential of learning automata can be increased if
specific rules for interconnections can be established. There are two main interconnected
automata models: synchronous and sequential. An important consequence of the synchronous
model is that the resulting configurations can be viewed as games of automata with particular
payoff structures. In the sequential model, only one automaton acts at each time step. The action
chosen determines which automaton acts at the next time step. Such a sequential model can be
viewed as a network of automata in which control passes from one automaton to another. At
each step, one decision maker controls the state transitions of the decision process, and the goal
of the whole group is to optimize some overall performance criterion.
In Chapters 4 and 5, the interconnection between automata is considered. Since each
vehicles planning layer will include two automata one for lateral, the other for longitudinal
actions the interdependence of these two sets of actions automatically results in an
interconnected automata network. Furthermore, multiple autonomous vehicles with multiple
automata as path controllers constitute a more complex automata system as described in Chapter
7.
Chapter 4
Application of Learning Automata to

Intelligent Vehicle Control
As stated in Section 2.2, vehicle control is one of the most important issues in ITS, and especially,
in AHS. Designing a system that can safely control a vehicles actions while contributing to the
optimal solution of the congestion problem is difficult. Besides the control problems at the
regulation layer, there is a need for extensive research on the planning layer of the control
structure described by Varaiya [Varaiya 93], or at the tactical level of the driving task
[Sukthankar96c]. When the design of a vehicle capable of carrying out tasks such as vehicle
following at high speeds, automatic lane tracking, and lane changing is complete, we must also
have a control/decision structure that can intelligently make decisions in order to operate the
vehicle in a safe way.
The task of creating intelligent systems that we can rely on consequently brings the idea of
artificial intelligence to mind. ITS community is well aware of the fact that the implementation
of such a complex (and maybe global) system requires investigation of several different methods
and the simultaneous applications of many. Several emerging methods, such as cellular automata,
self-organization, neural networks, fuzzy logic, and hybrid systems applications are being
mentioned [TRB95, Godbole95, Ho96, Deshpande96]. Initial research on intelligent vehicle
control indicates that a planning-regulation system that can guarantee optimal operation with a
sound theoretical background has not yet been developed, and it may be vital to the
implementation of an automated highway system [Varaiya93, Lasky93].
In this chapter, we introduce a decision/control method for intelligent vehicles.
Considering the complexity of an automated highway system or an intelligent vehicle-highway
system, classical control methods are found to be insufficient to provide a fully automated,
collision-free environment [Varaiya93]. Although we may not solve the whole problem using a
single method, we attempt to find a way to make intelligent decisions here. Our approach to the
problem of vehicle control makes the use of Learning Automata techniques described in Chapter
3. The learning algorithms used in this application are introduced and discussed separately in
Chapter 6 for reasons of clarity. We visualize the planning layer (See Section 2.3) of an intelligent
vehicle as an automaton (or automata group) in a nonstationary environment1. The aim here is to
design an automata system that can learn the best possible action (or action pairs: one for lateral,
1
The interpretation of the term environment is twofold, as we explain later.
Cem nsal Chapter 4. Application of Learning Automata to intelligent Vehicle Control 47
one for longitudinal) based on the data received from on-board sensors, and possibly some form
of vehicle-to-vehicle and roadside-to-vehicle communications.
The significance of this system is that the controller we define will be useful as a backup
system or the primary system in controlling the path of a vehicle in the case of communication loss
with the higher layer in the hierarchy of a full AHS as well as during transition from fully
automated to manual control. Since the implementation of a fully automated highway system is
not imminent, we will try to concentrate on autonomous vehicles throughout this work.
4.1. The Model

For our model, we assume that an intelligent vehicle is capable of two sets of lateral and
longitudinal actions. Lateral actions are shift-to-left-lane (SL), shift-to-right-lane (SR) and stay-
in-lane (SiL). Longitudinal actions are accelerate (ACC), decelerate (DEC) and keep-same-speed
(SM). The actions stay-in-lane and keep-same-speed are idle actions, and can also be treated as
a single action cruise. There are nine possible action pairs provided that speed deviations during
lane changes are allowed (Table 4.1).
ACC SL ACC SR ACC SiL

DEC SL DEC SR DEC SiL
SM SL SM SR SM SiL
Table 4.1. Possible action pairs (shaded regions indicate the

idle actions that can be combined into a single action).
An autonomous vehicle must be able to sense the environment around itself. In the
simplest case, it is to be equipped with at least one sensor looking at the direction of possible
vehicle moves. Furthermore, an autonomous vehicle must also have the knowledge of the rate of
its own displacement. Therefore, we assume that there are four different sensors on board the
vehicle. These are the headway sensor, two side sensors, and a speed sensor. The headway
sensor is a distance measuring device which returns the headway distance to the object in front of
the vehicle. An implementation of such a device is a laser radar. Side sensors are assumed to be
able to detect the presence of a vehicle traveling in the immediately adjacent lane. Their outputs
are binary. Infrared or sonar detectors are currently used for this type of sensor. The speed sensor
is simply an encoder returning the current wheel speed of the vehicle.
Each sensor is connected to its associated module. Sensor modules evaluate the sensor
signals in the light of the current automata actions, and send a response to the automata (Figure
4.1). We visualize each sensor module as a teacher in a nonstationary automata environment (or a
multi-environment system). The detailed descriptions of these sensor modules are given in the
next section.
Our basic model for planning and coordination of lane changing and speed control is
shown in Figure 4.1. The response of the environment is a combination of the outputs of all four
teacher blocks. The details of the mapping F is given in Section 4.4. The mapping F from sensor
module outputs to the input of the automata can be a binary function (for a P-model
environment), a linear combination of four teacher outputs, or a more complex function as is
the case for this application. An alternative and possibly more ideal model would use a linear
combination of teacher outputs with adjustable weight factors (e.g., S-model environment).
Decision blocks/ LA Environment

distance
Headway Teachers
Longitudinal Physical Environment

binary
Left Detection Automaton
1 & 2 action
F Vehicle Highway
binary
Right Detection Lateral Regulation & Physical
Automaton Layers
speed Planing Layer

Speed Detection
Sensors
Figure 4.1 Automata in a multi-teacher environment connected to the physical layers.
It is important to differentiate between automaton environment and the physical

environment. The action of an automaton is a signal to the regulation layer which defines the
current choice. It is the regulation layers responsibility to interpret this signal. When an action is
carried out, it of course affects the physical environment. The sensors in turn sense the changes in
the environment, and the feedback loop is closed with the sensor modules and the signal .
The regulation layer is not expected to carry out the action chosen immediately. This is
not even possible for lateral actions. To smooth the system output, the regulation layer carries out
an action if it is recommended m times consecutively by the automaton, where m is a predefined
parameter less than or equal to the number of iterations per second (Figure 4.2). In other words,
the length of the memory vector is m. Whenever this vector (or buffer) is filled with the same
action, the action is fired. This may of course be changed to k times in the last m choices or a
more sophisticated decision rule.
Action from
A1 A2 A2 A3 A1 A2
Planning Layer
IF ACTIONS ARE THE SAME
Action to
Execute
Figure 4.2. Memory vector/buffer in regulation layer.
When an action is carried out, the action probabilities in the controlling automaton are
initialized to 1/r, where r is the number of actions. Although not necessary, this initialization
enables the learning automaton to adapt to a new situation faster. One of the reinforcement
schemes described in Chapter 6 is designed to speed up the adaptation process when the
probability of the best/desired action is close to zero. When this new nonlinear scheme and/or very
fast update rates are used, probability vector initialization is not necessary. When an action is
executed, the memory vectors are initialized also; all locations are filled with the idle actions SiL
or SM. A minimum processing speed of 25, and a maximum of 200 iterations per second are
assumed. This value is related only to the computations; the sensor data feeds may have a
different (and constant) rate. The upper limit of 200Hz is due to the communication requirements
considered in the current AHS research [Lasky93].
The discussion of nonstationary environments in Chapter 3 is based on the changing
penalty probabilities of actions. In this application, the action probabilities of the learning
automata are functions of the status of the physical environment (e.g., a decreasing headway
distance will result in a penalty response from the front sensor module if the chosen action is for
ACC). The realization of a deterministic mathematical model of this physical environment may be
impossible, if not extremely difficult. Therefore, simulation is the only way of demonstrating the
expediency and/or absolute optimality for a nonstationary automata environment resulting from a
changing physical environment.
4.2. Sensor/Teacher Modules

The four teacher modules listed above are simple decision blocks that calculate the response
associated with the corresponding sensor, based on the last chosen action. Table 4.2 below
describes the output of these decision blocks for side sensors.
Sensor Status (Left /Right)

Vehicle in sensor No vehicle in sensor
2
Actions range or no adjacent range and adjacent
lane lane exists
SiL 0/0 0/0
SL 1/0 0/0
SR 0/1 0/0
SM # #
ACC # #
DEC # #
Table 4.2. Output of the left/right sensor modules.
As seen in Table 4.2, a penalty response from the left sensor module is received only when
the action is SL and there is a vehicle in the left sensors range or the vehicle is already traveling
at the left or rightmost lane. Similarly, the action SR is penalized if there is a vehicle on the right
lane. All other situations result in a reward response from this sensor module. The longitudinal
automaton do not use side sensors for reinforcement. Initial simulations define the range of the
side sensors using two parameters sr1 and sr2 as shown in Figure 4.3. It is assumed that the side
sensors range covers only the adjacent lane, and the sensor is mounted right at the middle of the
vehicles side.
The output of the side sensors are assumed to be binary, indicating the existence of a
vehicle on the adjacent lane. However, the side sensor module may also use the distance of the
detected vehicle from the sensor source for more intelligent decisions. This type of design may
result in a more expensive implementation, as it requires distance measurement and possibly
additional sensors.
V1 > V2
sr1 sr2
V3
V1 V2
fsr
Figure 4.3. Sensor ranges for an autonomous vehicle.
2
The actions are shift-to-left-lane (SL), shift-to-right-lane (SR), stay-in-lane (SiL), accelerate (ACC), decelerate
(DEC) and keep-same-speed (SM).
Assuming that the front sensor is capable of detecting the headway distance, we define the
headway module as shown in Table 4.3. If the sensor sees a vehicle at a close distance (< fsr), a
penalty response is sent to the automaton for actions stay-in-lane, accelerate, and keep-same-
speed. All other actions (shifting lane or decelerating) may serve to avoid a collision, and
therefore, are encouraged. Then, an implemented headway modules task is simply to compare the
time-of-flight of the echo to the predefined time interval corresponding to distance fsr. Also note
that although we show the limiting value fsr as the sensor range in Figure 4.3, the actual sensor
range and this value do not have to be equal. The output of this module affects both automata as
seen in Table 4.3.
The speed sensor of the autonomous vehicle is assumed to be an encoder connected to the
wheel shaft. The output of the encoder can be used to detect the wheel speed, which is
approximately equal to vehicle speed (or equal when cruising). The speed modules task is simply
to compare the actual speed to the desired speed, and based on the action chosen, send a feedback
to the longitudinal automata. The responses depending on the speed and the longitudinal actions
are given in Table 4.4.
Sensor Status
Actions Vehicle in range No vehicle in range
(headway < fsr) (headway fsr)
SiL 1 0
SL 0 0
SR 0 0
SM 1 0
ACC 1 0
*
DEC 0 0
Table 4.3. Output of the Front Sensor block.
Sensor Status
Actions dev < -pdif -pdif < dev < pdif dev > pdif
SiL # # #
SL # # #
SR # # #
SM 1 0 1
ACC 0 0 1
DEC 1 0 0
dev actual speed - desired speed
pdif permitted difference between actual and desired speeds
Table 4.4. Output of the Speed Sensor Module.
When the actual vehicle speed differs from the desired speed by a large amount, then only
the action decreasing the speed deviation receives a reward; others are penalized. This forces the
vehicle to slow down or speed up in order to match the desired speed. The value of the parameter
pdif (permitted difference between actual and desired speeds) is predefined.
The values of the sensor limits sr1, sr2, fsr, and pdif define the capabilities of the sensors
(in the case front and side sensors) as well as the behavior of the vehicle. Sensitivity to the
headway distance and/or to speed fluctuations are given by the predefined parameters and are key
to the behavior of the autonomous vehicle. For example, a vehicle with shorter side sensor range
will tend to jump in front of vehicle in the next lane more often than a vehicle equipped with
more capable sensors.
Now that we have defined the sensor module outputs, the problem is to intelligently
employ these signals for automata reinforcement. It is possible to treat all sensor modules as
separate teachers with conflicting feedback responses for the automata. For example, consider the
situation given in Figure 4.3. The longitudinal action will receive a penalty from the front sensor
module due to the existence of a vehicle in front. If the actual speed of this vehicle is less than the
desired speed, the speed module will try to force the vehicle to increase its speed. These two
outputs conflict, and it is obvious that one must have priority over the other: no matter what its
current speed is, a vehicle must slow down in order to avoid a collision if it senses another vehicle
occupying the immediate space in front of itself. The next section describes our approach to the
problem.
4.3. Nonlinear Combination of Multiple Teacher Signals

Action probabilities of the longitudinal and lateral automata are updated using the binary feedback
from the four sensor modules described in the previous section. As seen in Table 4.5, lateral
actions probabilities depend on three sensor outputs (front and side sensors), and longitudinal
action probabilities are updated based on the output of the headway and speed sensor modules.
Modules
Actions Headway Left Right Speed
SiL ? 0 0 -
SL 0 ? 0 -
SR 0 0 ? -
SM ? - - ?
ACC ? - - ?
DEC 0 or 0* - - ?
? means reward (0) or penalty (1) response is possible.
- means that output is not used; can be treated as 0.
0* has a higher priority (See discussion below).
Table 4.5. Action - Sensor Module matrix.
The probability of the lateral action SiL depends mainly on the output of the headway
sensor. When this action is chosen, the probability is decreased when the headway module
indicates a vehicle in the front sensor range. Side sensors always sends a reward response to this
action no matter what the situation is. Therefore, we can obtain a combined environment response
for this action by simply OR-ing the binary module outputs.
The same method is also valid for the two other lateral actions. Probabilities of actions SL
and SR depend on the output of the left and right sensor modules, respectively. A decision to shift
lane is penalized if the associated sensor indicates the presence of a vehicle in the adjacent lane.
Therefore, by using a simple OR function, we can combine all three sensor module outputs to
obtain a meaningful teacher response for the lateral automaton.
For the longitudinal actions, although there are only two sensors modules whose feedback
is considered for reinforcement, the decision is more complicated. Longitudinal actions SM and
ACC may receive a penalty or reward from the front sensor depending on the headway distance.
The output of the speed module depends on the actual speed of the vehicle, and is used to force
the vehicle to match the desired speed. Action DEC is considered as good all the time by the
headway sensor module. However, the reward response in the case where there is a vehicle in the
front sensor range (indicated by 0*) is different than the normal reward response (indicated by 0):
this reward response must override a possible penalty from other modules. For the safety of the
vehicle, the output of this sensor must have a higher priority than that of the speed module.
Therefore, a simple OR-gate is not sufficient; additional Boolean functions must be used. Possible
action-sensor output scenarios are shown in Table 4.6.
Sensor Module Output

Actions Speed Headway Combined
SM 0 0 0
or 0 1 1
ACC 1 0 1
1 1 1
0 0 0
DEC 0 0* 0
1 0 1
1 0* 0
Table 4.6. Possible longitudinal action-sensor output scenarios.
Again, an OR-ing of the headway and speed module outputs will give the combined
output in Table 4.6 except the case where the action is DEC and the headway module indicates
the presence of a vehicle in front. Therefore, the output of an OR-gate must be double-checked
for this condition. Since we are using OR gates, we can define dont care conditions (-) as
reward responses. Thus, it is possible to OR the outputs of all sensors. Then, the function F
mapping individual sensor outputs to a combined feedback signal is defined as in Figure 4.4.
IF act.=DEC & Hdwy = 0

=0
ELSE = f
Action or
DEC =? LNG
Headway Sensor Long. Automaton
Left Sensor f
OR
Right Sensor
LAT
Speed Sensor Lateral Automaton
Figure 4.4. The definition of the mapping F.
Therefore, by adding one conditional rule to the OR-ed feedback signal, we obtain a
combined environment response to be applied to the automata. This single response is used to
update the action probabilities in both automata using the reinforcement schemes described in the
next section.
4.4. The algorithms

As described in previous chapter, the environment response is used to update the action
probabilities in an automaton. The update algorithm, called the reinforcement scheme, is the key
to automata learning. In Chapter 3, we introduced the learning paradigm and related definitions.
As we describe in more detail in Chapter 6, the reinforcement schemes are categorized according
to their nature and the behavior of the automaton. There are four different reinforcement schemes
used throughout this dissertation. These are:
Linear reward-penalty LR-P (with a = b),

Multi-teacher general absolutely expedient (MGAE) scheme,
Linear reward-penalty LR-P (with a b),
NLH, an absolutely expedient nonlinear scheme.
The first algorithm is a well-known linear reinforcement scheme, one of the first LA
algorithms [Bush58]. The second algorithm is given by Baba [Baba83, Baba85], and also is valid
for S-model environments. The last two reinforcement schemes are extensions of the first two,
and have desirable characteristics for our application. They are the direct results of our attempts
to create reinforcement schemes with desirable characteristics suitable for this study of learning
automata applications to intelligent vehicle control. The detailed description of the algorithms can
be found in Chapter 6. Here, we will briefly mention their characteristics and relative properties.
Table 4.7 gives the general description of the above mentioned schemes.
Scheme Nature Behavior Advantages

LR-P - Linear - Expedient
- Equal learning [Narendra89]
parameters (a = b)
- Nonprojectional
LR-P - Linear - Expedient, and optimal -Faster convergence
- Unequal learning in an ideal environment with same learning
parameters (a b) (See Chapter 6) parameter a.
- Nonprojectional - Optimal (equilibrium)
solution is guaranteed
in N-person nonzero-
sum games
MGAE - Nonlinear - Absolutely expedient
- Projectional [Baba85]
- Applicable to S-model
NLH - Nonlinear - Absolute expedient (See - Faster convergence
- Projectional Chapter 6) than MGAE
- Applicable to S-model
Table 4.7. Properties of the reinforcement schemes.
The second algorithm is slightly different than the first one, and has not been studied in
detail previously. However, the convergence characteristics is better when the same learning
parameter a associated with reward is used. Furthermore, LR P is proven to guarantee
convergence to an equilibrium point3 in N-person games. This is a property we would like to use
for analyzing the behavior of multiple autonomous vehicles. The last algorithm is an extension of
the third one, again resulting in a faster convergence in probability updates. These two schemes
are also proven to be absolutely expedient, meaning that the decision tends to a better one at
every time step.
4.5. Simulation Results

The first simulation example shows a single autonomous vehicle traveling faster than (nine) other
vehicles on a 3-lane highway. All other vehicles are assumed to be cruising at fixed speed of
80kmh without changing lanes. The autonomous vehicles desired speed is set at 85kmh. Sensors
limits are fsr = 15m and sr1 = sr2 = 10m; the permitted speed deviation is 1kmh. Linear reward-
penalty scheme LR = P is used for reinforcement (see Table 4.8 for all other parameter settings). At
3
Or Nash equilibrium, where no player has a positive reason to change her strategy.
t = 0, the autonomous vehicle, traveling in lane 3, approaches the group of vehicles cruising at
fixed speed of 80kmh (Figure 4.5a; the vehicles are traveling from left to right). It immediately
changes lane to avoid a collision. It detects another slow moving vehicle in lane 2 at
approximately t = 10sec and slows down to keep a 15-meter headway distance (see Figure 4.6a).
At approximately t = 14sec, it shifts to lane 1. Immediately after the lane change, the speed is
adjusted to desired value of 85kmh. At t = 34sec and t = 45sec, two shift-to-left-lane actions are
carried out, again due to the presence of slow moving vehicles in front (Figures 4.5d-e). During
these maneuvers, the vehicle speed is not decreased, because the probability of the lateral action
SL reached 1 much faster than that of the longitudinal action DEC. The reason for this behavior is
the fact that the lateral action SR receives penalty at all times due to the presence of another
vehicle in the right lane, or due to the fact that the vehicle is in the rightmost lane.
After two shifts to the right lane (Figures 4.5f-h), the vehicle finds itself again in lane 1,
traveling behind another vehicle. The speed is adjusted to match the speed of the vehicle in front
(Figure 4.6). At this point, the only lateral action receiving reward is SiL, and this action is fired
repeatedly. Longitudinal automata, on the other hand, will chose the action necessary to keep the
headway distance over 15m. Whenever the headway distance is larger than this predefined limit,
the vehicle will attempt to increase its speed to 85kmh. It cannot move out from the pocket
created by two slow moving vehicles.
Lane 3
Lane 2
t=0s t=3s
(a) Lane 1 (b)
t=14s t=33s
(c) (d)
t=45s t=61s
(e) (f)
t=74s t=90s
(g) (h)
Figure 4.5. Positions of nine vehicles: gray colored vehicle is autonomous,

black colored vehicles are cruising at constant speed.
Vehicle Lane Vehicle Speed (kmh)

90
4
88
86
84
3
82
80
2
78
76
74
1
72
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
time (sec) time (sec)
(a) (b)
Figure 4.6. Automated vehicles (a) lane position and (b) speed (See also Figure 4.5).
The second and third simulation results show the behavior of a vehicle following a slowly
moving vehicle. Both simulations have been started with an initial headway distance of 30m
(Figure 4.7). Front sensor range fsr is set to 15m. The memory vector size is the same as the
processing speed (25), and this results in speed changes every second at the least. The only
difference between two simulations is in the desired speeds; 81kmh in the second, and 85kmh in
the third simulation. The headway distance and speed of the automated vehicles are shown in
Figures 4.8 and 4.9.
30m
t=0s
Figure 4.7. An automated vehicle following a slow-moving

vehicle in lane 2.
Headway (m) Vehicle Speeds (kmh)

35 90
88
30
86
84
25
82
20 80
78
15
76
74
10
72
5 70
0 20 40 60 80 100 120 0 20 40 60 80 100 120
(a) (b)
Figure 4.8. (a) Headway distance and (b) speed of an autonomous vehicle following another
slowly moving vehicle: the desired speed of the autonomous vehicle is 81kmh while the vehicle
in front cruises at 80kmh.
Headway (m) Vehicle Speeds (kmh)

35 90
88
30
86
84
25
82
20 80
78
15
76
74
10
72
5 70
0 20 40 60 80 100 120 0 20 40 60 80 100 120
(a) (b)
Figure 4.9. (a) Headway distance and (b) speed of an autonomous vehicle following another
slowly moving vehicle: the desired speed of the autonomous vehicle is 85kmh while the vehicle
in front cruises at 80kmh.
As seen in the above figures, both vehicles are able to avoid a collision4; the headway
distance is never too close. However, the response of the automated vehicle is oscillatory. The
longitudinal automaton sends the action ACC whenever the headway distance is larger than 15m
4
This is, of course, possible provided that the iteration rate is fast enough or memory vector is relatively short.
to reach the desired speed of 85kmh. The action DEC is the choice when the distance is less than
the predefined limit. Also, the oscillation amplitude is higher when the relative speed is higher.
Although the oscillations are smaller for the desired speed setting closer to the speed of the
vehicle in front, the behavior is unstable, as seen in Figure 4.7.
Sim. Learning Sensor limits Proc. Memory buffer size Figures

# Parameters (m) Speed
a b fsr sr2 sr1 (Hz) lateral long.
1 0.15 0.10 15 10 10 25 25 25 4.5, 4.6
2 0.15 0.10 15 10 10 25 25 25 4.7, 4.8
3 0.15 0.10 15 10 10 25 25 25 4.7, 4.9
4 0.15 0.10 15 10 10 25 25 25 4.10
Table 4.8. Parameter settings for simulations.
Although not shown here, the behavior of the vehicles change with different parameters.
The choice of sensor limits and learning coefficients as well as the iteration speed and the size of
the regulation buffer are important factors.
Another important issue is the fact that the automated vehicles in simulations 2 and 3
chose to adjust their speed (back and forth) to keep a safe distance instead of shifting to lane 1 or
lane 3 to evade the slow moving vehicle. Since the only lateral action receiving a penalty from the
front sensor module is SiL, the probability of the two other actions SL and SR reach 0.5. This
results in a memory buffer filled with (approximately) equal number of two different actions. The
vehicle cannot decide which action to take.
Figure 4.10 shows the initial and final positions, numbers and speeds of ten autonomous
vehicles. The vehicles are traveling from left to right on a 500m circular highway. Final speeds
shown in Figure 4.10(f) are equal to desired speeds set initially. Simulation parameters are again
given in Table 4.8.
t=1s (a)
2 3 5 8
4 7 9
1 6 10
(b)
85 80 80 80
80 80 80
80 80 80
(c)
t=243s (d)
7 8
1 3 9 4 2
10 5 6
(e)
83 78
82 82 81 79 80
80 80 80
(f)
ch4m1.mpg
(1.230Mb, every frame represents one fifth of a second)
(g)
Figure 4.10. Initial and final positions (a, d), numbers (b, e), and speeds (c, f) of ten automated
vehicles traveling on a 3-lane 500-meter circular highway. The mpeg movie of the simulation is
accessible via icon (g).
As seen in figure above and the mpeg movie, vehicles with higher speeds are slowly
moving ahead of the others. All vehicles are able to avoid collision, either by slowing down or
shifting lanes. In the end, all vehicles have reached their desired speeds, since they were all able to
shift to an open lane, sooner or later during the simulation. However, the paths taken to overtake
slower vehicles are not the shortest and/or quickest paths.
4.6 Discussion of the Results

The simulation results given in the previous section are indicative of several important issues. First
of all, assuming that the regulation layer can respond to the planning layers requests in time, and
that the real dynamic behavior of this system as well as actual sensor are not drastically different
than the simplified model here, our planning layer intelligent controller is capable of safely
directing the vehicle(s). Of course, the processing speed and the learning parameters need to be
adjusted to guarantee timely responses to environmental changes. Considering the current
technology in computing, the iteration speed is not a big concern. Learning parameters must be
chosen large enough to guarantee fast learning for the desired iteration speed while avoiding
values larger than 0.3 for linear schemes, and 0.1 for nonlinear schemes in order to avoid
unnecessarily large action probability update of a non-optimal action probability to 1. The length
of the memory vector is another factor in firing actions. This buffer is definitely needed for
lateral actions, in order to avoid continuous lane changes, while its size can be decreased down to
1 for longitudinal actions provided that the speed changes are continuous.
Although there are no collisions with carefully selected parameters to obtain the necessary
convergence speed, the path and/or behavior of the vehicles are far from optimal. The
autonomous vehicle in simulation number 1 did not possess the information to help it avoid a
pocket created by two slow moving vehicles. The decisions are based on local data, and therefore
cannot guarantee global solutions for the vehicle, or the traffic as a whole for that matter.
Conflicting decisions are expected.
The observed oscillations in the headway distance is due to the discrete time control of
the speed, and the defined front sensor limit. For a fast processing speed and a more realistic
speed controller design, the problem may be relatively insignificant. However, with the current
definition of the headway module, the vehicles will keep decelerating even though the headway
distance is improving.
Furthermore, unnecessary lane changes can be observed in simulation number 4. This is
again due to the fact that the vehicles do not have an evaluation method for the desired lane when
such a decision is needed. This fact is also the cause of the behavior seen in simulations 2 and 3.
When lateral actions SL and SR are both good, the vehicle is unable to change lanes using the
current method of firing actions when the memory vector/buffer is full (see Section 4.1).
Initial simulations show promising results, but there is a need for more elaborate sensor
definitions, and for more complex decision rules. The next chapter will address these upgrades,
and other related issues.
It is interesting to note that again a task performed easily by human beings proves to be
very complicated for a machine. Even a simple decision of shifting lanes in order to avoid being
stuck in a packet of slow moving vehicles (simulation 1, Figure 4.5h) requires complicated
sensory hardware and, possibly, a complex communication mechanism. However, a human driver
can simply look farther ahead toward the adjacent lanes, and decide to slow down in order to find
an opening. Research currently conducted on AHS aims to solve such problems with extensive
vehicle-to-vehicle and/or vehicle-to roadside communications [Construction96, Lygeros94],
multiple sensors and sensor fusion [Agogino96] and a hierarchical control structure [Varaiya91,
Varaiya93, Lygeros95, Lygeros96]. When, the hierarchical system and all vehicles are
connected, they will collectively reach a solution. However, such an architecture consequently
requires an expensive implementation. The problem of making decisions based on local
information would exist even in a full AHS environment unless the system is fully hierarchical.
Chapter 5
More Intelligent Vehicles: Extending the

Capabilities
The intelligent controller defined in the previous chapter, although capable of controlling an
automated vehicles path using local information, is far from optimal in the sense that the
resulting safe path is not always the shortest or fastest path, nor a good choice for improving
the overall throughput of a segment of the automated highway. Also not seen in the simulation
examples, a pinch maneuver condition where two vehicles changes lanes to occupy the same
spot on the highway (See Section 5.2.2) is possible when only the immediate neighborhoods of
the vehicles are considered. In this chapter, we will extend our assumptions on the capabilities of
the intelligent vehicles. These changes can be grouped into three different classes: extending the
capabilities of the sensors, creating more complete (and complex) decision rules, i.e., extending
the capabilities of the teachers, and adding more sensor/teacher modules. All these changes are
described in the following sections.
Extensions on sensor capabilities require more complex and expensive hardware
implementations, and extending the capabilities of the teachers leads to more complex evaluation
rules for teacher modules as we describe in the following sections. These include new definitions
of front and side sensors, more rules for headway and side sensor modules, as well as interaction
of lateral and longitudinal automata. Two new sensor modules are defined in this chapter, namely
the pinch and lane sensor-module pairs. The assumptions and physical implementations of these
sensors as well as the flag structure used in the decision modules are described.
5.1. More Complex Sensors and Sensor Modules

The front sensor of an automated vehicle is defined as a range sensor in Chapter 4. Without
changing our assumptions on the sensor hardware, we will extend the decision capability of the
front sensor module by defining multiple headway ranges and by using consecutive range data to
obtain rate of change of the headway distance as well as its current value. For the side sensors, we
assumed only that these were capable of detecting a vehicle in a predefined range. Extensions on
the side sensor modules in this chapter require more capable and therefore complex sensor
structures.
5.1.1 Front Sensor

Again assuming that the sensor mounted in the front of the vehicle can detect the headway
distance, reward-penalty rules in the headway module are redefined as shown in Table 5.1. The
Cem nsal Chapter 5. More Intelligent Vehicles: Extending the Capabilities 63
cases where the output of the sensor has an overriding characteristic are indicated by asterisks.
There are now three different range values used in the feedback decision. The value fsr defines the
sensor range, or this value can be visualized as the limit of the region of interest. Any
measurements equal or greater than this value results in a reward response for all longitudinal and
lateral actions. As seen in Figure 5.1, two additional parameters d1 and d2 with d1 < d2 < fsr are
used to define a new buffer region where the sensor modules response also depends on the rate
of change of the headway distance.
Sensor Status
Vehicle in Vehicle in region B Vehicle in region C No vehicle
region A (d1 < dist. < d2 ) (d2 < dist. < fsr) in range
Current (dist. < d1) Vehicle is Vehicle is Vehicle is Vehicle is (dist. fsr)
Action approaching NOT approaching NOT
approaching approaching
SiL 1 1* 0 1* 0 0
SL 0** 0 0 0 0 0
SR 0** 0 0 0 0 0
SM 1 1 0 0 0 0
ACC 1 1 1 1 0 0
DEC 0 0* 0 0 0 0
Asterisks are used to distinguish the cases where additional functions need to be used (See Section 5.3.3
for details).
Table 5.1. Output of the Front Sensor block (Also see Figure 5.1).
The lateral action SiL is discouraged when the vehicle in front is dangerously close. The
parameter d1 defines this safety region. The same action also receives penalty when the headway
distance is larger than d1, but less than the safe distance defined by parameter fsr. However, for
the region between these two values, the action is penalized only when the rate of change of the
headway distance is negative, i.e., if the headway distance decreases. All other lateral actions
receive reward from the front sensor module.
Longitudinal action ACC is discouraged when a vehicle enters the front sensor range, and
is approaching. If the headway distance is under d2, this action is penalized regardless of the rate
of change of the measured distance. A vehicle should not increase speed while sensing another
vehicle in proximity. The differentiation of feedback response for the range [d2 fsr] enables the
automated vehicle to catch up with the vehicle in front. If the headway distance is too close, or
if the vehicle in front is approaching, the action SM also receives penalty.
Note that if we choose the parameters as d1 = d2 = fsr, the reward-penalty structure of
Table 5.1 becomes the one previously given in Table 4.4. By defining these new regions where the
feedback response is different for different longitudinal actions, we hope to obtain a smoother
response, and thus remove the oscillations in the headway distance while following a slowly
moving vehicle (See examples in Figure 4.8 and 4.9). The reward response of the front sensor
module for longitudinal action ACC in region C facilitates speed recovery.
sr1 sr2 <1

sr1 sr2
E D E
A B C
d1
d2
fsr
Figure 5.1. Extended definitions of front and side sensor ranges.
Again, consider the situation given in Figure 4.7 where the automated vehicle follows a
slow-moving vehicle. The initial headway distance is still 30m, the sensor parameters are chosen
as shown in Table 5.2. The automated vehicle has a desired (and initial) speed of 85kmh while the
other cruises at 80kmh. All other parameters are set to the same values as in the second and third
examples in Chapter 4 (Table 4.8). The regulation layer changes velocity after receiving the same
action 25 times consecutively, and the processing speed is 25 iterations per second.
Simulation Figure Parameters (m)

Number d1 d2 fsr
1 5.2 10 20 30
2 5.3 10 15 20
3 5.4 12 15 18
4 5.5 14 15 16
5 5.6 12 15 16
Table 5.2. Front sensor parameters for simulations.
The vehicle speeds and the headway distances for the five simulations with the above
parameters are given in Figures 5.2-5.6. Only the sensor range definitions are changed to see the
effect of various settings. As seen in Figures 5.2 and 5.3, it is possible to prevent the oscillations
in the headway distance. Usually, the headway distance stabilizes at a value between d1 and d2
when the automated vehicle approaches a slow-moving vehicle. As seen in Figure 5.2, the
regulation layer changed the speed 5 times to match the speed of the vehicle in front, in
approximately 6 to 7 seconds. Again, note that a maximum of one change per second was
possible in these simulations.
Vehicle Speeds (kmh) Headway (m)

90 35
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 0 10 20 30
Figure 5.2. Headway distance and speed of an autonomous vehicle following another
slow-moving vehicle: sensor parameters are d1 = 10m, d2 = 20m, and fsr = 30m.
Vehicle Speeds (kmh) Headway (kmh)

90 35
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20
0 10 20
time (sec)
time (sec)
Figure 5.4 shows another vehicle with much shorter sensor ranges than the previous ones.
Again starting at the same initial conditions, the vehicle is able to match the speed of the vehicle in
front, and keep a safe distance. Since the parameter fsr is relatively shorter than previous
simulation runs, the vehicle had less time to adjust its speed starting at only 18m. As a result of
very rapid changes (only action DEC is permitted after t = 10sec since the headway is less than d2
and the vehicle is approaching), the headway decreased under d1 = 10m and the vehicle speed
below 80kmh. For approximately 6 seconds, the automated vehicle cruised at 79kmh in order to
permit the headway to increase to a safe distance. At t 30sec, right after the headway distance
becomes equal to d2, the speed is matched. In this case, the region [d2 fsr] is where the steady-
state value is obtained. After t = 40sec, the vehicle ahead changes its speed to 83kmh, then back
to 80kmh. As seen in Figure 5.4, this speed change is quickly matched, and the headway distance
reaches its steady-state a few seconds after the vehicle in front reaches its cruising speed of
80kmh (at t 60sec).

90 35
88
30
86
84
25
82
80 20
78
15
76
74 10
72
70 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60
When we attempt to decrease the sensor regions further to [d1 d2] = [14 15] and
[d2 fsr] = [15 16], oscillations resulted as seen in Figure 5.5. The regions where different actions
are encouraged based on the distance and rate of change of distance are so small that before an
intelligent decision can be made, the value of the headway distance moves out of the buffer
region. Again, it is important to note that an ACC or DEC action is fired every second in the best
case. For a much shorter memory buffer as well as a faster iteration speed, the response will be
quicker, and therefore these sensor values may be feasible for non-oscillatory response.

90 35
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Keeping the same processing speed and memory vector, we try to solve the problem by
enlarging the lower sensor region, i.e., by decreasing the value of d1 to 12m. The results are
shown in Figure 5.6. The response is not as damped as in simulations 1 and 2, but a steady-state
value is reached. Again, a change in the speed of the vehicle in the front (at t 64sec) is quickly
matched with a minimal deviation in the headway distance.

90 35
88
30
86
84
25
82
80 20
78
15
76
74
10
72
70 5
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
From the simulation results above, we see that there is a trade off between the time to
reach a steady-state value for headway distance and the permitted region for the steady-state
values. For faster processing speeds (which decrease the time to reach a decision in the planning
layer) and shorter memory buffer (which results in a shorter decision time in the regulation layer),
the regions may be smaller, and therefore the problem less insignificant.
The definition of the front sensor module in Section 4.2 is the limiting case for the
extended version given here. As d1 and fsr approach d2, the definition given here becomes similar
to the one given in Chapter 4, and oscillatory responses are observed.
The idea of having multiple parameters defining the regions [d1 d2 ] and [d2 fsr] is similar
to the idea of defining a boundary layer around a switching surface for sliding mode controllers in
order to stop chattering [Slotine91]. In sliding mode control, additional integral terms are added
to the driving function for the predefined boundary layer. Here, we force the speed to change
according to value and rate of change of the headway distance. Since distance is the integral of
the velocity, the resulting effect is similar. Of course, the control methodology is much more
discrete, and also less structured, than in a sliding mode controller. In the design methodology of
the sliding mode controllers, either the width of the boundary layer or the function parameters are
predefined and the rest adjusted accordingly [Kachroo96c]. Similarly, in our case, the choice of
processing parameters and the sensor regions are sequential, in the sense that we either define the
sensor regions first, and then choose the learning parameters, or vice versa.
The region [d1 d2 ] is effective when the automated vehicle has to slow down; [d2 fsr] is
mainly important in the case of acceleration while closely following another vehicle. This second
region also improves the vehicles braking time in the previous case.
Also note that the length of the memory vector is 25, i.e., at the very best, only one
velocity adjustment per second is permitted. More continuous regulation layer modeling is of
course possible, but not considered here. Since the controlled variable is the distance, not the
speed, and a more realistic model would also include delays, we did not investigate continuous
models. The aim here is to show the importance of sensor range definitions, and to introduce an
intelligent vehicle control method based on learning.
5.1.2 Side Sensors

In the previous chapter, the side sensors are described as being capable of detecting vehicles and
obstacles in adjacent lanes. This may be achieved using relatively easy then the method we will
now describe. By using multiple sensor arrays, and combining different technologies such as
sonar, IR, and visual sensing, it may be possible to not only detect the presence, but also the
relative movement of vehicles in the adjacent lanes. Current side sensor implementations include
radar, vision, and IR technologies. Vehicle control systems based on the inter-vehicle
communications do not consider the use of side sensors since all vehicles are assumed to be
controlled by a higher level in the hierarchy or all will be capable of relaying their position
information to other vehicles. Autonomous navigation implementations have not yet reached the
level of capability where inter-vehicle collisions is a consideration.
Millimeter-wave radar sensing [Zoratti95] and short-range vehicle-to-vehicle
communications a more global solution to the problem are currently under development for
prediction of inter-vehicle collisions [Asher96]. Forward looking sensors [Liu89] or binocular
vision systems [Weber96] could also provide some information necessary for lateral movements
[Liu89].
Assuming that there is a sensing structure possibly an array of sensors that is able to
give information about the relative speed of the neighboring vehicles, the side sensors detection
area is divided into two sections as seen in Figure 5.1. An immediate neighborhood is now
defined. In this region (D) the output of the sensor module depends not only on the detection of
vehicles, but also on measurement of their relative longitudinal movements. The penalty-reward
structure for the side sensor modules is given in Table 5.3.
Sensor Status (Left /Right)

Actions Vehicle in region D Vehicle in region E
Approaching NOT approaching
SiL 0/0 0/0 0/0
SL 1/0 1/0 0/0
SR 0/1 0/1 0/0
Table 5.3. Output of the left and right sensor modules.
Longitudinal actions are not shown in the Table 5.3 since there are not affected by the side
sensor modules. The regions D and E are defined as shown in Figure 5.1; the parameters sr1, and
sr2 are used to define the detection range of the sensors (or sensor arrays). The parameter
differentiates the region D from the region E. If a vehicle is detected in the region E, and it is not
(longitudinally) approaching the automated vehicle, the penalty response to the lane changing
actions are suppressed. This distinction gives more flexibility to the automated vehicle as we
describe with the following simulation examples.
Figure 5.7 shows five automated vehicles with sensor ranges sr1 = sr2 = 12m, d1 = 8m,
d2 = 11m, fsr = 15m (the side sensor module of Chapter 4 is used in this simulation). The vehicle
in lane 3 wants to merge into the platoon, however the distance between the first two vehicles in
the platoon is only 20m. It slows down in order to change lanes (the behavior forcing this change
is given in Section 5.3.1). The right side sensor module always returns a penalty response for the
action SR, and consequently the vehicle is unable to change lanes. It slowly moves the to the back
of the platoon.
t=0s t=3s
(a) (b)
t=6s t=9s
(c) (d)
t=15s t=19s
(e) (f)
Figure 5.7. Positions of five automated vehicles: gray colored

vehicle attempts to shift to the right lane by slowing down.
However, if we employ the extended side sensor module for the same scenario, the vehicle
is able to change lanes. As seen in Figure 5.8, the vehicle in lane 3 slows down to change lanes,
and shifts lanes at t 8.5sec.
In this simulation, the side sensor ranges are sr1 = sr2 = 12m with the additional definition
of = 0.66 , i.e., if the detected range is between 8m and 12m and the neighboring vehicle is not
approaching, the action SR becomes acceptable. Using this decision structure, the gray colored
vehicle shifts to lane 2, and adjusts its speed accordingly. Figure 5.9 shows the headway distances
for the platoon. Before the lane change, the headway distance are steady with minor deviations
due to the random speed changes (See Figure 5.10a). At t 8.5sec, the lane shift occurs and the
vehicle entering the platoon is cruising at a lower speed (See Figure 5.10). Consequently, the
distance between the second and third vehicles decreases; the vehicle which is now the third in the
platoon detects a sudden change in headway distance. As a result, this and other trailing vehicles
slow down to adjust their headway distances. The effect can be seen in Figures 5.9 and 5.10b. The
ripple is moving backward through the chain of vehicles. All vehicles are able to keep their
headway distances within the acceptable range while returning to their desired speeds of 80kmh at
t = 30sec. The steady-state response is due to the new definition of the front sensor module given
in the previous section.
t=0s t=6s
(a) (b)
t=8.5s t=8.8s
(c) (d)
t=12s t=16s
(e) (f)
t=24s t=32s
(g) (h)
Figure 5.8. Positions of five automated vehicles: gray colored

vehicle attempts to shift to the right lane by slowing down.
Headway distances (m)

25
20
15
10
5
0 4 8 12 16 20 24 28 32
time (sec)
Figure 5.9. Headway distances in the platoon; lane change occurs at t 8.5sec.
Vehicle Speeds (kmh) Space-time trajectories

90 500
88
400
86
84
position (m )
300
82
80
200
78
76 100
74
72 0
0 10 20 30 0 4 8 12 16 20 24 28 32
(a) (b)
Figure 5.10. (a) Speed and (b) space-time trajectory of five automated vehicles
(Space-time trajectory for vehicle 2 is plotted after the lane change).
In the simulation example given here, the sensor ranges are chosen to illustrate the
advantage of the new definition and the new front sensors ability to stabilize the headway
distance. However, with such small (side sensor) range definitions, the lane shift creates ripples
in the otherwise stable platoon. The main reason for defining multiple regions for the side sensors
is actually to avoid sudden jumps of one vehicle in front of another. The sensor ranges must be
carefully chosen, possibly using adaptive techniques mentioned in Chapter 1.
5.2 Additional Sensors and Inter-vehicle Communications

In addition to the changes made to the previously defined sensors, we now introduce two new
modules which are more difficult to implement, and relatively more global than the previous
definitions. The first module is visualized as a lane detector; the second as a pinch situation
detector. While there are many possibilities for lane detector implementation, sensing a dangerous
situation that may occur during lane changes is probably beyond the local sensing abilities of an
automated vehicle.
5.2.1. Lane Detection

In the previous chapter, we assumed that the side sensor modules were inherently capable of
sensing that the vehicle is in the leftmost or rightmost lane. Here, we will separate this function as
a new module, and assume that not only the barriers and/or roadside are detected, but the vehicle
can also sense which lane it is in.
A physical implementation of this module could be a vision system. Several AHS
programs in US, Europe, and Japan use camera images for lateral and longitudinal control (see
[Pomerlau96], [Malik95], [zgner96], and [Construction96] for recent examples). Extracting
the lane information from such an image is becoming more and more trivial. Also, the magnetic
markers used for lateral control may again provide position information. These solutions are still
based on local sensing; another alternative is to feed the present and desired lane information
to vehicles via broadcast communications, as envisioned by the PATH program (e.g., [Hedrick96]
and [Lygeros96]).
Once an automated vehicle knows its (lateral) position, and assuming that it can obtain the
information on the best choice of lane for its desired speed and path, the decision for lateral
movement is relatively straight-forward. Although current practical implementations do not yet
consider such computationally complicated1 maneuvers, research on control of lane changing
maneuvers is slowly emerging [OBrien96, Wonshik95, Hessburg95]. The difficult part in path
planning is to find the optimal lane position for a specific vehicle.
For our purposes we will assume that an automated vehicle can sense its present lane, and
that it has some idea about where it should be. Based on these two values, the action that leads to
necessary lane shifts is encouraged by the teacher module shown in Figure 5.11. The flag
structure described in Section 5.3.1 is used to force necessary longitudinal actions if the vehicle
has trouble shifting into the desired lane.
IF Des. Lane = Cur. Lane

=0
ELSEIF Des. Lane > Cur. Lane
Desired Lane AND Action=SL
=0 LANE
Current Lane
ELSEIF Des. Lane < Cur. Lane
Lateral Action AND Action = SR
=0
ELSE
=1
Figure 5.11. The lane module.
5.2.2. Pinch Condition Detection

Just like a driver who checks his rearview mirrors and looks to his side before changing lanes, it is
imperative for an automated vehicle to make sure that the next lane is not claimed by another
vehicle. The problem occurs when two vehicles one lane apart shift to the same spot in the lane
between them (Figure 5.12). This situation is defined as pinch condition. Besides the side
sensors returning local data, vehicles signaling to shift lanes must be checked in order to make
sure that a pinch condition does not occur.
1
Computational difficulty here describes the complexity of a good decision-making process for lane maneuvers;
the low-level control of a lane maneuver is possible for most of the current test vehicles.
Figure 5.12. Pinch condition: two vehicles decides to shift

to the same lane simultaneously.
In the simulations, we use the memory vector to check for other vehicles intention to shift
lanes. If the number of memory locations containing SR or SL is more that half the size of the
vector for vehicle, it is assumed that the vehicle is likely to shift lanes (Figure 5.13). Since there
are three lateral actions, a 50% distribution in the memory vector most likely indicates that an
action will soon be fired. If such an intention signal is received from a neighboring vehicle, the
pinch module sends a penalty response to the lateral automaton for the action that may cause a
problem. In a sense, pinch module in an automated vehicle is driven by the memory vector of
neighboring vehicles. In a real implementation, this corresponds to a signaling vehicle which may
be detected by a vision system, vehicle-to-vehicle communications indicating intended actions, or
a roadside-to-vehicle communication relaying the position of the vehicles. The last possibility is a
global solution to the problem.
m
Lateral
Planning Action
Layer A1 A2 A2 A3 A1 A2
50% FULL 100% FULL
Signal Execute
Figure 5.13. Memory vector/buffer: If an action fills half of the buffer, it is signaled.
Figure 5.14 shows two vehicles cruising at the same speed. Initially, vehicle 1 is in the
rightmost lane, vehicle two in the leftmost lane. However, vehicle 1 needs to shift to the leftmost
lane, while the other wants to change to the rightmost lane. Figure 5.15 shows the lane and speed
of the two vehicles versus time. Assuming that their initial positions in Figure 5.14 correspond to
time t = 0, both vehicles are unable to change lanes until t 3sec, then vehicle 1 shifts to the
middle lane.
The reason for the delay is the pinch module in both vehicles. Without a detection method
for this condition, the vehicles would change to the middle lane at approximately the same time,
and the result would be a collision. As seen in Figure 15.6, both vehicles lateral memory vector is
initially filled with the idle action SiL. Since both vehicles are forced to chose the necessary lateral
actions by the lane module, the percentages of the actions SL and SR in the memory buffers (of
vehicle 1 and 2, respectively) increase. The percentage of the action SL in the memory buffer of
vehicle 1 reaches 50% slightly faster than that of action SR in the vehicle 2 (Figure 5.16). As a
result, the lateral action SR of the vehicle 2 starts receiving penalties earlier than vehicle 1, and
the probability of this action decreases. The result is also a decrease in the percentage of the
action SR in the memory buffer. The percentage of vehicle 1s SL action also drops (until n = 40,
or t = 1.6sec), but never under 50%. Once the intention signal from vehicle 2 is stopped, action
SL fills the memory buffer, and is fired at n = 71 (t 2.84sec). The memory vector is then reset
to all-SiL.
Figure 5.14. Initial positions and trajectories of two vehicles.
Vehicle lanes Vehicle Speeds (kmh)

90
4
88
86
3 84
82
80
2
78
76
74
1
72
0 1 2 3 4 0 1 2 3 4
(a) (b)
Figure 5.15. Lane (a) and speed (b) of two vehicles from t = 0 to t = 4sec.
This new module enforces safe lane transitions; either one of the vehicles temporarily
suppress its desire to shift lanes (i.e., a penalty response from the pinch module overrides all other
reward responses from lane and side sensor modules by ORing), or both vehicles are hindered
from changing lanes to achieve their desired paths.
As seen in Figure 5.14, the two vehicles wont be able to exchange lanes as long as they
cruise at the same speed. Some bottleneck situation might also occur in a pinch situation where
both vehicles have similar action ratios in their memory buffers. To overcome this problem, more
complex decision rules need to be defined. These are explained in the next section.
2 Vehicle 1- Output of the pinch sensor
-1
0 10 20 30 40 50 60 70 80 90 100
time (1/25 sec)
Number of actions in the memory buffer
25
20 SiL
15 SL
10
5 SR
0
0 10 20 30 40 50 60 70 80 90 100
time (1/25sec)
2 Vehicle 2 - Output of the pinch sensor
-1
0 10 20 30 40 50 60 70 80 90 100
time (1/25 sec)
Number of actions in the memory buffer
25
20 SiL
15 SR
10
5
SL
0
0 10 20 30 40 50 60 70 80 90 100
time (1/25sec)
Figure 5.16. The output of the pinch module and the actions in the memory
buffer versus time for vehicle 1 (top) and vehicle 2 (bottom).
5.3. Higher Level Additions: More Intelligent Teachers

With the addition of new sensor modules and the extension of the previous ones, several changes
need to be made to fuse all sensor/teacher responses. In this section, we define the flags used
to temporarily alter the behavior of the vehicles in order to obtain a better path or to avoid
bottleneck situations. The definition of the new mapping function and necessary changes are also
given here.
5.3.1. Lane flag

The lane flag is defined in order to enable an automated vehicle to take an action if it cannot
match its desired lane in a pre-specified time interval. A specific module keeps track of the time
passed after a desired lane value is set (by on-board or roadside decision makers). If the vehicle
cannot change to its desired lane in the predefined time interval, then the lane flag is set. The
effect of this flag is to temporarily decrease the value of the desired speed by some fixed amount.
As a result, the vehicle slows down and hopefully finds an opening to change lanes. Once the
vehicle reaches its desired lane, the flag is reset and the vehicle slowly increases its speed to match
the previous desired speed.
Consider the situation shown in Figure 5.17(a) where the velocities of the vehicles 1, 2
and 3 are 85, 80 and 80mph respectively. The desired speed for each vehicle is the same as its
current speed. Vehicle 1 slows down to match the speed of vehicle 3 to avoid a collision, however
it is unable to move away from the pocket created by vehicles 2 and 3. Assume that vehicle 1
knows that it has to be in lane 3 because its desired speed is faster than others. Also assume that
an automated vehicle is capable of sensing its current lane. In the simulation run, the desired lane
of vehicle 1 is changed to lane 3 at time t = 8sec. Since the vehicle is unable to shift lanes, after 4
seconds (the predefined time for lane changes) the lane flag is set, forcing the vehicles desired
speed to be redefined as 80-3 = 77mph. The amount of the change in the desired speed (3mph) is
also predefined, and may be a function of current vehicle speed, average speed of other vehicles in
the neighborhood, headway, and/or current lane and desired lane. Once the flag is set (at t =
12sec), the speed of the vehicle 1 starts decreasing to match its new desired value (Figure 5.17c).
As a result, the vehicle finds an opening once its left sensor(s) clear vehicle 2, and shifts lane at t
27sec (Figure 5.17d). Upon reaching the desired lane, the flag is reset, and the speed again
matches its previous desired value of 85mph (Figure 5.17c).
This lane flag is also used to get around the bottleneck situation given in Figure 5.14. For
the vehicles to exchange lanes, they have to change their speeds to find an opening on their side.
For this, the lane flag is designed to affect a vehicle that needs to shift to the right lane slightly
differently than a vehicle that needs to shift left. The value of the adjustment in desired speed
with the lane flag is chosen differently to break the symmetry. Again, the temporary adjustment
value for the desired speed may be a function of the present and desired lanes, solving the
symmetry problem.
1 2 3
t=0s t=60s
Lane 1
(a) (b)
Vehicle Speeds (kmh) Vehicle Lanes
90
4
88
86
84
3
82
80
2
78
76
74
1
72
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
ch5m1.mpg
(0.273Mb; every frame represents one fifth of a second)
(e)
Figure 5.17. Initial (a) and final (b) situation, speed (c) and lane positions (d) of three
automated vehicles traveling on a 3-lane 500-meter circular highway. The mpeg movie
of the simulation is accessible via icon (e).
5.3.2. Speed flag

Similar to the method described above, another flag to temporarily change the desired lane is
defined. If desired lane is not set, an adjoining lane is selected. For this flag, the predefined time
interval must be larger than that of the lane flag since the time to reach the desired speed is much
longer than the time required to change lanes.
A specific module keeps track of the time passed after the current speed deviates from the
desired speed for the first time If the vehicle cannot adjust its speed in the predefined time
interval, then the speed flag is set. The effect of this flag is to send a penalty response to the
lateral action SiL. Consequently, the vehicle changes to the left or right lane, if there is an
opening. Once the vehicle changes lane, the speed flag and the time counter are reset for the next
interval.
Consider the situation in Figure 5.18(a) where an automated vehicle (#1) with desired
speed 85mph is trailing another one with 80mph cruising speed. Vehicle 1 is able to slow down
and match the speed of vehicle 2, but it is unable to travel at its desired speed of 80mph (Figure
5.18(e)). The predefined time interval to match the desired speed is 10sec in this case. The count
starts at t = 5.32sec (the + mark in Figure 5.18(f)), and the time expires at t = 15.32sec (S in
Figure 5.18(f)) while the automated vehicle is keeping a safe distance from the vehicle in front.
Once the flag is set, the action SiL, which is normally encouraged to avoid unnecessary lane
changes, starts receiving a penalty. The probability of one of the actions SR and SL, in this case
SR2, approaches one, and the action SR is fired at t = 18.96sec. The flag is reset when the vehicle
completes the lane change maneuver at t = 19.52sec (o in Figure 5.18(f)). Immediately after the
lane shift, the vehicle increases its speed to match the desired value.
t=0s t=19s
(a) (b)
t=19.6s t=35s
(c) (d)
Vehicle Speeds (kmh) Vehicle Lanes

90
4
88
86
84 3
82
80
2
78
76
74
1
72
0 10 20 30 0 10 20 30
(e) (f)
Figure 5.18. Snapshots (a-d), speed (e) and lane positions (f) of two automated vehicles traveling
on a 3-lane 500-meter circular highway.
5.3.3. Additional rules

Addition of new modules and extending the capabilities of the previous sensors require changes
be made to the function mapping multiple teacher outputs to a single automaton input. Although,
2
Using the additional rule defined in Section 5.3.3.
the structure of the front sensor module is changed, the mapping that combines the teacher
outputs for longitudinal actions does not change; the rules are exactly the same as the previous
case given in Table 4.7 and Figure 4.4.
For the lateral actions, the changes in the front sensor and the addition of lane and pinch
modules affect the mapping. The new function generating the automaton input is more complex
than a single OR gate. For the lateral action SiL, the combined output is again an OR-ed
combination of sensors except for one case. When the lane module sends a reward signal to the
lateral automaton, but the headway module indicates an approaching vehicle, the action SiL must
be penalized, as is the case with a simple OR function. However, in order to avoid unnecessary
lane changes, the penalty response of the headway module is suppressed whenever the
longitudinal action is DEC for headway distances not dangerously close (Table 5.4).
Module Output
Actions Right/Left Lane Headway Pinch Combined
0 0 0
0 1 1
SiL always 0 1* always 1 or 0
0 1 0 0 1
1 1 1
1 1* 1
0 0 0 0 0
0 0 0 1 1
0 0 0** 0 0
0 0 0** 1 1
0 1 0 0 1
0 1 0 1 1
SR 0 1 0** 0 0
or 0 1 0** 1 1
SL 1 0 0 0 1
1 0 0 1 1
1 0 0** 0 1
1 0 0** 1 1
1 1 0 0 1
1 1 0 1 1
1 1 0** 0 1
1 1 0** 1 1

Combined output depends on the longitudinal action (See text and Figure 5.19).

All combined outputs for SR and SL are OR-ed side, lane and pinch module outputs except
this one, if 0 and 0** are assumed to be the same binary value, FALSE.
Table 5.4. Possible lateral action-sensor output combinations.

For actions SR and SL, again an additional if-then condition needs to be added for one
case. Consider a situation where the side sensors and pinch modules indicate a reward for lane
shifting actions SR and SL. The lane module, on the other hand, favors lateral action SiL because
the vehicle is currently cruising in its desired lane. However, there is another vehicle in very close
proximity of the vehicle and the front sensor module sends a reward signal to indicate the need for
a lane change (if the headway distance is under the predefined limit d1, we assume that the action
DEC has failed to avoid collision). This reward signal must override the lane sensors penalty
signal (indicated with 0 in Table 5.4). The resulting mappings from multiple teacher outputs to
a single automata input are given in Figure 5.19.
IF act.=DEC & Hdwy = 0

Headway Sensor =0
LNG
Long. Automaton
Left Sensor ELSE = f
Right Sensor f
OR
Speed Sensor
IF act.=DEC & Hdwy = 1* &
Pinch Sensor Lane=0
Lane Sensor =0
LAT
ELSEIF Hdwy=0* & Pinch=0 & Lateral Automaton
(Right=0 | Left=0)
Long. Action =0
ELSE
=f
Figure 5.19. The new definition of the mapping F.
In addition to the rules described above, one more subroutine is defined to check the
memory buffer interface to the regulation layer. Currently, if an automated vehicle senses a
decreasing headway and nothing on the sides with no useful information from the lane module, it
cannot decide whether to shift left or right. The reason for that is the lack of global information;
the vehicle using only its local sensing devices cannot distinguish between the left or right lane,
and the probability of both actions reach 0.5. To overcome the problem, a symmetry breaking rule
which randomly chooses one of the actions is introduced. The overall structure of the intelligent
planner is given in Section 5.4.1.
5.4. Results and Discussion

With all the additions described in this chapter, the planning layer of an automated vehicle is now
more complex and robust than what we started with. The presentation of the final structure of the
automata-teacher model, the discussion of the environment, the effect of parameters, and some
final thoughts on the design will conclude this chapter.
5.4.1 Overall Structure

The final design of our automata-teacher planning layer consists of two automata one for
lateral, the other for longitudinal actions a total of six teacher modules reporting on headway
distance, speed deviation, left and right sensor detection, lane deviations and possible pinch
conditions. The two flags described in Section 5.3 are used to provide additional information to
speed and lane detection modules (Figure 5.20). There are three additional rules that need to be
implemented, besides the logical OR gate, in order to guarantee collision-free path control.
Furthermore, the automata are interconnected, not only through the physical environment,
but also by the fact that the current output of the longitudinal automaton is used as a condition for
the lateral automaton (Section 5.3.3). This is not necessarily important for collision avoidance, but
it illustrates the fact that lateral and longitudinal actions cannot be treated as fully decoupled
actions.
distance
Headway F
Flag structure current
Lane Speed Detection
desired Longitudinal
If-then
Left Detection Automaton To Reg.
binary 1 & 2 Memory Layer
OR
binary Buffer
Right Detection Lateral
If-then
Automaton
current
Speed Lane Detection
desired
Pinch Detection
binary
Sensors From Physical

Environment
Figure 5.20. Automata in a multi-teacher environment with new definitions

of mapping functions and flag structures.
Although not important for our design purposes, we chose to disable speed changes
during lane shifting. Once the lane change is completed, the probability update for longitudinal
actions continues. The longitudinal vehicle dynamics are modeled as simply as they can be;
however, we assert that realistic vehicle dynamics will not affect the overall system behavior since
the regulation and physical layers will react to fired actions fast enough. Although not
instantaneous, speed changes of 1m/s are not hard to obtain. For lateral movements, we assumed
that an automated vehicle completes a single lane change so that the maximum lateral acceleration
is less than or equal to 0.5m/sec2.
The automata use the algorithms given in Chapter 6 as reinforcement schemes. The
learning parameters are adjusted to give fast enough learning while keeping them relatively low to
avoid instantaneous learning.
5.4.2 Effect of Parameters

There are many parameters affecting the behavior of an automated vehicle. In this section, we will
give simulation examples illustrating the effects of parameters such as the length of the memory
buffer, the definition of sensor ranges, and the learning parameters. The sample runs given here
are simulations of platoons; only longitudinal actions are considered for demonstration purposes.
The effects of the learning parameters on lateral actions is similar. Values of the parameters for
each simulation is given in Table 5.5. General linear reward-penalty scheme is used in all
simulations.
Sim. Learning Sensor limits Proc. Memory

# Param. (m) Speed buffer size Figures
a b fsr sr2 Sr1 (Hz) longitudinal
1 0.15 0.10 16 15 12 25 25 5.21(a), 5.22(a)
5.23
2 0.15 0.10 16 15 12 25 9 5.21(b), 5.22(b)
5.23
3 0.15 0.10 20 15 10 25 25 5.21(c), 5.22(c)
5.23
4 0.05 0.05 16 15 12 25 9 5.21(d), 5.22(d)
5.23
5 015 0.10 18 15 10 25 25 5.24a, 5.25a,
5.26
6 0.15 0.10 18 15 10 25 9 5.24b, 5.25b,
5.26
7 0.15 0.10 18 15 10 25 9 5.27a
8 0.15 0.10 18 15 10 25 9 5.27b
Table 5.5. Parameter settings for simulations.
Figures 5.21 and 5.22 show the average platoon speed and the distances between vehicles
for four platoons of ten vehicles, each using a different set of parameters. Positions and speeds of
the vehicles for t = 0 and t = 29.6sec are given in Figure 5.23. All the simulations (#1-4) are run
separately, and then the data combined for illustrations. Each platoon corresponds to a simulation
run. Figure 5.23 also includes the mpeg movie of the four runs. The initial positions and speeds of
the vehicles are the same for all runs as is the processing speed of 25Hz. The desired speed for all
vehicles is 80kmh.
As seen in Figure 5.23 the positions of the vehicles at t = 29sec and their speed are
different for each simulation run. All values plotted in Figures 5.21, 5.22, and 5.23 are not
averaged over a large number of runs with the same parameters, but the result for each set of
parameters shows characteristics of a particular choice of parameters.
Simulations 1 and 2 (Figures 5.21a-b and 5.22a-b) illustrate the effect of the length of the
memory buffer on the vehicle following behavior of an automated vehicle. For the same initial
separation and velocity, the velocities of the vehicles reach a steady-state much faster when the
memory buffer is shorter. The separation between vehicles is more uniform with a larger memory
buffer. This is due to the fact that the system with the shorter buffer reaches the permitted region
[12m 18m] for the headway distance faster. Once the desired region is reached, the reaction to the
change in the feedback response is also faster. As a result, the separation between vehicles is more
uniform for a large memory buffer, however the response is not stable; the opposite is true for a
shorter memory vector length.
Changing the sensor range while keeping other parameters constant also affects the
behavior of the vehicles in the platoon, as expected. Simulations 1 and 3 differ only in the choice
of the values for d1 and fsr. As seen in Figures 5.21(a, c) and 5.22(a, c), by increasing the region
around the switching value d2, it is possible to obtain a steady-state speeds 15 sec after the start of
the run. The average speed of the platoon is still under the desired value of 80kmh since the
headway distance is still too small for the action ACC to fire for some of the vehicles. Eventually
all vehicles will reach their desired speed provided that the headway distance is above the safe
value. An example of this behavior is seen in Figures 5.21c and 5.22c: as soon as the headway
distance grows larger than d2 = 15m, at t 29sec, one of the vehicles increases- its speed, and the
effect on the average platoon speed is seen. Furthermore, it is important to note that the initial
vehicle separations are already in the defined sensor range. As seen in Figure 5.23, the overall
behavior of the platoon in simulation 3 is closer to the one in simulation 1 than in simulation 2.
Simulation 4 is given to illustrate the importance of the choice of learning parameters. In
terms of uniformity of the headway distances, the result of this simulation is worse. This is of
course due to relatively small learning parameters. Although the length of the memory vector is
the same as in simulation 2, the responses to the headway module are slow, and therefore the
headway and speed variations are large.
Average platoon speed (kmh) Average platoon speed (kmh)

82 82
81 81
80 80
79 79
78 78
77 77
0 10 20 0 10 20
(a) (b)
82 82
81 81
80 80
79 79
78 78
77 77
0 10 20 0 10 20
(c) (d)
Figure 5.21. Average platoon speeds for simulations 1-4.
Headway distances (d1=12, d2=15, fsr=16) Headway distances (d1=12, d2=15, fsr=16)
25 25
20 20
15 15
10 10
5 5
0 10 20 0 10 20
(a) (b)
25 25
20 20
15 15
10 10
5 5
0 10 20 0 10 20
(c) (d)
Figure 5.22. The distance between vehicles for simulations 1-4.
t=0s (a)
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
(b)
Simulation 4
t=29.6s (c) Simulation 1
79 80 80 79 79 7979 79 79 80
79 79 79 79 79 79 79 79 80 80
79 79 79 79 79 79 80 80 80 80
79 79 80 80 79 80 80 80 80 79
(d)
ch5m2.mpg
(e)
Figure 5.23. Initial and final positions (a,c) and speeds (b,d) of automated vehicles in four
ten-vehicle platoons. The mpeg movie of the simulation is accessible via icon (e).
Figures 5.24-5.26 show the results of a second batch of simulations (numbers 5 and 6). In
both simulations, the initial conditions are the same; the desired speed for all vehicles is still
80kmh except the leading vehicle which desires to slow down to 77mph. In this scenario, the
effect of the learning parameters is obvious: a fast learning vehicle will react faster to the speed
changes (and therefore headway changes). Also, a very short sensor region may create a problem
because it directly affects the reaction time. However, the effect of the length of the memory
buffer is not that obvious. In simulations 5 and 6 (Figures 5.24 and 5.25), the only difference
between two platoons is the size of the memory vector. There are no apparent differences in the
behavior of the platoon. Again, the separations are distributed in a larger region for a smaller
memory vector, but the difference is minor.

81 81
80 80
79 79
78 78
77 77
76 76
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(a) (b)
Figure 5.24. Simulations (a) 5 and (b) 6: average speed for a platoon of ten automated vehicles.
25 25
20 20
15 15
10 10
5 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(a) (b)
Figure 5.25. Simulations (a) 5 and (b) 6: distances between vehicles for a platoon of ten
automated vehicles.
t=0s (a)
80 78 78 80 80 82 80 82 82 80
80 78 78 80 80 82 80 82 82 80
(b)
Simulation 2
t=60s (c) Simulation 1
76 76 76 76 76 76 77 77 77 78
76 76 76 76 76 77 77 76 77 77
(d)
ch5m3.mpg
(e)
Figure 5.26. Initial and final positions (a,c) and speeds (b,d) of automated vehicles in four
ten-vehicle platoons. The mpeg movie of the simulation is accessible via icon (e).
When the speed change for leading vehicle is much larger, the effect of the memory vector
in platoon behavior can be seen much clearly. Two mpeg movies in Figure 5.27 show two
unstable platoons. When the lead vehicle slows down to 75mph, the following vehicles are not
able to keep their safe distances. Using the same parameters as before, a smaller length of the
memory vector, resulting in faster firing of actions, gives a better result. With a memory vector of
length 25, the third (or sometimes fourth) vehicle crashes into the vehicle in front, while with a
smaller length of 9, the platoon stability is kept longer, and the problem occurs at the end of the
platoon (seventh, eighth, or ninth vehicles). Since the speed (and headway) deviations increase as
we move back in the platoon, an increase in the capability of the automated vehicles avoids an
early problem.
The simulations given here are run using the general linear reward-penalty learning
scheme. Examples of the nonlinear learning scheme and its comparison to linear schemes are
given in Appendix C.
ch5m4.mpg
(a)
ch5m5.mpg
(b)
Figure 5.27. Simulations 7 and 8: mpeg movies of two simulations of 10-vehicle platoons with
lead vehicle decelerating to 75mph: Length of the memory vector is (a) 25 and (b) 9, all other
parameters are the same as in simulation 6.
5.4.3 Encountered Problems

5.4.3.1 Environment Model
In our application, the input to the automata is binary, and thus, the environment is said to be of
P-model. Although the end result of the decision modules and functions is binary, we can visualize
the method we use as a multi-valued logic. Additional conditions and the logical OR gate can be
replaced by a simple logic function which takes multiple values into account. In this sense, the
teacher outputs are somehow similar to a Q-model environment where multiple values in the
range [0, 1] are possible. However, changing the learning schemes for a Q-model environment is
tedious. In fact, there are no known applications of the Q-model multi-teacher environments.
The lack of presence of the S-model in this work is due to similar reasons. As stated in
Chapter 6, the nonlinear reinforcement schemes are designed with S-model environments in mind.
Our initial attempts to define teacher modules for the S-model have failed because of the difficulty
in combining these outputs in a way that makes sense to the automaton. Again, the lack of
previous work on the continuous environment feedback and the choice of weighting factors for
the model leads to the difficulty. Nonlinear reinforcement models with continuous feedback
responses are discussed in the literature, however there are no applications of this model, even for
a single teacher case. The problem of choosing the appropriate weighting parameters is an open
research area, and needs to be investigated.
5.4.3.2 Information Content and Other Issues

A great percentage of the previous research on intelligent control has emphasized the control
methods and technological necessities for AVC, rather than making intelligent decisions, as we
discussed previously in Chapter 1. In other words, the research mainly covers the regulation layer
in [Varaiya93]. All AHS projects currently carried out in the world have not yet considered lane
change maneuvers for implementation. Although the vehicles are individually capable of such
actions, the coordination of such maneuvers will obviously be difficult to implement. In the case
of a hierarchical control structure, there is not much need for autonomous decision making since
the control commands will be the direct result of information fed by the hierarchy. The advantage
of this approach is mainly in the content of the information. A hierarchical AHS system will have
complete knowledge about each vehicle, and will be able to relay this global information to
every agent on the highway. The autonomous vehicle approach, on the other hand, suffers from
the lack of global information. While designing the intelligent controller, we assumed local sensing
and minimal or no communications.
Solving the path planning problem locally may be relatively inexpensive and simple due to
individual design goals and lack of communications, but the price to be paid may be the optimality
of such a distributed system. Consider the situation given in Section 5.3.1 (Figure 5.17). For an
autonomous vehicle to detect that the current solution is only local, and does not satisfy the
global goals (in the sense that the vehicle is capable of avoiding collision, but is not able to reach
its desired speed), the information about the relative position (and maybe speed) of other vehicles
needs to be known. This information can be provided by link layer in a hierarchical system
[Varaiya93], or may be extracted using visual clues in the autonomous vehicle approach
[zgner96, Pomerlau96, Weber96]. Again, the information content will be lower in the latter
case. The need for more global information, rather than using only local sensors, is crucial and
apparent, considering difficulties encountered during this research.
Chapter 6
New Reinforcement Schemes for Stochastic

Learning Automata
As described in Chapter 3, a variable structure stochastic automaton, having received the

response of the environment, updates its action probabilities according to a reinforcement
scheme. The reinforcement scheme is the mechanism generating the learning behavior of the
stochastic automaton. It is of course based on the type of environment, and is generally classified
according to two criteria:
nature of the mapping from the previous probability vector to the new probability
vector, e.g.:
(i) linear, nonlinear or hybrid or
(ii) projectional, nonprojectional1
the behavior of the learning automaton using the scheme, e.g., optimal, expedient.
Our efforts of designing an intelligent path controller for an autonomous vehicle using
stochastic learning automata led us to employ four different reinforcement schemes. These are:
Linear reward-penalty L=R P (with a = b) or Bush-Mosteller scheme [Bush58]

Multi-teacher general absolutely expedient scheme (MGAE), a nonlinear scheme
under S-model environment [Baba83,Baba85]
Linear reward-penalty LR P scheme (general LR P with a b) [Narendra89, nsal96]
Nonlinear reinforcement scheme with H function (NLH), an absolutely expedient
scheme [nsal95]
The first two algorithms are widely known [Narendra89, Baba85]. Their convergence
properties were investigated, and several applications using these schemes were previously
presented [Chand69, Baba80]. The third algorithm is less popular, since the behavior of an
automaton using this scheme have not been analytically proven [Narendra89]. A special case of
this algorithm, LR P , is obtained by adding a small penalty term ( 0 < b << a < 1 ) to the LR I
1
See Section 3.5 for the definition of these terms.
Cem nsal Chapter 6. New Reinforcement Schemes for Stochastic Learning Automata 93
scheme. However, the proof of -optimality for the general r-action case is again not complete
[Narendra89]. The general linear reward-penalty scheme, LR P , is linear, nonprojectional. In
Section 6.1.1, we will prove that this scheme is optimal in the case where there is an pure
optimal action.
The last scheme is nonlinear, projectional and is an extension of the general absolutely
expedient schemes [Narendra89, Baba85]. It differs from previous schemes by the definition of
an additional reward function. We will also show in this chapter that this scheme is absolutely
expedient in stationary environments.
The last two schemes are used in this work for their improved behavior in specific cases
frequently encountered in our application to intelligent vehicle control. Both schemes are found
to be convergent to the optimal solution faster than the first two schemes. They are the direct
results of our attempts to create reinforcement schemes with desirable characteristics suitable for
this study of learning automata applications to intelligent vehicle control.
In the following section, we will compare the linear reinforcement scheme LR P to
previously existing linear learning algorithms. The nonlinear reinforcement scheme NLH is also
compared to the general nonlinear scheme in Section 6.2. Since these two new schemes are found
to be useful, we have investigated their convergence properties. Using the nonlinear stability
theorems, we give the proof of optimality of the linear scheme LR P for a specific case.
Furthermore, conditions of absolute expediency are checked with the new nonlinear scheme NLH,
and the scheme is proven to be absolutely expedient.
6.1 A Linear Reinforcement Scheme: Linear Reward-Penalty with

Unequal Parameters, LR-P
Consider the general linear reward-penalty reinforcement scheme previously given in Section
3.5.1:
if ( n) = i
pi (n + 1) = pi ( n) + a (1 pi (n))
for (n) = 0
p j (n + 1) = (1 a ) p j (n) j i
(6.1)
pi (n + 1) = (1 b) pi (n)
for (n) = 1 b
p ( n + 1) = + (1 b) p j ( n) j i
j
r1
where r is the number of actions. Parameters a and b are associated with the reward and penalty
updates respectively. For b = 0, the algorithm is called linear reward-inaction scheme, and is
known to be -optimal [Narendra89]. The specific case where a = b is, generally, known as the
linear reward-penalty scheme, and is expedient for stationary environments [Bush58,
Narendra89].
An interesting phenomenon is seen in the case where a = 0 and b 0. For an automaton

with a single "optimal action" in a stationary environment and using this linear inaction-penalty
scheme LI P , although the probability of the optimal action grows higher than others, it can
never reach 1. In fact, it approaches 1/(r-1). Therefore, nothing can be said about the optimality
of the scheme. Since this reinforcement scheme is relatively unimportant, we give the proof of
this statement in Appendix B. If the parameter b = 0 while 0 < a < 1, the algorithm is called LR-I,
the linear reward-inaction scheme. Again, several researchers studied the behavior and possible
applications of this scheme [Viswanathan73, Shapiro69]. The case where the leaning parameters
are unequal (a b) have not studied rigorously by previous researchers2 because the proof of
optimality does not exist for this case.
The relative behavior of the LR I (b = 0), L=R P (a = b) and LR P (a b) schemes is shown
in Table 6.1 and Figure 6.1. The data show the convergence behavior of an optimal action in the
best case possible (i.e., c = 0, cj = 1) for different values of parameters a and b. We calculate
the time for the optimal actions probability to reach approximately 1. The number of actions is
3; initial probability values are the same. For every parameter set, the reinforcement scheme is
tested 500 times. The updating of the probabilities stops when the probability of the optimal
action reaches 0.995. The simulation is stopped automatically at time step 1500 even if the
convergence is not obtained. The top graph in Figure 6.1 shows the minimum, average, and the
maximum number of steps for the parameters space [01. , 09 . ] [00. , 09
. ] . The bottom graph
shows only the minimum and average number of steps; it is also scaled and rotated version of the
first plot to emphasize the effects of the parameter b for smaller values of parameter a.
Simulation data show that the time to reach the pure optimal strategy depends on the
learning parameters. Although Figures 6.1 and 6.2 show the whole parameter range, values larger
than 0.3 are not generally used. As seen from the data in Table 6.1, and Figures 6.1 and 6.2, the
convergence to pure optimal strategy is faster with L=R P than LR I . The difference between these
two reinforcement schemes is that the first one is expedient, while the latter is -optimal. Again,
the simulation results show that when b is slightly less than a (e.g., LR P with [a, b] = [0.2, 0.1]
or [a, b]=[0.3, 0.1]) the convergence to a pure optimal strategy is faster than with L=R P .
2
Except the case where the learning parameter b is much smaller than the parameter a, i.e., LR P scheme.
min /max/ average number of steps
1500
1000
500
0
1
b 0.5
1
0.8
0.6
0.4
0 0.2
0
a
minimum and average number of steps
100
80
60
40
20
0
0
b 0.5
0.2 0
0.6 0.4
1 1 0.8
a
Figure 6.1. Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: (top) maximum, average and minimum number of steps; (bottom) average and
minimum number of steps (over 500 runs; 3 actions; initial probabilities same; perfect case; note
the axis shift and scaling in the second plot).
Scheme 3 Parameters Number of steps for p opt to reach 0.995

a b max. min. mean median std.dev. Nonconv.
LR I 0.1 0.0 79 50 59.418 59 4.557 0
L=R P 0.1 0.1 142 51 75.336 71 16.978 0
* 0.1 0.2 327 48 108.062 95 47.066 0
* 0.1 0.3 820 49 170.254 134 114.389 0
* 0.1 0.4 1338 48 266.16 204 218.733 0
* 0.1 0.5 1479 48 344.535 273.5 264.416 8
* 0.1 0.6 1497 47 425.219 326 338.649 17
LR I 0.2 0.0 42 23 29.124 28.5 3.532 0
* 0.2 0.1 76 23 32.548 32 6.328 0
L=R P 0.2 0.2 105 23 34.940 33 8.737 0
* 0.2 0.3 96 23 38.594 35 1.744 0
* 0.2 0.4 164 23 43.440 38 1.901 0
* 0.2 0.5 181 23 46.884 39 2.168 0
* 0.2 0.6 169 23 51.136 42.5 2.941 0
LR I 0.3 0.0 36 15 19.068 18 3.199 0
* 0.3 0.1 39 15 20.326 20 3.659 0
* 0.3 0.2 43 15 21.164 20 5.006 0
L=R P 0.3 0.3 47 15 21.966 21 5.754 0
* 0.3 0.4 60 15 22.814 21 6.988 0
* 0.3 0.5 97 15 25.130 22 10.211 0
* 0.3 0.6 72 15 24.964 22 9.943 0
LR I 0.4 0.0 26 11 14.278 14 2.767 0
* 0.4 0.1 27 11 14.788 14 3.223 0
* 0.4 0.2 29 11 15.090 14 3.233 0
* 0.4 0.3 31 11 15.308 14 3.642 0
L=R P 0.4 0.4 51 11 15.618 14 4.469 0
* 0.4 0.5 42 11 16.002 14 4.973 0
* 0.4 0.6 51 11 16.396 15 5.745 0
LR I 0.5 0.0 24 9 11.978 11 2.533 0
* 0.5 0.1 22 9 11.672 11 2.472 0
* 0.5 0.2 23 9 11.574 11 2.689 0
* 0.5 0.3 26 9 11.844 11 3.024 0
* 0.5 0.4 31 9 11.824 11 3.428 0
L=R P 0.5 0.5 30 9 11.772 11 3.243 0
* 0.5 0.6 27 9 11.886 11 3.325 0
LR I 0.6 0.0 26 7 9.540 9 2.750 0
* 0.6 0.1 24 7 9.536 9 2.349 0
* 0.6 0.2 20 7 9.240 9 2.096 0
* 0.6 0.3 19 7 9.322 9 2.132 0
* 0.6 0.4 23 7 9.446 9 2.519 0

* 0.6 0.5 31 7 9.814 9 2.982 0
L=R P 0.6 0.6 29 7 9.588 9 3.104 0
Table 6.1. Behavior of the automata under general LR P scheme: 500 runs;
maximum step =1500; 3 actions with c = [0 1 1] (This data is plotted in Figure 6.1).
3
The mark * indicates linear reward-penalty scheme L=R P (a b).
In some simulation runs, the probability popt did not reach 0.995. In this case, the
statistical values are calculated by discarding the data for these runs, and therefore, do not present
the actual behavior of the algorithm. The fact that the algorithm did not converge in 1500 steps
does not exactly mean that it would not converge if the limit was increased. Similarly, the fact
that we have convergence in all 500 runs does not guarantee convergence at all times
Figure 6.2 shows the result of the simulations for the parameters range 0 b 0.9, 0.1
a 0.9. For each parameter pair, 500 simulations are completed. The number of simulations
which did not converge in 1500 steps are indicated with o. The marker x indicates the fact that
the probability of the pure optimal action had converged to 0.995 in all 500 simulation runs. We
know that for a = 0, convergence is not obtained no matter what the value of parameter b is
(Appendix B). As seen from the plot, if the value of the parameter b is much larger than
parameter a for small values of a, convergence again is not guaranteed. The exact limit of the
non-convergence region is not known since the number of simulations is 500, and the limit of
time steps is 1500. Furthermore, this region is different for different numbers of actions.
non-converging runs = "o" - max.step=1500; 3 actions; runs=500

1
0.9 65
0.8 48
0.7 47
0.6 17
0.5 8
b
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
a
Figure 6.2. For some parameter values, convergence is not obtained for all 500 runs
(Number of non-converging runs are shown next to marker o).
Table 6.2 gives the maximum, minimum and average number of steps for the optimal
actions probability to converge to 0.995, for a smaller range of learning parameters. As seen from
the table (and Figure 6.1), a slight change in the value of the parameter affects the speed of
convergence to the pure optimal strategy. Especially for smaller values of parameter a, the
maximum and consequently the average number steps decreases significantly with a decreasing
value of parameter b. For example, the maximum number of steps decreases from 105 to 48 when
we switch from the L=R P scheme to the LR P scheme by decreasing the value of the parameter b
from 0.2 to 0 while a = 0.2.
Scheme Parameters Number of steps for p opt to reach

0.995
a b max. min. mean
LR I 0.10 0.00 79 50 59.418
LR P 0.10 0.05 122 49 67.080
L=R P 0.10 0.10 142 51 75.336
LR P 0.10 0.20 327 48 108.062
LR P 0.10 0.30 820 49 170.254
LR I 0.20 0.00 42 23 29.124
LR P 0.20 0.05 48 23 31.038
LR P 0.20 0.10 76 23 32.548
L=R P 0.20 0.20 105 23 34.940
LR I 0.30 0.00 36 15 19.068
LR P 0.30 0.10 39 15 20.326
LR P 0.30 0.20 43 15 21.164
Table 6.2. Behavior of the automata under general LR-P scheme:
500 runs; maximum step = 1500; 3 actions with c1 = 0, c2 = c3 = 1.
Changing the initial values of the action probability does not drastically change the
behavior of the automata except to increase the number of time steps for convergence to the pure
optimal strategy. Figure 6.3 shows the maximum, average and minimum number of steps for
convergence with initial probability vector of [0.005 0.4975 0.4975]. Besides the increase in the
number of steps needed for convergence, the comparison of Figures 6.1 and 6.3 shows that the
effects of the learning parameters are comparable.
Similarly, increasing the number of actions also increases the number of time steps needed
for convergence; the effect of the parameters is generally the same. As seen from Tables 6.2 and
6.3, the minimum number of steps for convergence is not affected significantly, but the maximum
number of steps, and therefore the average, increases with an increasing number of actions.
Furthermore, if the probabilities of penalty for all non-optimal actions (cj) are less than
1, the convergence is again slower as expected (Figure 6.4). The change in the convergence rate is
due to the decrease in the number of correct updates on the action probability vector. The
effects of the parameters is again similar to previous cases. The effect of the parameter b on the
number of non-converging runs and the average number of steps for convergence is significant
when learning parameter a is small.
max / min/ average number of steps

1500
1000
500
0
1
b 0.5
0.8 1
0.6
0 0.2 0.4
0 a
Figure 6.3 Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: maximum, average and minimum number of steps (over 500 runs;
3 actions; perfect case; p(0) = [0.05 0 .4975 0.4975]).
# of Parameters Number of steps for p opt to reach

actions 0.995
a b max. min. mean
4 0.10 0.10 292 50 98.2420
5 503 55 117.2060
7 459 65 153.9420
10 774 64 198.0060
4 0.20 0.00 50 24 31.3400
5 61 24 32.8320
7 85 25 36.1540
10 129 25 39.1120
4 0.20 0.10 70 24 36.2900
5 74 24 40.0940
7 101 26 43.9480
10 114 25 50.1940
4 0.20 0.20 138 24 44.5460
5 135 24 50.8560
7 256 25 64.1420
10 360 25 82.2700
4 0.30 0.30 92 16 27.2540
5 125 16 31.2400
7 117 16 38.5120
10 259 16 49.2380
Table 6.3. Effect of number of actions on behavior of the automata using general LR-P scheme:
500 runs; maximum step = 1500; perfect case; all action probabilities initially equal.
max / min/ average number of steps

1500
1000
500
0
1
b 0.5
0.8 1
0.6
0 0.2 0.4
0 a
Figure 6.4 Number of steps needed for popt to reach 0.995 for different values of learning
parameters a and b: maximum, average and minimum number of steps (500 runs; 3 actions;
ideal case: c =[0 0.8 0.5]; initial probabilities are the same).
In simulations described in Chapters 4 and 5, the LR P reinforcement scheme is frequently

used. Instead of increasing both learning parameters, decreasing the parameter associated with the
penalty (namely b) resulted in a better performance. However, LR P reinforcement scheme is not
widely used, nor there are complete results on learning behavior of an automata using this
scheme. We will now show that this reinforcement scheme is expedient and -optimal for a
special case.
6.1.1 Proof of Convergence for Linear Reward-Penalty Scheme with

Unequal Learning Parameters, L R-P
The linear reward-penalty reinforcement scheme with a b has been previously studied to some
extent. Narendra and Thathachar defined a special case of LR P , the LR P scheme where the
learning parameter associated with penalty (namely b) is much smaller than learning parameter
associated with reward. In [Narendra89], this scheme is treated as a special case of the linear
reward-inaction scheme LR I because only small values of the penalty parameter b = .a where
0 < << 1 are considered. The reason for such a definition is that the behavior of the scheme can
studied by using the methods given in [Narendrta89] provided that the parameters are small.
However, even for small values of the parameter b (as well as a), this linear approach does not
result in a complete proof; the establishment of -optimality of the LR P scheme in the r-action
case in general is incomplete [Narendra89].
Here, we will consider all values of the parameters a and b and use the general stability
theorem on nonlinear discrete-time systems [Kalman62] to prove the convergence for a specific
region, without limiting the parameter b to small values. However, there is a price we have to pay
for this; the use of the stability theorem limits this proof to a specific case where there is an
optimal action4.
To prove the convergence to pure optimal action, we start with the conditional
expectation of the action probabilities. From the definition of the LR P scheme (Equation 6.1), we
may write the conditional expectation for the probability of an action at the next step as:
r
[ ] [
E pi (n + 1) pi ( n) = E pi ( n + 1) pi (n) (n) = k p k (n )
k =1
]
r
[
= c k E pi (n + 1) pi ( n) ( n) = k (n) = 1 p k (n )
k =1
]
r
[
+ (1 c k ) E pi ( n + 1) pi (n) (n) = k (n) = 0 p k ( n)
k =1
]
r
b
= ci pi ( n)(1 b ) pi ( n) + c k pk (n) + (1 b) pi (n)
k i r 1
r
+ (1 ci ) pi (n)[pi (n) + a(1 pi (n)) ]+ (1 c k ) pk (n)(1 a) pi (n )
k i
(6.2)
We can write this equality as:
[ ]
r
b r r
E pi (n + 1) pi (n) = pi (n)(1 b) ck pk (n) + c p ( n ) + (1 a ) p ( n ) pk (n )
k =1 r 1 k i k k i
k =1
r
(6.3)
(1 a ) pi (n) ck pk (n ) + api (n)(1 ci )
k =1
r
Rearranging terms and using the fact that p ( n) = 1 , we get:
k
k =1
r r
[ ]
E pi (n + 1) pi ( n) = [(1 b ) (1 a )] pi (n ) pk c k +
k =1
b
pk (n) c k
r 1 k i
+ (1 a ) pi (n) 1 + a pi (n)(1 ci ) (6.4)
r r
b
= (a b) pi (n ) pk (n) c k +
r1
pk (n) ck + (1 aci ) pi ( n)
k =1 k i
4
The case where there is an optimal action is the most common case in stochastic automata applications. The
convergence of the probability vector to the pure optimal strategy in this case is proven for almost all linear and
nonlinear reinforcement schemes [Najim94, Narendra89].
For a = b, this corresponds to difference equation for the L=R P scheme [Narendra89]. Now,
taking the expected value of both sides, we obtain:
r r
b
E [ pi ( n + 1)] = ( a b) c k E [ pi (n ) p k (n) ] + c k E [pk (n )]+ (1 aci ) E[pi (n)] (6.5)
k =1 r 1 k i
At this point, we will state two theorems for the stability of nonlinear discrete-time
systems which will be used in our proof.
Theorem 3.1: Consider the discrete-time system

x ( n + 1) = f (x (n) ) (6.6)
where x is a vector, f is a vector with the property that f(0) = 0. Suppose there exists a scalar
function V(x) continuous in x such that:
(i) V(x) > 0 for x 0.

(ii) V ( x) < 0 for x 0 where V ( x( n + 1)) = V ( x (n + 1)) V ( x( n)) = V ( f ( x (n ))) V ( x (n ))
(iii) V(0) = 0.
(iv) V(x) as x .
Then, the equilibrium state x = 0 is asymptotically stable in the large and V(x) is a Lyapunov
function. Condition (ii) may be replaced with:
(ii) V(x) 0 for all, and V(x) does not vanish identically for any solution sequence {x(n)}
satisfying Equation (6.6).
Theorem 3.2: If the function f(x) defined above is a contraction, e.g.:

f(x) < x with f(0) = 0 (6.7)
for some set of values x 0 and some norm, then the system above is asymptotically stable and
one of its Lyapunov functions is:
V(x) = x (6.8)
These two theorems and their proofs are given in [Kalman62].
The general form of the difference Equation 6.5 is:
( )
r r
pi (n + 1) = ij pi (n) p j (n) + ij2 ( n) + ik pk (n) (6.9)

ji k =1
where:
[
pi (n ) E pi ( n) ] i = 1,..., r
(
ij2 (n ) Cov pi ( n), p j (n) )
ik ( a b) ck k = 1,..., r (6.10)5
ii 1 a ci
b
ij c j i
r 1 j
We first give the proof for a three-action automaton for the sake of clarity. For a three-
action automaton with an optimal action (c1 = 0), Equation 6.9 (or Equation 6.5) can be written
as:
p1 (n + 1) = 11 p1 ( n) + 12 p2 ( n) + 13 p3 ( n)
+ 11 p1 (n) 2 + 11112( n) + 12 p1 ( n) p2 (n ) + 12 122 (n) + 13 p1 ( n) p3 (n) + 13132 (n)
p2 (n + 1) = 21 p1 (n) + 22 p2 ( n) + 22 p3 (n )
+ 21 p2 ( n) p1 (n ) + 21 212( n) + 22 p2 (n) 2 + 22222 (n) + 13 p2 (n ) p3 ( n) + 23232 ( n)
p3 ( n + 1) = 31 p1 ( n) + 32 p2 (n) + 33 p3 ( n)
+ 31 p3 (n) p1 (n) + 31312 (n ) + 32 p3 (n) p2 (n) + 32322 ( n) + 33 p3 (n) 2 + 33 332 (n)
(6.11)
where:
j1 = ( a b) 0 = 0 jk = (a b) c k
b (6.12)
j1 = 0= 0 jk = 1 ack j = 2,3 k = 2,3
3 1
Again, ii2 (n) are variances, and ij2 (n) are covariances for the action probabilities. Since the first
action has zero penalty probability, the difference equations for the last two action probabilities
does not include the probability of the first action. Therefore, the difference equations for the
probabilities of the last two actions can be written as:
b
p2 (n + 1) = (1 ac2 ) p2 ( n) + c p (n )
2 3 3
+ (a b )c2 p2 (n )2 + ( a b)c2 222 (n) + (a b) c3 p2 (n) p3 ( n) + (a b)c3 322 ( n)
(6.13)
b
p3 ( n + 1) = c2 p2 (n) + (1 ac3 ) p3 (n)
2
+ (a b )c2 p3 (n) p2 (n) + (a b) c2 322 ( n) + (a b)c3 p3 (n) 3 + ( a b)c3 332 (n )
5
We omit the operator E[.] for simplicity.
Let us define a function f of the expected values of the last two probabilities as 6:
b 2 2
p2 (1 ac2 ) p2 + 2 c3 p3 + (a b)c2 p2 p2 + ( a b) c2 22 + (a b)c3 p2 p3 + (a b )c323
f = b
p3 c p + (1 ac ) p + (a b)c p p + (a b)c 2 + (a b )c p p + (a b)c 2
2 2 3 3 2 3 2 2 32 3 3 3 3 33
2
(6.14)
If we can prove that this function is a contraction using the 1-norm7, then we can state that the
expected value of the probabilities of the last two actions asymptotically converge to 0, by using
Theorems 3.1 and 3.2. Thus, the probability p1 of the optimal action asymptotically approaches
1.
The contraction condition is:
p2 ( n) b
f = (1 ac2 ) p2 + c3 p3 + (a b)(c 2 p22 + c 2 22
2 2
+ c3 p2 p3 + c3 23 )
3
p (n ) 2
1
b
+ (1 ac3 ) p3 + c p + (a b) (c3 p23 + c3 33
2
+ c2 p2 p3 + c2 32
2
)
2 2 2
? p (n )
< p2 (n) + p3 ( n) = 2
p3 (n) 1
(6.15)
Since, a, b, ci, pi are all parameters/variables in [0, 1], and the last terms in the absolute values are
always positive8 for all n, we can write:
p 2 ( n ) b
f (1 ac 2 ) p2 + c 3 p3 + a b (c2 p22 + c2 222 + c3 p2 p3 + c3 23
2
)
p3 (n ) 2
1
b
+ (1 ac3 ) p3 + c2 p2 + a b (c 3 p32 + c3 33
2
+ c2 p2 p3 + c2 23
2
)
2
(6.16)
Rearranging the terms, we have:
p2 ( n) b
f (1 ac 2 ) p2 + c 2 p2 + a b (c2 p22 + c2 222 + c2 p2 p3 + c 2 232 )
p3 (n ) 2
1
b
+ (1 ac3 ) p3 + c3 p3 + a b (c3 p 23 + c3 332 + c3 p2 p3 + c 3 232 )
2
6
By omitting the terms (n) for simplification.
Denoted here as 1 .
7
8
All the terms of the sum are positive: the first term includes the square of the expected value, and the second term,
the variance which are always positive; the third and fourth terms can be combined as ci ( pi pj + 2ij ) = cE
i
[ pi pj ] .
Expected value of the multiplication of any two action probabilities is always positive since 0 < pi <1.
(6.17)
b
Replacing by b without invalidating the inequality, and using the identity
2
E [x y ] = E [x ] E [y ]+ xy2 , we obtain:
p2 (n)
f ([ ] [
< (1 ac2 ) p2 + bc2 p2 + a b c2 E p2 + E p2 p3
p3 ( n) 1
2
])
( [ ] [ ])
+ (1 ac3 ) p3 + bc3 p3 + a b c 3 E p23 + E p2 p3 (6.18)
[ ] [ ] [ ] [ ] [ ]
We know that E pi2 + E pi p j = E pi2 + pi p j = E pi ( pi + p j ) < E pi since the sum of all
probabilities is equal to 1. Therefore:
p2 (n)
f < (1 ac2 ) p2 + bc2 p2 + a b c2 p2 + (1 ac 3 ) p3 + bc3 p3 + a b c3 p3
p3 ( n) (6.19)
1
Assuming a > b, we get:

p2 ( n) p 2 (n)
f < (1 ac 2 + bc 2 + ac 2 bc2 ) p2 + (1 ac 3 + bc3 + ac3 bc3 ) p3 = p 2 + p3 =
p3 (n ) 1 p3 (n) 1
(6.20)
Therefore, for the region a > b (Figure 6.5), the expected values of the probabilities of the sub-
optimal actions all converge to zero. Since the sum of all probabilities is equal to 1, the expected
value of the optimal action must converge to 1. Figure 6.6a shows the changes in the probabilities
of the suboptimal actions for a linear reward-penalty scheme with a = 0.2 and b = 0.1 with initial
probabilities of p(0)=[0.05 0.50 0.45]. As seen from the plot, probabilities p2 and p3 do not
asymptotically converge to 0. In 83 steps, the probability p1 of the optimal action reaches 0.999.
n
(
The cumulative averages pi (n ) = pi ( k ) (n = 0, 1, , 83, i = 1,.2) are plotted in Figure 6.6b.
k=0
They converge asymptotically to 0.

b
1
stable
region
0 a
0.5 1
Figure 6.5. The region for asymptotic convergence to
pure optimal strategy.
Action probabilities 2 and 3 from n=1 to n=83 Cumulative average of action probabilities
0.5 0.5
n=0
0.4 0.4
0.3 0.3
p3 p3
0.2 0.2
n = 83
0.1 0.1
p(0)=[0.05 0.50 0.45]
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
p2 p2
(a) (b)
Figure 6.6. Probabilities (a) and cumulative average (b) of the probabilities of the non-optimal
actions for a 3-action automata with LR P (a = 0.2, b = 0.1).
[ ]
For r = N actions with p( n) = p1 ( n) p2 (n) ... pr (n ) , the contraction condition can be written
as:
p2 (n) p2 (n)

p3 ( n) p3 (n) N
f < pi (n) (6.21)
. .

i=2
N
p ( n )
1
p N ( n ) 1
To prove this condition, we start with the standard norm inequality:
( )
N N N
b
f (p(n )) = (1 aci ) pi + c j p j + (a b )c j pi p j + ij2
i=2 j=2 N 1 j =2
j i j i
(6.22)
( )
N N N N N
b
(1 aci ) pi + c j p j + (a b )c j pi p j + ij2
i =2 i=2 j=2 N1 i = 2 j =2
ji
Again, (1 aci), p i(n), c i , b, (a b) are all positive values, and the sum of all ( pi p j + ij2 ) has
positive values for all time steps. Therefore:
( )
N
b N N N N
f (p(n )) (1 aci ) pi + c j p j + (a b) c j pi p j + ij
2
(6.23)
i =2 N 1 i =2 j =2 i =2 j =2
j i
More explicitly:
(1 ac 2 ) p2 ( + c3 p3 + c4 p4 + ...........+ c N p N ) +
(1 ac ) p (c p + + c4 p 4 + ...........+ c N p N ) +
3 b 2 2
( )
f p(n )
3
+
... N 1 ...
(1 ac N ) p N ( c2 p2 + c3 p3 + .........+ c N 1 p N 1 + )
c2 ( p22 + 222 + p 2 p 3 + 232 + ...........+ p2 p N + 22 N ) +

c 3 ( p2 p3 + 322 + p32 + 332 + ...........+ p3 p N + 22N ) +
+ (a b)
...
c N ( p 2 p N + N2 2 + p3 p N + N2 3 + ...........+ p 2N + 2NN )
(6.24)
Rearranging, and using E [x y ]= E [x] E[x ] + xy , we have:
2
(1 ac2 ) p2 + ( N 2)c2 p2 +
(1 ac ) p +
b ( N 2)c3 p3 +
f (p(n ))
3 3
+
... N 1 ...
(1 ac N ) p N ( N 2)c N pN
9

([ ] [ ])
c2 E p22 (n ) + E p2 ( n) p3 (n) + ........... + E p2 (n) pN (n ) + (6.25)
] [
+ (a b)
([ ] [ ] [ ])
c3 E p2 ( n) p3 (n) + E p32 (n) + ........... + E p3 ( n) pN (n) +

...
N ([2 n ] [ 2 N ]
c E p (n ) p (n) + E p ( n) p (n) + ........... + E p 2 ( n)
N [ ])

[ ]
N N
[ ]
Since E pi ( n) p j (n ) = E pi (n) p j ( n) < E pi (n) 1 , we may write:
j= 2 j= 2
(1 ac2 ) p2 + c2 p2 + c2 p2 +
(1 ac ) p + c p +
( N 2)b c3 p3 + 3 3
f (p(n )) <
3 3
+ + ( a b ) (6.26)
... N 1 ... ...
(1 acN ) pN c N pN cN pN
N 2
Combining the terms above and omitting the coefficient N 1 < 1, we have:
N N
f (p(n )) < [(1 aci ) + bci + (a b)ci ]pi = pi p(n ) (6.27)
i= 2 i= 2
We have therefore proven that for the region a > b (Figure 6.5), the expected values of the
probabilities of the sub-optimal actions all converge to zero. This, in turn, implies that the
probability of the optimal action goes to 1, i.e., the pure optimal strategy is obtained. The
conditions for convergence are that learning parameter associated with reward be greater than the
learning parameter associated with penalty, and that there is one optimal action with c = 0. In
short, we have the following:
Theorem 3.3: An automaton using the general linear reward-penalty scheme with unequal
learning parameters in a stationary environment reaches the pure optimal strategy if:
i) there is an optimal action (i.e., ck = 0 for some k [1, r] ), and,

ii) the learning parameter associated with reward is greater than the one associated with
penalty (i.e., a > b)
9
Note that pi stands for the expected value of pi , i.e., the terms E[.] are omitted. The notation p(n) defines the actual
value of the probabilities at time step n.
And, since the probability vector converges to the unit vector corresponding to the optimal
strategy, the expected value of the average penalty converges to zero. Therefore, such an
automaton is expedient and optimal (see the definitions in Chapter 3).
Although LR P is not absolutely expedient like nonlinear schemes given in Chapter 3 and
Section 6.2, it is optimal for a specific but widely encountered environment condition. It still
remains to extend the proof to the region {(a , b) [ 01, ] [01, ] a < b }(Figure 6.5) if at all
possible. Also, the proof can again be extended to all possible environments by considering
values copt > 0. However, in this case, previous linearization approaches [Narendra89] and the
contraction theorems 3.1 and 3.2 cannot be used.
6.1.2 Notes on the Ideal Environment and Penalty Probabilities

The proof given in the previous section for general the linear reward-penalty reinforcement
scheme is valid for an automaton in an ideal environment. The word ideal here refers to the
fact that there is an optimal action for which the probability of receiving a penalty is zero. In
such a case, the automaton is expected to reach pure optimal strategy. In the literature, both
the conditions where min {ci} = 0 and min {ci} > 0 are studied. When there is no optimal action,
the action probabilities converge to values conversely proportional to probabilities of penalty
[Narendra89, Najim94]. We have studied the ideal environment in depth because this case is often
encountered in applications. Almost all of the learning automata applications are defined in ideal
P-model environments [Baba85, Najim94]. In all the applications discussed in this thesis, the
environment is defined so that there is always one action with zero probability of penalty. This
action is the optimal action for the current situation or the solution to the problem at hand.
Throughout this dissertation, we defined the probabilities of penalty for each action as a
constant in the range of [0,1] where zero corresponds to no penalty at all, and 1 corresponds to
guaranteed penalty. Although these values may be time varying, most of the literature except
[Najim94, Baba85] consider these to be deterministic. On the other hand, it is interesting to note
that all proofs in this chapter and in the previous literature carry directly to a noisy environment
by defining the penalty probabilities as random variables. Assuming that the probability of
penalty from the environment for any action is a random variable of mean mi and variance v i, all
cis in difference equations can be simply replaced by mis to accommodate this new definition10.
For all the linear reinforcement schemes we mentioned so far, it has been shown here or
elsewhere that convergence to optimal or pure optimal strategy is guaranteed, sometimes based
on several conditions on the parameters and the environment. However, there is not much to say
about the expediency of these schemes, especially absolute expediency. For a learning algorithm,
it is important to be able to show that the learning characteristic is kept at every step, i.e., the
automaton does better than the previous step at every time iteration. This need has led to a
10
This is because the probabilities of penalty ci are independent from action probabilities pi .
synthesis approach toward reinforcement schemes, and the first result was the absolutely
expedient nonlinear scheme of Lakshmivarahan and Thathachar [Lakshamivarahan73].
Definitions of several nonlinear reinforcement schemes were given in Section 3.5.2. In the
next section, we will introduce a new nonlinear scheme that works better than the previously
defined schemes [Baba83, Baba85] for several applications where fast convergence under certain
conditions is necessary.
6.2. A Nonlinear Reinforcement Scheme: NLH

Consider the general nonlinear reinforcement scheme given in Section 3.5.2 (see Equations 3.20
and 3.27). For a multi-teacher P-model environment the general absolutely expedient scheme can
be written as:
if (n) = i ,
i1 + + i N r i1 + + i N r
pi (n + 1) = pi (n) +
N
( )
j p(n ) 1
N
(
j p(n) )
jj=1i jj =1i
i1 + + i N i1 ++ i N
p j (n + 1) = p j (n )
N
( )
j p (n) + 1
N
(
j p(n) ) for all j i

(6.28)
where the functions i and i satisfy the following conditions:
1 ( p( n)) ( p( n))
= = r = ( p (n))
p1 ( n) pr ( n)
1 ( p (n)) ( p( n))
= = r = ( p(n ))
p1 (n ) pr (n)
p j (n) + j ( p(n)) > 0 (6.29)
r
pi (n ) + j ( p(n)) > 0
j= 1
j i
p j (n) j ( p(n)) < 1

for all i, j = 1,..., r.
The functions ( x ) and( x) can be chosen arbitrarily as long as they satisfy the
conditions given above. Previous attempts to find absolutely expedient schemes led to nonlinear
functions. The only absolutely expedient linear scheme is the LR I (b =0) scheme. In this case,
( x ) = a and ( x ) = 0 , and they satisfy the above conditions. The following theorem given in
[Narendra89] and [Najim94], states the necessary and sufficient conditions for absolute
expediency:
Theorem 3.4: If the functions ( p) and ( p ) satisfy the following conditions:

( p) 0
(p) 0 (6.30)
( p) + ( p) < 0
then the automaton with the reinforcement scheme in Equations 6.28 and 6.29 is absolutely
expedient in a stationary environment.
In our applications, we defined a single environment response that is a function of

multiple (sensor) responses. However, the theorem is valid for any number of teachers in an
environment. A well-known description of functions ( p) and ( p) is given in Section 3.5.2.
The resulting reinforcement scheme is called multi-teacher general absolutely expedient (MGAE)
scheme in [Baba85]. Our reinforcement scheme differs from the one given in Section 3.5.2
(Equations 3.27 and 3.28) by the definition of these two functions. In the form of Equation 6.28,
we can write our update algorithm as:
Nonlinear reinforcement scheme NLH

if (n) = i
pi (n + 1) = pi (n) + f ( k H(n )) [1 pi ( n)] [1 f ] ( ) [1 pi (n)] (6.31)
p j (n + 1) = p j (n ) f ( k H(n )) p j (n) + [1 f ] ( ) p j ( n) for all j i
i.e.:
k ( p(n)) pk ( n)
(6.32)
k ( p (n)) k H( n) pk (n)
where learning parameters k and are real valued and satisfy:
0 < <1
(6.33)
0 < k < 1
The next section introduces the function H used in our reinforcement scheme and the
reasoning behind this choice. We also prove that the definitions above satisfy all necessary and
sufficient conditions for absolute expediency. The comparison of the new nonlinear scheme with
the general absolutely expedient scheme is concludes this chapter.
6.2.1 Function H and the Conditions for Absolute Expediency

In Equation 6.31, the feedback f is some combination of teacher outputs 11, and the function is
defined as (see also Figure 6.7):
pi(n)
H( n ) min1; max ; 0 (6.34)
k (1 pi ( n))
11
It could be the average of all teacher outputs as in Equation X.28 or some nonlinear combination such as the one
we define in Section 4.x.x.
Parameter is an arbitrarily small positive real number. Also note that the function H includes p i
which is the action probability corresponding to the current action.
H(p)
1/k

0
0.5 1 p
Figure 6.7. Sketch of the function H.
We now show that the defined algorithm (Equations 6.31-6.34) satisfies all the conditions
in Equations 6.29. From Equation 6.32, we have:
k (P ( n)) k H( n ) p k ( n)
= = k H ( n) ( P( n)) (6.35)
p k ( n) pk (n )
and:
k (P( n )) pk ( n )
= = ( P (n )) (6.36)
pk (n) pk (n)
That is, our definition is consistent with the first two conditions of Equation 6.29. There are
three remaining necessary and sufficient conditions for absolute expediency. Using Equations
6.31 and 6.32, the rest of the conditions on i and i translates to the following:
(a) p i ( n ) + (1 pi ( n)) < 1

(b) p j ( n) p j (n ) > 0 (6.37)
(c) p i ( n ) k H ( n) (1 p i (n )) > 0
Conditions (a) and (b) are associated with the reward updates while condition (c) is associated
with the penalty updates. These conditions guarantee that the probabilities stay in the range (0,
1) at all times (with the assumption that none of the probabilities is initially 0 or 1). Conditions
(a) and (b) can be shown to be satisfied using the fact that the sum of all probabilities is 1:
r r
(a) p i + (1 p i ) < pi + p j <p i + p j = 1 since 0 < < 1 (6.38)
j =1 j =1
j i j i
(b) p j ( n) p j ( n ) = p j (1 ) > 0 since 0 < p j < 1 and 0 < < 1 (6.39)
For the third condition, we have:

p i ( n ) k H (n ) (1 pi ( n )) > 0 k H( n ) (1 pi ( n)) < p i ( n )
(c) pi ( n ) (6.40)
H( n ) <
k (1 pi ( n ))
This condition is already satisfied by the previous definition of the function H(n). For the
limiting values of H = 0 and H = 1 (Equation 6.34), we have the following:
H = 0 pi ( n ) > 0 . This is true for all values of the action probabilities pi.
For H = 1, we must have p i ( n ) k (1 p i ( n )) > 0 to satisfy condition (c). From

the definition of the function, we conclude that H = 1 implies:
pi (n)
1
k (1 pi (n) )
This inequality can be rewritten as:
pi (n ) (1 + ) k (1 pi ( n))
or, omitting the time step variable and rearranging:
pi k (1 pi ) k (1 pi ) 0
Since all the factors of the third term are strictly positive real, we may omit this term without
affecting the inequality to obtain:
pi k (1 pi )> 0 (6.41)
which is exactly what we must have for H = 1.
Thus, condition (c) is satisfied for the limiting values of function H too. With this, we
conclude that all five conditions of the Equation 6.29 are satisfied. Therefore, the reinforcement
scheme NLH is a candidate for absolute expediency.
Furthermore, the functions and for nonlinear scheme NLH satisfy the following:
(p(n) )= < 0
(p(n )) = k H 0 (6.42)
(p(n) )+ (p( n)) = k H < 0
because 0 < < 1 , 0 < k < 1 , and 0 < H < 1 . Since the conditions above are sufficient for
absolute expediency, and we know that the nonlinear reinforcement scheme NLH satisfies all the
conditions listed in the definition of a candidate nonlinear reinforcement scheme, we state the
algorithm given in Equations 6.31-6.34 is absolutely expedient in a stationary environment.
6.2.2 Comparison of NLH with General Absolutely Expedient Scheme

The choice of the function H is also due to another factor besides the conditions given in
Equation 6.29. Our definition is found to work better than the general MGAE Algorithm given in
[Baba85], considering the convergence to a solution. We compared two reinforcement schemes
using three actions and two different initial conditions (Table 6.1). Our definition results in a
faster update in general as shown in Table 6.4. The average number of steps for the pure optimal
actions probability to reach 0.995 are given for different values of the learning parameters k
and . Two different cases are considered.
The definition of the function H (Equation 6.34) is based on the fact that the update rate
for a penalty response from the environment is higher when the probability of the current action
is close to 1. (Note that H is a function of the probability of the current action pi; it is not related
to the index j in Equation 6.28.) This provides a much faster convergence rate when the actions
receiving a penalty from the environment have high probability values associated with them.
Average number of steps to reach popt(0) = 0.995

(a) 3 actions with (b) 3 actions with
Parameters p(0) = [1/3 1/3 1/3] p opt (0)=0.005,
p jopt (0)=0.4975
algorithm in New algorithm algorithm in New algorithm
k [Baba85] with H [Baba85] with H
1 0.010 544.99 304.48 952/94 * 534.42
0.035 155.67 155.59 358.46 ** 159.11
0.050 108.79 66.22 263.84 ** 113.98
0.100 54.11 35.13 184.85 60.13
0.200 26.85 19.21 141.64 31.16
2 0.050 102.80 48.36 259.77 77.60
0.100 51.42 27.57 185.83 41.70
0.200 25.58 15.73 117.84 23.53
*
80 runs did not converge in 1200 steps. H function is defined in Equation 6.34.
**
1 run did not converge in 1200 steps. 500 runs for each parameter set.
Table 6.4. Convergence rates for a single optimal action of a 3-action automaton
in a stationary environment.
The data shown in Table 6.4 are the results of two different initial conditions: (a) all
probabilities are initially the same, only one action is receiving reward, and (b) the only action
receiving a reward12 from the environment has a very small probability value. The difference in
convergence rate is more distinct in the situation where the probability of the optimal action is
initially very close to 013. As seen in Table 6.4, when popt = 0.005, the number of iteration steps
12
I.e., the optimal action.
13
This situation occurs frequently in our application to automated highway systems; for example, while the
probability of the lateral action shift left is converging to 1, a vehicle may enter the left sensor range. In this case,
we need a strong penalty update to decrease the probability of this action, while encouraging the action stay in
lane.
to reach p opt = 0.995 is reduced drastically for relatively large values of the learning parameter .
Especially when both learning parameters are large (see the shaded area in Table 6.4), the
difference between the average number of steps for two schemes is threefold or more. In order to
have a fast update on the probability vector, the function H is set to the highest possible value
(see Equation 6.34) satisfying the conditions for i in Equation 6.29.
Chapter 7
Interacting Vehicles: Rules of the Game
In previous chapters, we introduced an intelligent control method for autonomous navigation and
path planning. The decision system mainly uses local information, and as a result, the actions
learned by the intelligent controller are not globally optimal; the vehicles can survive, but may not
be able to reach some of their goals. To overcome this problem, we visualize the interaction
between vehicles as sequences of games played between pairs of automata. Every game
corresponds to a state of the physical environment (the highway and the vehicles) as described
in Section 7.3. By evaluating these games, it is possible to design new decision rules, and analyze
the interactions by predicting the behavior and the position of each vehicle. Some of the
additional modules in Chapter 5 are actually the result of the approach described in detail here.
7.1 Interconnected Automata and Games

Automata are, by design, simple gadgets for doing simple things. As a control tool, a learning
automaton has limited applications since increasing the number of actions results in a decrease in
the speed of responses [Norman68]. The full potential of the stochastic automaton is realized
when multiple automata interact. For example, the separation of lateral and longitudinal actions
by using two automata leads to faster convergence, while bringing the issues of interaction and
coordination to the design problem. As stated in [Narendra89], the first question that needs to be
answered is whether the automata can be interconnected in useful ways for desired group
behavior for modeling or controlling complex systems. Narendra and Thathachar also ask if an
individual automata can be designed to function as a distributed yet coordinated intelligent control
system. The application of learning automata described in previous chapters partially answers
that question. This chapter introduces the idea and methodology behind the design of automata as
interacting agents.
Although the interaction between automata in our application is slightly different than the
definitions given in the literature, we will briefly introduce the basic ideas and previous results on
interacting automata. A detailed treatment of the subject can be found in [Narendra89,
Lakshmivarahan81].
Similar to other definitions in game theory, two or more automata are assumed to play a
game formulated in terms of automata actions and reward-penalty responses. At every step,
automata send an action to the environment which in turn evaluates the actions and sends out a
feedback. This feedback is called the outcome or payoff as in the general game theory. The
Cem nsal Chapter 7. Interacting Vehicles: Rules of the Game 117
probability vector of actions defines the mixed strategy of a player/automaton; each element of
the probability vector corresponds to a specific action, or pure strategy.
Standard game definitions such as equilibrium point(s), dominant strategy, and pareto
optimality, are defined similarly. In our application, we are interested in equilibrium points where
no player has a positive reason to change its strategy, assuming that none of the other players is
going to change strategies. Such equilibrium points are also called Nash equilibria. All N-player
games with finite mixed strategy sets have at least one equilibrium. In a two-person zero-sum
game, there may be more than one equilibrium point with the same payoff. This is not
necessarily true for two-person non-zero sum games.
All descriptions above are defined on a game matrix. A game matrix is used to show the
payoff structure of a game. For example, for a two-player game where each player has two
actions (pure strategies), the game matrix D can be shown as:
(d 11
1
, d112 ) (d12
1
, d122 )
D= 1 2
(d 21 , d 21 ) (d 22 , d 22 )
1 2
where dijk is the payoff1 to player k when the players play pure strategies i and j. For a two-
player zero-sum game, the equality dij1 = d ij2 is true. There were several learning schemes
designed for zero-sum games; the solution to the problem (of obtaining the pareto optimal2 or
equlibrium point outcome) is given in [Lakshmivarahan82]. It is known that linear reward-
inaction LR I and -optimal linear reward-penalty LR P (with b << a) schemes guarantee that the
automata reach the best possible solution, i.e., the expected penalty reaches its minimum value.
The results are also valid for identical payoff games [Lakshmivarahan81], and may be extended to
N-player games [Narendra89].
For non-zero sum games, however, unique rational solutions are difficult to define. While
considering non-zero sum automata games, it is not possible to evaluate the performance of the
automata using game theoretic arguments. The role of the learning schemes must be reevaluated
in such situations [Narendra89]. Unfortunately, relatively little work exists in this area. Here,
we will attempt to provide some insight on automata interaction of several vehicles by defining
games of automata and vehicles. The results obtained for non-zero sum games can be extended to
N-player non-zero sum games to show that the mixed strategies will converge to the equilibrium
point(s). In the case of multiple equilibrium points, the convergence depends on the initial
conditions of the probability vectors.
It is also important to note that, in an automata game, the players are not aware of the
mixed strategy used by the other player(s) nor its previous actions. In fact, the players do not
even have the knowledge on the distribution of random payoff structure as a function of the
strategy combinations. The interaction medium between automata is the payoff function.
1
This is still the probability of receiving penalty, and therefore a player wants to minimize its payoff.
2
In an N-person game, an outcome is said to be pareto optimal if there is no other outcome in which all players
simultaneously do better, i.e., receive less penalty from the environment.
7.2 Interacting Automata and Vehicles

As described in Chapters 4 and 5, the planning layer of an autonomous vehicle includes two
automata for lateral and longitudinal actions. Furthermore, the actions of each automaton, sooner
or later, affects the physical environment, thus creating another level of interactions although
indirect. In this section, we will describe the interaction between automata in the same vehicle,
and the automata in different vehicles. Furthermore, the nature of the games played by learning
automata is defined.
7.2.1 Interacting Automata in an Autonomous Vehicle

Each vehicles intelligent path controller consists of two automata. The interaction between these
two automata can be visualized as in Figure 7.1. As described in Section 5.3.3 (Figure 5.19),
lateral and longitudinal automata synchronously update their action probabilities based on the
response(s) of the environment. Furthermore, the value of the current longitudinal action changes
the environment response to the lateral automaton. In other words, the lateral environment3 is
determined by the current longitudinal actions. This is different than an automata game since the
interaction is not via the environment, but the automata directly interact. The idea of interacting
automata was first introduced in [Wheeler85]. The resulting configurations can be viewed as
games of automata with a particular payoff structure. Some of the results mentioned in Section
7.1 can then be applied to such interconnected automata. The model given in Figure 7.1 is an
example of interconnected automata with a changing environment.
LNG LAT
E 1LAT , E 2LAT
A LAT LAT
Lateral
Lateral Environment
LNG Automaton
ALNG E LNG
Longitudinal Longitudinal
Automaton Environment
Figure 7.1. The longitudinal automaton determines the lateral automatons

environment (Adapted from [Narendra89]).
We know that lateral automaton ALAT can operate in two stationary environments E iLAT
(i = 1, 2). The difference between these two automaton environments is the response of the
headway module (see Section 5.3.3). In some situations, the choice of longitudinal action LONG
affects the response of the lateral environment. All other environment changes are due to changes
in the physical environment, and we visualize those changes as switching environments described
in Section 7.3. Due to interconnections, one would expect the longitudinal automaton ALNG to
3
The word environment here describes the automata environment, not the physical one.
converge to its best action using absolutely expedient schemes. Lateral automaton ALAT in turn
would converge to the best action in the environment determined by ALNG.
Let us assume that an automated vehicle finds itself in the rightmost lane of a two-lane
highway, say by merging from an exit. Consider the situation in the first few seconds where the
automata environment stays stationary. With a relatively fast update rate, this assumption is
possible. The lateral action SR will receive a penalty until the vehicle shifts lane. Assuming the
vehicle is in its desired lane and speed range, the environment response for the longitudinal and
lateral actions depends on the output of the headway and left side sensor modules as given in
Table 7.1. Let us also assume that the probability that a sensed vehicle is moving away from the
autonomous vehicle is 1/3, and we know the probability of a vehicle being in a particular sensor
range4 as given in Table 7.2.
Range A B C D E none
Action
ACC 1 1 1/3 0 0 0
DEC 0 0 0 0 0 0
SM 1 1/3 0 0 0 0
SL 0 0 0 1 1/3 0
SR 1 1 1 1 1 1
SiL 5 1 1/3 1/3 0 0 0
0 0 1/3 0 0 0
Table 7.1. Assumed probabilities of penalty for each action based on the front
and side sensors (see Figure 5.1 for range definitions).
.
Probability A B C D E none
(a) 1/10 1/10 1/10 1/10 1/10 _
Table 7.2. Probability of sensing a vehicle in the sensor range.
Based on the numbers given in Tables 7.1 and 7.2, the probabilities of receiving penalty
for both automata can be calculated. For example, for longitudinal action ACC, the payoff is
1 1 1 1 1 1 1 1 7
1 + + + 0 + 0 + 0 = . Therefore, the game matrix for longitudinal and
10 10 3 10 3 10 10 2 10
lateral automata is:
4
Note that this is not required for the actual application, it is just an assumption to illustrate the effect of changing
environment for interconnected automata.
5
The second row indicates the probability of penalty when the longitudinal action is DEC.
SL SR SiL
ACC 7 4 7 7 5
30 , 30 , 1
30
,
30 30
DEC
4 1

0,
30
(0 ,1 ) 0,
30
SM 4 4 4 4 5
, , 1 ,
30 30 30 30 30
Entries in the first and third row correspond to the first environment for the lateral
automaton while the entries of the second row to the second environment. The difference
between the two environment is the longitudinal action DEC: when the longitudinal automaton
chooses DEC, lateral environment ALAT is switched to second environment which changes the
response to the lateral action SiL. If the vehicle is slowing down, lane shifts are discouraged
(Figure 5.19). If the automata were not connected, absolutely expedient algorithms would be
expected to converge to actions DEC and SL (without the connection, the lateral action SL is
optimal since the penalty response from the front sensor in not suppressed.) When the automata
are connected as described, the optimal actions are DEC and SiL. Based on the probabilities given
in Tables 7.1 and 7.2, this solution is pareto optimal and an equilibrium point for this game. In a
situation where the probability of sensing a vehicle in region C is much larger than others, the
payoff of the lateral action SL is always less than SiL, and the action pair (DEC, SL) becomes the
pareto optimal solution. The two-automata game will converge to this new optimal solution in
this case.
Note that the situation above is very specific when we consider all sensors: the vehicles
must be cruising at their desired speed range and lane, the pinch sensor must not send a penalty
response, etc. For all other situations, the two automata may be considered unconnected. Using
the algorithms given in Chapter 6, the automata will converge to their best (optimal) actions
separately. The interaction between automata is via the physical environment, and for the
duration of a specific game we consider it to be stationary, resulting in a stationary automata
environment. Of course, the solution of such a disjoint game will be an equilibrium point (and a
pareto optimal solution) due to the convergence characteristics of the reinforcement schemes.
7.2.2 Interacting Vehicles

While two automata in each autonomous vehicle form an interconnected automata pair that is
guaranteed to reach the optimal solution for a stationary (physical) environment, interaction
between autonomous vehicles creates another level of interaction via the physical environment.
As seen in Figure 7.2, automata actions from other autonomous vehicles changes the physical
environment which in turn affects the feedback responses sent to an automaton. This type of
interaction is indirect, and therefore cannot be formulated using a game matrix. Furthermore, the
fact that the game matrix needs to be time variant when considering multiple interacting vehicles
complicates the matter.
Instead, we visualize the automata environments resulting from the ever-changing

physical environment as a switching environment. Based on a certain change in the physical
world, the automata environment is changing from one state to another. Every automata
environment state includes a different set of feedback responses to lateral and longitudinal
automata. All these different states of the environment are assumed to be stationary since the
automata converge to the optimal actions before another change takes place. As discussed
previously, the automata pair in every vehicle is absolutely expedient (or optimal when using
linear schemes) in each stationary environment, and therefore, the autonomous vehicles are
guaranteed to reach an intelligent decision. Once a decision is made and sent to the regulation
layer, corresponding actions are fired and the physical environment changes, forcing the automata
environment to switch to another state.
ALNG ELNG
Regulation
Layer
ALAT ELAT
Vehicle
Physical
Environment
Vehicle
ALAT ELAT
Regulation
Layer
ALNG ELNG
Figure 7.2. Multiple vehicles interacting through the physical environment.
It is important to realize that lateral or longitudinal actions need not be fired for the
automata environment to switch from one state to another. For example, if two vehicles cruising
at their desired speed are in adjacent lanes, and if their idle actions SiL and SM are optimal, the
physical environment may change due to the speed difference between the vehicles. The moment
that one of the vehicles clears the others side sensor detection area, the lateral automata
environment for both vehicles change. Similarly, when ideal actions are fired, the physical and
automata environments may not change. The interactions between the actions and the physical
environment, and the physical and automata environment, are fairly complicated. In the next
section, we will introduce a representation scheme that will facilitate analysis of the changes in
the physical environment in relation to the automata environment.
7.3 States of the Switching Automata Environment

Although illustrating vehicle interactions as automata games for every (stationary) instance of the
automata environment is not feasible, it may be possible to define a similar matrix for all the
actions of autonomous vehicles. Consider a situation where two autonomous vehicles interact via
their sensors, and communication (or signaling) devices. The physical presence of one vehicle is
affecting the automata environment of the other via range sensors. Note that the vehicles are not
actually aware of the presence of others: the automata in each vehicle are simply trying to find
the best action to take given a set of feedback responses from the modules. For example, consider
two vehicles cruising at their desired speeds and occupying the spot next to each other without
any other vehicles in the sensor ranges (Figure 7.3a). Let us also assume that the vehicles have no
lane preferences, resulting in a constant reward from the lane detection module. In this case, the
penalty-reward structure for both vehicles is given in Figure 7.3c. The matrix in this table gives
the conditions in a particular automata environment resulting from physical location, current
conditions, and predefined vehicle parameters. For example, lateral action SR for vehicle 1
receives penalty from the right sensor, and the combined response is a penalty. This penalty
response is shown as a shaded lower triangle in the matrix; reward responses are shown by white
triangles. Combined feedback responses are evaluated using the decision structure given in Section
5.4.1. Again, note that the entries in this matrix are the environment responses, not the
probabilities of receiving a penalty.
As seen from the Figure 7.3c, the only intelligent choice for vehicle 2 is the idle action
(SiL and SM) while vehicle 1 may shift the left lane or cruise on its current lane while keeping its
speed. Let us assume that vehicle 1 stays in its current lane, and it is moving faster than vehicle 2.
After a short period of time, the physical situation shown in Figure 7.3b is reached; vehicle 1
moves out of the left sensor range of vehicle 2. In the mean time, the idle actions may be fired
repeatedly in both vehicle. This new situation affects the automata environment as shown in
Figure 7.3d. The vehicles now have more choices: vehicle 2 may also shift left while it is possible
for vehicle 1 to shift right, left or stay in the middle lane. There are six pairs of possible (vehicle)
actions; five of these force the automata environment to switch to another state. The idle action
pairs (SiL and SM) do not affect the physical environment instantaneously; in fact as far as the
two vehicles are concerned, these actions do not change the automata environment at all. For our
analysis of vehicle interactions, all the physical situations where a vehicle is outside of another
vehicles sensor range lead to the same automata environment since the reward-penalty structure
does not change.
Des. Spd. = Cur. Spd. Des. Spd. = Cur. Spd.
1 Des. Spd. = Cur. Spd. 1
Des. Spd. = Cur.Spd. 2 2
(a) (b)
v2 v2
v1 ACC DEC SM SiL SL SR v1 ACC DEC SM SiL SL SR
ACC ACC
DEC DEC
SM SM
SiL SiL
SL SL
SR SR
(c) (d)
Figure 7.3. Situations for two interacting vehicles (a, b), and combined environment responses
(c, d) for situations (a) and (b) respectively (shaded triangles indicate a penalty; upper triangles
are associated with vehicle 2; optimal action pairs are indicated with black borders).
Consider a situation with three vehicles as shown in Figure 7.4a. Vehicle 1 and 2 are
autonomous; vehicle 3 is not automated and cannot sense nor communicate. It is just an obstacle
as far as the intelligent vehicles are considered. Vehicle velocities are given as V1 = V 3 > V 2.
Vehicle 2 has no lane preference while vehicle 1 wants to shift to the middle lane. However,
vehicle 1 cannot shift immediately to the middle lane since vehicle 2 is in the side sensor range
(Figure 7.4a); the automata environment for this situation is given in Figure 7.4d (The actions SM
and SiL are combined as a single action IDLE. If a lateral action other than SiL is chosen, the
row/column for combined action IDLE refers to the lateral idle action, and vice versa. If both SiL
and SM are chosen the table shows the OR-ed response).
Due to velocity differences, vehicle 2 drifts away from vehicle 1s sensor range (Figure
7.4b), and the automata environment switches (Figure 7.4e). In the mean time, the idle actions are
fired repeatedly. With the new environment, the number of possible actions for vehicles 1 and 2
increases, and lateral action SL becomes the optimal solution for vehicle 1. As a result, vehicle 1
changes lane (Figure 7.4c) which in turn causes another automata environment change (Figure
7.4f).
Using the same reasoning, we can establish which automata environment corresponds to
what physical situation-vehicle condition pairs. Since the automata in each vehicle uses optimal
or absolutely expedient algorithms, convergence to the optimal solution is guaranteed for all these
situations, provided that the automata have enough time to learn. Using this method, it is then
possible to predict how the vehicle will react to a specific physical situation. This will enable us
to define highway scenarios as described in the next section, and find solutions for intelligent
path planning.
v2
v1 ACC DEC Idle SL SR
ACC
3 V3
DEC
Des. Spd. = Cur. Spd. 2 V2 Idle
SL
Des. Spd. = Cur. Spd. 1 V1
Des. Lane = middle SR
(a) (d)
v2
ACC
3 V3
DEC
2 V2 Idle
SL
1 V1
SR
(b) (e)
v2
ACC
3 V3
DEC
2 V2 1 V1 Idle
SL
SR
(c) (f)
Figure 7.4. Changes in the physical (left) and automata environments (right): vehicle 1 shifts to
the middle lane.
7.4 Highway Scenarios as State Transitions

Autonomous vehicles described in previous chapters are able to avoid collisions by making
decisions based on local information available to them. However, the resulting vehicle path may
not be the best solution toward the problem of congestion. Furthermore, some of the vehicle
paths may conflict, and prevent the vehicles from reaching their desired goals. The problem exists
for both hierarchical control and autonomous vehicle approaches. However, the second approach
is a decentralized control method in nature, and finding a solution may be much more difficult
than finding a globally optimal path strategy with a hierarchical architecture that has all the
information about the highway situation.
This problem with the autonomous vehicle approach has not yet been answered by the
previous research efforts, while the hierarchical control structure inherently possesses the
methodology to solve the problem. We visualize a possible situation with multiple interacting
vehicles as a sequence of environment states. For all states of the physical environment which
includes the positions of the vehicles and current parameters defining their behavior a
corresponding automata environment can be defined. The automata environment is analyzed to
predict possible physical environment changes. These changes will be illustrated as state
transitions. State diagrams formed using possible environment state transitions can then be used
for analysis as well as design purposes.
Consider two vehicles sharing a 3-lane highway. The possible physical situations are
given in Figure 7.5. Besides the relative lateral positions, we assumed that only three possibilities
exist for relative longitudinal positions. The distinguishing factor between these positions is
whether a vehicle is in the side sensor range or not. Also note that for each state given in Figure
7.5, there is a reciprocal state with switched vehicle positions, denoted by an asterisk (e.g., B1*;
therefore, total number of states is 24). Figure 7.5 is basically a list of situations that are
interesting for the analysis of vehicle interactions; there may be more situations. Appendix D.1
shows the relative positions of the two vehicles in a larger physical environment. Situations
where the vehicles fall outside of the sensor range are combined into a single state, and further
simplification gives the twelve situations given in this section.
Note that for any physical environment state in Figure 7.5, there may be multiple
corresponding automata environments due to several factors such as desired speed, desired lane
as indicated in Figure 7.3 and 7.4.
To analyze the behavior of autonomous vehicles and the conflicts resulting from their
interactions, we define highway scenarios. A scenario is a specific situation with physical
locations of vehicles, their sensor outputs, and internal parameters such as desired lanes and
desired speeds. Once we know the automata environment at the beginning of a scenario, we can
predict the (state) changes in the physical environment. Then, all possible changes are combined
to form a state diagram showing the progress of the physical environment. The transitions from
one state are the direct results of the automata environment given by the matrices such as those in
Table 7.4.
To illustrate the idea, let us consider the situation A1 in Figure 7.5. Assume the
probabilities of possible vehicle actions are equal. For example, vehicle 1 may shift right, move
ahead of the vehicle 2, slow down or do nothing while vehicle 2 keeps its speed and lane the
same. These vehicle actions will cause transitions from A1 to A3, B1, C1 and A1 respectively.
All possible transitions between defined states are shown in Figure 7.6. The links indicated with
* show a transition to a reciprocal state, e.g., D1 B1*, B2 D2* or B3* D1*.
1 1 1
A1 B1 C1 D1
1 1 1 1
A2 B2 C2 D2
1 1 1
1
A3 B3 C3 D3
Figure 7.5. Possible physical environment states for 2 vehicles in 3-lane highway.
A3
A1
A2
C3 C1 C2 B2 B1 B3
*
*
D2
*
*
D1
D3
Figure 7.6. State transition diagram for two vehicles on a 3-lane

highway; reflexive transitions are not shown (see also Figure 7.5).
Assuming equal probability for all vehicle actions in state A1, we can define a transition
matrix as follows. At state A1, vehicle 1 can shift right, move ahead, stay back or keep its speed
and lane. Similarly, vehicle 2 can shift left, move ahead, stay back or keep its speed and lane. We
assume that vehicles do not take actions simultaneously, i.e., only one vehicle can fire an action at
a given time. Thus, there are eight possible transitions with equal probability (Table 7.3).
Actions Transition to
Vehicle 2 idle, Vehicle 1 move ahead B1
Vehicle 2 idle, Vehicle 1 slips back C1
Vehicle 2 idle, Vehicle 1 idle A1
Vehicle 2 idle, Vehicle 1 shifts right A3
Vehicle 1 idle, Vehicle 2 move ahead C1
Vehicle 1 idle, Vehicle 2 slips back B1
Vehicle 1 idle, Vehicle 2 idle A1
Vehicle 1 idle, Vehicle 2 shifts left A2
Table 7.3. Action-transition pairs for state A1.
Therefore, the probabilities of transition from state A1 to states B1, C1, A2 and A3 are
0.250, 0.250, 0.125, and 0.125 respectively. The environment stays at state A1 with probability
of 0.250. All other transitions for state A1 are not possible. Similarly, all probabilities of
transition for other states can be written, and the transition matrix T becomes:
A B C D A* B* C* D* K
A1 14 1 1 1
0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2
8 8 4 4
A2 2 2 2
9 9
0 0 9
0 0 9
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9
A3 1 0 2
0 0 2
0 0 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
91 9
1 1
9
1
9 9
B1 4 0 0 2 8 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B2 0 2
0 1 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0
9 9 9 9 9
B3 0 0 2 1
0 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 1
0
9 9 9 9 9
C1 14 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 8 8
C2 0 2
0 0 0 0 1 4
0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0
9
2
9
1
9
4
9
1
9
1
C3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 9 9 9
D1 0 0 0 0 0 0 0 1 1 2
0 0 0 0 0 1 1
0 0 0 0 0 0 0 1
10
1
10 5
1
10
1
10 5
1
D2 0 4
0 0 0 0 0 0 8
0 0 2
0 0 0 0 8
0 0 0 0 0 0 0 0
D3 0 0 0 0 0 0 0 0 1
0 0 1
0 0 0 0 0 1
0 0 0 0 0 0 1
8 2 8 4
A1* 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
0 0 1
0 0 0 0 0 0
4 8 8 4 4
A2 * 0 0 0 0 0 0 0 0 0 0 0 0 1 2
0 0 2
0 0 2
0 0 0 0 2
9 9 9 9 9
A3* 0 0 0 0 0 0 0 0 0 0 0 0 1
0 2
0 0 2
0 0 2
0 0 0 2
9
1
9
1 1
9
1
9 9

B1 * 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2 8 8
B2 * 0 0 0 0 0 0 0 0 0 1 1
0 0 2
0 1 4
0 0 0 0 0 0 0 0
9 9 9 9 9
B 3* 0 0 0 0 0 0 0 0 0 1
0 1
0 0 2 1
0 4
0 0 0 0 0 0 0
9 9 9 9 9
C1 * 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 1 1 1
0 0 0 0
4 2 8 8
C2 * 0 0 0 0 0 0 0 0 0 0 0 0 0 2
0 0 0 0 1 4
0 1 1
0 0
9 9 9 9 9
C 3* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
0 0 0 1
0 4 1
0 1
0
9 9 9 9 9
D1* 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2
0 0 1
10
1
10 10
1
10 5
1 1
5
D2* 0 4
0 0 0 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
0 0 2
0
D3 * 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 1 1
8 8 2 4
K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Each row of the matrix T shows the probabilities of transitions from a state, and sums up
to 1. Such a matrix is called a stochastic matrix. As seen above, it also includes the reciprocal
states obtained by changing the locations of the vehicles in states shown in Figure 7.5. The last
row and column of the matrix corresponds to a collision state, denoted by K.
It is also possible to consider simultaneous vehicle actions. For example, if vehicle 1 shifts
to the middle lane while vehicle 2 move ahead in state A1, a transition to state C3 will occur. The
probability of simultaneous actions is relatively small, and also, even if this is the case, the
transition may be thought of as a sequence of to separate transitions from state A1 to C1 or A3,
and then to state C3. Thus, we assumed that only one vehicle may take an action at one time.
For the scenario above, we may assume that the probability of transitions are constant for
all steps. If the probabilities of finding the environment at a given state are known, the row vector
which consists of these probabilities can then be multiplied with the transition matrix to evaluate
the probability distribution between states at the next time step. Note that the use of the
transition matrix suggests that probability of the environment being in one state at time n
depends only on the probability distribution at the previous step n-1. Therefore, the state
transitions of the physical environment given above constitute a Markov Chain.
Let us assume that the environment is at state A1 initially (n = 0). The row vector
defining the probability distribution for the states is then:
[
s(0) = 1 0 0 1 x25 ]
The probabilities of being at a specific state is then calculated by:
A1 A2 A3 B1 B2 B3 C1 C 2 C 3 K
[
s(0) T = s(`1) = 0250
. 0125
. 0125
. 0250
. 0 0 0250
. 0 0 0 ]
For consecutive steps, the probability of being at state K increases. At n = 20, we have:
s = [23 15 15 23 18 18 23 18 18 11 7 7 7 6 6 7 8 8 7 8 8 10 7 7 716].10-3
The state K for this Markov chain is an absorbing state 6. Then, we conclude that two vehicles
randomly choosing one of the possible actions in a three-lane highway will collide sooner or later,
if, of course, our probabilistic model is correct.
7.4.1 Scenario 1: Two Vehicles on a Three-Lane Highway

Now, let us consider the situation A1 in Figure 7.5, now with two intelligent vehicles equipped
with the sensor and teacher modules, except for the flag structures discussed in Chapter 5. The
speeds of the vehicles, and their lateral positions are the same. Vehicle 1 wants to shift to lane 3,
and vehicle 2 to lane 1 (Figure 7.7). Since they are initially traveling at the same speed, with the
A subset C of the state space for a Markov process is said to be closed if for every state si C and sj C, and the
6
transition probability tij is zero. If the subset C consists of a single state, that state is called an absorbing state. Only
state K has that probability in our example.
basic sensors and modules, there are no decisions/actions that would lead to a goal state (or
states). Starting at state A1, possible transitions are to states A1, A2, and A3. For transitions to
states A2 or A3, one of the vehicles must fill its memory vector with a lane shifting action much
faster than the other. If one of the pinch modules detects the other vehicle signaling for lane shift,
the state A1 will be permanent, and idle action SM will fire repeatedly.
Cur. Spd =Des. Spd.

Des Lane = 3 1 V Lane 1
Lane 2
Cur. Spd =Des. Spd.

Figure 7.7. Scenario 1: Two vehicles with conflicting desired paths.
If the vehicles were to change their speed, the transitions shown in Table 7.4 are possible
with the current internal parameters. Figure 7.8 shows the corresponding state diagram where
arrows indicate the direction of transition. Two possible chains with the solution states as
absorbing states B1* and C1* are distinguished with solid and dashed lines.
From To
Chain 1
A1 A1,A2, A3
A2 A2
A3 A3
Chain 2
B1 B1,B2, B3
B2 B2,D1*, D2*
B3 B3,D1*, D3*
D1* D1*,C2*, C3*
D2* D2*,C2*
D3* D3*,C3*
C2* C1*, C2*
C3* C1*,C3*
C1* C1*
Chain 3
C1 C1, C2, C3
C2 C2, D1, D2
C3 C3,D1,D3
D1 D1,B2*, B3*
D2 D2,B2
D3 D3,B3
B2* B1*, B2*
B3* B1*,B3*
B1* B1*
Table 7.4. Possible transitions for Scenario 1.
A3
A1
A2
C3 C1 C2 B2 B1 B3
*
*
D2
*
*
D1
D3
Figure 7.8. Possible chains for Scenario 1 (reflexive transitions are not shown; two
chains are distinguished with dashed and solid lines).
From Table 7.4 and Figure 7.8, it is obvious that if the initial state is A1, the vehicles will
not be able to reach their desired lanes under the current circumstances. The final situation is then
A1, A2 or A3. The states C1* and B1* represent the goal state for this situation and are
reachable from all other states in their respective chains. Furthermore, these two states are
absorbing states for chains 2 and 3 respectively. Given the vehicles desire to shift lanes, the
physical environment is guaranteed to switch to states C1* or B1* starting at any state in chain 2
or 3. Therefore, all we need to do is to force the physical environment to switch from A1, A2 or
A3 to any of the states in chain 2 or 3. Possible transitions are from Ai to Bi or Ci where i =
1,2,3. All these transitions require a speed change for at least one of the vehicles.
Consider the situation A1 with the automata environment given in Figure 7.9a. If the
penalty-reward structure can be changed to one of the matrices shown in Figures 7.9b-7.9e, or
any suitable combination of longitudinal actions, the physical environment will then switch to
state B1 or C1 depending on the chosen action(s). If at least one of the vehicles can be forced to
change its speed by changing its automata environment, the physical environment will switch
after some time, leading to a goal state. The same reasoning is true for states A2 and A3.
v2 v2 v2
v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR
ACC ACC ACC
DEC DEC DEC
Idle Idle Idle
SL SL SL
SR SR SR
(a) (b) (c)
v2 v2
v1 ACC DEC Idle SL SR v1 ACC DEC Idle SL SR
ACC ACC
DEC DEC
Idle Idle
SL SL
SR SR
(d) (e)
Figure 7.9. Possible penalty-reward structures (b-e) to force physical environment to switch
to states B1 or C1 from current state A1.
Therefore, in order to introduce a change to the automata environment, the flag structure
given in Section 5.3.1 is designed. If a vehicle cannot shift to its desired lane in a predefined time
interval, a flag is set. This flag changes the vehicles desired speed, usually to a value smaller than
the current speed. The learning automata environment then changes as shown in Figure 7.9b or
7.9c; the transition in physical environment state comes later, attaching the states Ai to chain 2 or
3 (Figure 7.8). Consecutive state transition are automatic for this situation.
7.4.2 Scenario 2: Three Vehicles on a Three-Lane Highway

In this section, we consider three intelligent vehicles in a situation similar to the one given in
previous section. As seen in Figure 7.10, the speed and lateral positions of the vehicles are the
same, and their desired lane parameters create a conflict. Similar to scenario 1, the solution to the
problem lies in changing the relative speeds of the vehicles. Again, the lane flag is used to decrease
the speeds of vehicles 1 and 2 to a smaller value than that of vehicle 3.
Cur. Spd =Des. Spd.

Cur. Spd =Des. Spd.

Cur. Spd =Des. Spd.

Figure 7.10. Scenario 2: Three vehicles with conflicting desired paths.
All possible environment states for three vehicles on a three-lane highway are given in
Appendix D.2. Again, assuming that only one vehicle take can take an action at a time, the state
transitions leading to a solution would be similar to the one given in Figure 7.11. This transition
diagram only shows a solution when vehicles 1 and 2 slow down. A few other solutions are also
possible if different speed adjustment are considered.
C2 C3 D5 K2 J3 F2
C6 A5
Figure 7.11. A possible chain for Scenario 2: lane flag forces vehicles 1 and 2 to slow down.
All the transitions except the first one are automatic under current circumstances. For the
first transition, on the other hand, the lane flag needs to be set in at least one vehicle (if it is
vehicle 3), breaking the symmetry. The problem and the solution for this case is similar to the
two-vehicle situation given in the previous section. This is not a coincidence; it is due to the
superposition of two two-vehicle situations.
The term superposition indicates that a three-vehicle situation given here can be treated
as three separate two-vehicle interactions. In terms of the two-vehicle states, the state transition
diagram above can be written as separate transition diagrams using the two-vehicle situations
previously described, as shown in Figure 7.12.
Scenario 2
Vehicles: C2 C3 D5 K2 J3 F2
1, 2 and 3
C6 A5
Scenario 1a
Vehicles: A2 A2 A1 A3 A3 A3
1 and 2
A1 A1
Scenario 1b
Vehicles: A1
1 and 3
C1 C1 C3 D1 B2*
C2 D2
Scenario 1c
Vehicles:
2 and 3
A3 C3 D3 D3 B3* B1*
B3* B1*
Figure 7.12. Three-vehicle transition diagram can be written as three separate

two-vehicle transition diagrams using the definitions in Figure 7.8.
As seen from Figure 7.12, the three-vehicle situation is nothing more than three
asynchronous Markov chains representing two-vehicle situations. In Figure 7.12, the
corresponding two-vehicle states are aligned with the three-vehicle environment states.
Transitions that need to be forced by the lane flag are shown in gray, and they are (and must be)
between corresponding states in both three-vehicle and two-vehicle transition diagrams. The two-
vehicle scenario including vehicles 1 and 2 is automatic, i.e., there are no conflicts. The other two
scenarios both have a synchronous forced transition.
Similarly, for the three-vehicle scenario of Section 5.3.1, it is possible to view the
situation as three two-vehicle transition diagrams. Figure 7.13 shows the corresponding two- and
three-vehicle chains. Again, the transition breaking the symmetry occurs at the between the
corresponding states of three- and two-vehicle scenarios. When three-vehicle situation is forced
from state K2 to state L2, two-vehicle scenario of vehicles 1 and 2 is forced from state A3
(corresponding to state K2) to state B3* (corresponding to state L2). Two-vehicle scenario
including vehicle 1 and 3 is automatic, the one between 2 and 3 is not significant, i.e., there are no
conflicts, nor transitions. The situation between vehicle 1 and 2 is the defining scenario.
Scenario 2
Vehicles: K2 L2 J5 D1
1, 2 and 3
Scenario 1a
Vehicles: A3 B3* D1 C2
1 and 2
Scenario 1b
Vehicles: D3
1 and 3
D3 C3 C1
Scenario 1c
Vehicles:
2 and 3
C3 C3 C3 C3
Figure 7.13. Three-vehicle transition diagram is equivalent to two separate

two-vehicle diagrams for the example in Section 5.3.1 (See Figure 5.17).
Analysis of the example situations given in this section indicates that it is possible to
define complex situations of multiple interacting vehicles as a group of many (conflicting and
non-conflicting) two-vehicle situations. A complex scenario is nothing more than a superposition
of multiple two-vehicle scenarios. When the state transitions of significant two-vehicle scenarios
corresponding to the transition diagram of the N-vehicle (N > 2) scenario are studied, it is seen
that the key transition that breaks the symmetry in N-vehicle situation is synchronous with at
least one of the two-vehicle symmetry-breaking transitions 7.
7.4.3 Scenario 3: Four Vehicles on a Three-Lane Highway

As we discussed in the previous section, a scenario of multiple vehicles can be treated as multiple
asynchronous two-vehicle transition diagrams. In the previous example, interaction between
vehicle 2 and 3 was not significant since their relative positions and speeds, as well as other
parameters, did not change during the time interval of interest. Similarly, the example with four
vehicles given in this section will be equivalent to a single two-vehicle situation due to relative
position and speed of vehicles. This example also emphasizes the need for a speed flag as
described in Section 5.3.2.
Consider the situation given in Figure 7.14. Vehicles 3 and 4 are traveling at their desired
speed and lane, unaware of the fact that vehicles 1 and 2 are approaching. Assume that speeds of
vehicles 1 and 2 are equal to each other and greater than those of vehicles 3 and 4 while vehicles 3
and 4 are traveling at the same lateral position with equal speeds. This situation is the most
complicated among similar situations. For vehicles 1 and 2 to keep their desired speed, it is
necessary that they shift to the middle lane and pass vehicles 3 and 4. Although they will be able
7
The term synchronous here means that corresponding transitions occur between the corresponding states in N-
and two-vehicle scenarios.
to slow down to avoid collisions, the fact that they do not have a lane preference will prevent
them from shifting to the middle lane.
Cur. Spd =Des. Spd. V1 = V 2 > V3 = V 4

Des. Lane = 2
Cur. Spd =Des. Spd.

Des. Lane = none 1 V1 3 V3 Lane 1
Lane 2
Cur. Spd =Des. Spd.

Des. Lane = none 2 V2 4 V4 Lane 3
Cur. Spd =Des. Spd.

Des. Lane = 3
Figure 7.14. Scenario 3: Four vehicles with conflicting desired speeds.
For our analysis of the situation, we only consider vehicles 1 and 2 while defining the
environment states since interactions of vehicles 3 and 4 with others are unaffected by any
possible actions vehicles 1 and 2 may take. Therefore, by analyzing the interactions between the
first two vehicles, we must be able to find a solution to this conflict. The states in which we are
interested are listed in Appendix D.3. Initially the environment is in state B2. Since vehicles 1
and 2 are traveling at the same speed and faster than vehicles 3 and 4, the transition to state D1 is
automatic (Figure 7.15).
B2 D1 C1 D1
F2 G2
Figure 7.15. Two possible solutions to a situation with 4 vehicles.
At state D1, two transitions that will solve the conflict, but need to be forced, are
transitions to form state D1 to states C1 and F2. From these states, the chain will move to goal
states C3, E4, or G3 (see Appendix D.3). Then, vehicles 1 and 2 will increase their speed and
pass other vehicles. Since the desired lane for vehicles 1 and 2 are not set, another method needs
to be found to force the environment state to switch to one of the chains leading to a goal state.
The solution to the problem is what we defined as speed flag in Section 5.3.2. Under current
circumstances, vehicles 1 and 2 both fire their idle actions repetitively at state D1. With the
addition of the speed flag, both vehicles decide to shift to another lane after a predefined time
interval. Since shifting to the middle lane becomes the only possible lateral action, one of the
states C1 or F2 is reached.
If both vehicles start filling their memory vectors with conflicting actions, then the pinch
modules will start returning a penalty for these actions. At this point one of the vehicles will
decide against shifting to the middle lane, and the other will be the first to shift as described in
Section 5.2.2, therefore changing environment state to either C1 or F2. Figure 7.16 includes the
mpeg movies of the two separate runs for this highway scenario where vehicle 1 and 2 mutually
find a solution through their pinch modules. The time interval for the speed flag is chosen as 6
sec. In the first simulation (Figures 7.16a, c, e), vehicle 1 shifts to the middle lane immediately
after the speed flag is set, returning a penalty response to lateral action SiL. In the second
example (Figures 7.16b, d, f), although the flag is again set after 6 seconds, vehicles 1 and 2 need
some time to negatiate the shift via their pinch modules. Since the last deviation from the
desired speed of 85kmh is approximately the same for both vehicles (Figure 7.16b), they attempt
to shift to the middle lane at approximately the same time, leading to a pinch condition.
Therefore, reaching the solution takes slightly more time in the second simulation.
Vehicle Speeds (kmh) Vehicle Speeds (kmh)

90 90
88 88
86 86
84 84
82 82
80 80
78 78
76 76
74 74
72 72
0 10 20 30 0 10 20 30 40
(a) (b)
Vehicle Lanes Vehicle Lanes
4 4
3 3
2 2
1 1
0 10 20 30 0 10 20 30 40
(c) (d)
ch7m1.mpg ch7m2.mpg
(0.065 Mb; every frame represents (0.177 Mb; every frame represents
one fifth of a second) one fifth of a second)
(e) (f)
Figure 7.16. Speeds (a, b) and positions (c, d) of vehicles 1 and 2 for scenario 3:
(left) vehicle 1 shifts first; (right) vehicle 2 shifts first. Mpeg movies of the
simulations are accessible via icons (e, f).
Figure 7.15 shows only two of the possible transitions from state D1; there are two other
possibilities switching to state B3 or to state C4. These two possibilities become a solution to
the problem at hand when both vehicles cannot shift lanes because of their pinch modules
respective penalty outputs. The probability of occurrence for this situation is very small;
however, it is not zero. If this is the case, the vehicles will not be able to change to the middle
lane defined now as the desired lane by the speed flag interference. Then, the lane flag will be set
(see Section 5.3.1) after a time period, forcing the vehicles to slow down. Since the desired
speeds set by the lane flag are different, the symmetry will be broken, and one of the goal states
will be reached via state B3 or C4.
7.5 Discussion
This chapter introduced the treatment of path decisions for autonomous vehicles as interacting
automata, and then at a higher level, as interacting vehicles. In each intelligent vehicle, lateral and
longitudinal automata create an interacting system which can be visualized as an automata game.
In our application, this game is usually disjoint i.e., the mixed strategies of automata do not
depend on others strategy. Therefore, the optimal action pair is reached. For the case where the
longitudinal automatons actions affect the lateral automatons environment, the interacting
system again reaches the equilibrium point which happens to be the optimal solution for a given
environment.
The visualization of multiple automata interactions as games is based on the assumption
that the physical (and therefore, the automata) environment is stationary. Assuming that the
physical environment does not change for a relatively long period of time, and provided that the
learning rates for all automata are fast enough, it is possible to assign a stationary automata
environment to any physical environment consisting of locations and internal parameters of
vehicles in which we are interested for a given situation. Changes of the physical environment can
then be treated as a Markov chain whose transitions are direct results of the automata
environment evaluated for each state of the physical environment. Situations involving multiple
vehicles can be analyzed by investigating the transition diagrams resulting from a specific
situation, as described in Section 7.4.
Despite the fact that we used automata environments to evaluate a physical
environments state transitions, the method can also be used for other decision mechanisms. As
long as we can define a formal way of describing the decision/control procedure resulting from a
specific environment condition, the transition diagrams for physical environment states can be
created.
The treatment of the highway scenarios is based on 3x3 or 3x4 location matrices. Larger
matrices can also be used, however those may be redundant for our analysis purposes of the
situations given in Section 7.4. For a larger number of vehicles or different sensor definitions,
larger location matrices may be needed, but again, it is possible to represent multi-vehicle
scenarios as a simultaneous two-vehicle interactions as described in Section 7.4.2.
Chapter 8
Conclusions
This chapter opens with a brief discussion of the automated highway system as a possible
solution to current traffic problems and a short evaluation of the reinforcement schemes as an
automata learning process. The results and contributions of this dissertation are listed in the
second section. The chapter concludes with the recommendations for future research.
8.1 Discussion
The automated highway system envisioned by researchers and institutions around the world
suggests a complex structure that requires extensive research efforts. Among many issues, the
realization of AHS depends mainly upon the realization of intelligent vehicle control methods.
No matter what design approach is taken for vehicle control, there are two important issues that
will haunt the researchers. The first problem in front of the AHS is a social issue: user
acceptance. Without the user acceptance, the aim of automating the highway system is bound to
fail. We will briefly discuss the possible future of the AHS deployment in the light of the current
influence of the technology on society. The second problem is the technical feasibility of the
designed systems. While designing the intelligent path planner described in this work, the
transition from abstraction to implementation was one of our concerns. We will therefore briefly
discuss the technical issues that hinder the realization of an automated highway system.
We designed our intelligent vehicle controller using an AI technique called the learning
automata. The adaptation of the automata to existing conditions is provided by the reinforcement
schemes mentioned in previous chapters. We will thus conclude our discussion with a few ideas
on reinforcement schemes.
8.1.1 AHS as a Social Issue

In the last five years, we have witnessed the birth of the automated highway system. Highway
automation efforts showed themselves in various forms of technological gadgets (e.g., automatic
vehicle identification, electronic toll collection, emergency signaling devices, video enforcement
systems, and visual vehicle detection systems [ITSW96, TrafT96]) as far as the user is
concerned1. The first question in the mind of those familiar with the computation and
1
The author of this dissertation has traveled more than 40,000 miles on US interstate highways while working on
design of intelligent vehicle and highway systems, and on the evaluation tools for intelligent transportation systems.
Cem nsal Chapter 8. Conclusions 140
communication aspects of AHS (and ITS) is the information content of such a system. Most of
the projects for full automation of the highway system envision total knowledge about the
highway and the vehicles.
Present AHS and ITS applications such as traffic monitoring, electronic toll collection and
license plate recognition are already employing identification/classification techniques that can be
used to generate information about the drivers/owners of the vehicles. Human factors research on
AHS systems shows that drivers would not want the infrastructure to know everything about
their actions, locations, and daily routines. This problem is called the big brother syndrome.
The objections to usage of this type of data include fear that the security of the data will be
compromised, and overall discomfort with the modern trend toward routine collection of personal
data. Initial system analyses indicated that check-in procedures envisioned for AHS
implementation may include processing of financial, medical, or driving records to verify
potential users' qualifications [Delco94].
The information content of the data collected for AHS purposes is very important for the
user acceptance of the automated highway system. While the collection of driver data may be
perceived as an invasion of privacy, some of the problems may be solved by choosing the least
invasive AHS implementation with minimal data storage. Such issues are important because they
may influence the design of the communication and processing functions of the (longitudinal)
control system for a hierarchical infrastructure [Delco94]. For the autonomous vehicle approach
that will most probably include some form of visual control system, similar problems exist.
With the current technological progress and its applications to daily life, the invasion of
privacy is becoming an important issue. A recent example is the implementation of a global
computer network, a.k.a. Internet or World Wide Web. More and more individuals are starting to
use the network in their daily routine, and in return, they are providing the WWW server, the
Internet service provider or the routing system with some form of information. This could be
login data, an e-mail message, or a server log of the visited pages with the address of the client
machine. In most cases, it is possible for a computer user to monitor the activities of another
from his console. A similar and more common example for today is the use of credit cards. Every
time a charge card is used, the corresponding computer system records the time, date, location
and type of the transaction which can later be sold to third parties. Yet, many are comfortable
with the accessibility that a charge card or an online computer system provides (to all interested
parties).
The use of AHS or ITS systems such as automatic toll collection or automatic vehicle
identification (or ATIS, ATMS, AVL systems in the near future) is no different than
technologies mentioned above. A trade-off between comfort and privacy exists, and will exist in
the future. The evolution of the AHS will stop at a level where the majority of the users are no
longer willing to trade their privacy with the additional comfort the system may bring. As AHS
becomes more visible, its complexity may lead to misunderstandings and misrepresentations that
The reflections in this section are the results of his AHS experience as a researcher and his driving experience as a
user of the present highway system.
may jeopardize public acceptance [Batelle94]. Considering the recent mistakes made with the
latest technological revolution, the Internet, it is obvious that a balanced and accurate picture of
AHS must be presented to the public.
Recent investigations addressing the user acceptance of AHS, and legal and technical
privacy issues in AHS technologies can be found in [Stevenson95], [Wright95], [Garfinkel96],
[McGurrin96], and [Wigan96].
8.1.2 Technical Feasibility of AHS

The technology needed to create an intelligent vehicle control system is already available,
although still expensive for full implementation. The term technology here refers to our ability
to implement the necessary components for vehicle control. On the other hand, vehicle control
problems related to AHS have been vigorously investigated by many researchers, as discussed in
Chapter 2. However, the research on sensing and communication methodologies for AHS is
relatively new, and many problems remain. These problems are solvable, yet they require
significant research effort for safe and successful implementation of AHS. Several observations
and ideas related to technical feasibility of the AHS implementation are discussed in this section.
Since the vehicles in an automated highway will be autonomous, they will require on-
board sensing equipment to gather information about their surroundings. Sensing is vital for AHS
approaches that do not consider hierarchical control structures where most of the information is
relayed from the higher levels of the hierarchy. Non-hierarchical control approaches, on the other
hand, do require simple communication systems. No matter what the sensing and communication
requirements are for a specific realization of AHS, several issues such as headway measurements,
lateral sensing, local communications, and sensor fusion must be addressed.
The most important issue for autonomous vehicle navigation are the sensor capabilities.
The question of can we trust our observations? needs to be answered. For example, the
headway measurement is the heart of longitudinal vehicle control. There are several different
technologies such as sonar, microwave radar and optic sensors for headway sensing, but a single
method is not reliable for all environment conditions. Many researchers were able to implement
longitudinal and lateral control methods [Lui89, zgner95, zgner96, Pomerlau96, Weber96],
however operability under various environmental conditions is still an important design problem.
The solution seems to be the sensor fusion. Although more reliable observations are possible by
combining distinct sensor information, finding an intelligent way of fuse sensors is very
difficult. Furthermore, an autonomous vehicle must have redundant sensors, and an additional
back-up mechanism to guarantee safe operation [Singer95, Agogino96]. Thus, the complexity and
the cost are the two main factors in autonomous/intelligent vehicle implementation.
Current commercially available sensing technologies include radar-based headway sensors
for adaptive cruise control, global positioning sensors for intelligent navigation, and vision
systems for active driver support for lane keeping [ITSW96, TrafT96]. Magnetic sensors for lane
keeping, adaptive vision systems for lateral and longitudinal vehicle control, laser radar for
headway control, and proximity sensors for lane departure warning are currently used in research
vehicles [zgner96, Pomerlau96, Tan96, Weber96, Yanagisawa92, Construction96].
The research on sensing technologies is gaining momentum due to the ongoing research on
AHS, and due to the fact that most of these technologies find quick commercial applications. The
research on sensing technologies for AHS was initially driven by the robotics field; during the last
few years more research has been carried out specifically for intelligent vehicle applications. The
sensor capabilities for autonomous navigation will increase with the investigation of new
alternatives and the progress of the previous technologies.
Communication with the roadside and other vehicles is another important factor for AHS
implementation. Again, a reliable communication system is vital for a hierarchical control
approach. A hierarchical control structure requires three types of communications:
communications for control, maneuvers, and navigation information. For the first type, high
bandwidth is required. Furthermore, the loss of data must be minimized to maintain stability of
platoons, and to assure collision-free lane changes [Foreman95]. Communicated data must be
trustworthy. A real-life analogy to the problem of reliability of the received information is to
wait at the intersection until another car approaching from left makes a right turn although its
right turn signal flashes while approaching the intersection. This type of behavior is a clever one
on the part of a human driver and it is useful in avoiding collisions, but it is against everything
AHS tries to achieve. An automated system repeatedly checking every bit of information against
errors cannot be successful in enhancing the traffic flow. On a greater scale, a hierarchical
automated highway system must optimize traffic flow by coordinating vehicle actions in an
intelligent way. The need to increase traffic flow theoretically increases the risk of accidents. The
reason for this is a decreasing possibility of recovering from a wrong decision due to a smaller
reaction window.
Commercial communication systems for AHS and ITS include new radar detectors with
road hazard and emergency vehicle signals, traffic information systems, fleet-management
systems, in-vehicle mobile offices and on-line services, MAYDAY devices and emergency
services, and electronic toll collection [ITSW96, TrafT96]. On the other hand, vehicles currently
used for research purposes are equipped with 2.5GHz band radio transceivers, vehicle-to-vehicle
radio modem communications [Construction96, PATH96]. Ultra wide band communication is
also considered as an alternative [James96]. The network architecture for vehicle-to-vehicle
communications is currently being defined [Bolla96, Fuji96], and related problems such as
interference and scattering are being investigated [Gavan96, Zoratti95].
Progress in the electronics filed, especially in wireless communications, will positively
affect the deployment of AHS2. AHS is actually a combination of many engineering fields. For
example, the design of the ITS/AHS technologies are carried out by the government organizations
in conjunction with the Institute of Electrical an Electronics Engineers (IEEE), the Society of
2
AHS will use emerging technologies as soon as they become available during its evolution to fully automated
highways. For example, drivers with cellular phones are already acting as probes in traffic via emergency phone
lines. Such an information source was not available ten years ago.
Automotive Engineers (SAE), the Institute of Transportation Engineers (ITE), and the American
Society for Testing and Materials (ASTM) [ITSst96]. National AHS Consortium consists of
many aerospace and automotive companies as well as civil, mechanical, electrical engineering, and
computer science departments of several educational institutions. AHS is a multi-disciplinary
research filed. In addition to technologies mentioned above, its realization depends on many
engineering applications such as systems and control, fiber optics, automotive design, data
mining, data scheduling, human factors. It slowly creates its own niche in many related fields of
engineering boosting the research efforts and resources.
Automotive companies are hoping to make AHS technologies such as cruise control and
lane departure warnings commercially available in the next five years. Lane tracking and vision
enhancement applications are expected to be available in ten years [Cogan96]. It is interesting to
note that the current AHS research has already found answers to the problems related to these
applications. The transition from research to commercial availability is only a matter of cost
reduction. Considering the rate of technological progress in our century, the future of the AHS
technologies looks bright.
Returning to our assumptions on the vehicle capabilities, it is fair to say that the
assumptions made on sensing and communications for autonomous vehicles are feasible, but not
widely available.
8.1.3 Reinforcement Learning

The intelligent vehicle path controller discussed in this work is based on learning automata
techniques. Learning stochastic automata are capable of adapting to changing environments by
using reinforcement schemes described in Chapters 3, 4 and 6.
Vehicle control in the context of AHS is a very complex problem. Automated vehicles are
part of a dynamic environment where the control decisions must be made very quickly and by
coordinating with other vehicles. In an environment where there are multiple fast-moving
vehicles, making the right decision to avoid collisions and optimize the traffic flow is difficult.
The inherent complexity of the AHS has forced researchers to look for new methods
capable of creating flexible and powerful high level control systems. The artificial intelligence
methods mentioned in Section 1.2.1 are the results of this need. Precursor System Analyses for
AHS concluded that these techniques, especially knowledge-based systems and learning methods,
can make a strong contribution to AHS especially in the areas of malfunction detection and fault
tolerance, high level control of vehicle actions, overall traffic management, and sensor fusion
[Rayteon94]. The work presented here attempts to make contributions to three of these
problems except the traffic management.
In any eventual AHS, there will be widely varying capabilities among the possibly
millions of AHS vehicles because there will be a mixture of new and old models, plus there will be
widely varying levels of maintenance. Moreover, the world is complex and unexpected things can
happen, such as a moose entering the highway, or a refrigerator falling off of a truck?
[Rayteon94] It is clear from this argument that the AHS environment will be more chaotic than
structured. It will not be easily predicted, nor analyzed. In the case of mixed traffic (human
drivers and automated vehicles coexisting on the same highway), the problem grows enormously.
As a result, an automated vehicle must be able to deal effectively with a large number of possible
traffic situations.
The last statement above emphasizes the importance of learning and adaptation. What is
needed is a system that can handle situations unforeseen by the designers. For this type of
capability, the choice of artificial intelligence technologies is obvious. While expert systems, and
knowledge-based planning [Forbes95] are effective, learning methods are capable of discovering
new situations and optimal responses using simulated environments [Sukthankar96].
One of the most important characteristics of the reinforcement learning is the choice
between the actions that are known to return positive responses and the actions with unknown
performance. For example, a learning automata starting with a priori action probability values
needs time to evaluate the actions: it learns from its mistakes.
This process can be examined from two different points of view [Ashby60]. It can be
thought of simply as an attempt to find the best solution. In case of failure, the attempt is not
rewarded (or is penalized). This view of the learning process is not very attractive. On the other
hand, the trial and error method is invaluable in gathering information necessary to achieve
optimal action. Choosing non-optimal actions to gain information improves the long-term
performance while resulting in temporary penalty responses from the environment. Learning
automata is one of the successful examples of such methods.
Another important aspect of the reinforcement learning is that the actions performed by
an automaton influence the environment response that the automaton will receive in the future.
Rather than receiving an independently chosen set of action-response pairs, the agent has some
form of control over what response it will receive and complete control over what actions will be
generated in response. In addition to making it difficult to make distributional statements about
the inputs to the agent, this degree of control makes it possible for what seem like small
experiments to cause the agent (automaton) to discover an entirely new part of its environment
[Kaelbling94].
8.2 Results and Contributions

The results of the work described in this dissertation can be divided into two groups: (a) learning
automata and reinforcement schemes, and (b) intelligent vehicle control and path planning. This
section lists the significant results of the completed work. The contributions are indicated
together with the associated results.
The application of learning automata techniques to intelligent vehicle control for
automated highway systems provided the following conclusions about learning automata,
reinforcement schemes, and games of automata:
The reinforcement scheme is the heart of the learning process, and its characteristics define
the behavior of the automata. However, besides the reinforcement scheme, the definition of
the reward-penalty structure (i.e., the teacher) is vital in the success of the learning automata
application. The automata cannot be expected to learn the best actions with a defective
teacher structure. The significance of the reward-penalty structure is emphasized in
environments with multiple teachers. We have spent a significant amount of time to design a
function that intelligently combines multiple environment responses.
The lack of applications using continuous environment response is well understood. The
problem of combining such responses as well as the difficulty in finding a meaningful
representation of the several factors with the same measure hinders the use of S-model
environments. There are no known applications that use this model.
The use of learning algorithms in our application is based on the assumption that the physical
environment around an autonomous vehicle can be associated with a stationary automata
environment (or multiple switching automata environments). We assume that the automata
adaptation to the environment is faster than the changes in the physical environment.
Although the current computational technology is sufficient for such an assumption, we
defined two new reinforcement schemes for our purposes. These new algorithms have
convergence rates up to four times faster than previously designed counterparts, especially in
situations frequently occurring in AHS applications.
One of these new reinforcement schemes has been previously known. However, its
convergence characteristics have not been proven due to the difficulty in analytical modeling
the learning process. By using nonlinear methods and a special condition resulting from our
design, we were able to prove optimality of the algorithm.
The second algorithm is also the result of our attempts to improve the convergence
characteristics of the learning automaton. It is an extension of a previously designed nonlinear
reinforcement scheme, but performs better. This algorithm is proven to be absolutely
expedient.
Our work on the design and analysis of an intelligent vehicle control using learning
automata enables us to conclude the following:
The intelligent vehicle path controller consisting of two stochastic learning automata is
capable of adapting to its dynamic environment by choosing the best (lateral and longitudinal)
actions to be sent to the lower level of control.
This non-model based approach would be especially useful in situations wherein the
complete knowledge about the overall traffic flow is not provided by higher level of the
control hierarchy (if such levels exists at all). Instead of trying to foresee all possible traffic
situations, we take the approach of defining a mechanism that can make intelligent decisions
based on local sensor information, keeping in mind the fact that the initial phases of the AHS
will include non-automated vehicles as well as intelligent vehicles capable of communicating
with others.
The method is capable of capturing the overall dynamics of the system that include the
vehicle, the driver and the roadway. Definitions of the learning and sensor parameters
determine the behavior of each vehicle.
Initial results indicated that capabilities of the vehicle sensors must be extended. As presented
in Chapter 5, these extensions are not too tasking, although some may be relatively expensive
to realize. A simple rate-feedback method for the headway distance measurements can
overcome the oscillatory behavior in longitudinal control.
Simulations of intelligent vehicles also indicated the need for additional information sources
besides the local sensors. No matter what control structure is used for AHS deployment,
some form of communications must exists between vehicles. Although visual clues can be
used to coordinate lateral actions, the lane changing capabilities of the intelligent vehicles as
well as the safety of the actions increase with local vehicle-to-vehicle communications. We
have found that, in order to avoid pinch situations, vehicles may coordinate their lane
changing actions by simply sending an intention signal to neighboring vehicles.
Our attempt to design an intelligent path controller inadvertently extended, to some degree, to
other levels of vehicle control. For example, we have found that if a higher level of
control/decision mechanism provides desired lane information, many local solutions (that are
not globally optimal) may be solved to optimize overall traffic flow. Furthermore, the
capabilities of the lower layers of control are significantly important when designing the
intelligent path controller. Although we did not consider feedback from the control layer that
actually carries out the vehicle actions, the design parameters of the lower layer affects the
design parameters of the path planning layer. All levels of control are interconnected and
cannot be treated as a single free-running entity.
There is a trade-off between what the automated vehicle can accomplish and how simple the
sensing/information system is. The more global the information content of the decision
mechanism is, the more the vehicle can accomplish autonomously. It is our belief that neither
a fully decentralized control method, nor a completely hierarchical system can solve the
problem completely. While a decentralized system will require some form of local
communications, a hierarchical system will suffer from being unable to foresee all conflicts
between vehicle paths.
Our approach to the vehicle path control problem can be viewed as an autonomous vehicle
approach. Automated vehicles are making their own decisions on which action to take by
using information from their own sensors and limited communication systems. Although each
vehicle is capable of avoiding collisions, the combined actions of multiple vehicles are not
always optimal. Previous research efforts in autonomous vehicles discussed in Section 1.2.1
have not considered multiple vehicle interactions. In this work, we also attempted to find a
structured methodology for the interactions of multiple autonomous vehicles.
The method of evaluating possible environment state transitions based on associated
automata environments enabled us to define additional decision mechanisms we called flags.
Speed and lane flags are used to solve the conflict situations arising from the multiple teacher
responses and vehicle interactions
Although our method of evaluating the physical environments state changes are based on the
learning automata environment, similar methods can also be used with other decision
mechanisms. By formal descriptions of the decision/control procedure, transition diagrams
similar to those given in Chapter 7 can be created to analyze the highway situations.
8.3 Recommendations
AHS research, especially high level vehicle path control, is a relatively new research area, and
there are an extensive number of questions to be answered. There are multiple possibilities for
extending the work discussed in this dissertation. Here, we will emphasize some of these
possibilities that seem relatively important for realization of intelligent vehicle control. Our
recommendations for future research efforts on the subject are the following:
Simulation:
Simulation is an indispensable way of testing the effectiveness of vehicle control methods
since the system becomes highly complex due to the presence of large number of vehicles.
Furthermore, the ideas investigated in this work cannot be tested with real vehicles, partly
due to the reasons listed in Section 2.5, but mainly because of the futuristic character of the
problem. The overall system is neither continuous nor discrete-time; the combination of
vehicle dynamics and the computational modules forms a hybrid system whose design and
analysis is inherently difficult. For the work described here, we simplified the vehicle
dynamics drastically since the main purpose of this research was the study of learning
automata techniques for vehicle control. More realistic vehicle dynamics must be
incorporated with the computational system, and the study of the interaction between the
discrete and continuous parts of the overall system needs to be carried out for a precise
control system simulation.
Sensor modeling:
For the purpose of studying the learning automata as an intelligent vehicle controller, we
assumed almost perfect sensor models. Our sensor module subroutines in the simulation
incorporate small percentage of measurement error. Beyond the addition of the measurement
noise, we assumed that the sensors defined in this work can be implemented using the current
technology. More realistic models of the sensors based on existing radar, laser radar, infrared
and sonar techniques may be created to further the simulations ability to imitate real-life
situations. The issues of sensor degradation and back-up methods also have to be answered
for a fault-tolerant autonomous navigation applications.
High level vehicle path control:
Our study of vehicle path planner indicated that this level of vehicle control cannot be
designed nor analyzed without considering lower and higher level of the control methodology.
Especially for the hierarchical control structure, feedback from the regulation layer and the
information sent from the link layer of the hierarchy must be taken into account. This study
of the planning layer did not consider a possible unable to comply signal from the low level
control modules. Furthermore, we assumed that global information such as the desired lane
and speed are available. Previous and current research on AHS has not yet answered such
issues. Lane and speed assignments by a higher layer in the control hierarchy may prove to be
crucial for obtaining optimal traffic flow.
Vehicle interactions:
The problem of coordination between autonomous vehicles is inherently addressed by a
centralized control approach. However, decentralized AHS methods such as autonomous
vehicle approach [Bayouth96] or cooperative approach [McKendree96] must still find an
answer to this problem. In case of mixed traffic which may be a possibility in the future of
AHS, the problem is far more serious. Analysis and design methods described in this work
address the vehicle interaction by simplifying multiple vehicle scenarios (situations) into
multiple two-vehicle interactions by using additional simplifications on the relative position
of the vehicles (the matrices in Section 7.4 and Appendix D). This is possible because of our
definition of sensors and the reward-penalty structure used for the learning automata. The
idea of analyzing multiple vehicle interactions can be applied to other vehicle control
methodologies provided that the states of the physical environment, and the decision
environment resulting from these are carefully defined. More detailed and complex
definitions for highway scenarios may be necessary.
Learning automata:
The idea behind the adaptive vehicle controller presented here is very simple: two automata
synchronously adapt to their (teacher) environment. During our study, we have observed
that the automata interactions and multiple teacher learning processes in multi-teacher
environments are relatively untouched areas of research. The intelligent controller may be
extended to multiple automata associated with a single teacher (i.e., sensor module). In this
case, the analysis of the interactions between automata will be more difficult. However,
translating the problem of combining the environment response to the problem of combining
the multiple actions may have its advantages. The application of the S-model environment
then may prove to be relatively easier, bringing the advantage of continuously mapping the
sensed physical environment to automata environment. Yet, the idea of combining automata
actions have not been investigated previously.
Cem nsal Bibliography 149
Bibliography
[Agagino96] Agogino, A., K. Goebel, and S. Alag, Intelligent Sensor Validation and Fusion,
Intellimotion, vol. 5, no. 2, pp. 6-7, 1996.
[AHS] FHWA/Mitretek AHS Team, Automated Highway Systems (AHS), NAHSC,

http://www.volpe.dot.gov/ahs/ (Dec 6, 1996).
[Arc96] ARC/INFO Professional GIS, Environmental Systems Research Institute, Inc.,

http://www.esri.com/base/products/arcinfo/arcinfo.html (Jan 2, 1996).
[Aoki] Aoki, T., T. Suzuki, and S. Okuma, Acquisition of Optimal Action Selection in
Autonomous Mobile Robot Using Learning Automata (Experimental Evaluation), Proceedings
of the IEEE Conference on Fuzzy Logic and Neural Networks/Evolutionary Computation, pp.56-
63, Nagoya, Japan, Nov. 1995.
[Ashby60] Ashby, W. R., Design for a Brain: The Origin of Adaptive Behaviour, New York,
NY: John Wiley and Sons, 1960.
[Asher96] Asher, H. J., and B. A. Galler, Collision Warning Using Neighboring Vehicle
Information, Proceedings of the 6th Annual Meeting of ITS America, vol. 2, pp. 674-684, 1996.
[Atkinson65] Atkinson, R.C., G.H. Bower and E.J. Crothers, An Introduction to Mathematical
Learning Theory, New York: Wiley, 1965.
[Baba80] Baba, N., T. Soeda, T. Shoman, and Y.Sawaragi, An Application of Learning

Automata to the Investment Game, International Journal of System Sciences, vol. 11, pp.1447-
45, 1980.
[Baba83] N. Baba, The Absolutely Expedient nonlinear Reinforcement Schemes under the
Unknown Multi-Teacher Environment, IEEE Transactions on Systems, Man and Cybernetics,
SMC-13, pp. 100-108, 1983.
[Baba85] N. Baba, New Topics in Learning Automata Theory and Applications, Number 71 in
Lecture Notes in Control and Information Sciences, Berlin: Springer-Verlag, 1985.
[Batelle94] Institutional and Societal Aspects of AHS, Technical Report FHWA-RD-95-045

by Batelle, AHS Precursor Systems Analyses, Compendium of Research Summaries, compiled by
Information Dynamics, Inc. & Viggen Corp., McLean, VA, February 1994.
[Bayouth96] Bayouth, M, and C. Thorpe, An AHS Concept Based an Autonomous Vehicle

Architecture, Proceedings of the 3rd World Congress in ITS, Orlando, FL, October 1996.
[Bloomfield96] Bloomfield, J. R., The driver's Response to Decreasing Vehicle Separations

During Transitions into the Automated Lane, Technical Report FHWA-RD-95-107, Federal
Highway Administration, Turner-Fairbank Highway Research Center, McLean VA, 1996.
[Bloomfield96b] Bloomfield, J. R., Human Factors Aspects of Transferring Control from the
Driver to the Automated Highway System with Varying Degrees of Automation, Technical
Report FHWA-RD-95-108, Federal Highway Administration, Turner-Fairbank Highway
Research Center, McLean VA, 1996.
[Bolla96] Bolla, R., F. Davoli, and C. Nobile, A Mobile Network Architecture for Vehicle-to-
Vehicle Communications, Proceedings of the 3rd World Congress in ITS, Orlando, FL, October
1996.
[Braess95] Braess, H.-H., PROMETHEUS, Contribution to a Comprehensive Concept for

Future Road Traffic, Smart vehicles, Lisse, Netherlands: Swets & Zeitlinger, pp. 3-36. 1995.
[Bush58] Bush, R. R., and F. Mosteller, Stochastic Models for Learning, New York: Wiley,
1958.
[Cameron96] Cameron, G. D .B., and G. I. D. Duncan, PARAMICS, Parallel Microscopic

Simulation of Road Traffic, Journal of Supercomputing, vol. 10, no. 1, pp. 25-53. 1996.
[Card93] Statement of Andrew H. Card, President and Chief Executive Officer of American
Automobile Manufacturers Association, November 10th, 1993 Hearing before the Subcommittee
on Investigations and Oversight of the Committee on Science, Space and Technology, US. House
of Representatives, 103 Congress, First Session, pp. 108-109. US Gov. Printing Office,
Washington 1994. (Doc. Y4.sci2 no.103/75)
[Chand68] Chandrasekharan, B., and D.W.C. Shen, On Expediency and Convergence in Variable
Structure Stochastic Automata, IEEE Transactions on System Science and Cybernetics, vol.
SSC-5, pp. 145-149, 1968.
[Chand69] Chandrasekharan, B., and D.W.C. Shen, Stochastic Automata Games, IEEE
Transactions on System Science and Cybernetics, vol. SSC-5, pp. 145-149, 1969.
[Chee95] Chee, W. Lane Change Maneuvers for Automated Highway Systems, Research
Report, University of California, Berkeley, 1995.
[Chen96] Chen, C., B. Foreman, A Discussion of the WaveLAN Radio as Relevant to AVCS,
Technical Note 96-1, Institute of Transportation Studies, University of California, Berkeley,
1996.
[Chira92] Chira-Chavala, T., W. B. Zhang, J. Walker, F. Javandel, and L. Demsetz, Feasibility

Study of Advanced Technology HOV Systems, vol. 4: Implementation of Lateral Control
Systems in Transitways, Technical Report UCB-ITS-PRR-92-6, Program on Advanced
Technology for the Highway, Institute of Transportation Studies, University of California,
Berkeley, 1992.
[Cogan96] Cogan, R., Wheels in Motion, ITS World, Technology and Applications for
Intelligent Transportation Systems, pp. 18-22, July/August 1996, ISBN 1086-2145.
[Construction96] Overview on AHS Operational Demonstration Tests, Public Works Research

Institute, Intelligent Transport Systems Division, Ministry of Construction, Ibaraki, Japan,
1996.
[Cremer94] Cremer, J., J. Kearney, Y. Papelis, and R. Romano, The Software Architecture for
Scenario Control in the Iowa Driving Simulator, Proceedings of the 4th Computer Generated
Forces and Behavioral Representation, May 1994.
[CSIM96] Mesquite CSIM 18- A Development Toolkit for Simulation and Modeling,
Mesquite Software, Inc., http://www.mesquite.com/csim18.htm (Jan 2, 1997).
[Delco94] Institutional and Societal Aspects of AHS, Technical Report FHWA-RD-95-151 by

Delco , AHS Precursor Systems Analyses, Compendium of Research Summaries, compiled by
[Deshpande96] Deshpande, A., Design and Evaluation Tools for Automated Highway
Systems, in Hybrid Systems III : Verification and Control. New Brunswick: Springer-Verlag, pp.
138-148, 1996.
[Duncan96] Duncan, G., Simulation at the Microscopic Level, Traffic Technology

International, pp. 62-66, 1996.
[Eskafi95] Eskafi, F., D. Khorramabadi, and P. Varaiya, An Automated Highway System

Simulator, Transportation research. Part C, Emerging technologies, vol. 3C, no. 1, pp. 1-17,
1995.
[Ewing95] Ewing, T., Large-scale Intelligent Transportation Systems Simulation, Technical

Report ANL/RA/CP-85716, Argonne National Laboratory, Argonne, IL, 1995.
[Field92] Intelligent Vehicle Highway systems: Review of Field Trials, Organization for Economic
Cooperation and Development, 1992.
[Forbes95] Forbes, J., T. Huang, K. Kanazawa, and S. Russell, The BATmobile: Towards a
Bayesian Automated Taxi, Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Montreal, Canada, 1995.
[Foreman95] Foreman, B., A Survey of Wireless Communications Technologies for Automated

Vehicle Control, Systems and Issues in ITS, pp. 73-79, 1995.
[Frank89] Frank A. A., S .J. Liu, and S. C. Liang, Longitudinal Control Concepts for Automated
Automobiles and Trucks Operating on a Cooperative Highway, in Vehicle/Highway Automation:
Technology and Policy Issues, Society of Automotive Engineers, pp. 61-68, 1989.
[Fu65a] Fu, K.S., and G.J. McMurtry, A Study of Stochastic Automata as Models of
Adaptive and Learning Controllers, Technical Report TR-EE 65-8, Purdue University,
Lafayette, IN, 1965.
[Fu65b] Fu, K.S., and G.J. McMurtry, An Application of Stochastic Automata to the
Synthesis Learning Systems, Technical Report TR-EE 65-17, Purdue University, Lafayette,
IN., 1965.
[Fu67] Fu, K.S., Stochastic Automata as Models of Learning Systems, in Computer and
Information Sciences II, J.T. Lou, Editor, New York: Academic, 1967.
[Fu69a] Fu, K.S., and T.J. Li, Formulation of Learning Automata and Automata Games,
Information Science, vol. 1, no. 3, pp. 237-256, 1969.
[Fu69b] Fu, K.S., and T.J. Li, On Stochastic Automata and Languages, Information Science,
vol. 1, no. 4, pp. 403-419, 1969.
[Fu71] Fu, K.S., Stochastic Automata, Stochastic Languages and Pattern Recognition, Journal
of Cybernetics, vol. 1, no. 3, pp. 31-48, 1971.
[Fuji96] Fuji, H., O. Hayashi, and Y. Hirao, Experimental Research on Protocol of Inter-Vehicle
Communications for Multiple Vehicles, Proceedings of the 3rd World Congress in ITS, Orlando,
FL, October 1996.
[Furakawa92] Furakawa, Y., The direction of Future Automotive Safety Technology, in

Vehicle Electronics Meeting Societys Needs: Energy, Environment, Safety, Society of Automotive
Engineers, pp. 95-102, 1992.
[Garfinkel96] Garfinkel, S. L., Why Driver Privacy Must Be a Part of ITS, Converging
Infrastructures: Intelligent Transportation and the National Information Infrastructure,
Cambridge, MA: MIT Press, pp. 324-340, 1996.
[Gavan96] Gavan, J., and F. Handler, Analysis, Computation and Mitigation of the Interference
to a Remote Receiver from Two Collocated Vehicular Transceivers, IEEE Transactions on
Vehicular Technology, vol. 45, no. 3, pp. 431-442, 1996.
[Gilbert92] Gilbert, V., J. Thibault, and K. Najim, Learning Automata for the Control and
Optimization of a Continuous Stirred Tank Fermenter, IFAC Symposium on Adaptive systems in
Control and Signal Processing, Grenoble, France, July 1992.
[Godbole95] Godbole, D., J. Lygeros, and S. Sastry, Hierarchical Hybrid Control : a Case
Study, in Hybrid systems II, Berlin: Springer-Verlag, pp. 166-190, 1995.
[Godbole96] Godbole, D., Towards a Fault Tolerant AHS Design. Part II, Design and
Verification of Communication Protocols, Technical Report UCB-ITS-PRR-96-15, PATH
Program, Institute of Transportation Studies, University of California, Berkeley, 1996.
[Godbole96b] Godbole, D., M. Miller, J. Misener, R. Sengupta, and J. Tsao, An infrastructure

Assisted Concept for AHS, Proceedings of the 3rd World Congress in ITS, Orlando, FL,
October 1996.
[Hayes94] Opening Statement by James A. Hayes, Chairman, Subcommittee on Investigations

and Oversight Committee, in November 10th, 1993 Hearing before the Subcommittee on
Investigations and Oversight of the Committee on Science, Space and Technology, US. House of
Representatives, 103 Congress, First Session, p.3. US Gov. Printing Office, Washington, 1994.
(Doc. Y4.sci2 no.103/75)
[Hedrick96] Hedrick, J. K., J.C. Gerdes, D. B. Maciuca, D. Swaroop, and V. Garg, Longitudinal
Control Development for IVHS Fully Automated and Semi-Automated Systems: Phase II,
Technical Report UCB-ITS-PRR-96-01, PATH Program, Institute of Transportation Studies,
University of California, Berkeley, 1996.
[Hessburg91] Hessburg, T. and M. Tomizuka, A Fuzzy Rule-Based Controller for Automotive

Vehicle Guidance, Technical Report UCB-ITS-PRR-91-18, PATH Program, Institute of
Transportation Studies, University of California, Berkeley, 1991.
[Hessburg95] Hessburg, T., and M. Tomizuka, Fuzzy Logic Control for Lane Change
Maneuvers in Lateral Vehicle Guidance, Technical Report UCB-ITS-PWP-95-13, PATH
Program, Institute of Transportation Studies, University of California, Berkeley, 1995.
[Ho96] Ho, F.-S., and P. Ioannou, Traffic Flow Modeling and Control using Artificial Neural
Networks, IEEE Control Systems, vol. 16, no. 5, pp. 1626, October 1996.
[Ikonen97] Ikonen, E., and K. Najim, Use of Learning Automata in Distributed Fuzzy Logic
Processor Training, IEE Proceedings of Control Theory and Applications, in press, 1997.
[ITSJ96] ITS Handbook in Japan, Highway Industry Development Organization, Tokyo, Japan.
[ITSst96] US Department of Transportation, The US DOT's Intelligent Transportation

Systems (ITS) Standards Program: 1996 Status Report, http://www.its.dot.gov/standards/docs/
REPORT96.htm (January 9, 1996).
[ITSW96] ITS World, Technology and Applications for Intelligent Transportation Systems, issues
July-August, September-October 1996, Advanstar Communications, Eugene, OR, ISBN 1086-
2145.
[James94] James, R. D. and A. S. Walimbe, An Automated Highway system Evolution

Concept: The Road and Vehicle United, Technical Report, Center for Transportation Research,
Virginia Tech, August 1994.
[James96] James, R.D., Widening the Net, COMTrans (Communication Networks for ITS),
supplement to Traffic Technology International, pp. 60-64 l, August 1996.
[Kachroo95] Kachroo, P., K. zbay, R. G. Leonard, and C. nsal, Flexible Low-Cost

Automated Scaled Highway (FLASH) Laboratory for Studies on Automated Highway Systems,
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 1, pp.
771-776, 1995.
[Kachroo95b] Kachroo, P., Sliding Mode Full-state Feedback Longitudinal Control of Vehicles
in an Automated Highway System (AHS), Mobile Robots X, SPIE, Bellingham WA, pp. 260-
268, 1995.
[Kachroo95c] Kachroo, P., and M. Tomizuka, Vehicle Control for Automated Highway
Systems for Improved Lateral Maneuverability, Proceedings of the IEEE International
Conference on Systems, Man and Cybernetics, Vancouver, B.C., vol. 1, pp. 777-782, 1995.
[Kachroo96] Kachroo, P., A. Walimbe, and B. Benham, Longitudinal and Lateral Vehicle
Control Setup in the Flexible Low-Cost Automated Scaled Highway (FLASH) Laboratory, in
the Proceedings of ITS America, Sixth Annual Meeting, 1996.
[Kachroo96b] Kachroo, P., K. zbay, and R. Nagarajan, DYNAmic Visual Micro-macroscopic

Traffic Software (DYNAVIMTS) Package for Studies on Automated Highway Systems (AHS),
3rd ITS World Congress, Orlando, Florida, 1996.
[Kachroo96c] Kachroo, P, and M. Tomizuka, Chattering Reduction and Error Convergence in

the Sliding Mode Control of a Class of Nonlinear Systems, IEEE Transactions on Automatic
Control, vol. 41, no.7, July 1996.
[Kaelbling94] Kaelbling, L. P., Reinforcement Learning, MLnet Newletter, February 1994

(http://www.tnt.uni-hannover.de/data/info/www/tnt/subj/sci/ai/machine-learning/reinforcement-
learning.html)
[Kalman62] Kalman, R.E, and J. E. Bertram, Control System Analysis and Design via the
Second Method of Lyapunov: II. Discrete-Time Systems, ASME Journal of Basic Engineering,
ser. D, 82, pp. 394-400, 1962.
[Kornhauser91] Kornhauser, A.L., Neural Network Approaches for Lateral Control of

Autonomous Highway Vehicles, in Vehicle Navigation and Information Systems Conference
Proceedings, Society of Automotive Engineers, pp. 1143-1151, 1991.
[Kushner72] Kushner, H.J., M.A.L. Thathachar, and S. Lakshmivarahan, Two-state

Automatona Counterexample, Dec. 1979; Appendix to Viswanathan R, and K.S. Narendra,
A Note on the Linear Reinforcement Scheme for Variable Structure Stochastic Automata, IEEE
Transactions on Systems, Man and Cybernetics, SMC-2, pp.292-294, Apr 1972.
[Kwon95] Kwon, E., and P. G. Machalopoulos, Macroscopic Simulation of Traffic Flows in

Complex Freeway Segments on a Personal Computer, Proceedings of the 6th Vehicle Navigation
and Information Systems Conference, Seattle, Wash., pp. 342-345, 1995.
[Lakshmivarahan73] Lakshmivarahan, S. and M.A.L. Thathachar, Absolutely Expedient

Learning Algorithms for Stochastic Automata, IEEE Transactions on Systems, Man and
Cybernetics, vol. SMC-6, pp. 281-286, 1973.
[Lakshmivarahan81] Lakshmivarahan, S., Learning Algorithms: Theory and Applications, New

York: Springer-Verlag, 1981.
[Lakshmivarahan82] Lakshmivarahan, S., and K. S. Narendra, Learning Algorithms for Two-

person Zero-sum Stochastic Games with Incomplete Information: A Unified Approach, SIAM
Journal of Control and Optimization, vol. 20, pp. 541-552, 1982.
[Lasky93] Lasky, T. L., and B. Ravani, A review of Research Related to Automated Highway
System (AHS), Interim Report for Federal Highway Administration, Contract no. DTFH61-93-
C-00189, University of California, Davis, October 25, 1993.
[Lee95] Lee, H., D.W. Love, and M. Tomizuka, Longitudinal Maneuvering Control for
Automated Highway Systems, Proceedings of the American Control Conference, Seattle, Wash.,
vol. 1, pp. 150-154, 1995.
[Levitan95] Levitan, L., Preliminary Human Factors Guidelines for Automated Highway
System Designers Volume I, Guidelines for AHS Designers, Technical Report FHWA-RD-94-
116, Federal Highway Administration, McLean VA, 1995.
[Lin96] Lin, C.-F., and A.G. Ulsoy Time to Lane Crossing Calculation and Characterization of
its Associated Uncertainty, ITS journal, vol. 3, no. 2, pp. 85-98, 1996.
[Liu89] Liu, S. M., and A. A. Frank, On Lateral Control of Highway Vehicles Guided by a
Forward Looking Sensor, Proceedings of the 1st International Conference on Applications of
Advanced Technologies in Transportation Engineering, San Diego, pp. 119-124, 1989.
[Lubin92] Lubin, J.M., E. C. Huber, S.A. Gilbert, and A. L. Kornhauser, Analysis of a Neural
Network Lateral Controller for an Autonomous Road Vehicle, in IVHS Issues and Technology,
Society of Automotive Engineers, pp. 23-44, 1992.
[Lygeros94] Lygeros, J., and D. N. Godbole, An Interface Between Continuous and Discrete-
event Controllers for Vehicle Automation, Proceedings of American Control Conference,
Baltimore, MD, vol. 1, pp. 801-805, 1994.
[Lygeros95] Lygeros, J., D. N. Godbole and M. E. Broucke, Design of an Extended Architecture

for Degraded Modes of Operation of IVHS, , Proceedings of American Control Conference,
Seattle, Wash., vol. 5, pp. 3592-3596, 1995.
[Lygeros96] Lygeros, J., D. N. Godbole, and M. Broucke, Towards a Fault Tolerant AHS
Design. Part I, Extended Architecture, Technical Report UCB-ITS-PRR-96-14, PATH Program,
Institute of Transportation Studies, University of California, Berkeley, 1996.
[McGurrin96] McGurrin, M. F., The Privacy Enhancement Protocol (PEP), 3rd World
Congress in ITS, Orlando, FL, October 1996.
[Malik95] Malik, J., D. Koller, and T. Luong, A Machine Vision Based System for Guiding
Lane-change Maneuvers, Technical Report UCB-ITS-PRR-95-34, PATH Program, Institute of
[Matlab96] MATLAB, The Mathworks, Inc., http://www.mathworks.com/matlab.html (Dec

20, 1996).
[Marsh93] Marsh, C., Gordon, T. J., and Q. H. Wu, Stochastic Optimal Control of Active
Vehicle Suspensions using Learning Automata, Proceedings I. Mech. Eng. Part I, Journal of
Systems and Control Engineering, vol. 207, pp.143-152, 1993.
[Marsh95] Marsh, C., and Gordon T. J., The Application of Learning Automata to Controller
Design in Slow-Active Automobile Suspensions, International Journal for Vehicle Mechanics
and Mobility, vol. 24, no. 8, pp. 597-616, 1995.
[Mazzae96] Mazzae, E. N., and W. R. Garrott, Development of performance Specifications for

Collision Avoidance Systems for Lane Change, Merging and Backing: Human Factors
Assessment of the Driver Interfaces of Existing Collision Avoidance Systems, Technical Report
DOT HS 808 433, US DOT, National Highway Traffic Safety Administration, Springfield, VA,
1996.
[McKendree96] McKendree, T., An AHS Concept Based on a Cooperative Approach,

Proceedings of the 3rd World Congress in ITS, Orlando, FL, October 1996.
[McLaren66] McLaren, R.W., A Stochastic Automaton Model for Synthesis of Learning

Systems, IEEE Transactions on System Science and Cybernetics, SSC-2, pp. 109-114, 1966.
[Moon96] Moon, S.-H., C.-W. Kim, and M.-H. Han, Navigation Control for an Autonomous
Road Vehicle Using Neural Network, Proceedings of the 3rd World Congress in ITS, Orlando,
FL, October 1996.
[Mourou92] Mourou, P., and B. Fade, Multi-agent Planning and Execution Monitoring:
Application to Highway Traffic, Proceedings of the AAAI Spring 92 Symposium, pp. 107-
112, 1992.
[Nagarajan96] Nagarajan, R., Micro-Macroscopic Modeling and Simulation of an Automated

Highway System, M. S. Thesis, Virginia Tech, September 1996.
[Najim77] Najim, K., and Y.M. El Fattah, Use of Learning Automaton in Static Control of a
Phosphate Drying Furnace, Proceedings of the 5th IFAC/IFIP International Conference on
Digital Computer Applications to process Control, The Hague, Netherlands, June 1977.
[Najim91] Najim, K., Modeling and Self-adjusting Control of an Absorption Column,

International Journal of Adaptive Control and Signal Processing, vol. 5, pp. 335-345, 1991.
[Najim 91b] Najim, K., and G. Oppenheim, Learning Systems: Theory and Applications, IEE
Proceedings of Computer and Digital Techniques, vol. 138, no. 4, pp. 183-192, 1991.
[Najim94] K. Najim and A. S. Poznyak, editors, Learning Automata: Theory and Application,
Tarrytown, NY: Elsevier Science Ltd., 1994.
[Najim96] Najim, K., and A. S. Poznyak, Multimodal Searching Technique Based on Learning
Automata with Continuous Input and Changing Number of Actions, IEEE Transactions on
Systems, Man and Cybernetics, Part B, vol. 26, no. 4, pp.666-673, 1996.
[Narendra72] Narendra, K.S. and R. Viswanathan, Learning Models using Stochastic

Automata, Proceedings of 1972 International Conference of Cybernetics and Society,
Washington, DC, Oct. 9-12, 1972.
[Narendra74] Narendra, K.S., and M.A.L. Thathachar, Learning AutomataA Survey, IEEE
Transactions in Systems, Man and Cybernetics, vol. SMC-4, no. 4, July 1974.
[Narendra89] K. S. Narendra and M. L. Thathachar, Learning Automata: An Introduction,

Englewood Cliffs, NJ: Prentice Hall, 1989.
[Naruse93] Naruse, K., and Y. Kakazu, Strategy Acquisition of Path Planning of Redundant
Manipulator using Learning Automata, IEEE International Workshop on Neuro-Fuzzy Control,
pp.154-159, 1993.
[NASHC96] National Automated Highway System Consortium, http://www.volpe.dot.gov/

nahsc/ (Dec. 6, 1996).
[Niehaus94] Niehaus, A, and R. F. Stengel, Probability-Based Decision Making for Automated

Highway Systems, IEEE Transactions on Vehicular Technology, vol. 43, no. 3, pp. 626-634,
1994.
[Norman68] Norman, M. F., Mathematical Learning Theory, Mathematics of the Decision

Sciences, Part 2, A. Dantzig, and A. Veinott, Editors, Providence, RI: American Mathematical
Society, 1968, pp. 283-313.
[OBrien96] OBrien, R. T., P. A. Iglesias, and T. J. Urban, Vehicle Lateral Control for
Automated Highway Systems, IEEE Transactions on Control Systems Technology, vol. 4, no. 3,
pp. 266-273, 1996.
[Oommen94a] Oommen, B. J., and E. V. de St. Croix, String Taxonomy using Learning
Automata, Technical Report TR-234, School of Computer Science, Carleton University, Ottawa,
Canada, March 1994.
[Oommen94b] Oommen, B. J., and E. V. de St. Croix, Graph Partitioning using Learning
Automata, Technical Report TR-250, School of Computer Science, Carleton University, Ottawa,
Canada, July 1994.
[zgner95] zgner, ., Combined Longitudinal and Lateral Controller Design for a Vehicle
with Radar Sensors, IVHS and Advanced Transportation Systems, Society of Automotive
Engineers, Warrendale PA, pp. 51-57, 1995.
[zgner96] zgner, ., M. Somerville, K. Redmill, C. Hatipo_lu, K.A. nyelio_lu, and D.

Craig, Experimental Results of a Lane Following Controller Based on a Vision Sensor,
[Papadimitriou94] Papadimitriou, G. I., A New Approach to the Design of Reinforcement

Schemes for Learning Automata: Stochastic Estimator Learning Algorithms, IEEE Transactions
on Knowledge and Data Engineering, vol. 6, no. 1, pp. 649-654, Feb. 1994.
[PATH] Joh, J., et al. (PATH Design Team), California Partners for Advanced Transit and
Highways, http://www-path.eecs.berkeley.edu/ (Dec. 6, 1996).
[PathDb96] Morris, S., and S. Petrites, PATH Database, California Partners for Advanced
Transit and Highways, http://sunsite.berkeley.edu/~path/search.html (Dec. 6, 1996).
[Peng93] Peng, J., and R. J. Williams, Efficient Learning and Planning within the Dyna
Framework, Adaptive Behavior, vol. 1, no. 4, pp. 437-454, 1993.
[Plan93] Department of Transportations IVHS Strategic Plan Report to Congress, in November

10th, 1993 Hearing before the Subcommittee on Investigations and Oversight of the Committee on
Science, Space and Technology, US. House of Representatives, 103 Congress, First Session, pp.
35-36. US Gov. Printing Office, Washington 1994. (Doc. Y4.sci2 no.103/75)
[Pomerlau96] Pomerlau, D., and T. Jochem, Rapidly Adapting Machine Vision for Automated
Vehicle Steering, IEEE Expert, vol. 11, no. 2, pp. 19-27, 1996.
[Poznyak96] Poznyak, A., K. Najim, and E. Ikonen, Adaptive Selection of the Optimal Model
Order of Linear Regression Models using Learning Automata, International Journal of Systems
Science, vol. 27, no. 2, pp. 151-159, 1996.
[PSA94] AHS Precursor Systems Analyses, Compendium of Research Summaries, compiled by

[Puri95] Puri, A., and P. Varaiya, Driving Safely in Smart Cars, Proceedings of the American
Control Conference, Seattle, WA, vol. 5, pp. 3597-3599, 1995.
[Rajamaran96] Rajaraman, K., and P. S. Sastry, Finite Time Analysis of the Pursuit Algorithm
for Learning Automata, IEEE Transactions on Systems, Man and Cybernetics, Part B, vol. 26,
no. 4, pp.590-598, 1996.
[Rault95] Rault, A., and M. Muffat, Short-term and Mid-term Applications in the Automotive
Sector : Prometheus Activities of Peugeot SA and Renault, Presentations of the Joint California
PATH and France Workshop, Pichmond, CA, 1995.
[Rayteon94] Knowledge Based Systems and Learning Methods for AHS, Technical Report
FHWA-RD-95-045 by Raytheon, AHS Precursor Systems Analyses, Compendium of Research
Summaries, compiled by Information Dynamics, Inc., and Viggen Corporation, McLean, VA,
February 1994.
[Safety92] Intelligent Vehicle-Highway Systems (IVHS) Safety and Human Factors

Considerations, Technical Report IVHS-AMER-ATMS-92-1, IVHS AMERICA, 1992.
[Sastry93] Sastry, P. S., K. Rajaraman, and S. R. Ranjan, Learning Optimal Conjunctive

Concepts through a Team of Stochastic Automata, IEEE Transactions on Systems, Man, and
Cybernetics, July-Aug. 1993.
[Schlegel96] Schlegel, N., and P. Kachroo, Telerobotic Operation Combined with Automated
Vehicle Control, Proceedings of the SPIE-Photonics East 96, Boston, MA, November 1996.
[Schuster96] Schuster, S., and C. Jacoby, An AHS Concept Based on Maximum Adaptability,
[Sequeria91] Sequeira, J., C. Bispo, and J. Sentieiro, Distributed Control of a Flexible

Manufacturing Plant Using Learning Automata, Proceedings of the IMACS/IFAC International
Symposium, Parallel and Distributed Computing in Engineering Systems, Corfu, Greece, pp. 327-
332, June 1992.
[Shapiro69] Shapiro I.J., and K.S. Narendra, Use of Stochastic Automata for Parameter Self-
Optimization with Multi-Modal Performance Criteria, IEEE Transactions on System Science
and Cybernetics, vol. SSC-5, pp. 352-360, 1969.
[Sheikholeslam89] Sheikholeslam S., and C.A. Desoer, Longitudinal Control of a Platoon of

Vehicles I: Linear Model, Technical Report UCB-ITS-PRR-89-3, PATH Program, Institute of
[Sheikholeslam91] Sheikholeslam S., and C.A. Desoer, Longitudinal Control of a Platoon of

Vehicles with No Communication of Lead Vehicle Information: A System-Level Study, PATH
Technical Memorandum 91-2, Institute of Transportation Studies, University of California,
Berkeley, 1991.
[Shiller95] Shiller, Z., and S. Sundar, Emergency Maneuvers for AHS Vehicles, Systems and
Issues in ITS, Society of Automotive Engineers, Warrendale PA, pp. 1-11, 1995.
[Singer95] Singer, R. M., K. C. Gross, and S Wegerich, A Fault Tolerant Sensory Diagnostic
System for Intelligent Vehicle Application, Intelligent Vehicles '95 Symposium, Detroit, MI, pp.
176-182, 1995.
[Slotine91] Slotine, J.-J. E., and Li, W., Applied Nonlinear Control, Englewood Cliffs, NJ:
Prentice-Hall, 1991.
[Smart94] Smart Road Development Effort Report, Center for Transportation Research, Virginia
Tech, 1994.
[Spooner95] Spooner, J.T., and K. M. Passino, Fault Tolerant Longitudinal and Lateral Control
for Automated Highway Systems, Proceedings of the American Control Conference, Seattle,
Wash., vol. 1, pp. 663-667, 1995.
[Spooner96] Spooner, J.T., and K. M. Passino, Adaptive Control of a Class of Decentralized

Nonlinear Systems, IEEE transactions on Automatic Control, vol. 41, no. 2, pp. 280-284, 1996.
[Stevenson95] Stevenson, F. E, Legal and Procurement Issues: a Practical Guide to Drafting

ETTM Agreements, Proceedings of the International Electronic Toll & Traffic Management
Symposium, New York, N. Y., pp. 91-102, 1995.
[Sukthankar96a] Sukthankar, R., J. Hancock, S. Baluja, D. Pomerlau, and C. Thorpe, Adaptive

Intelligent Vehicle Modules for Tactical Driving, Proceedings of the 13th National Conference on
Artificial Intelligence, Portland, OR, 1996.
[Sukthankar96b] Sukthankar, R., D. Pomerlau, and C. Thorpe, SHIVA: Simulated Highways for
Intelligent Vehicle Algorithms, Proceedings of the IEEE Intelligent Vehicles Symposium, Detroit,
MI, pp. 332-337, 1995.
[Sukthankar96c] Sukthankar, J, Hancock, R., D. Pomerlau, and C. Thorpe, A simulation and

Design System for Tactical Driving Algorithms, Proceedings of the Artificial Intelligence,
Simulation and Planning in High Autonomy Systems Symposium, pp. 3-10, 1996.
[Tan96] Tan, H.-S., and S. Patwardhan, Magnetic sensors for automatic steering control,
Intellimotion, vol. 5, no. 2, pp. 10-11, 1996.
[Thathachar77] Thathachar, M.A.L., and R. Bhakthavathsalam, Learning Automata Operating

in Paralel Environment, Journal of Cybernetics and Information Science, vol. 1, p.121-127,
1977.
[TrafT96] Traffic Technology International, issues April, May 1996, UK & International Press,
Surrey, UK, ISSN 1356-9252.
[TRB95] January 25, 1995 meeting of the Committee on AI, during the 74th Annual Meeting of
Transportation Research Board, Washington, D. C., 1995.
[Tsetlin73] Tsetlin, M.L., Automaton Theory and Modeling of Biological Systems, vol. 102 in
Mathematics in Science and Engineering, New York: Academic Press, 1973.
[Tsoularis93] Tsoularis, A., C. Kambhampati, and K. Warwick, Path Planning of Robots in

Noisy Workspaces Using Learning Automata, Proceedings of the IEEE International Symposium
of Intelligent Control, Chicago, 1993.
[Tsypkin71] Tsypkin, Ya. Z., Adaptation and Learning in Automatic Systems, New York:
Academic, 1971.
[nsal95] nsal, C., J. S. Bay and P. Kachroo, Intelligent Control Of Vehicles: Preliminary
Results on the Application of Learning Automata Techniques to Automated Highway System,
Proceedings of Tools with Artificial Intelligence Symposium, Washington DC, November 1995.
[nsal96] nsal, C., P. Kachroo, and J. S. Bay, Multiple Stochastic Learning Automata for
Vehicle Path Control in an Automated Highway Systems, in revision for IEEE Transactions on
Systems, Man, and Cybernetics, first submission May 1996.
[Varaiya91] Varaiya, P., and S.E. Shladover, Sketch of an IVHS System Architecture, Vehicle
Navigation and Information Systems Conference Proceedings, Society of Automotive Engineers,
pp. 909-922, 1991.
[Varaiya93] Varaiya, P., Smart Cars on Smart Roads: Problems of Control, IEEE Transactions
on Automatic Control, vol. 38., no. 2, pp. 195-207, Feb. 1993.
[Varshavski63] Varshavski, V.I., and I.P. Vorontsova, On the Behavior of Stochastic Automata
with Variable Structure, Automat. Telemekh., vol. 24, pp. 253-360, 1963.
[Viswanathan72] Viswanathan R., and K.S. Narendra, Comparison of Expedient and Optimal
Reinforcement Schemes for Learning Systems, Journal of Cybernetics, vol. 2, pp. 21-37, 1972.
[Viswanathan73] R. Viswanathan, and K. S. Narendra, Competitive and Cooperative Games of

Variable-Structure Stochastic Automata, Journal of Cybernetics, vol. 3, pp. 1-23, 1973.
[Vor65] Vorontsova, I.P., Algorithms for Changing Automaton Transition Probabilities,

Problemi Peredachii Informarsii, vol. 1, pp. 122-126, 1965.
[Weber96] Weber J, and M. Atkin, Further Results on the Use of Binocular Vision for Highway
Driving, Proceedings of the 1996 SPIE Conference on Intelligent Systems and Controls, vol.
2902, Nov. 1996.
[Wheeler85] Wheeler, R. M. Jr., and K. S. Narendra, Learning Models for Decentralized

Decision Making, Automatica, vol. 21, pp. 479-484, 1985.
[Wigan96] Wigan, M., Problem of Success: Privacy, Property, and Transactions, Converging
Infrastructures: Intelligent Transportation and the National Information Infrastructure,
Cambridge, MA: MIT Press, pp. 341-354, 1996.
[Wright95] Wright, T., Eyes on the Road : Privacy and ITS, Traffic Technology International,
pp. 88-93, 1995.
[Xu96] Xu, T., DYNAVIS, a Dynamic Visualization Environment for Automatic Vehicle
Control Systems, Intellimotion, vol. 5, no. 1, pp. 2, 10, 1996.
[Yanagisawa92] Yanagisawa, T., K. Yamamoto, and Y. Kutoba, Development of a Laser Radar

System for Automobiles, in Electronic Display Technology and Information Systems, Society of
Automotive Engineers, pp. 73-85, 1992.
[Yanakiev96] Yanakiev, D., and I. Kanellakopoulos, Speed Tracking and Vehicle Follower
Control Design for Heavy-duty Vehicles, Vehicle system dynamics, vol. 25, no. 4, pp. 251-276,
1996.
[Yang96] Yang, Y.-T., and B.H. Tongue, A New Control Approach for Platoon Operations
during Vehicle Exit/Entry, Vehicle System Dynamics, vol. 25, no. 4, pp. 305-319, 1996.
[Yu95] Yu, J., and A. Sideris, Control of Combined Vehicle Motion : a Gain-scheduling
Approach,. Proceedings of the 5th Annual Meeting of ITS America, vol. 1, pp. 109-115, 1995.
[Zoratti95] Zoratti, P., Millimeter Wave Scattering Characteristics and Radar Cross Section
Measurements of Common Roadway Objects, Collision Avoidance and Automated Traffic
Management Systems, SPIE, Bellingham, WA, pp. 169-179, 1995.
Cem nsal Appendices 165
Appendix A. Glossary
A.1 Notations and Definitions

a, b Learning parameters for linear reinforcement scheme
Set of outputs of an automaton
i ith action of the automaton
(n ) Automaton action at time step n
( n) Environment response at time step n
ci Probability of (receiving) penalty for ith action (P-model
environment)
e rj Unit vector of dimension r with jth element equal to 1
si Penalty strength for ith action
E[] Mathematical expectation
E[ | pi(n)] Mathematical expectation given the probability pi(n)
F() State transition function for an automaton (Chapter. 3); mapping for
teacher responses (Chapter. 4)
f ij Probability of transition from state I to state j given the environment
response
g, h Continuous nonnegative functions of probability vector p
G() Output function for an state-output automaton
H(,) Output function for an automaton
H(p) Update function used in nonlinear reinforcement scheme NLH
k Learning parameter for nonlinear reinforcement schemes
LI-P Linear inaction-penalty reinforcement scheme (a = 0)
LR-I Linear reward-inaction reinforcement scheme (b = 0)
LR-P or L= R-P Linear reward-penalty reinforcement scheme with a = b
LR-P Linear reward-penalty reinforcement scheme with a b
M(n) Average penalty to the automaton at time step n
Mo Average penalty for a pure chance automaton
ij ,ij Coefficients of difference equation for action probabilities
NLH Nonlinear reinforcement scheme with update function H
xk k th norm of vector x
p(n) Action probability vector
pi (n) Probability of choosing ith action at time step n
= {1 , 2 ,... s} Set of internal states of an automaton
i , i Update functions of probability vector for nonlinear reinforcement

schemes
r Number or automaton actions
ij2 Covariance of pi(n) and pj(n)
Learning parameter for nonlinear reinforcement schemes
V ( x) Lyapunov function
A.2 Acronyms and Abbreviations

AHS Automated Highway System
ASTM American Society for Testing and Materials
ATIS Advanced Traveler Information System
ATMS Advanced Traffic Management System
AVC Automatic Vehicle Control
AVCS Advanced Vehicle Control System
AVI Automatic Vehicle Identification
AVL Automatic Vehicle Location
DOT Department of Transportation
ELP Electronic License Plate
ETT Electronic Toll Tags
FHWA Federal Highway Administration
FLASH Flexible Low-cost Automated Scaled Highway (Laboratory)
GPS Global Positioning System
ITE Institute of Transportation Engineers
ITS Intelligent Transportation Systems
IVHS Intelligent Vehicle Highway System
LA Learning Automata
NAHSC National Automated Highway System Consortium
PATH Program on Advanced Technology for the Highway, California
PSA Precursor Systems Analyses
RIC Remote Intelligent Communications
RVC Roadside-Vehicle Communications
SAE Society of Automotive Engineers
TRB Transport Research Board
VRC Vehicle-to-roadside Communications
VORAD Vehicle On-Board Radar
Appendix B. Proof of Convergence of the Optimal

Action with the Linear Inaction-Penalty Scheme LI-P
The linear inaction-penalty scheme LI-P can be derived by a modification of the general linear
reward-penalty scheme LR-P (See Section 6.1). This reinforcement scheme, as the name suggests,
does not update the action probabilities when the environment response is affirmative:
if ( n) = i
pi (n + 1) = pi (n)
for (n) = 0
p j (n + 1) = p j (n) j i
(B.1)
pi (n + 1) = (1 b) pi (n)
for (n) = 1 p (n + 1) = b + (1 b) p ( n)
j j i
r1 j
From the definition above, the expected value of the probability of an action at the next
step can be written as:
[ ] [ ]
r
E pi (n + 1) pi ( n) = E pi ( n + 1) pi (n) (n) = k pk ( n)
k =1
[ ]
r
= ck E pi (n + 1) pi (n ) (n) = k ( n) = 1 pk (n)
k =1
[ ]
r
+ (1 ck ) E pi (n + 1) pi (n) (n) = k (n ) = 0 pk ( n)
k =1
b
r
= ci pi (n)(1 b) pi ( n) + ck pk (n) + (1 b) pi ( n)
ki r 1
r
+ (1 ci ) pi ( n) pi (n) + (1 ck ) pk (n )pi ( n)
ki
(B.2)
We will again consider the ideal case where the probability of penalty for the optimal action is
equal to zero, i.e., c= 0 and 0 < cj < 1. Then, for i = , we have:
[ ] b
r r
E p ( n + 1) p (n) = 0 + ck pk (n ) + (1 b) p (n) + p2 (n ) + (1 ck ) pk (n )p (n )
k r 1 k
(B.3)
More explicitly:
[ ] b r r r
E p ( n + 1) p (n) = c p
r 1 k k k
( n ) + p ( n ) c p
k k ( n ) bp i ( n ) ck pk (n)
k k
r r
+ p2 (n ) + pk (n) p (n) ck pk ( n) p ( n)
k k
r r r
b
= c p (n) bp ( n)
r 1 k k k
ck pk (n) + p2 (n ) + pk (n) p (n)
k k
(B.4)
r
Now, taking the expectation of both sides, and using the fact that p ( n) = 1, we obtain:
i
i =1

[ ] b r
[ ] [ ]
r
E p ( n + 1) =
r 1 k
ck E pk (n) b E p (n ) ck p k ( n) + E p2 ( n)
k (B.5)
[
+ E (1 p ( n))p (n ) ]
The simulations of the linear inaction-penalty scheme showed that the probability of the
optimal action converges to a steady-state value, as seen in Figure B.1 For such a steady-state
result, we can assume that:
E [p (n) ]= cons tan t p

2
As n p ( n ) = 0 (B.6)

[ ]
E p (n) p j (n) = E[p (n)] E p j (n) [ ]
Conditions above state that, as we approach the steady-state value for the probability of
the optimal action, the expected value approaches the same value; the variance of the probability
is negligibly small and the probability of all other actions are independent of the optimal actions
probability, although the sum of probabilities must be equal to 1. (Since the probability of the
optimal action is a constant value, this value and other probabilities can be treated as independent
r
variables.) Using these facts, the identity E [x 2 ]= E [x ]2 x2 , and renaming E ck pk (n) A ,
k
we can simplify the equality in Equation B.5 as follows:
r
E [p (n + 1) ]= c k E [pk (n)] b E [p ( n)] E c k p k (n)
b r

r 1 k k (B.7)
+E 2
[p (n)]+ E [p (n) ] E [p (n )]
2
Or:
b b
p= A bpA + p + p p 0 = A bpA
2 2
r 1 r 1
(B.8)
bA 1
p= =
bA (r 1) (r 1)
where r is the number of actions. This proves that a learning automaton using the linear inaction-
penalty scheme in a stationary environment where there is one (and only one) optimal action is
not -optimal (except where r = 2). Furthermore, the probability of the optimal action converges
to 1/(r-1). Figure B.1 shows the probabilities of five actions where the optimal actions
probability converges to 1/(5-1) = 0.25.
LI-P scheme (a = 0 , b = 0.1, N = 5)

0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0 50 100 150
iterations
Figure B.1. Probabilities of five actions in the LI-P scheme; only c1 = 0.
Appendix C. Simulation
C.1. The Program

The simulation for the work described in this dissertation is written in Matlab [Matlab96]. It
consists of a group of subroutines and functions coded as m-files, and can be run in Matlab
version 4.2 or higher. Graphic user interfaces are designed on a Sun workstation running Solaris
2.4 and Openwindows , but can be displayed on any platform running Matlab 4.2 or higher,
after making minor changes in the GUI subroutines.
Figure C.1. shows the overall structure of the simulation. The description of the files are
given in Table C.1. Subroutine m_run.m is the main simulation program which calls several
subroutines at each time step. At each time step, sensor outputs are evaluated and action
decisions are made. Then, those actions are carried out for each automated vehicle on the
highway. Sensor modules and flags are implemented as subroutines that take vehicle index and
returns corresponding output. The learning, decision (planning), and regulation subroutines work
similarly. When evaluations are complete, the highway is updated. The main program repeats the
loop until the final time is reached or the program is interrupted by the user.
Learning algorithm, learning parameters, processing speed and memory vector sizes
cannot be changed during the simulation run. Other parameters such as current vehicle speed,
desired speed, current vehicle lane, desired lane, permitted speed variations can be changed during
the simulation as well as other display parameters. For detailed description of some of the
graphic user interfaces, see Figures C.3-C.7. Command line interface can be used to change
parameters before the simulation run. It also displays several parameters and/or actions during
the simulation run (Figure C.2).
Files described in Table C.1 and Figure C.1 are provided here as zipped archives in two
different formats:
Unix: mfiles.tar.Z
Windows: mfiles.zip
File Inputs Output Description

laneb.m Vehicle index Integer indicating Evaluates current vehicle lane based on link layer
desired lane shift information
mlma.m - - Main program
move2.m Vehicle index - Updates vehicle position
m_init.m - - Initializes all parameters, arrays, etc.
m_mvec.m - - Plots simulation data
m_movie.m - - Shows the movie of the simulation
m_plap.m - - Plots simulation data
m_plot.m - - Subroutine for Plot GUI
m_redo.m - - Subroutine for Scenario GUI
m_reinit.m - - Reinitializes all parameters, arrays, etc.
m_run.m - - Main subroutine for simulation runs
m_sandl.m - - Plots simulation data
m_setgui.m - - Displays the main graphic user interface
m_traj.m - - Plots simulation data
pinch.m Vehicle index Binary value Function for pinch module
indicating pinch
condition
phmin.m Action probabilities, New action, new Function for nonlinear nonlinear scheme of Baba
action, environment action probabilities [Baba85].
response
Phfun.m Action probabilities, New action, new Function for nonlinear reinforcement scheme of
action, environment action probabilities Section 6.2.
response
plan2.m Vehicle index - Subroutine for planning layer
plot_h2.m Vehicle positions - Subroutine for display updates
plrp.m Action probabilities, New action, new Function for general linear reward-penalty learning
action, environment action probabilities scheme.
response
Reg2.m Vehicle index - Subroutine for regulation layer
sb_fr.m Vehicle index Function for headway module
sb_lft.m Vehicle index Function for left sensor module
sb_lftn.m Vehicle index Binary value Function for extended left sensor module
sb_sp.m Vehicle index indicating module Function for speed module
response
sb_rgt.m Vehicle index Function for right sensor module
sb_rgtn.m Vehicle index Function for extended right sensor module
speedb.m Vehicle index Speed flag Evaluates vehicle speed condition
Table C1. Description of the subroutines and functions for multiple lane, multiple automata
intelligent vehicle simulation.
mlma.m
m_setgui.m m_init.m
m_redo.m m_run.m m_plot.m
m_reinit.m
plot_h2.m m_sandl.m
move2.m m_plap.m
reg2.m m_traj.m
plan2.m m_mvec.m
m_movie.m
sb_sp.m sb_fr.m speedb.m plrp.m
sb_rgtn.m sb_lft.m laneb.m phfun.m

sb_rgt.m sb_lftn.m
pinch.m phmin.m
Figure C1. Structure of the simulation program.

Figure C.2. The simulation.

Learning
Extended side Learning
algorithm menu
sensor switch parameters
Side sensor ranges
Processing
Front sensor speed (1/sec)
ranges and length of
memory
vectors
Index of the
current vehicle Speed and lane parameters
for current vehicle
On/off switches
for the current
Speed
vehicle
increment
Permitted
Data file I/O speed
variation
Display
update rate
Switch for
Opens scenario saving flow
GUI (Fig. C.6) data
Opens plot
GUI (Fig C.4) Highway
window size
adjustment
Final time (sec)
Figure C.3. Graphic User Interface.

Memory vector
and pinch module Indices of the
outputs vehicles to be
plot
Speed and lane
positions
Indices of the first
Timed snapshots and last vehicle to
(Fig. C.5) be plot
Space-time Time data for

trajectories for snapshots
platoons
Time interval for
Speed and headway
plots
for 2 vehicles
Flow-concentration Number of iterations

curve for single lane per frame
Switch for movie matrix
Matlab movie of
the simulation run Time interval for plots
Relative positions
of vehicles during Indices of the first
run and last vehicle to
be plot
Reference
vehicle
Figure C.4. GUI for data visualization.

Time interval:
t = 1sec to t = 16sec. Reference vehicle
Vehicles to be plot: Vehicle 1, moving faster

1 to 4 then vehicle 4, shifts to
Reference position: middle lane and
vehicle 4 (at t = 1sec.) continues
Figure C.5. Trajectory command for relative position plots.
Figure C.6. Scenario window: Clickable buttons

initialize several different scenarios.
Vehicle 1 to 4 are plotted
at t = [0 10 20 28 30 40 45]
Figure C.7. Timed snapshots of a simulation run.

C.2. Additional Simulation Files

Additional simulation examples of the different highway situations are given in this
section. The following table gives the description of the situations and the mpeg movie file of the
test run.
File Simulation Parameters and Description

appCm1.mpg 20 vehicles with various initial speed definitions changing to the same lane:
- Desired speed and lane for all vehicles are 85kmh and lane 2 respectively.
0.263Mb - Algorithm: NLH with k = 3, = 0.05.
40sec - Memory vector sizes: 9 (lateral and longitudinal)
5 frames/sec - Sensor ranges: 10m side, 11m, 15m, and 20m front sensor.
- All vehicles shift to lane 2, but due to speed variations, some of the vehicles
shift left and right to avoid collision. At t = 40sec, average platoon speed is
approximately 85kmh.
AppCm2.mpg Three platoons with various memory vector sizes:
- 3 ten-vehicle platoons with same initial speeds of 80mph and inter-platoon
0.264Mb distances of 15m. Leaders have a desired speed of 70mph.
40 sec. - Algorithm: LR-P with a = 0.15 and b = 0.10.
5 frames/sec - Memory vector sizes:
Platoon 1 (yellow) - lateral: 25, longitudinal: 13
Platoon 2 (green) - lateral: 13, longitudinal: 13
Platoon 3 (red) - lateral: 13, longitudinal: 9
- Followers in the platoon try to match leaders speed change. In platoons 1 and 2,
some of the vehicle (4, 5, and 10) shifts lane to avoid collisions. The size of the
lateral memory vector affects the time to shift lanes slightly. The effect of the
longitudinal memory vector is more significant. No vehicle shift lane in platoon 3
because of faster reactions to headway changes.
appCm3.mpg Three separate group of four vehicles in the situation of Section 7.4.3:
- V1 = V2 =85mph, V1 = V2 =805mph; initial vehicle separations are 16m; no lane
0.209Mb preferences are set for vehicles 1 and 2.
40 sec. - Algorithms and memory vector sizes:
5 frames/sec Group 1 (yellow) - LR P (a = 0.15, b = 0.10) lateral: 25, longitudinal: 13
Group 2 (green) - NLH (k = 10, = 0.05) lateral: 25, longitudinal: 13
Group 3 (red) - NLH (k = 10, = 0.05) lateral: 13, longitudinal: 9
- Speed flags in groups 1 and 2 are set approximately at the same time (10.04 and
8.8 seconds for group 1; 9.64 and 9.36 for group 2). The lane changes are made
also made approximately at the same time. For group 1, lane changes occur before
the flags are set, due to short lateral memory vector.
Sensor ranges for the last two simulations are 10m side, 10m, 15m, and 20m front. Processing
speed for all simulations is 25Hz.
Table C.2. Description of the situations and simulation files.
Appendix D. States of the Environment for Multiple

Vehicle Interactions
The tables given in this section show the possible positions of vehicles in a highway for the
scenarios described in Chapter 7. They indicate relative positions of multiple vehicles with
respect to each others sensor ranges. Each row in a matrix corresponds to a lane; each square
illustrates a road section which falls into the side sensor range of an automated vehicle. A dark
square indicates the presence of a vehicle. Not all possibilities are considered; instead, only the
situations that are of interest for a specific scenario are listed. Similar situations are then
combined into a single state and simplified if necessary. Two situations are said to be similar if
the sensor module outputs and/or possible actions are the same for both. Three scenarios of the
Chapter 7 are:
Two Vehicles on a Three-Lane Highway

Considering the immediate neighborhood of the vehicles, there are C(9,2) = 36 possible
situations shown in Figure D.1. Most of these situations are similar and, therefore, can be
combined as shown in Figure D.2. Further simplifications are illustrated in Figure D.3. These
final 12 situations are the states given in Section 7.4.
Three Vehicles on a Three-Lane Highway

Considering the immediate neighborhood of the vehicles, there are C(9,3) = 84 possible
situations shown in Figure D.4. Some of these situations are similar and, therefore, are
combined as shown in Figure D.5. These final 64 situations are the states given for 3-vehicle
scenario in Section 7.4.
Four vehicles on a Three-Lane Highway

This four vehicle situation of Figure 7.13 can be treated as a two-vehicle situation with other
vehicles treated as obstacles. The two vehicles in front are traveling at their desired speed and
lane, unaware of the other approaching vehicles (See Section 7.4.3). Therefore, there are
C(7,2) = 28 possible situations shown in Figure D.6. Since the conflict will be solved when
the two trailing vehicles are both in the middle lane, the states E3, E4, and/or F3 are goal
states. Again, we considered only immediate neighborhoods, and therefore, the physical
locations of the vehicles are shown using a 3x4 matrix.
Figure D.1. All possible immediate neighborhood situations for two vehicles.
Figure D.2. Combined states for two vehicles: states not shown
are identical to those given here.
Figure D.3. Further simplified states.

Figure D.4. All possible immediate neighborhood situations for three

vehicles in a three-lane highway.
1 2 3 4 5 6 7
E A2 A3 A5 A6
F B7 C2
G C3 C6 C7 D5
gg
H E7
I F2 F3 F6 C2
J G5
K J3 J4
L K2 L1
Figure D.5. Combined states for three vehicles: states not shown are
identical to those indicated.
1 2 3 4
Figure D.6. Possible states for four vehicles actually

a two-vehicle situation.
185
Vita
Cem nsal was born on November 4, 1967 in Ankara, Turkey. He is an alumnus (118) of
Galatasaray Lisesi (Lyce de Galatasaray), Istanbul. He received a B.Sc. degree with honors in
electrical and electronics engineering from Bogazii University, Istanbul, in 1991, and an M.S.
degree in electrical engineering from Virginia Polytechnic Institute and State University (Virginia
Tech), in 1993. His M.S. thesis "Self-Organization in Large Populations of Mobile Robots"
received the Paul E. Torgersen Graduate Student Research Excellence Award (3rd place) in 1994.
From 1993 to 1996, he was a Ph.D. candidate in the Bradley Department of Electrical
Engineering, and research assistant at the Center for Transportation Research at Virginia Tech.
His research interests include automated highway systems, automated vehicle control systems,
artificial intelligence, learning automata, sliding mode control and nonlinear observers.
He will continue his research career at the Robotics Institute at Carnegie Mellon
University, Pittsburgh, PA. Mr. nsal has been a student member of AMS, IEEE, ITS America
and TRB during his graduate studies. He is currently a member of IEEE and ITSA.
29 January 1997
View publication stats

Intelligent Navigation of Autonomous Vehicles in A

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Intelligent Navigation of Autonomous Vehicles in A

Diunggah oleh

Hak Cipta:

Format Tersedia

See

Intelligent Navigation of Autonomous Vehicles in

Cem nsal Pushkin Kachroo

SEE PROFILE SEE PROFILE

Travel Time Theory by Pushkin Kachroo View project

Automated vehicles View project

The user has requested enhancement of the downloaded file.

John S. Bay, Chair

January 29, 1997

Keywords: AHS, Intelligent Vehicle Control, Stochastic Learning Automata,

Copyright 1997. Cem nsal

Committee Chairman: John S. Bay

We propose an artificial intelligence technique called stochastic learning automata to

2 Automated Highway Systems 9

3 Stochastic Learning Automata 29

4 Application of Learning Automata to Intelligent Vehicle Control 46

5 More Intelligent Vehicles: Extending the Capabilities 62

6 New Reinforcement Schemes for Stochastic Learning Automata 92

7 Interacting Vehicles: Rules of the Game 116

Appendix A. Glossary 165

Appendix B. Proof of Convergence of the Optimal Action with Linear

Appendix C. Simulation 170

Appendix D. States for Multiple Vehicle Interactions 179

Figure 2.1 Four-block evaluation and experimentation framework for AHS. . . . . . . . 19

Table 2.1 Comparison of different stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1.1 ITS and AHS

Research and Development: The categories of research applicable to the development

1.1.2 Intelligent Vehicle Control

1.2.2 Learning Automata as an Intelligent Controller

1.3 Scope and Structure of Dissertation

Automated Highway Systems

A major long-term element of Intelligent Transportation Systems research and development is

2.1 AHS Program Phases and the National Automated Highway

The Analysis Phase established an analytic program foundation. It consisted of Precursor

2.2 Vehicle Control

2.2.1 Lateral Control

2.2.2 Longitudinal Control

2.2.3 Combined Lateral and Longitudinal Control

2.3 Hierarchical Control Structure

Decision on maneuvers to attempt in order to realize the assigned path,

[Schuster96], autonomous vehicle architectures [Bayouth96], and infrastructure assistance

2.4 Other AHS Issues

2.4.1 Sensors and Communication

2.4.2 Safety and Fault Tolerance

2.4.3 Human Factors

2.5. An Experimentation and Evaluation Framework for AHS

Software FLASH Smart Highway Deployment

Stage Simulation FLASH Lab. Test Site Deployment

2.5.1. FLASH Laboratory

Encoder Sensor Board

Figure 2.2. Automated FLASH Car: Test Model.

Figure 2.3. Driving station.

2.5.2. Smart Road

Figure 2.5. Location of the Smart Road.

2.6. A Simulation Tool for AHS Systems: DYNAVIMTS

2.6.1. Software System Architecture

Figure 2.6. Software System Architecture

2.6.2. Components of DYNAVIMTS

Macroscopic Microscopic Simulation

Vehicle Process Sensor Process

Vehicle Link Sensor

Figure 2.8. Opening screen of DYNAVIMTS [Nagarajan96].

Stochastic Learning Automata