(CSC309)
Lecturer:
Dr. Oyelade, O. J.
AIM AND OBJECTIVES
The course is all about the theories that enable computation, and
computation is all about modeling, designing, and programming the
computer system to simulate our model.
In this course, we will be concern about the languages or in other words,
formal languages that enable computation with the computer possible.
The course content, therefore, includes:
1.0 Introduction
1.1 Alphabet and Strings
1.2 Languages
1.3 Language operation
2.0 Grammars
2.1 Definition
2.2 Regular Grammar
2.3 Regular expression
2.4 Relationship between regular grammar and regular expression
2.5 Types of Grammar (Chomsky hierarchy)
3.0 Finite Automata
3.1 Deterministic and Non-deterministic finite automata
3.2 Conversion automata to certain types of grammars and back again,
using non-deterministic automata
3.3 Conversion of non-deterministic finite automata to deterministic finite
automata
3.4 Regular expressions and their relationship to finite automata
4.0 Pushdown automata and context-free grammars
4.1 Deterministic and non-deterministic pushdown automata
4.2 Context-free grammars
4.3 Useless production and emptiness test
4.4 Ambiguity
4.5 Context-free grammars for pushdown automata and vice-versa
5.0 Properties of Context-free languages
5.1 Pumping lemma
5.2 Closure properties
5.3 Existence of non-context-free languages
5.4 Turing languages
5.5 Decidability and Undecidability
Oyelade O.J.
Recommended Texts
1.
Lawson, M.V. Finite Automata. Chapman and Hall/CRC, 2004
2.
Brookshear, J.G. Theory of Computation: Formal languages,
Automata, and Complexity. The Benjamin/Cummings Publishing
Company, Inc. 1989.
3.
Carroll, J. and Long, D. Theory of Finite Automata (with an
introduction to formal languages). Prentice Hall, 2004.
1.0
Oyelade O.J.
Introduction
The set of all strings over the alphabet is denoted by * and the empty string is
denoted by | |.
The set of all strings except the empty one is denoted one is denoted by + .
Given two strings x, y *, a new string xy can be form, which we called
concatenation of x and y by adjoining the symbols in y to those in x.
For example, if = 0,1 and both 0101 and 101010 are strings over . The
concatenated of both gives string 0101101010.
3
Oyelade O.J.
If x, y * then |xy| = |x| + |y|, i.e. when two strings are concatenated, the result of
their lengths is the sum of the lengths of the two strings.
Note: for empty string ; if x *, then x = x = x .
The order in which strings are concatenated is important. For example, suppose
= a, b and u = ab and v = ba, then uv = abba and vu = baab uv vu . Therefore,
the order matter in spelling.
Associativity also holds for concatenation. For example, given three strings x, y,
and z. there are two ways to concatenate them in this order:
we can concatenate x and y first to obtain xy and then concatenate xy
with z to obtain xyz or
we can concatenate y and z first to obtain yz and then concatenate x with
yz to obtain xyz. That is, (xy)z = x(yz).
The usual law of indices hold. For example, if x is a string and x = ba, then (ba)2 =
baba. If m,n 0 then xmxn = xm+n
1.1.1 Prefix, Suffix, proper factor and substring of a string
Given x,y,z *. If u = xyz then y is called a factor of u, x is called a prefix of u,
and z is called a suffix of u. y is called proper factor of u if at least one of x and z
is not just the empty string. Also prefix x (or suffix z) is proper if x u (or z u).
The string u is a substring of string v if u = a1, , an, where ai and strings
x0, .., xn v = x0a1x1xn-1anxn. Let x *. We call a representation of x =
u1un, where each ui *, factorization of x.
For example, given string u = abab over the alphabet {a,b}.
the prefix of u are ,a, ab, aba, abab
the suffix of u are ,b, ab, bab, abab
factors of u are , a, b, ab, ba, aba, bab, abab.
Examples of substring of u are aa, bb, abb. And
u = ab.ab is a factorization of u.
1.1.2 The tree order on *
This is the standard way of listing strings over an alphabet i.e. let x,y * and x
< y iff |x| < |y| and the string x occurs to the left of the string y in the tree over * .
For example, let = {0,1} and 0 < 1 the tree over * of length 2 are:
4
01
Oyelade O.J.
10
11
00
Oyelade O.J.
Languages
1.2
Oyelade O.J.
1
0
1
0
0
1
0
1
0
0
1
0
This adjacency matrix can be used to construct a binary string by adjoining (or
concatenating) the rows of the matrix.
Therefore, the graph G is represented by :
0100.1010.0101.0010 = 0100101001010010
code(G) = 0100101001010010
Therefore, every simple graph can be encoded by a string over the alphabet
= {0,1}.
Let L = {x {0, 1}* : x = code(G) where G is connected}.
This is a language that corresponds to the decision problem: is the simple
graph G connected? G answers yes iff code(G) L
1.3
Language operations
Suppose X is any set, then P(X) is the set of all subsets of X, (called the power
set of X).
Let be an alphabet. A language over is any subset of *, this implies that,
the set of all languages over is P(*). If L and M are languages over so
are L M, L M and L\M (relative complement). If L is a language over ,
then L = *\L is a language called the complement of L.
The operations of intersection, union, and complementation are called Boolean
operations (from set theory).
Note that x L M means x L or x M or both.
1.3.1
Oyelade O.J.
Product of languages
Note: We can use the Boolean operations, the product, and the Kleene star to
describe languages.
8
Oyelade O.J.
For example, let L = {a, b}*\ {a, b}*{aa,bb}{a, b}* . This consists of all strings
over the alphabet {a,b} that do not contain a doubled symbol.
Therefore, the string ababab L. But string abaaba L.
Some examples of languages over the alphabet = {a, b} are:
1. * can be written as (a + b)*, i.e. * = {a, b}* = ({a} + {b})* = (a + b)*
2. The language (a + b)3 consists of all 8 strings of length 3 over . This is
because (a + b)3 means (a + b) (a + b) (a + b). A string x belongs to this
language if we can write it as x = a1a2a3 where a1,a2,a3 {a, b}.
3. The language aab(a + b)* consists of all strings that begin with the string aab,
but the language (a + b)* aab consists of all strings that end in the string aab.
The language (a + b)*aab(a + b)* consists of all strings that contain the string
aab as a factor.
4. The language (a + b)*a(a + b)* a(a + b)* b(a + b)* consists of all strings that
contain the string aab as a substring.
Oyelade O.J.
GRAMMARS
Definition
Automata are devices for recognizing strings while Grammars are devices for
generating strings belonging to a language
Let us consider the following fragment of English
Sentence denoted by S
Noun-phrase denoted by NP
Verb-phrase denoted by VP
Noun denoted by N
Definite article denoted by T
Verb denoted by V
There are rules that tell us how the grammatical categories are related to each
other. We shall use to indicate how a grammatical category on the left can be
constructed from grammatical categories on the right. That is:
1.
2.
3.
S NP.VP
NP T.N
VP V.NP
Also, let us include amongst these rules the specific English words that belong to
those grammatical categories consisting only of words; the symbol | means or
4.
5.
6.
T the
N girl | boy | ball | bat | frissbee
V hit | took | threw | lost
= {the, girl, boy, ball, bat, Frisbee, hit, took, threw, lost}.
The starting point is always the symbol S.
S NP.VP
NP.VP T.N.VP
T.N.VP T.N.V.NP
10
Oyelade O.J.
T.N.V.NP T.N.V.T.N
T.N.V.T.N the N.V.T.N
the N.V.T.N the boy V.T.N
the boy V.T.N the boy threw T.N
the boy threw T.N the boy threw the N.
the boy threw the N the boy threw the ball.
We can see that the string the boy threw the ball belongs to the language
generated by the grammar.
Therefore, the Grammar G can be defined as a quadruple or 4-tuple G = {VT, VN,
P, S} where:
XSZ
Y
yY |
x
z
1
2
(3,4)
5
6
With start symbol S, and a derivation showing how that grammar can generate
the string xyz as follows:
S XSZ
xSZ
xyZ
xyz
If the terminals of a grammar G are symbols in an alphabet, we say that G are
*
*
actually strings in . In such a case the strings generated by G are strings in .
11
Oyelade O.J.
(1,2)
(3)
(4)
(5)
Oyelade O.J.
Regular expression
Regular expression
Grammar
Regular Grammar
Regular Grammars
A regular grammar is a grammar whose production (rewrite rule) rules conform to
the following restrictions:
The left side of any production rule in a regular grammar must consist
of a single nonterminal.
would not.
e.g. S xX
X yY
Y xX |
13
Oyelade O.J.
This is a regular grammar that generates strings consisting of one or more copies of
the pattern xy
Note: Any rule of the form: N x in a regular grammar could be replaced by the
pair of rules
N xX
X
where X is a nonterminal that does not appear
elsewhere in the grammar without altering the set of strings that could be
generated by the grammar.
Regular expression
Regular expressions are used to define patterns of characters. It is just a form of
notation, used for describing sets of words.
For any given set of characters , a regular expression over is defined by:
The empty string is a regular expression.
Each member of is a regular expression. For instance, if we write a as
a regular expression, this means take the letter a from the input.
If p and q are regular expressions, then so is p q
If p and q are regular expressions, then so is p.q
If p is a regular expression then, so is p*. i.e. the Kleen closure of a regular
expression denoted by * indicates zero or more occurrences of that
expression. Thus p* is the (infinite) set { , p, pp, ppp, ..} and means
take zero or more p from the input
The Chomsky hierarchy
Based on pioneering work by a linguist (Chomsky, 1959), computer scientists now
recognize four classes of grammar. The classification depends on the format of the
productions, and may be summarized as follows:
Type 0 Grammars (Unrestricted)
An unrestricted grammar is one in which there are virtually no restrictions on the
form of any of the productions, which have the general form
with (N T )* N (N T )* , (N T )*
(thus the only restriction is that there must be at least one non-terminal symbol on
the left side of each production). The other types of grammars are more restricted;
to qualify as being of type 0 rather than one of these more restricted types it is
necessary for the grammar to contain at least one production with | | > |
14
Oyelade O.J.
with | | | | , (N T )* N (N T )* , (N T )+
Strictly, it follows that the null string would not be allowed as a right side of any
production. However, this is sometimes overlooked, as -productions are often
needed to terminate recursive definitions. Indeed, the exact definition of "contextsensitive" differs from author to author. In another definition, productions are
required to be limited to the form
with , (N T )*, A N+, (N T )+
(It can be shown that the two definitions are equivalent.) Here we can see the
meaning of context-sensitive more clearly - A may be replaced by when A is
found in the context of (that is, surrounded by) and .
A much quoted simple example of such a grammar is as follows:
G
N
T
S
P
=
=
=
=
=
{ N , T , S , P }
{ A , B , C }
{ a , b , c }
A
A aABC | abC
CB BC
bB bb
bC bc
cC cc
15
(1, 2)
(3)
(4)
(5)
(6)
Oyelade O.J.
Let us derive a sentence using this grammar. A is the start string: let us choose to
apply production (1)
A aABC
and then in this new string choose another production for A, namely (2) to derive
A a abC BC
and follow this by the use of (3). (We could also have chosen (5) at this point.)
A aab BC C
We follow this by using (4) to derive
A aa bb CC
followed by the use of (5) to get
A aab bc C
followed finally by the use of (6) to give
A aabbcc
However, with this grammar it is possible to derive a sentential form to which no
further productions can be applied. For example, after deriving the sentential form
aabCBC
if we were to apply (5) instead of (3) we would obtain
aabcBC
but no further production can be applied to this string. The consequence of such a
failure to obtain a terminal string is simply that we must try other possibilities until
we find those that yield terminal strings. The consequences for the reverse
problem, namely parsing, are that we may have to resort to considerable
backtracking to decide whether a string is a sentence in the language.
Type 2 Grammars (Context-free)
A more restricted subset of context-sensitive grammars yields the type 2 or
context-free grammars. A grammar is context-free if the left side of every
16
Oyelade O.J.
production consists of a single non-terminal, and the right side consists of a nonempty sequence of terminals and non-terminals, so that productions have the form
with | | | | , N , (N T )+
with A N , (N T )+
that is
with a T , A, B N
with a T , A, B N
17
Oyelade O.J.
18
Oyelade O.J.
FINITE AUTOMATA
FINITE
The word finite is self explanatory. It means having an end or limit or having a countable number
of elements.
AUTOMATA
These are devices for recognizing strings.
Therefore, a finite automaton is a mathematical model of computing device that has discrete
inputs and outputs as well as a finite set of internal states. Finite automata are formal way of
describing certain simple but highly useful languages called regular expressions. It is also a
graph with a finite number of nodes called states which are known as finite state machines.
Finite automaton is a device that has a processing unit with limited memory capacity and has no
auxiliary that is main memory. It receives input on a special input tape and reads it, one character
at a time. Reading a character results in changing the states of the automaton and moving the
head one position to the right; the set of states is its finite control. The automaton has no
means to deliver output; however, some states can be designated as favorable. Thus, even
though an automaton does not produce any physical output, it still can be used as a recognition
device. We sometimes call these favorable states accepting states. Note that a start state could
also be the final or accepting state.
A diagram showing an accepting state and a start state:
Oyelade O.J.
The language of the FA is the set of strings that label paths that go from the start state to
the accepting state.
Formally, there are two types of finite automata and these are:
Deterministic finite automata and
Non deterministic finite automata
DETERMINISTIC FINITE AUTOMATA (DFA)
In the theory of computation, a deterministic finite automaton (DFA) also known as
deterministic finite state machine is a finite state machine where for each pair of state and
input symbol there is one and only one transition to the next state. DFAs recognize the set of
regular languages and no other languages.
The term deterministic is used as every move of a finite automaton is completely
determined by the input and its current state and no choice is allowed. A DFA will take in a
string of input symbols. For each input symbol it will then transition to a state given by following
a transition function. When the last input symbol has received it will either accept or reject the
string depending on whether the DFA is in an accepting state or non-accepting state.
Deterministic finite automata (DFA) is denoted by a:
Quadruple A= (Q, , , I, F)
where
Q is a finite set of states;
is a finite input alphabet;
is a transition function from Q X A to Q which can be written Q X AQ;
I Q is the initial state of the automaton;
F Q is the set of favorable(final) states.
is the transition function from QxA to Q that takes a state q in Q and a symbol a in and returns a
new state r, which is the state the automaton should make a transition to upon reading the input
symbol a in the state q. I is simply an initial state, and F is a subset of Q called the set of final, or
accepting states.
There are two ways of providing the five pieces of information needed to specify an automaton:
Transition diagrams and Transition tables.
The main part of the finite automaton is a function that defines allowable transitions for all
current states and all input symbols.
A transition diagram is a diagrammatic representation of inputs being made into a machine. It
is a collection of circles, which are labeled for reference purposes and are connected to each
other by arrows known as arcs. These arcs are labeled with a symbol that might occur in the
input string being analyzed.
The initial state is preceded by an inward arrow () and the accepting states by double circles
20
Oyelade O.J.
Example
A transition table is a tabular representation of a transition diagram. The rows of the table is
labeled by the state and the columns labeled by the inputs letters. The initial state is preceded by
an inward arrow () and the accepting states by an outward arrow (). In the case where a state
is the initial and final state, it is denoted by ().
S = {S1, S2},
= {0, 1},
s = S1,
A = {S1}, and
T is defined by the following state transition table:
0
S1
S2
S1
S2
S1
S2
Simply put, the state S1 represents that there has been an even number of 0s in the input so far,
while S2 signifies an odd number. A 1 in the input does not change the state of the automaton.
When the input ends, the state will show whether the input contained an even number of 0s or
not.
The language of M is the regular language given by this regular expression:
21
Oyelade O.J.
q0
q1
q2
0
q3
q4
Sn
22
1
q5
0, 1
Oyelade O.J.
The standard model used is the finite-state automaton. We can convert any definition involving
regular expressions into an implementable finite automaton in two steps:
23
Regular expression
NFA
Oyelade O.J.
DFA
The purpose of an NFA is to model the process of reading in characters until we have formed
one of the words that we are looking for. A non-deterministic automaton is exactly like an
automaton except that we allow multiple initial states and we impose no restrictions on
transitions as long as they are labeled by symbols in the input alphabet. The non-deterministic
automaton is regarded as tools helpful in designing deterministic automata rather than real life
machines.
Therefore, the formal definition of an NFA is given below:
1
q0
1
q1
0
q2
0
q3
Example 2
Create an NFA that accept the string aabba
24
1
q4
q5
Oyelade O.J.
25
Oyelade O.J.
Oyelade O.J.
endowed with access to an unbounded stack, and bounded linear automata are Turing machines
whose read/write head is not allowed to leave the region of tape where the input is given.
Since there are relatively strong limitations on what a finite automaton can do, one may
well ask is there a way to characterize all languages that actually can be recognized by finite
automata. This set of languages is known as the set of regular languages and can be specified by
regular expressions. Regular languages themselves are very restricted classes of languages, in
keeping with the relative weakness of finite automata. Since pushdown automata are more
powerful than finite automata, they can recognize a more general class of languages known as
context free languages, whereas the linear bounded automata can recognize an even more general
class: the context sensitive languages. Finally Turing machines, the most powerful machines can
recognize any language that can be recognized by an algorithmic procedure. Such a language is
called recursively enumerable. This hierarchy of machines and languages is intimately related to
the theory of grammars and is called the Chomsky hierarchy after the linguist Noam Chomsky.
Regular expressions
Regular expressions are algebraic equivalent to finite automata. They are used in many
places as a language for describing simple patterns in text.
Let A= {a1, a2, a3 } be an alphabet. A regular expression over A is a sequence of symbols
formed by repeated applications of the following rules:
(R1) is a regular expression.
(R2) is a regular expression.
(R3) a1, a2, a3 are regular expressions.
(R4) R1R2 is a regular expression if R1, R2 are regular expressions
(R5) R1R2 is a regular expression if R1, R2 are regular expressions
(R6) (R1)* is a regular expression if R2 is regular expression.
(R7) every regular expression arises by a finite number of applications of the rule (R1) to (R6).
Examples
L (001) = {001}
ab*a= {aa, aba, abba, abbba}
a*b*= {, a, b, abb, aab, aaabbb, aaaaaabb, aaaaa, bbbbbb}
a*b*a*={, a, b, ab, aa, ba, aaab, abbba, baaaaa }
(a U b)*= {, a, b, aa, bb, aaaa, bbbb} etc.
The languages accepted by DFA, NFA, and NFA-, or expressed by RE are called the regular
languages.
A regular language over an alphabet is one that can be obtained from basic languages
using the operations of union concatenation and kleene*. A regular language therefore can be
described by an explicit formula. It is common to simplify the formula slightly, by leaving out
the brackets {} or replacing them with parentheses and by replacing U (i.e. union) by +; the
result is called a regular expression.
Here are several examples of regular languages over the alphabet {0, 1}, along with the
corresponding regular expressions.
27
Language
Oyelade O.J.
1. { }
2. {0}
3. {001}(i.e.{0}{0}{1})
4. {0,1}(i.e.{0}U{1})
5. {0,10}(i.e.{0}U{10})
6. {1, }{001}
7. {110}*{0,1}
8. {1}*{10}
9. {10,111,11010}*
10 {0,10}*({11}*U{001, })
0
001
0+1
0+10
(1+ )001
(110)*(0+1)
1*10
(10+111+11010)*
(0+10)*((11)*+001+ )
NFA
representing
the
empty
string
is:
2. If the regular expression is just a character, eg. a, then the corresponding NFA is :
3. The union operator is represented by a choice of transitions from a node; thus a|b can be
represented
as:
4. Concatenation simply involves connecting one NFA to the other; e.g. ab is:
5.
The Kleene closure must allow for taking zero or more instances of the letter from the
input; thus a* looks like:
28
Oyelade O.J.
Summary
Regular Expressions: Let be an alphabet. A regular expression is constructed from the
symbols , and a, where a , together with the symbols +, . , and * left and right brackets
according to the following rules: , and a are regular expressions, and if s and t are regular
expressions so are (s+t), (s.t), (s*).
Regular Languages: Every regular expression R describes a language L(r). A language is regular
if it can be described by a regular expression.
The -automata
Every non-deterministic automaton with -transitions can be converted into a non-deterministic
automaton without -transitions that recognize the same language.
NFA with -transitions
If is a label on arcs; that is, it is an input then:
-when an arc labeled is traversed no input will be consumed.
NON-DETERMINISTIC FINITE AUTOMATA VERSUS DETERMINISTIC FINITE
AUTOMATA
A deterministic Finite State Automaton requires less memory than a non-deterministic
Finite State Automaton. Given that no alternatives have to be stored, then the search can
consume constant memory - that is, it never has to use more memory than that used to traverse
the first arc.
A deterministic Finite State Automaton is likely (on average) to be faster than a nondeterministic Finite State Automaton because there are fewer arcs that can be traversed for any
given input.
Finite State Automata that are non-deterministic in recognition have two or more arcs
with the same label emanating from a given state. Any non-deterministic Finite State Automaton
can be converted into a corresponding deterministic Finite State Automaton while for
deterministic, no two arcs from one state can have the same label. This means that when
recognizing some input, the finite state machine can only go along one route through the
network.
Automata theory: formal languages and formal grammars
Chomsky
Minimal
Grammars
Languages
hierarchy
automaton
Type-0
Unrestricted
Recursively enumerable Turing machine
Type-1
Context-sensitive
Context-sensitive
Linear-bounded
Type-2
Context-free
Context-free
Nondeterministic Pushdown
n/a
Deterministic Context-free Deterministic Context-free Deterministic Pushdown
Type-3
Regular
Regular
Finite
Each category of languages or grammars is a proper subset of the category directly above it.
29
Oyelade O.J.
30