Sie sind auf Seite 1von 36

Formal Languages, Automata and

Grammar
SAURABH SINGH

Language
Language is the ability to acquire and use complex systems of

communication, particularly the human ability to do so, and a language


is any specific example of such a system.
The system of words or signs that people use to express thoughts and
feeling to each other OR any one of the systems of human language that
are used and understood by a particular group of people.
The scientific study of language is called linguistics

Linguistics
In linguistics, formal languages are used for the scientific study of

human language.
Linguists privilege a generative approach, as they are interested in
defining a (finite) set of rules stating the grammar based on which any
reasonable sentence in the language can be constructed.
A grammar does not describe the meaning of the sentences or what can
be done with them in whatever context - but only their form.

Chomsky Hierarchy
Noam Chomsky (1928) is an American linguist, philosopher, cognitive

scientist, historian, and activist.


In Syntactic Structures (1957), Chomsky models knowledge of language
using a formal grammar, by claiming that formal grammars can explain
the ability of a hearer/speaker to produce and interpret an infinite
number of sentences with a limited set of grammatical rules and a finite
set of terms.
The human brain contains a limited set of rules for organizing language,
known as Universal Grammar. The basic rules of grammar are hard-wired
into the brain, and manifest themselves without being taught.

Chomsky Hierarchy
Chomsky proposed a hierarchy that partitions formal grammars into

classes with increasing expressive power, i.e. each successive class can
generate a broader set of formal languages than the one before.
Interestingly, modelling some aspects of human language requires a
more complex formal grammar (as measured by the Chomsky
hierarchy) than modelling others.
Example, While a regular language is powerful enough to model
English morphology (symbols, words), it is not powerful enough to
model English syntax.

Linguistics in Computer Science


In computer science, formal languages are used for the precise

definition of programming languages and, therefore, in the


development of compilers.
A compiler is a computer program (or set of programs) that transforms
source code written in a programming language (the source language)
into another computer language (the target language).
Computer scientists privilege a recognition approach based on abstract
machines (automata) that take in input a sentence and decide whether
it belongs to the reference language.

Automata and Grammar


Which class of formal languages is recognized by a given type of

automata?
There is an equivalence between the Chomsky hierarchy and the
different kinds of automata. Thus, theorems about formal languages
can be dealt with as either grammars or automata.

Describing formal languages: generative approach


Generative approach

A language is the set of strings generated by a grammar.

Generation process

Start symbol
Expand with rewrite rules.
Stop when a word of the language is generated.

Pros and Cons

The generative approach is appealing to humans.


Grammars are formal, informative, compact, finite descriptions for possibly infinite languages,
but are clearly inefficient if implemented naively.

Describing formal languages: recognition approach


Recognition approach

A language is the set of strings accepted by an automaton.

Recognition process

Start in initial state.


Transitions to other states guided by the string symbols.
Until read whole string and reach accept/reject state.

Pros and Cons

The recognition approach is appealing to machines.


Automata are formal, compact, low-level machines that can be implemented easily and
efficiently, but can be hard to understand to humans.

Formal languages: definition and basic notions


Formal language

Is a set of words, that is, finite strings of symbols taken from the alphabet over which the
language is defined.

Alphabet: a finite, non-empty set of symbols.


Example

1 = { 0, 1 }
2 = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
3 = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F }
4 = { a, b, c,, z }

Notation

a, b, c, . . . denote symbols

Formal languages: definition and basic notions


String (or word) on an alphabet : a finite sequence of symbols in .
Example

1010 1
123 2
hello 4

Notation

is the empty string


v, w, x, y, z, . . . denote strings

|w| is the length of w (the number of symbols in w).


Example

|a| = 1
|125| = 3
||=0

Formal languages: definition and basic notions


k-th power of an alphabet :

k {a1ak | a1ak }

Example

0 = {} for any
11 = {0, 1}
21 = {00, 01, 10, 11}

Kleene closures of an alphabet :

*
+

Formal languages: definition and basic notions


String operations:

vw is the concatenation of v and w


v is a substring of w iff xvy = w
v is a prefix of w iff vy = w
v is a suffix of w iff xv = w

Example

w = w = w

Formal languages: definition and basic notions


Formal language: mathematical definition

A language over a given alphabet is any subset of *.

Example

English, Chinese, . . .
C, Pascal, Java, HTML, . . .
the set of binary numbers whose value is prime:
{10; 11; 101; 111; 1011;. }
(the empty language)
{}

Operations on languages
Let L1 and L2 be languages over the alphabets 1 and 2, respectively.
Then:

L1 U L2 = {w | w L1 V w L2 }
= {w *1 | w }
L1L2 = {w1w2 | w1 L1 w2 L2}
L*1 = {} U L1 U L21 U
L1 L2 = {w | w L1 w L2}

Grammar
A grammar is a tuple G = (V,T,S,P) where

V is a finite, non-empty set of symbols called variables


(or non-terminals or syntactic categories)
T is an alphabet of symbols called terminals
S V is the start (or initial) symbol of the grammar
P is a finite set of productions where (V T)+ and (V T)

In the exampleI eat apple

V = {Sentence, Subject, Verb, Object},T = {I, You, Eat, Buy, Pen, Apple}, S =
{Sentence}, and P = {Sentence SubjectVerbObject, Subject I | You, Verb Eat |
Buy, Object Pen | Apple}.

The Chomsky hierarchy: summary


Level
3

Language type Grammar rules


Regular
X , X Y,
X aY

Context-free
X
pushdown automata

Context-sensitive

Unrestricted

Accepting machines
NFAs (or DFAs)

Nondeterministic


Nondeterministic linear
with || ||
bounded automata

Turing machines
(unrestricted)

(Type-3) Regular Grammar


Type-3 grammars generate regular languages. Type-3 grammars must have a single non-

terminal on the left-hand side and a right-hand side consisting of a single terminal or single
terminal followed by a single non-terminal.
The productions must be in the form X a or X aY
where X, Y N (Non terminal)
and a T (Terminal)
The rule S is allowed if S does not appear on the right side of any rule.
Example

S aB
B bB
B
What language does this define? ab*

Finite Automata
A finite automaton is a 5-tuple M = (Q, , , q0, F)

Q is the set of states


is the alphabet
: Q Q is the transition function
q0 Q is the start state

F Q is the set of accept states

L(M) = the language of machine M = set of all strings machine M

accepts

Finite Automata
M = (Q, , , q0, F) where

Q = {q0, q1, q2, q3}

= {0,1}
: Q Q transition function*
q0 Q is start state

F = {q1, q2} Q accept states

q1

1
0,1

1
q0

qq2
2
0

0
q3

Build an automaton that accepts all and only those strings that
contain 001

0,1

0
q

q0
1

q00

q001

Limits of Regular languages and finite automata


What types of languages cant FAs accept? In other words, what limits

are there on the complexity of regular languages?


FAs lack memory, so that you cant have one part of a regular language

dependent on another part.

(Type-2)Context-free grammars

Type-2 grammars generate context-free languages.


The productions must be in the form A , where A N (Non terminal) and

(TN)* (String of terminals and non-terminals).


These languages generated by these grammars are be recognized by a nondeterministic pushdown automaton.
Example

SX
X ab|aXb
L={anbn | n }

Whereas in regular languages, nonterminals were restricted as to where they could

appear in the rules, now they can appear anywhere. Hence the term context-free.

Pushdown Automata
Pushdown automata extend FAs in one very important way. We are
now given a stack on which we can store information. This works like a
standard LIFO stack, where information gets pushed onto the top and
popped off the top.
This means that we can now choose transitions based not just on the
input, but also based on whats on the top of the stack.
We also now have transition actions available to us. We can either
push a specific element to the top of the stack, or pop the top element
off the stack.

Pushdown Automaton (PDA)


finite control and a single unbounded stack
a, A/AA

L { a nb n : n 1}

a, $/A$
b, A/

b, A/
#, $/

models finite program + one unbounded stack of bounded registers


top

Pushdown Automaton (PDA)


a, A/AA

L { a n b n # : n 1}

b, A/

a, $/A$
b, A/

#, $/

models finite program + one unbounded stack of bounded registers


aaabbb#

$
accepting

Pushdown Automaton (PDA)


a, A/AA

L { a b # : n 1}
n

b, A/

a, $/A$
b, A/

#, $/

models finite program + one unbounded stack of bounded registers


aaabbbb#

rejecting

Pushdown Automaton (PDA)


a, A/AA

L { a b # : n 1}
n

a, $/A$
b, A/

b, A/
#, $/

models finite program + one unbounded stack of bounded registers


aaabb#

rejecting

(Type-1)Context-Sensitive Grammar
Type-1 grammars generate context-sensitive languages. The productions must be in the form

A
where A N (Non-terminal) and , , (T N)* (Strings of terminals and non-terminals)
The strings and may be empty, but must be non-empty.
The rule S is allowed if S does not appear on the right side of any rule. The languages generated by these grammars
are recognized by a linear bounded automaton.
Example

S abc|aAbc
Ab bA
Ac Bbcc
bB Bb
aB aa|aaA

L={anbncn ; n }

Alternate Definition:

P={-> ; | ||}

It is based on Random Access Memory

(Type-0) Unrestricted Grammar


Type-0 grammars generate recursively enumerable languages. The productions have no

restrictions. They are any phase structure grammar including all formal grammars.
They generate the languages that are recognized by a Turing machine.
The productions can be in the form of where is a string of terminals and non-terminals
with at least one non-terminal and cannot be null. is a string of terminals and non-terminals.
Example
S ACaB
Bc acB
CB DB
aD Db

Turing Machines
Recall that NFAs are essentially memory-less, whilst NPDAs are

equipped with memory in the form of a stack.


To nd the right kinds of machines for the top two Chomsky
levels, we need to allow more general manipulation of memory.
A Turing machine essentially consists of a nite-state control unit,
equipped with a memory tape, innite in both directions. Each cell
on the tape contains a symbol drawn from a nite alphabet .

Turing Machines cont.


At each step, the behaviour of the machine can depend on the current state of the control unit, the tape

symbol at the current read position.


Depending on these things, the machine may then overwrite the current tape symbol with a new symbol,
shift the tape left or right by one cell, jump to a new control state.
This happens repeatedly until (lets say) the control unit enters some final state.

Turing Machines cont.


To use a Turing machine T as an acceptor for a language over and set u

p the tape with the test string s * written left-toright starting at the read position, and with blank symbols everywhere else.
Then let the machine run (maybe overwriting s), and if it enter
the nal state, declare that the original string s is accepted.
The language accepted by T (written L(T )) consists of all strings
s that are accepted in this way.
Theorem: A set L * is generated by some unrestricted (Type 0)
grammar if and only if L = L(T ) for some Turing machine T . So
both Type 0 grammars and Turing machines lead to the same class
of recursively enumerable languages.

Turing Machines cont.


A Turing machine T consists of:
A set Q of control states
An initial state i Q
A nal (accepting) state f Q
A tape alphabet
An input alphabet
A blank symbol
A transition function : Q Q {L, R}.

Linear bounded automata


Suppose we modify our model to allow just a nite tape, initially

containing just the test string s with end markers on either side:

The machine therefore has just a nite amount of memory,

determined by the length of the input string. We call this a linear


bounded automaton.
(LBAs are sometimes dened as having tape length bounded by a
constant multiple of length of input string doesnt make any
dierence in principle.)
Theorem: A language L is context-sensitive if and only if
L = L(T ) for some non-deterministic linear bounded automaton T .

The Chomsky hierarchy: summary


Level
3

Language type
Regular

Grammar rules
X , X Y,
X aY

Context-free
X
pushdown automata

Context-sensitive

Unrestricted

Accepting machines
NFAs (or DFAs)

Nondeterministic

Nondeterministic linear
with || ||
bounded automata


(unrestricted)

Turing machines