Sie sind auf Seite 1von 41

Formal Languages and Chomsky hierarchy

SAURABH SINGH
SCIENTIFIC ANALYSIS GROUP
DEFENCE R&D ORGANISATION, DELHI

saurabhsingh@sag.drdo.in
1 ST M A R C H 2 0 1 6
Language

Language is the ability to acquire and use complex systems of


communication, particularly the human ability to do so, and a language
is any specific example of such a system.(WIKIPEDIA)
The system of words or signs that people use to express thoughts and
feeling to each other OR any one of the systems of human language that
are used and understood by a particular group of people.
The scientific study of language is called linguistics
Linguistics

In linguistics, formal languages are used for the scientific study of


human language.
Linguists privilege a generative approach, as they are interested in
defining a (finite) set of rules stating the grammar based on which any
reasonable sentence in the language can be constructed.
A grammar does not describe the meaning of the sentences or what can
be done with them in whatever context - but only their form.
Chomsky

Noam Chomsky (1928) is an American linguist, philosopher, cognitive


scientist, historian, and activist.
In Syntactic Structures (1957), Chomsky models knowledge of language
using a formal grammar, by claiming that formal grammars can explain
the ability of a hearer/speaker to produce and interpret an infinite
number of sentences with a limited set of grammatical rules and a finite
set of terms.
The human brain contains a limited set of rules for organizing language,
known as Universal Grammar. The basic rules of grammar are hard-
wired into the brain, and manifest themselves without being taught.
Chomsky Hierarchy

Chomsky proposed a hierarchy that partitions formal grammars into


classes with increasing expressive power, i.e. each successive class can
generate a broader set of formal languages than the one before.
Interestingly, modelling some aspects of human language requires a
more complex formal grammar (as measured by the Chomsky
hierarchy) than modelling others.
Example, While a regular language is powerful enough to model English
morphology (symbols, words), it is not powerful enough to model
English syntax.
Linguistics in Computer Science

In computer science, formal languages are used for the precise


definition of programming languages and, therefore, in the
development of compilers.
A compiler is a computer program (or set of programs) that transforms
source code written in a programming language (the source language)
into another computer language (the target language).
Computer scientists privilege a recognition approach based on abstract
machines (automata) that take in input a sentence and decide whether
it belongs to the reference language.
Compiler Structure
Compilation Phases
Automata and Grammar

Which class of formal languages is recognized by a given type of


automata?
There is an equivalence between the Chomsky hierarchy and the
different kinds of automata. Thus, theorems about formal languages
can be dealt with as either grammars or automata.
Describing formal languages: generative approach

Generative approach
A language is the set of strings generated by a grammar.

Generation process
Start symbol
Expand with rewrite rules.
Stop when a word of the language is generated.

Pros and Cons


The generative approach is appealing to humans.
Grammars are formal, informative, compact, finite descriptions for possibly infinite
languages, but are clearly inefficient if implemented naively.
Describing formal languages: recognition approach

Recognition approach
A language is the set of strings accepted by an automaton.

Recognition process
Start in initial state.
Transitions to other states guided by the string symbols.
Until read whole string and reach accept/reject state.

Pros and Cons


The recognition approach is appealing to machines.
Automata are formal, compact, low-level machines that can be implemented easily and
efficiently, but can be hard to understand to humans.
Formal languages: definition and basic notions

Formal language
Is a set of words, that is, finite strings of symbols taken from the alphabet over which
the language is defined.
Alphabet: a finite, non-empty set of symbols.
Example
1 = { 0, 1 }
2 = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
3 = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F }
4 = { a, b, c,, z }

Notation
a, b, c, . . . denote symbols
Formal languages: definition and basic notions

String (or word) on an alphabet : a finite sequence of symbols in .


Example
1010 1
123 2
hello 4
Notation
is the empty string
v, w, x, y, z, . . . denote strings
|w| is the length of w (the number of symbols in w).
Example
|a| = 1
|125| = 3
||=0
Formal languages: definition and basic notions

k-th power of an alphabet :


k =def {a1ak | a1ak }

Example
0 = {} for any

11 = {0, 1}

21 = {00, 01, 10, 11}

Kleene closures of an alphabet :



* 0 U 1 U 2 U = =0

+ \{} = =1
Formal languages: definition and basic notions

String operations:
vw is the concatenation of v and w

v is a substring of w iff xvy = w

v is a prefix of w iff vy = w

v is a suffix of w iff xv = w

Example
w = w = w
Formal languages: definition and basic notions

Formal language: mathematical denition


A language over a given alphabet is any subset of *.
Example
English, Chinese, . . .

C, Pascal, Java, HTML, . . .

the set of binary numbers whose value is prime:

{10; 11; 101; 111; 1011;. }

(the empty language)

{}
Formal languages: definition and basic notions

Operations on languages
Let L1 and L2 be languages over the alphabets 1 and 2, respectively.
Then:
L1 U L2 = {w | w L1 V w L2 }
1= {w *1 | w 1 }

L1L2 = {w1w2 | w1 L1 w2 L2}

L*1 = {} U L1 U L2
1 U
L1 L2 = {w | w L1 w L2}
A grammar is a tuple G = (V,T,S,P) where
V is a finite, non-empty set of symbols called variables
(or non-terminals or syntactic categories)
T is an alphabet of symbols called terminals
S V is the start (or initial) symbol of the grammar
P is a finite set of productions where (V T)+ and (V T)

In the exampleI eat apple


V = {Sentence, Subject, Verb, Object},T = {I, You, Eat, Buy, Pen, Apple}, S =
{Sentence}, and P = {Sentence SubjectVerbObject, Subject I | You, Verb Eat |
Buy, Object Pen | Apple}.
Grammar

A grammar is a finite object that can describe an infinite language.


Notation
A, B, C, . . . V represent non-terminal symbols

a, b, c, . . . T represent terminal symbols

X, Y , Z, . . . V U T represent generic symbols

u, v, w, x, . . . T* are strings over T

,,,, . . . (V U T)* are strings over V [ T


The Chomsky hierarchy: summary

Level Language type Grammar rules Accepting machines


3 Regular X , X Y, NFAs (or DFAs)
X aY
2 Context-free X Nondeterministic
pushdown automata
1 Context-sensitive Nondet. linear
with || || bounded automata
0 Unrestricted Turing machines
(unrestricted)
Type-3 Grammar

Type-3 grammars generate regular languages. Type-3 grammars must have a single
non-terminal on the left-hand side and a right-hand side consisting of a single
terminal or single terminal followed by a single non-terminal.
The productions must be in the form X a or X aY
where X, Y N (Non terminal)
and a T (Terminal)
The rule S is allowed if S does not appear on the right side of any rule.

Example
S aB
B bB
B
What language does this define? ab*
Finite Automata

A finite automaton is a 5-tuple M = (Q, , , q0, F)


Q is the set of states

is the alphabet

: Q Q is the transition function

q0 Q is the start state

F Q is the set of accept states

L(M) = the language of machine M = set of all strings machine M


accepts
Finite Automata

M = (Q, , , q0, F) where


Q = {q0, q1, q2, q3}

= {0,1}

: Q Q transition function* 0
q1 1
0,1
q0 Q is start state

F = {q1, q2} Q accept states 1


q0 q
q22
0 0

1
q3
Build an automaton that accepts all and only those strings that
contain 001

0,1
1 0

0
0 1
q q0 q00 q001
1
Limits of Regular languages and finite automata

What types of languages cant FAs accept? In other words, what limits
are there on the complexity of regular languages?

FAs lack memory, so that you cant have one part of a regular language
dependent on another part.
(Type-2)Context-free grammars

Type-2 grammars generate context-free languages.


The productions must be in the form A , where A N (Non terminal) and
(TN)* (String of terminals and non-terminals).
These languages generated by these grammars are be recognized by a non-
deterministic pushdown automaton.
Example
S AB
Aa
Ba
X abc
X
Whereas in regular languages, nonterminals were restricted as to where they could
appear in the rules, now they can appear anywhere. Hence the term context-free.
Pushdown Automata
Pushdown automata extend FAs in one very important way. We are
now given a stack on which we can store information. This works like a
standard LIFO stack, where information gets pushed onto the top and
popped off the top.
This means that we can now choose transitions based not just on the
input, but also based on whats on the top of the stack.
We also now have transition actions available to us. We can either
push a specific element to the top of the stack, or pop the top element
off the stack.
Pushdown Automaton (PDA)
finite control and a single unbounded stack
Lecture
01-28

a, A/AA
b, A/
L { a n b n # : n 1} a, $/A$

b, A/ #, $/

models finite program + one unbounded stack of bounded registers

top

$
b

Theory of Computation, NTUEE


Pushdown Automaton (PDA)
finite control and a single unbounded stack
Lecture
01-29

a, A/AA
b, A/
L { a n b n # : n 1} a, $/A$

b, A/ #, $/

models finite program + one unbounded stack of bounded registers

aaabbb#

A
A A A
A A A A A
$ $ $ $ $ $ $
accepting
Theory of Computation, NTUEE
Pushdown Automaton (PDA)
finite control and a single unbounded stack
Lecture
01-30

a, A/AA
b, A/
L { a n b n # : n 1} a, $/A$

b, A/ #, $/

models finite program + one unbounded stack of bounded registers

aaabbbb#

A
A A A ? rejecting
A A A A A
$ $ $ $ $ $ $
Theory of Computation, NTUEE
Pushdown Automaton (PDA)
finite control and a single unbounded stack
Lecture
01-31

a, A/AA
b, A/
L { a b # : n 1}
n n
a, $/A$

b, A/ #, $/

models finite program + one unbounded stack of bounded registers

aaabb#

A
A A A
rejecting
A A A A A ?
$ $ $ $ $ $
Theory of Computation, NTUEE
Limits on PDAs and CFGs
Adding memory is nice, but there are still
significant limits on what a PDA can
accomplish.

Can a PDA be constructed that can do


arithmetic?

How, or why not?


(Type-1)Context-Sensitive Grammar

Type-1 grammars generate context-sensitive languages. The productions must be in the form
A
where A N (Non-terminal) and , , (T N)* (Strings of terminals and non-terminals)
The strings and may be empty, but must be non-empty.
The rule S is allowed if S does not appear on the right side of any rule. The languages generated by
these grammars are recognized by a linear bounded automaton.

Example
AB AbBc
A bcA
Bb
Alternate Definition:
P={-> ; | ||}
It is based on Random Access Memory
(Type-0)

Type-0 grammars generate recursively enumerable languages. The productions have no


restrictions. They are any phase structure grammar including all formal grammars.
They generate the languages that are recognized by a Turing machine.
The productions can be in the form of where is a string of terminals and non-
terminals with at least one non-terminal and cannot be null. is a string of terminals and
non-terminals.

Example

S ACaB
Bc acB
CB DB
aD Db
Turing Machines

Recall that NFAs are essentially memory-less, whilst NPDAs are


equipped with memory in the form of a stack.
To nd the right kinds of machines for the top two Chomsky
levels, we need to allow more general manipulation of memory.
A Turing machine essentially consists of a nite-state control unit,
equipped with a memory tape, innite in both directions. Each cell
on the tape contains a symbol drawn from a nite alphabet .
Turing Machines cont.

At each step, the behaviour of the machine can depend on the current state of the control unit,
the tape symbol at the current read position.
Depending on these things, the machine may then overwrite the current tape symbol with a new symbol,
shift the tape left or right by one cell, jump to a new control state.
This happens repeatedly until (lets say) the control unit enters some final state.
Turing Machines cont.

To use a Turing machine T as an acceptor for a language over ,


assume , and set up the tape with the test string s
written left-to-right starting at the read position, and with blank
symbols everywhere else.
Then let the machine run (maybe overwriting s), and if it enters
the nal state, declare that the original string s is accepted.
The language accepted by T (written L(T )) consists of all strings
s that are accepted in this way.
Theorem: A set L is generated by some unrestricted (Type 0)
grammar if and only if L = L(T ) for some Turing machine T . So
both Type 0 grammars and Turing machines lead to the same class
of recursively enumerable languages.
Turing Machines cont.

A Turing machine T consists of:


A set Q of control states

An initial state i Q

A nal (accepting) state f Q

A tape alphabet

An input alphabet

A blank symbol

A transition function : Q Q {L, R}.


Linear bounded automata

Suppose we modify our model to allow just a nite tape, initially


containing just the test string s with endmarkers on either side:

The machine therefore has just a nite amount of memory,


determined by the length of the input string. We call this a linear
bounded automaton.
(LBAs are sometimes dened as having tape length bounded by a
constant multiple of length of input string doesnt make any
dierence in principle.)
Theorem: A language L is context-sensitive if and only if
L = L(T ) for some non-deterministic linear bounded automaton T .
The Chomsky hierarchy: summary

Level Language type Grammar rules Accepting machines


3 Regular X , X Y, NFAs (or DFAs)
X aY
2 Context-free X Nondeterministic
pushdown automata
1 Context-sensitive Nondet. linear
with || || bounded automata
0 Unrestricted Turing machines
(unrestricted)
Thanks a lot for your attention
and QUESTIONS Please!

Das könnte Ihnen auch gefallen