Sie sind auf Seite 1von 113

Principles of Programming Languages

a 14-week course

Javier Gonzalez-Sanchez Maria-Elena Chavez-Echeagaray July, 2013

Authors
Javier Gonzalez-Sanchez is pursuing his Ph.D. degree in Computer Science at Arizona State University. His primary research interests lie in developing, and advancing development approaches for affect-aware and affect-driven self-adaptive systems, as well as prototyping mobile learning and augmented reality applications. Prior to joining Arizona State University, he was a teaching professor at Tecnolgico de Monterrey (the largest private university in Mexico) where he taught diverse undergraduate courses about software architecture, software engineering, web development, and programming. He was also an adjunct professor at Universidad de Guadalajara (the second largest public university in Mexico) in the Masters in Applied Computing program and in the Masters in Information Technologies program. Within the business field, Javier has tutored companies on the CMMi Technical Solution process area, software architecture, software design patterns, and UML. He worked as a software engineer managing the development of web-based applications delivered to clients in Mexico, Brazil, and Honduras. Lastly, He has participated as CTO in two startup companies. Maria-Elena Chavez-Echeagaray is pursuing her Ph.D. degree in Computer Science at Arizona State University. Her work includes areas such as Affective Computing, Intelligent Tutor Systems, and Educational Technology. Prior to joining Arizona State University, she was a teaching professor at Tecnolgico de Monterrey (the largest private university in Mexico) where he taught diverse undergraduate courses about programming languages, computer architecture, and networking. She worked as a freelance software engineer developing web-based applications for clients in Mexico and United States.

Table of Contents
AUTHORS PREFACE WEEK 01 1. OVERVIEW 1.1. WHAT IS A PROGRAMING LANGUAGE? 1.2. LANGUAGE LEVELS 1.3. COMPILERS AND TRANSLATORS 1.4. LANGUAGE PARADIGMS 1.5. PROJECT OF THE COURSE 2. LEXICAL ANALYSIS 2.1. COMPILING 2.2. LEXICAL PATTERNS 2.3. REGULAR EXPRESSIONS 2.4. DETERMINISTIC FINITE AUTOMATON WEEK 02 REVIEW OF THE LEXICAL ANALYZER 2.5. PROGRAMMING A LEXER 3. SYNTAX ANALYSIS 3.1. TERMINAL AND NON-TERMINAL ELEMENTS 3.2. DERIVATION TREE WEEK 03 REVIEWING FOR PROGRAMMING A LEXER 3.3. BNF 3.4. SYNTAX DIAGRAMS WEEK 04 REVIEW OF GRAMMAR REPRESENTATIONS 3.5. TYPES OF GRAMMARS 3.6. PROGRAMING A PARSER 3.7. DEFINING A LANGUAGE WEEK 05 REVIEW FOR PROGRAMMING A LEXER REVIEWING THE SYNTAX RULES FOR OUR LANGUAGE WEEK 06 REVIEWING OUR LANGUAGE 3.8. FIRST AND FOLLOW SETS 4 3 7 10 10 10 10 10 12 12 12 12 13 14 14 16 16 19 20 20 21 24 24 25 31 33 33 34 39 41 43 43 44 49 49 50

WEEK 07 REVIEWING OUR LANGUAGE MODIFYING OUR LEXER REVIEWING OUR LANGUAGE IMPLEMENTING THE PARSER REVIEWING OUR LANGUAGE CREATING THE DERIVATION TREE WEEK 08 4. SEMANTIC ANALYZER 4.1. SYMBOL TABLE 4.2. VARIABLES ARE PREVIOUSLY DECLARED 4.3. TYPE MATCHING AND THE CUBE OF TYPES WEEK 09 EXAM REVIEW: LEXICAL AND SYNTAX ANALYSIS EXAM REVIEW: SEMANTIC ANALYSIS WEEK 10 5. CODE GENERATION AND COMPILING 5.1. COMPILERS 5.2. INSTRUCTION AND VARIABLES HANDLING WEEK 11 5.3. COMPILING 6. MEMORY 6.1. RUNTIME MEMORY 6.2. POINTERS IN C/C++ WEEK 12 7. FUNCTIONAL PROGRAMMING LANGUAGE: LISP 7.1. ATOMIC EXPRESSIONS 7.2. EVERYTHING IS A LIST 7.3. PREDICATES 7.4. ERRORS 7.5. RECURSION 7.6. CONTROL STRUCTURES WEEK 13 7.7. BLOCKS 7.8. VARIABLES 7.9. LOOPS 7.10. LOCAL VARIABLES 7.11. WRITING FUNCTIONS 7.12. DATA STRUCTURES 7.13. EXAMPLE

60 60 60 62 65 65 70 73 73 77 77 79 83 83 83 84 88 88 88 88 90 95 95 95 96 96 96 96 97 98 98 98 98 99 101 103 105

7.14. EXERCISES WEEK 14 8. LOGICAL PROGRAMMING LANGUAGES: PROLOG 8.1. FACTS AND RULES 8.2. QUERIES 8.3. DEFINING RULES 8.4. EXERCISES

105 107 107 107 107 107 109

Preface
This text has been written to students, professors, and practitioners in the area of computer science who desire to satisfy the natural curiosities of any programmer, how a programming language works? How a computer understands a program? Satisfying these curiosities requires learning fundamental computer science topics. The content of this text includes the use of diverse techniques, ideas, and concepts learned in computer science courses, such as: data structures, mathematical computing theory, and analysis and design system methodologies. This text has been written aiming to provide a guide for the development of a compiler for a new programming language. It is the product of the experience of teaching the programming languages course and it is structured in weeks, where each week covers a specific set of related topics as follows: WEEK 01. Introduction This week we review basic terminology and an overview about compilers, translators, interpreters, and code generation. These concepts will be applied in the following sections of the text. This week also presents an introduction to lexical analysis, defining, besides the needed terminology and concepts, the mathematical principles to recognize lexical patterns (words): deterministic finite automatons and regular expressions. WEEK 02. Lexical analysis Once deterministic finite automatons and regular expressions are defined, the development of the lexical analyzer (Lexer) can start. During this week, it is explained how to implement a Lexer following a state machine approach. WEEK 03. Languages, grammars, and syntax diagrams At this point, the lexical analysis has been completed. Therefore, it is possible to identify whether words are valid or not, what to do with them, and what corrections can be made automatically. The next step is to build sentences with those words. The formal definition of the structure of a sentence in a language is defined by the grammar of that language. Key topics for this week include: grammar classification, BNF notation, syntax diagrams, and derivation trees. WEEK 04. Syntax Analysis This week we will learn more about grammar classification and how to represent the grammar of a language using BNF notation, how to map that BNF notation into syntax diagrams, and how to transform syntax diagrams into code to implement a Parser (a syntax

analyzer). This week finishes with a deeper review of the programming language to be developed as a descendent-predictive-recursive Parser. WEEK 05. Parser implementation This week we focus on the implementation of the software components for the Parser and its integration with the Lexer. WEEK 06. Parser validation and FIRST and FOLLOW sets The syntax rules of our Parser will be validated defining the FIRST and FOLLOW sets for our grammar rules. This week, we will cover how FIRST and FOLLOW sets are calculated. WEEK 07. Testing Lexer and Parser together During this week, our Lexer and Parser are tested to validate their correct functionality. Also, we will review how to implement a new feature that will allow our Parser to generate derivation trees graphs to facilitate reviewing that the syntax rules are correctly implemented. WEEK 08. Symbol table and Semantic analysis So far, we have syntactically correct sentences formed by valid words. Now, what remains is to assure that this valid structure has a valid meaning. This is the task to be performed by the semantic analysis, which includes validation of unicity in identifiers, type revision, parameters, return values, logic conditions in control structures, and array index validation. WEEK 09. Semantic analyzer implementation This week includes a review of lexical and syntax analyzers implementation and how to integrate them to the semantic analyzer. Details of the semantic analyzer implementation will be reviewed and we will establish the intrinsically linking between semantic analyzer and Parser components. WEEK 10. Code Generation The analysis part (lexical, syntax, and semantic) is now complete. This week, we will review the synthesis part of the process: code generation. The theoretical concepts related to code generation are reviewed this week including an overview of the compilation process and the use of virtual machines. WEEK 11. Memory management The topic for this week is memory handling for variables and functions; this includes the review of concepts, such as allocation, pointers, and recursion.

WEEK 12. Programming Languages Paradigms In the prior weeks we built a programming language. This language was an imperative or procedural language. However, there are other paradigms of programming languages, such as functional and logical programming. This week covers an overview of functional and logical paradigms, and an introduction to a functional programing language: LISP. WEEK 13. Functional Programming: LISP This week continues with the introduction to programming in LISP presenting hands-on examples and exercises. WEEK 14. Logical Programming: PROLOG Now is the turn to review the logical programming languages paradigm. This week presents the fundamentals and hands-on exercises of the logical programming language PROLOG.

WEEK 01
1. OVERVIEW
This section presents the overview of the whole course. Providing general concepts and examples. 1.1. What is a programing language? A language is a way to communicate. A language is composed by words/pieces and rules; those rules help us to combine the words/pieces together. A programing language is a way to communicate with the computer. So, a programming language has words and rules. Examples of different languages: Java, C/C++, Perl, Scheme, LISP, and PROLOG. 1.2. Language levels There are high-level and low-level languages. The closer the language is to the natural (human) language it is a high level language. The closer the language is to the machine language (binary, 0s and 1s) it is in a low-level. High-level languages are suppose to be easier to humans, while low-level are suppose to be easier for the computer. An example of a low-level language is assembly language. Examples of high-level are Java and C/C++. In order to a program could be executed by a computer this program should be in machine language. So, what if the program is written in a high-level language? There is the need to convert / transform this program into a machine language. That is what a compiler does. The goal of the class is to define a new language (words and rules) and implement a compiler for that language. 1.3. Compilers and Translators When we are compiling / translating a program (high-level) to another (low-level) language, the first step is to understand the original program. Understanding the program includes understanding all the words in the program, be sure that they are correct and that have the correct meaning (according with the purpose of the program). In other words, it is important to know the lexical, syntax, and semantic details. Below is a brief explanation of the purpose of each of these parts in a compiler: Lexical analysis (Lexer). The words are correct. Syntax analysis (Parser). The words are connected in a correct way. The words follow the rules of the language. Semantic analysis. The meaning of the sentence is correct.

10

Example 1. Consider the following code in Java


int {x} = 5;

So in a lexical level all the words are correct. In a Parser level there is an error. According with the grammar of the language the { and } are misplaced. Example 2. Consider the following code in Java int x = hello; In a lexical and parser level this code is correct. However, in a semantic level it is not correct due to the fact there is an integer variable that is being assigned with a String value. Example 3. These are examples of lexical level.
int @x = 7; -- a variable can not start with a @ int x = 77_; -- a number can not end with a _ float x = 7.7.5; -- the format of the number is not correct.

Example 4.
main ( ) { x = 5; }

There are 9 words in the code. Lexically there is not a problem. Now in term of syntax (grammar), there is not a problem either, all the words are in the correct order and are following the rules of combining words. In terms of semantic (meaning) there is a mistake. It is not possible to know the meaning of the code because it is not complete; there is not a defined type for variable x.
main ( ) { foo (1,2); x = 5; }

imagine we have the declaration of the foo method > foo (int x) { } The problem with this piece of code is in a semantic level, due to the fact that according with the definition of the foo method, it is expected to have only one parameter and not two.

11

Once you understand the original program (text) then it is possible to translate. The most complex thing to do is understand the code. 1.4. Language Paradigms What make a programing language different from another are the rules that define the language. For example, for the case of an object-oriented language the main difference is that it has classes to define and handle objects. Classes are a particular kind of rule for object-oriented programming languages that differentiate them from other types of languages. 1.5. Project of the course For the course, we will be developing an imperative programing language (these are the simple ones). In order to implement the compiler for this programming language we are going to use an object-oriented language.

2. LEXICAL ANALYSIS
This section covers the elements considered for the lexical analysis. 2.1. Compiling The following table shows examples of different errors. Text in red represents the error on the line. For the definition of lexical and syntax errors each line was considered independently while for the semantic errors all the lines were considered as a single piece of code. Notice that when we are compiling a code, once an error is detected, there is not need to keep reviewing for further errors.

Expression int x = 5; float y = hello; String @z = 9.5; int x= cse340; if (x>14) while (5==5) if (int a) a = 1; x = x;

Lexical

Syntax

Semantic

x x x x

12

y = 13.45.0; int me = 9999900000111122211111122223443483045830948; while {x != 9} ( ); int {x} = 10;

x ? x x

Four steps compose a compiler: 1. 2. 3. 4. Lexical analysis Syntax analysis Semantic analysis Code generation

2.2. Lexical Patterns It has been defined that a language is composed by words and rules. Where rules should be understood as the grammar that define the language. Alphabet set of characters valid in a particular language String combination of characters Patterns are the rules to form a word Word - is a string that follows a valid pattern in a particular language. Token Words have a type or a category (a particular pattern defines a category), these categories are named tokens, such as String (hello), an integer number (5557), a float number (5.7), and ID (x), characters (x), operators (= + - * /), delimiters ([ ] ( ) { } )

Example 1. This line has 4 words, we know that there are 4 words because they follow a pattern; thus it is possible to categorize each word in type.
x = 5;

x - ID = - operator 5 - integer ; - delimiter

So in order to be able to define the tokens in a line it is important to know the patterns and being able to identify those patterns. There are two ways to identify patterns: (text-based) Regular Expressions (visual-based) Deterministic Finite Automatons

13

2.3. Regular Expressions To create a regular expression there is needed an alphabet, and then a way to integrate the characters on the alphabet (operations). There are four operations to know when working with regular expressions: Union or alternation Concatenation Zero or more (Kleene operator) One or more (+)

Example 1. Imagine that we are defining the patterns for a language P1 = {5} P2 = {0 - 9} = {0, 1, , 9} Using these patterns we can review the following words and try to find their type. 5 this is a token that follows pattern P1 8 this is a token that follows pattern P2 80 according with the previous patterns (P1 and P2) it is not a token

Now if we modify P2 and add P3 as follows: P1 = {5} P2 = {0, 1, , 9} + P3 = {a, b, c, d} + 5 still follows P1 8 still follows P2 80 now this is a token that follows pattern P2 ab is a token that follows pattern P3 abba is a token that follows pattern P3

2.4. Deterministic Finite Automaton Patterns could be expressed using regular expressions or using a Deterministic Finite Automaton (DFA). The P1 pattern above would look as follow if expressed using a DFA. Remember that in a DFA the Start label shows the starting point, the double-lined circle represents the final state, and the values on the arrows represent the value that needs to be as input to move from one state to another. So for P1 = {5} the DFA will be:

14

When using a DFA to evaluate if a word follows a valid pattern, when all the characters in the word are processed you need to get into the final state (double-circled state). Now if we want to define the pattern for integer numbers. The regular expression is: 0 | (1- 9)(0-9)* And the DFA will be:

15

WEEK 02
REVIEW OF THE LEXICAL ANALYZER Languages have rules for: Lexical Syntax Semantic

Lexical is about words. To review the lexical section there are: inputs, an alphabet rules, patterns outputs, tokens

In order to express or recognize patterns there are two options: Regular expressions (text-based representation) DFA (visual-based representation)

Review of Regular Expressions To create a regular expressions there are operands (characters) and operators. Operators: . (dot) wild character. As the dot is used as operator, if we want to include the dot as a character we need to have the escape character \. | OR ? Zero or one * Zero or more + 1 or more [ ] indicate a set ( ) indicate a set, but a different kind of set

Examples. R1 = [a b c]+ With this regular expression it is possible to create the following words: a, abc, abaaab. The set contained inside the brackets have three elements. Due to the fact this regular expression have the + operator (one or more), this regular expression cannot create an empty value.

16

2.5. Programming a Lexer Having the following DFA, how can we implement the Lexer?

For the implementation of this DFA it is possible to use conditions such as if-else. The problem using conditionals is the complexity of the program, the more states in the DFA the more complexity in the conditions, and this can cause problems such as fault prone. Then an alternative is to use a state machine. Using a state machine allow to transform a DFA into a table. Using a state machine reduces the complexity of the code.

19

0 S0 S1 S2 S0 S2 S1

1 S1 S0 S2

To transform the DFA into a state machine (table), you use as rows all the states on the table, and you use as columns the different values to change from one state to another. In the DFA above, there are 3 states, thus the table will have three rows. To move from one state to another the inputs are 0s and 1s, so the table will have two columns. The values inside the table represents the resultant state if I started in the state on represented by the row and have the input represented by the column, e.g. if I am in the state S2 (last row) and get the input 0, I will move to the state S1. The following is the pseudo-code to implement the lexical analyzer.
//step 1 state = S0; // step 2 while (hasLetters()) { l = readLetter(); state = calculateState(state, l); } // step 3 if (state == S0) print (token correct); else print (error);

3. SYNTAX ANALYSIS
In a syntactical level, a language has rules, namely the grammar, that defines the way to combine the words in the language. In order to specify the grammar of a language there are two options one text-based Backus- Naur Form (BNF); and one visual-based. 3.1. Terminal and non-terminal elements When defining a grammar there are terminal elements, that we called words, and non- terminal elements. 20

For the lexical analysis the input was the alphabet and the output were tokens, and the rules dictate how to move from an alphabet to tokens. For the syntactical analysis the input are words and the output the classification of those words as terminal or non-terminal elements. Example 1. Think for a moment in the English language. The following expressions represent non-terminal and terminal elements. <verbs> = run, walk, sing, write <subject> = I, you, Helen, Javier In these expressions verbs and subject are the non-terminal elements while run, walk, I, you are terminal elements. A non-terminal element can be expressed by a set of terminal elements, but also as a set of non-terminal elements. For instance, the following is a non-terminal expressed in terms of other non-terminal elements: <sentence> = <subject><verb> Non-terminal elements are always between < and >, or some literature uses UPPERCASE to represent the non-terminal elements. So, the following expression is equivalent to the previous expression: SENTENCE = SUBJECT VERB Example 1. Given the following sentence Helen walk, is this sentence correct? Yes, due to it follows the rule SENTENCE = SUBJECT VERB, where the SUBJECT is Helen and the VERB is walk. 3.2. Derivation Tree Example 2. Given the following grammar: <E> = <E> + <E> <E> = <E> / <E> <E> = integer In this grammar there are three terminal elements: +, /, and integer and only one non- terminal element <E>.

21

If we have the expression 5 + 1, is it possible to know if it is correct according with the above grammar? Well, the first thing is to know which of the rules in the grammar is the first one. Lets say that the first rule in the grammar is the first one in the list, so to analyze the expression 5 + 1 we do the following: 5 <E> + + <E> 1 <E>

This can be seen as a tree, this is a derivate tree. Due to the fact it is possible to find a tree to match the expression then we can say that it is correct. Now if we have the following expression, when reviewing the expression and we get to the point to review the two operators /, it is expected that after one / there is an <E> and in this case there is / 3 and this is not a valid <E>. So, in this point it is possible to say that the expression is not correct. 5 <E> + + <E> 1 <E> / <E> / <E>? 3

Programming languages are not ambiguous, and this is because any sentence has one and only one tree associated. If we have the expression 5 / 3 + 1, how it can be evaluated? For this expression, there are two options to create the tree. The first one considering that 5/3 are together and the put it together with the 1. We can have the following tree: 5 <E> / / <E> 3 <E> + + <E> 1 <E>

But also, it is possible to consider first 3 + 1 together and then include the 5 /. 5 <E> / / <E> 3 <E> + + <E> 1 <E>

Notice that BOTH processes are correct, this means that there are two different trees that represent the same expression. 22

This is not correct; a grammar should not allow that one expression has more that one three to represent it. So, this grammar is not correct due to the fact it allows ambiguity. This grammar needs to be fixed. Example 3. Consider the following grammar for if statements. Is this grammar correct (in terms of ambiguity)? stmt if stmt stmt other if stmt if expr then stmt if stmt if expr then stmt else stmt If expr then If if stmt stmt stmt But again, this expression could be expressed using a different tree. So, again this grammar is not correct in terms of ambiguity. expr then other stmt else other stmt

23

Now the problem is how to implement this table in code. There are different ways to implement this table; one of it (used by the professor) is using a hash table, where given the current state and an input, the hash table return the next state or an error if this is the case. One possible way to implement this is to: 0. Create the table (using a data structure). 1. Define which state is the starting state. 2. Use a loop to go through the table and obtain the next state given the current state and an input (a value to change from one state to another). If we want to add the representation of Hexadecimals, Octal, or Binary numbers what we need to do is to add more rows and columns in the table, but the code does not change. So, adding new states generates more complexity in the table, but not in the code. In order to add these new patterns we need to know how those numbers are represented. For example: 1. Binary numbers could be represented by a string that starts with the prefix 0b and that is followed by a sequence of 0s and 1s, e.g., 0b0110. 2. Hexadecimal numbers could be represented by a string that starts with the prefix 0x followed by a sequence of hexadecimal symbols [0-9, A-F], e.g., 0xFF. 3. Integer is a sequence of [0-9]. In some languages such as Java or C/C++ integers cannot start with a 0, e.g. 758. 4. Octal numbers are represented with a sequence of numbers [0-7]. In some languages such as Java or C/C++, this sequence always starts with 0, e.g., 075. Note that in our project a number that starts with a 0 could be recognized as an integer or a float (if it has the decimal point). 3.3. BNF Once the lexical analysis is done, what we have is bunch of tokens. We know for sure that all those tokens are valid. The next step is to review the syntax, i.e., that the words follow the rules to be combined. This rules form the grammar of the language. Grammars can be expressed using the BNF that is the text-based form to express a grammar. BNF uses two elements, terminal and non-terminal elements. When an expression is evaluated to verify if it follows or not a grammar expressed in a BNF, we create a derivation tree. This three shows the followed path to cover the expression using the rules in the grammar. One goal is to have non-ambiguous grammars. A grammar is ambiguous when there is possible to create more than one derivation tree to represent an expression. 25

So, how to avoid ambiguity? Here are some rules: 1. It is needed to define a symbol that would be used as the starting symbol. 2. It is needed to define which is the first rule of the grammar. For default, the first rule in the grammar would be the first rule in the list and we should use the leftmost derivation principle. Given the following grammar: <E> <E> + <E> <E> <E> * <E> <E> (<E>) <E> Integer According with previous rules the first rule in the grammar is the first in the list; thus, <E> is the starting symbol. Beside that we need to follow the leftmost derivation. Given that, to evaluate the previous expression we will do the following: And if we have the following expression: 10 + 20 * 30.

So we took the first rule <E> <E> + <E>, and start from the left side, that is how we identify that we need an Integer and that the 10 fits that rule. Then we know that we need a plus symbol, and another <E>. Now, we now that an <E> could be <E> * <E>, that is the rule we use, and again, for each <E> we can say that and <E> can be an Integer. Let see another example: 10 + 20 + 30. For this case, we will have problems with ambiguity due to the fact even though we start from the left side there is an option to find at least two different derivation trees. We start with the first rule: <E> <E> + <E>; however, when it comes to define the leftmost <E> there is an option to use the rule <E> integer or <E> <E> + <E>.

26

This problem is caused because we have several expressions for the same element <E>. So, the solution is to have the start symbol, e.g., <E> and then the first rule in terms of other elements and the same for the rest of the rules. For example, the previous grammar could be expressed as follows: <E> <M> + <M> <M> <P> * <P> <P> (<N>) | <N> <N> Integer In this grammar <E> is the starting element and it is in terms of M, M is in terms of P, P is in terms of N, and N in terms of a terminal element. With this grammar there is not an option to have more that one derivation tree, i.e., it is not ambiguous. Note that by doing this, we are also solving problems with the precedence of the operations + and *. However, this grammar has a problem, all the expressions need to have an addition, due to the first rule <E> <M> + <M>. To fix this problem we need to modify the grammar to be the following:
NOTE 1: The notation used here is different from the one reviewed on class to match it with the textbook. NOTE 2: The grammar here was updated (and it is different from the one reviewed in class) to make it simpler. It is the same that appears in the slides.

<E> <M> {+ <M>} <M> <P> {* <P>} <P> (<N>) | <N> <N> Integer The start symbol is still <E> and the first rule is still the first on the list. However, by adding the { and } symbols we are saying that <E> could be <M> added with 0 or more <M>s terms. Something similar happens with the rule <M>. This new notation is named Extended BNF or EBNF. Where the symbols { and } are used to represent a repetition of elements (0 or more) and symbols [ and ] are used to represent an optional element (0 or 1).

27

Now if we have the expression 10 + 20 + 30.

In the graph above, the first rule to be followed is <E><M> + <M>, then for the leftmost <M> the rule to be followed is that <M> <P>, then <P> <N>, and <N> Integer. Then, we go back to the first rule that is incomplete, in this case, the rightmost <M>. Then, we go down to complete it. We go back again, to find out that there is another + symbol, so we need an extra <M>, and again we complete it. Now, if we have the following expression: 5 * (1+7) + 6 * 5 E M *

P N Integer 5

P (

Notice that when we get to the point where we use the rule <P> (<N>) we got stick, due to the fact <N> can only be an Integer, and we have a whole expression there 1 + 7. So, we need to modify the grammar again, to be the following. <E> <M> {+ <M>} <M> <P> {* <P>} <P> (<E>) | Integer So going back to the expression: 5 * (1+7) + 6 * 5.

28

We can say that <P>(<E>) and that <E><M>+<M>. When we finish with that expression we go back, we finish the (E) expression, so we go back, the P*P is complete, so we go back all the way to <E> and notice that we have another + symbol so we use the {+ <M>} section of the rule and complete it.

One more example, consider the following expression: 5 * 2 + 10 The first rule is <E>, the next rule will be <M> (alone) because the next character is a *, so we cannot apply the <M> + <M>. Now <M> can be <P>*<P>. So we go left and check the first <P>, <P> can an Integer and finally a 5. Now, we have to go back to the first level we left incomplete, which is the <P>*<P>, because we need to complete the second <P>, so that <P> is an Integer and finally a 2. Now we need to go back until an incomplete level, that would be <E> to the top, and we can say now that <E> can be <M> + <M>

29

One more example, consider the following expression: 5 * (2 + 10)

Homework. Do the derivation trees for the following two expressions. (5 * 2) + 10 5 * 2 * 10

30

3.4. Syntax Diagrams As for lexical analysis there is a way to express rules using text (Regular expressions) or using a visual diagram (DFA); for syntax we also have these two options. We can use the BNF or EBNF (text-based) or we can use syntax diagrams (visual-based). The previous grammar can be expressed with the followings diagrams:
NOTE: The grammar here was updated (and it is different from the one reviewed in class) to make it simpler. It is the same that appears in the slides.

<E> <M> {+ <M>} <M> <P> {* <P>} <P> (<E>) | Integer

Notice that terminal elements are symbolized with circles, when non-terminals are represented with rectangles. Notice that the arrowheads represent the flow on the diagram, or expressions are represented with diverse paths coming out from the root of each diagram. Now, what if we want to define a new grammar that is able to recognize logical, relational, and arithmetical operations? And use it to evaluate the following expression: 10 + 20 > 15 & -10 != 1 | 20 / 10 + 1 > 5 The first step is to define the precedence of all this operations. So logical operations are the ones with higher precedence, then comes the relational, and then arithmetical. Notice that this precedence is defined by how one rule calls (uses) a subsequent rule.

31

The diagrams below represent the grammar, where <E> is the starting symbol. Notice that according with the diagram, <E> is expressed in terms of <A>. Notice that the precedence of the operations is defined by how one rule calls (uses) a subsequent rule.

Homework. Generate the BNF or EBNF for these diagrams.

32

WEEK 04
Review of Grammar Representations Grammars can be expressed using BNF or EBNF that are text-based representations, or they can be represented using syntax diagrams. It is important to remember that when creating a grammar it is important to keep in mind that grammars should be no ambiguous. Remember that when creating the diagrams, the operation with the higher precedence should be is the one at the end of the sequence of diagrams. When we have the syntax diagrams it is possible to obtain the BNF (Backus-Naur Form) or the EBNF (Extended BNF) of them. Given the following diagrams we can obtain the BNF and EBNF expressions.

Lets try to create the BNF and EBNF for the D rule.

33

BNF D F | F / F | F / F / F | F * F | F * F * F | F * F / F * F * F / F EBNF DF ( [/F] | [* F])


NOTE: The EBNF notation is different from the notation on the textbook. Using the notation of the textbook the notation would be D{[/F | [*F]}, where [ ] means 0 or 1 element, and { } means 0 or more repetitions.

Exercise. The previous grammar was created based on the rules discussed in the class and in order to be able to parse the following expression: 10 + 20 > 15 & -10 != 1 | 20 / 10 + 1 > 5 However, what if we change the expression to be: 10 + 20 > 15 & (-10 != 1 | 20 / 10 + 1 > 5) In order to be able to recognize the parenthesis we need to modify the grammar to add the parenthesis rule. Parenthesis rule, that is an association rule, has the higher precedence, thus it should be the last rule in the sequence of the diagrams. So, we need to add a rule or modify the last rule. We can modify the rule G as follow:

If we want to add more operations such as subtraction (-), multiplication (*), more logical operations such as (==), what it takes is to add more options in the current rules. So, if a group of rules have the same precedence they are in the same rule. For example, subtraction has the same precedence than the addition, so they both would be in the same rule; for multiplication, it will go in the same rule as division; for the logical equal operation (==) operation will be in the same rule as the logical operation different (!=). 3.5. Types of grammars There are four types of grammars. So far all the grammars that we have covered so far are classified as Context-free grammars. The reason why we choose to use context-free grammar is because implementation reasons. Implementing the Parser for a Context-free grammar is easier compared with other grammars, it is not the easier (the easier are Regular grammar)

34

The types of grammar are the following: Recursively enumerable Context sensitive Context-free Regular

The grammars are subsets of each other, as shown in the figure below. So, regular grammars are a subset of context-free, and these are a subset of context-sensitive, and finally those are a subset of recursively enumerable grammars.

3.5.1. Context-free grammars Context-free grammars are defined with the following rules: A S, A N, S (N U T)* - N Where N represents the set of non-terminal elements and T represents the set of terminal elements. In this type of grammars the left side of the rule can only have one no terminal element and none terminal elements (A N), and the right side can have a combination of terminal and non-terminal elements or empty S (N U T)* and it is no possible to have a non-terminal by itself (- N). Given the four grammars below, only the first one is a context-free grammar, due to the fact is the only one with a no terminal element alone in the left side, <E>. The rest of them have terminal elements (in red) in the left side. <E> <A> + <A> (<E>) (<A> + <A>) <S> <B> <C> @<X> <A><B> <C> The second rule in a context-free grammar is that the right side of the expression cannot be only a no-terminal element, i.e., it has to have terminal elements or a combination of no terminal elements. For example:

35

<E> = <A> + <A> is a context free grammar. Another example will be the grammar below. This is not a context-free grammar, due to the fact the rule <E> has in the right side only one element and it is a non-terminal element. <E> <X> One more example lets analyze the following grammar: <S> a <A><D> <A>a | b <D>integer In this grammar all the rules have only one element on the left side and it is a non-terminal, so all the rules follow the first rule of the context-free grammars. Then, if we review the <S> rule, it has on the right side a combination of terminal and non-terminals, so it follows the second rule. Now the <A> rule has a combination of terminal elements, which is valid too. Finally, rule <D> has only one element, but it is a non-terminal, so it is also valid. Thus it can be said that this is a context-free grammar. 3.5.2. Regular grammars Regular grammars are a subset of context-free grammar, which means that ALL regular grammars are context-free grammar, but this does not work vice-versa. Regular grammars are defined as following: A a | ab A, b N, a T Where N represents the set of non-terminal elements and T represents the set of terminal elements. So, the rule says that the left side should be one element and should be a non-terminal <A> and (A N); and in the right part it could be either a terminal by itself or a combination of one terminal and one non-terminal in that order ( a | ab, b N, a T), i.e., the right side can be one or two in length. <E> (<A> + <A>) This is a context-free grammar but it is not a regular grammar due to the fact in the right side the expression is longer than 2 in length. Given the follwing grammars, define their clasification.

36

P1 { <s> <A> <s> <A><A><B> <A> aa <A>a <A><B>a <A><B><A><B><B> <B>b <A><B>b <B> b } P2 { <s> b<s> <s> a<A> <s> b <A> a<s> <A> b<A> <A> a } } } P4 { <s> <A><B> <A> a<A> <A> a <B> <B>b <B> b <A><B> <B><A> P3 { <s> <B><A><B> <s> <A><B><A> <A> <A><B> <A> a<A> <A> ab <B> <B><A> <B> b

It can be quickly define that P1 and P4 are not either regular nor context-free grammars due to the fact the left side of some rules has more than one element. This leaves only P2 and P3. P2 is a regular grammar. Where all the rules have only one element on the left side and it is a non-terminal. In the right side all the rules have only two elements the most; for those with one element it is always a terminal, and for those with two elements the first one is always a terminal and the second one is a non-terminal. P3 is context-free grammar but it is not a regular grammar. The left side of all rules has only one element and it is a non-terminal; however, the right side of some rules have more than two elements.
NOTE: While classifying a grammar it is important to remember to define it classification in the lower class possible, i.e., if it is said that a grammar is context-free, it means that it is not a regular grammar. If it is said that a grammar is a context-sensitive, it means that it is not a context-free or a regular grammar.

3.5.3. Context sensitive grammars aAb adb, a, b (N U T)*, A N, d (N U T)* - The rule defines that the left side can have more than one element either terminal or nonterminal but at least one has to be a non-terminal.

37

So a context sensitive grammar can have in the left side the following: -<A> which is a terminal - and a non-terminal <A> (<A>) which is a terminal ( then a non-terminal <A> and then one terminal ) (<B><A>; which can be seen as a terminal ( followed by a non-terminal <B> followed by a combination of terminal and no-terminal <A>;. Another way to see it is a combination of a terminal and no terminal (<B> then a non-terminal <A> and a terminal ; at the end. In all these cases, there is at least one non-terminal and then before and after it could be a combination of terminals and non-terminal. Now, for the right side of the grammar; the rule specifies that it can be a combination of terminal and non-terminals at the beginning and at the end where the right side of the rule should have the same starting combination and ending combination as the left side and where these combinations can be empty. Then in the middle of the right side there should be a combination of terminal or non-terminal d (N U T)* but never an empty element ( ). Reviewing the previous grammars, which of them are context sensitive? Due to the fact P2 and P3 are regular and context-free respectively, they are also context sensitive. However, remember that when classifying a grammar we are looking for the lower level possible. Now, what happen with P1 and P4. P1 is a context sensitive grammar. Rules 1-3 and 7 have only one element in the left side, a non-terminal, so we can say there is an empty element, then the non-terminal, and then another empty element. So the right side for these rules, should have an empty element at the beginning and at the end and then in the middle a combination of non-terminal and terminals, which is correct. Lets review for rules 4-6. For rule 4, it has in the left side a non-terminal <A> with an empty element before and a terminal after a it, and in the right side we have an empty element then a combination of terminal and non-terminals <A><B> and then the terminal a that matches the left side. Something similar happens with rule 6. For rule 5, it is possible to say that it has an empty element first, then a non-terminal <A> followed by a non-terminal <B>. Then in the right side it has the empty element, followed by a combination <A><B>, and at the end the non-terminal <B>, which matches the rule for this type of grammars. P4 is NOT a context sensitive due to the last rule in the grammar <A><B><B><A> In this rule, in the left side we need a non-terminal element, we can say it is <A> or <B>. If

38

we decide it is <A>, then we have an empty string, <A> and then <B> at the end. So, when we review in the right side, we have the empty string at the beginning, then we would have <B><A> in the middle which would be correct, but then we would have an empty string, which is not correct. Something similar results if we decide that the mandatory non-terminal in the left side is <B>. So, if P4 is not regular, or context-free, or context sensitive, then what is it? It is a recursively enumerable grammar.
NOTE: In the programming assignment, we are using a context-free grammar. They have more power that a regular grammar and they are easier to implement than a context sensitive grammar.

3.6. Programing a Parser We choose to use a context-free grammar, which allows us to create a Predictive, Recursive, and Descendent Parser. Given the following grammar, we want to implement the Parser that helps us to review that an expression follows the rules in this grammar.

For the implementation: Each rule is going to be a method. Each of the non-terminals (boxes) is a call to a method. Each terminal (circle) will be either an if condition or a while. Also is with terminal element when we have the opportunity to identify errors.

For the Parser implementation we need the tokens. These tokens are the ones provided by the Lexer (lexical analyzer). So, the result of a syntax analysis is to know if there is an error in any expression. This means being able to recognize if there is an expression that is not following the rules of the grammar. We are not looking any more whether the words are valid; we did that in the lexical analysis. 39

The Parser is going to say if the tokens are in the correct form and if the derivation tree could be processed. If the derivation tree cannot be process in full this means that there is an error. Beside each of the methods for each rule, we have a method parser. This method will call the first rule of the grammar.
public static void parser() { add(); }

So, in this case we will have four methods: add, multiplication, negative, and term.
public static void add(){ multiplication(); while (tokens[currentToken] )) == +){ currentToken++; multiplication(); } } public static void multiplication() { negative(); while (tokens[currentToken] == "*") { currentToken++; negative(); } } public static void negative() { if (tokens[currentToken] == "-") { currentToken++; } term(); }

As mentioned above, each rule will be a method. Inside the method, each box is a call to a method and each circle is a loop or an "if" condition. In the case of add and multiplication, the circle represents a loop, in the case of negative it is an "if" condition. Notice that this follows the diagram of the grammar. Where add and multiplication rules show a path to return to the initial point (a loop) and where negative shows two options (conditions) to follow in the beginning. For the term method:
public static void term() { if (tokens[currentToken] == "(") { currentToken++; add();

40

if (tokens[currentToken] == ")") { currentToken++; } else { errors++; System.out.println("missing parenthesis"); } } else if (tokens[currentToken] == "INTEGER") { currentToken++; } else { errors++; System.out.println("missing integer"); } }

Notice that when you have rules with options, such as term, it is possible to identify errors, also when you have terminal elements at the end of an expression it is also possible to identify an error. If we want to add more operations such as division, we will add it in the multiplication rule due to the fact both operations have the same precedence. In terms of code, this means to add and or operation to say token == * or token == /.
public static void multiplication() { negative(); while (tokens[currentToken] == "*" || tokens[currentToken] == "/") { currentToken++; negative(); } }

3.7. Defining a language The first thing that we need is to define the grammar for the programming language. We can start by reviewing the sections of a program. A program has declaration of variables and declaration of methods. Something such as the following:
int x = 5; int y() { int z = 1; }

So, the grammar that could define this could be the following:

41

And the rule for the declaration of variables could be as follows:

And the rule for the declaration of methods could be as follows:

These two rules have a problem. The problem is, how do we decide which path of the program rule to choose, dvar or dmethod? As this rule is defined we need to look ahead several tokens, to know if the expression is the declaration of a variable (dvar) or a declaration of a method (dmethod). So, having to look ahead for several tokens is not a good idea, so we need to change the grammar. With the new grammar below, it is possible to know looking only one character ahead with path to follow.

42

WEEK 05
Review for Programming a Lexer The following list shows some specials cases that it is important to keep in mind while programing a Lexer: This should be two tokens and and both are ERRORs. hola+hola23 hola STRING + OPERATOR STRING hola ID 23 ERROR \3 One token and it is a CHAR 33 One token ERROR \ One token, and STRING, due to the fact the backslash avoids that the second is considered as the closing of the string. \ One token, and it is an ERROR; due to the fact there is missing the closing for the STRING. 0x23.23 One token, the dot is NOT a delimiter; thus, this is parsed as a float number and there is an ERROR. B-) Three tokens B ID - OPERATOR ) DELIMITER

43

Reviewing the Syntax Rules for our language Last week we defined the following rule for a program. However, this rule has a problem. It does not allow having multiple variables or multiple methods.

To fix the rule we need to add a loop.

However, there is still one more thing to review. How to differentiate global variables from local variables? So, we need to specify if in the rule above we are talking about global or local. We are talking about global.

44

So, the diagram for Program will be:

Now, we need the rule for dmethod:

So, notice that this rule specifies that the method can have zero parameters but always have the parenthesis. The method has to have the start and end, specified by the curly brackets. It is possible that the method does not contain anything or that contains one or more lines, the two branches specify this after the open curly bracket and the loop. Now, there are two rules that need to be defined: parameter and line.

The loop indicates that we can have a list of types and identifiers separated by the coma. Notice that the option of NOT having parameters is covered in the rule for dmethod not here. Now, for line; what can we have inside of a method? We can have declaration of variables, assign a value to a variable, calling a method, we can have loops (while), conditions (if-else), and the method could be able to return a value.

45

So the diagram for line will be as follows:

Now there are a lot of rules to define. Lets start with the return rule:

Notice that at this point, syntax analysis, we are not reviewing if the type of the method matches the type of the expression to be returned. This is something that will be covered on the semantic analysis. Now, for assignment and call function, the rules will looks like:

Note that both assign and call_function uses the exprlog, which stands for logic expressions. Remember that due to the fact logic expressions are the ones with the lower precedence they are the first on the parsing process.

46

47

So the grammar of our programing language has 20 rules (diagrams). These rules can be translated on 20 methods for our syntax analyzer. Homework. Start thinking and working on the implementation of the methods for the Parser.

48

WEEK 06
Reviewing our language The following table shows all the characteristics and requirements of our language.

Now, we have all the rules of our language (the 20 rules that we reviewed last lecture) and need to transform these rules into the Parser implementation. Now when transform these rules into implementation, there are two things to consider: (1) how to express the existence of diverse paths (conditions) and more important how to define which path to follow and (2) how to express the existence of loops and more important how to define for how long it last or which is the condition to stop de loop. In order to define these conditions that define the path to follow and whether to continue a loop or not, we use what is called the FIRST and FOLLOW sets, respectively. These sets give the tokens that define those conditions.

49

3.8. FIRST and FOLLOW sets For instance, in order to implement the following rule we need the conditions that define which path to follow at the beginning <A> or <B>, then we need to define the conditions to define if there is or not a line, and then the condition to know that there are multiple lines. To define this we need to define the FIRST and FOLLOW sets. 3.8.1. FIRST Set The FRIST set is defined for the first token or tokens (if there are different paths / options at the beginning of the rule) that are valid in a rule. For the following rule, the FIRST set would be A and B. However, A and B are nonterminals, so what we need to do is look into those rules and get the FIRST set of them and these will be the FIRST set for the rule. From this rule, we can define what would be the FOLLOW set for the non-terminal (rule) line. As it can see after the line rule finishes there are two options, either to have a curly bracket (}) or to return and have another line. However, line is a non-terminal element so it can not be part of a set, so in this case we will take the FIRST set of line and it would be part of the FOLLOW set of line, too.

Now, for the rules dvar_global and dmethod in our language we can identify the FIRST set. For the dvar_global, FIRST set would be (;) and (,). Notice that both elements are terminal.

For the dmethod rule, the FIRST set would be formed by the parenthesis. FIRST (dmethod)={(}. While the FIRST (parameter) = {type}. Notice that the FIRST sets have only terminal elements.

50

3.8.2. FOLLOW Sets Now, what would be the FOLLOW set for parameter? Again, the FOLLOW set is formed with the tokens that happen after a rule finishes. So, in this case for parameter what follows is the terminal parenthesis, FOLLOW (parameter) = {)}. So, in general terms, it is possible to say that the elements in the FIRST set are defined in the rule itself and are the first terminals that can be found in the rule. For the FOLLOW set, the elements are not in the rule itself but rather they are in the rule that calls the rule in question, and the elements are the token(s) that appears after the rule is completed. A good practice while creating a rule is that the FIRST set and the FOLLOW do not overlap, i.e., the intersection of these to sets should be empty. For our language, in order to validate that the rules that we have are the best rules, we need to review the all the rules follow the good practice mentioned above. Example. Given the following grammar: <E> -> <A> + <A> | <A> <A> -> <B> * <B> | <B> <B> -> - <C> | <C> <C> -> a We want to review if this grammar follows the best practice. So, we start to define the FIRST set for each rule, i.e., for any non-terminal in the grammar (<E>, <A>, <B>, and <C>). For <C>, the FIRST set is as simple as the terminal a. FIRST(C) = {a}. Now if we review <B>, there are two options to start <B>, the terminal - and the nonterminal <C>, but <C> can be the terminal a. So, FIRST(B) = {-, a}.

51

Following the same process, what happens with <A>? There are two options for <A> and both of them starts with the non-terminal <B> and the FIRST(B)={-,a}, so for <A> the FIRST set would be the same as for <B>. FIRST(A) = {-,a}. Now, for <E>, there are also two options and both of them starts with <A>, so for <E> the FIRST set would be the same as for <A>. FIRST(E) = {-, a}. Notice that in order to calculate the first set for each rule, i.e., the list of tokens that can be at the beginning of each rule, we started with the simplest rule, in this case <C>. Now, lets review what happen for some of the rules in our language, e.g. dmethod and parameter. Example. Given the following grammar: S ABC SF A EFd Aa B aBb B C cC Cd E eE EF F Ff F Where uppercases are non-terminals, lowercases are terminals, and represents empty. The following table represents the evolution of how the FIRST set is calculated, each column left to right represents a new step on the process. rule S A B C E F {a} {a, } {c, d} {e} {} {e, } {,f} {e, , f} FIRST set - evolution {a, } {a, e} {a, , e, f} {a, e, f, d} {a, , e, f, d}

52

So the first column shows that the FIRST set starts with the empty set. The empty set is the minimun set that a FIRST set can have. Then in a second iteration we reviewed the easiest rules, in this case the ones that have a terminal element at the begining, such as F, C, B, A. For the third column/iteration, we stared from the bottom to the top to complete the rules. The same happen for the forth and fifth column. Notice that in the forth itereation for the rule <A> we do not add to the FIRST set even though it is part of the FRIST set for F, why? Because <A> cannot start with F, it can start with E and E does not have the . Homework. Define the FIRST set for ALL the rules in our language. Now, what happen with the FOLLOW set. The FOLLOW set is formed with the tokens that could be after the rule finishes. Which is the token that would be in all FOLLOW sets for any rule? It will be the end-of-file (eof | $). Example. Given the following grammar: <E> -> <A> + <A> | <A> <A> -> <B> * <B> | <B> <B> -> - <C> | <C> <C> -> a What are the FOLLOW sets for each rule? To define the FOLLOW sets we start top to bottom. FOLLOW(E) = {$} what follows when rule <E> finishes? It follows the eof. FOLLOW(A) = {+, $} what follows when rule <A> finishes? If we observe rule <A> we can define what follows when rule finishes. There are two options, either a plus or the end of the line/file. FOLLOW(B) = {*, +, $} what follows when a rule <B> finishes. The obvious is either the eof or an asterisk. However, if we observe, <A> could be an <B>, in this sense, after finishing an <B> it is also find two options the plus sign or the eof (rule <E>). FOLLOW(C) = {*, +, $}. Example. Given the following grammar, define the FOLLOW set for all the rules. S ABC SF A EFd Aa B aBb B C cC Cd

53

E eE EF F Ff F The table below shows the process to find the FOLLOW set for all the rules. Remember that at least the FOLLOW set will containg the eof. For S, there is any rule that follows. So FOLLOW (S) = {eof}. For rule A, after A finishes, in rule S it is possible to see that what follows is rule B. As B is a rule (non-terminal) it is needed to review the first tokens on B to know what will follow A. So, FIRST (B) = {a, }, now, as one of the elements in FIRST (B) is empty, if we observe the rule S ABC, this means that if after A, B is empty, what followed A will be C. As C is a rule (non-terminal) then we need to review FIRST (C) = {c, d}, to add those elements in FOLLOW (A). Thus, finally FOLLOW (A) = {a, c, d}. Notice that the empty element is not in the set. Now for rule B, from the rule S ABC, it is possible to say that after B is C, so FIRST (C) elements will be part of FOLLOW (B) elements {c,d}. Now from rule B aBb, it is possible to say that after B it is possible to find the terminal b; so, this element is added to the set. Thus, FOLLOW (B) = {c,d,b}. For rule C, from rule SABC it is possible to say that after C is the eof; and from rule C cC it the same. So, FOLLOW (C) = {eof}. For rule E, from rule AEFd, it is possible to see that after E comes F. As F is a nonterminal we will review the elements in FIRST (F) = { ,f }. The empty is not included in FOLLOW, so, we have {f}. Then, in the same rule AEFd if F is empty, then after E, it is possible to have d. So, we have {f, d}. Finally, for rule F. From rule SF, what follows F is eof. From rule A EFd, what follows F, is d. From rule E F, what follows F is eof. From rule FFf what follows F is f. Thus, FOLLOW (F) = {eof, d, f} rule S A B C E F FOLLOW set - evolution {eof} {a} {c,d} {eof} {f} {eof} {f,d} {eof,d} {eof,d,f} {a,c,d} {c,d,b}

54

For the Program rule in our language:

FIRST (program) = {type, } FOLLOW (program) = {$} For dvar_global:

FIRST (dvar_global) = {;, ,} FOLLOW (dvar_global) = {$, type} Homework. Calculate the FOLLOW set for ALL the rules on the grammar for our language. 3.8.3. Example of FIRST and FOLLOW Sets Consider the grammar: S --> ABC S --> F A --> EFd A --> a B --> aBb B --> C --> cC C --> d E --> eE E --> F F --> Ff F --> (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

55

a) Calculate the FIRST sets for all grammar symbols. The rules we use to calculate the FIRST sets are the following, assuming initially FIRST sets are empty: I. II. III. IV. V. FIRST(X) = {X} if X is a terminal FIRST() = {}. Note that this is not covered by the first rule because is not a terminal. If A Xa, add FIRST(X) { } to FIRST(A) If A A1A2A3 AiAi+1 Ak and FIRST(A1) and FIRST(A2) and and FIRST(Ai), then add FIRST(Ai+1) { } to FIRST(A). If A A1A2A3 Ak and FIRST(A1) and FIRST(A2) and and FIRST(Ak), then add to FIRST(A).

In what follows, we will be referring to the grammar rules and FIRST set rules. For example, apply III to (2)" means apply FIRST set rule three to grammar rule (2). Using rule I, we get FIRST(a) = a FIRST(b) = b FIRST(c) = c FIRST(d) = d FIRST(e) = e FIRST(f) = f Using rule II, we get FIRST() = Then we calculate all other FIRST sets. rule (1): no change in FIRST(S) because FIRST(A) = ; rule (2): no change in FIRST(S) because FIRST(F) = ; rule (3): no change in FIRST(A) because FIRST(E) = ; Apply III to (4): FIRST(A) = {a} Apply III to (5): FIRST(B) = {a} Apply V to (6): FIRST(B) = {a, } Apply III to (7): FIRST(C) = {c} Apply III to (8): FIRST(C) = {c, d} Apply III to (9): FIRST(E) = {e} rule (10): no change in FIRST(E) rule (11): no change in FIRST(F) Apply V to (12): FIRST(F) = { } Apply III to (1): FIRST(S) = {a} 56

Apply V to (2): FIRST(S) = {a, } Apply III to (3): FIRST(A) = {a, e} rule (4): no change rule (5): no change rule (6): no change rule (7): no change rule (8): no change rule (9): no change Apply III to (10): FIRST(E) = {e, } Apply IV to (11): FIRST(F) = {, f} rule (12): no change Apply III to (1): FIRST(S) = {a, , e} Apply III to (2): FIRST(S) = {a, , e, f} Apply IV to (3): FIRST(A) = {a, e, f} (because FIRST(E), add (FIRST(F)- { }) to FIRST(A)) Apply IV to (3): FIRST(A) = {a, e, f, d} (because FIRST(E) and FIRST(F), add (FIRST(d) { })) to FIRST(A)) rule (4): no change rule (5): no change rule (6): no change rule (7): no change rule (8): no change rule (9): no change Apply III to (10): FIRST(E) = {e, , f} rule (11): no change rule (12): no change Apply III to (1): FIRST(S) = {a, , e, f, d} rule (2): no change rules (3) - (12): no change rules (1) - (12) no change

The final FIRST sets are: FIRST(S) = {a, , e, f, d} FIRST(A) = {a, e, f, d} FIRST(B) = {a, } FIRST(C) = {c, d} FIRST(E) = {e, , f} FIRST(F) = { , f}

b) Calculate the follow sets for all non-terminals

57

FOLLOW sets are calculated only for non-terminals. We have the following rules for adding terminals or eof to the FOLLOW sets (remember that _ is never added to the FOLLOW sets). I. II. III. IV. V. eof FOLLOW(S) If A aB, add FOLLOW(A) to FOLLOW(B) If A aBA1A2 Ak and FIRST(A1) and FIRST(A2) and and FIRST(Ak), then add FOLLOW(A) to FOLLOW(B) If A aBA1A2 Ak, add FIRST(A1)- { } to FOLLOW(B) If A aBA1A2 AiAi+1 Ak and FIRST(A1) and FIRST(A2) and and FIRST(Ai), then add FIRST(Ai+1) { } to FOLLOW(B)

In the remainder of this section, roman numerals refer to the rules for FOLLOW sets. We start by applying rule I to S and obtain FOLLOW(S) = eof Now we can start applying the remaining rules: Apply II to (1): FOLLOW(C) = {eof} Apply IV to (1) and add FIRST(C) { } to FOLLOW(B): FOLLOW(B) = {c, d} Apply IV to (1) and add FIRST(B) { } to FOLLOW(A): FOLLOW(A) = {a} Apply V to (1) and add FIRST(C) { } to FOLLOW(A): FOLLOW(A) = {a, c, d} Note that rule III does not apply to (1) because FIRST(C) Apply II to (2) and add FOLLOW(S) to FOLLOW(F): FOLLOW(F) = {eof} Apply IV to (3) and add FIRST(d) { } to FOLLOW(F): FOLLOW(F) = {eof, d} Apply IV to (3) and add FIRST(F) { } to FOLLOW(E): FOLLOW(E) = {f} Apply V to (3) and add FIRST(d) { } to FOLLOW(E): FOLLOW(E) = {f, d} No (roman numeral) rules apply to (4) Apply IV to (5) and add FIRST(b) { } to FOLLOW(B): FOLLOW(B) = {c, d, b} No other (roman numeral) rules apply to (5) No (roman numeral) rules apply to (6) Apply rule II to (7) does not change FOLLOW(C) No other (roman numeral) rules apply to (7). You should convince yourself this is the case! No (roman numeral) rules apply to (8)

58

Apply rule II to (9) does not change FOLLOW(E) No other (roman numeral) rules apply to (9). Apply II to (10) and add FOLLOW(E) to FOLLOW(F): FOLLOW(F) = {eof, f, d} Apply IV to (11): and add FIRST(f) { } to FOLLOW(F): FOLLOW(F) = {eof, f, d}. No change. No (roman numeral) rules apply to (12)

At this point we have: FOLLOW(S) = {eof} FOLLOW(A) = {a, c, d} FOLLOW(B) = {c, d, b} FOLLOW(C) = {eof} FOLLOW(E) = {f, d} FOLLOW(F) = {eof, f, d} We do another round: rule (1): no change rule (2): no change rule (3): no change (not affected by FOLLOW sets) rule (4): no change rule (5): no change rule (6): no change rule (7): no change rule (8): no change rule (9): no change. The most up to date FOLLOW(E) is already added to FOLLOW(F) rule (10): no change rule (11): no change rule (12): no change

So, the FOLLOW sets are: FOLLOW(S) = {eof} FOLLOW(A) = {a, c, d} FOLLOW(B) = {c, d, b} FOLLOW(C) = {eof} FOLLOW(E) = {f, d} FOLLOW(F) = {eof, f, d}

59

WEEK 07
Reviewing our Language Modifying our Lexer In order to implement we need to do a change in our Lexer. It needs to generate a table (list) with all the tokens identified. This list of tokens will be the input for the Parser. This list should have for each token, its category, the token itself, and the line where the token appears in the input file, e.g.,
IDENTIFER i 1

In order to implement this, we will need a LinkedList of Token. Thus, a Token class should be created. The structure of our compiler so far, will have three classes (.java files): Lexer.java which contains all the methods for Lexer analyzer to recognize all the tokens on the input file and the contains the main method. Token.java which defines de token object and all its methods. Parser.java which contains all the methods for all the grammar rules in our language.

Reviewing our language Implementing the Parser In order to implement (code) the Parser for our language we need first to validate all our grammar rules. To validate the rules, we need to review the FIRST and FOLLOW sets. There are two conditions to review: Whenever there is the option of following many different paths (conditions), the intersection of the FIRST set of all the possible paths should be the . Whenever there is a loop, the intersection of the FOLLOW set and the FIRST set of that rule should be the .

Translating a syntax diagram into code is possible by following these tips: Whenever there is a terminal (circle), it will be translated as an if-else condition. Whenever there is diverse path to follow, that will be translated as if conditions. Whenever there is a loop in the diagram, it will be translated as a do-while structure.

For example, for the first rules of our language, we will have the following pseudo-code.

60

public void program (){ do { if (TokenList.get(currentToken) == type){ currentToken ++; } else { print(Error type expected); } if (TokenList.get(currentToken) == identifier){ currentToken ++; } else { print(Error identifier expected); } if (TokenList.get(currentToken) == () { // ( element on dmethod dmethod(); } else { dvarglobal(); } } while (!eof); }

is

the

first

public void dvarglobal () { do { if (tokenList.get(currentToken) == ,) { currentToken++: } else { print (Error missing coma); } if (tokenList.get(currentToken) == identifier) { currentToken++: } else { print (Error missing identifier); } } while (tokenList.get(currentToken)== ,); if (tokenList.get(currentToken)==;) { currentToken ++; } else { //print(Error missing semicolon); //This is what is expected on your assignment 2 while (tokenList.get(currentToken) != ; && tokenList.get(currentToken) != \n) { currentToken ++; } } }

61

public void dvarlocal () { boolean foo = false; do { foo = false; if (tokenList.get(currentToken) != =) { // nothing here } else () { currentToken ++; exprlog(); } if (tokenList.get(currentToken) == tokenList.get(currentToken+1) == identifier) { currentToken = currentToken +2; foo = true; } } while (foo); if (tokenList.get(currentToken) == ;) { currentToken ++; } else { print (Error missing semicolon); } }

&&

Reviewing our Language Creating the derivation tree Whenever we review that an input file follows the defined grammar, it is possible to generate the derivation tree. As part of the Assignment #2, we are going to generate the derivation tree of the input file. In order to (graphically) create the derivation tree, we are going to use a .dot file that could be read by the GraphViz application. So, what we need to do is to create the .dot file. GraphViz is a free graph visualization tool, it is possible to download it from: http://www.graphviz.org/Download..php The format for the .dot file is the following:
digraph derivationTree { A A A B [label = Hello] -> {B} -> {C} [label = World]

62

C [label = Bye] 1 2 3 4 1 2 } [label = [label = [label = [label = -> {2,3} -> {4} One] Two] Three] Four]

The first and last lines of the .dot file are mandatory. The name of the digraph could be anything, in this case is derivationTree. Nodes in the tree could be identified with letters or numbers or any word, in this case some of them are identified with the letters A, B, and C, and others with the numbers 1, 2, 3, and 4. To set the label we want to appear in the node, the notation is: nodeID [label = <label>] e.g.,
A [label = Hello] 1 [label = One]

To indicate a connection of a node with one or more other nodes the notation is: node -> { list of nodes } e.g.,
A -> {C} 1 -> {2,3}

So the previous .dot file generates the following tree:

The challenge then is to create the .dot file while doing the parsing. The challenge is to find the correct place to create the nodes (nodesID), labels, connections, and write into the .dot file.

63

So, inside of each of the methods that represent a rule in our program language it would be necessary to add the code that writes into the .dot file. Every time we enter in a new rule, we need to create a node (nodeID) and define the label of it. And every time that a method is called it means a connection in the three.

64

WEEK 08
4. SEMANTIC ANALYZER
Semantic is about the meaning, i.e., the interpretation of the expression and its context. To do the semantic analysis of the code we need to check for six things: 1. 2. 3. 4. 5. 6. Review that a variable is unique and that before it is used, the variable was declared. Review that the value assigned to a variable corresponds to the type of the variable. Review that the indexes in an array are integer. Review that all the conditions in an if and while structures are boolean. Review that the return value of a method corresponds to the type of the method. Review that the parameters in the call of a method corresponds in number and type to the declaration of the method.

So the first thing that we need to do is to store and identify the variables. For example, consider the following code:
int x; x = 5; int x (int y) { } int x () { }

So, in the case above, we need to make the difference about the variable and the method named x, and also consider that there is the possibility to have two methods with the same name, as the once above, but there is important that them have different parameters, meaning different in type and/or number. This is called, overloading. So the above is correct; however, if we would add the following method:
void x (){ }

This should be recognized as an error. Remember that the name of the methods (its signature) is different, this include the name and parameters. If we have the following code:
int a; a = 5;

65

int x (int y){ int a; } int x (){ int a; }

As it can be seen, there is a variable a that is declared globally but also it is declared inside of each of the methods. This is correct; due to the fact each variable has its own scope. This means that the first declaration of a is in the global scope, while the other two as are inside of a method, this makes those variables different. Another thing to review beside the signature and the scope is the assignation and types.
int a; a = 5;

This is correct due to the fact 5 is an integer value.


char a; a = 5;

This is also correct, due to the fact if a char variable is assigned with an integer number, it is taken as the numerical value of the character, e.g., if the value is 65 the letter would be A.
float y=5;

This above for some languages could be correct due to the fact it is possible to store an integer in a float. However, this might be not valid in other languages.
boolean z = 1;

Again this example here, might depend on the considered language, some languages might consider the number one as a true value. As it can be seen, each language has its own semantic rules. Now, consider the following code and lets review the assignation of values.
float a= 0; int x = (2 + a) / 7;

When a variable is assigned with a whole expression as the one above x, it is important to evaluate the expression to see if it matches the type of the variable.

66

For instance, for the expression above we can create the derivational tree. And then review the resultant type of each section of the expression and the end type of the whole assignation to see if it is a valid or not assignation.

So, given the derivation tree, we need to know is adding an integer (2) with a float (a) is valid. For this example, if we assume it is valid and that it returns a float, we can proceed to the division of that float with an integer (7), this will return a float. Then we can review the assignation of a, which is a float, whit a value that is a float, that is correct. What happen with the expressions inside the if and while structures? It is expected that the resultant type of the expression is a boolean value.
if if if if (x < 5){} (a){} (5+5){} (5/6){}

For the expressions above, it is necessary to review the resultant type of the expression. x<5 would return a boolean value, so it is correct. For the last three expressions, it might depend of the language. For instance for Java, these expressions are incorrect due to the fact the resultant type of the expressions are numerical values, not boolean values. However, for other languages such as C/C++ it would be correct, due to the fact the language considers anything that is different from 0 as true. Another example is for arrays. The rule for arrays is that the expression for the index in an array should be an integer value. So, it is possible to have the following expressions:
int x; float a=0.0; int arr[]; arr[5] = 0; arr[x+1] = 5; arr[a] = 5;

67

So in the case of arr[5] it is easy to know that it is correct, 5 is an integer. For the expression x+1, as x is an integer, the result of this expression would be an integer. However, for the case of a, a is a float; so, this should be considered an error. Now, for the return value of a method, it is important to review that the expression in the return statement matches the type of the method. For instance, in the following example, the expression in the return statement is a 5. So, 5 is an integer and matches the type of the method, which is integer too.
int foo () { float a; return 5; }

For the following example, the expression (5*7)/a would be a float value, and the type of the method is an int, so this is incorrect.
int foo () { float a; return (5*7)/a; }

Last but not least, it is important to review that the parameters used when a method is called correspond in type and number with the parameters defined for that method. For instance, in the following example:
int foo (){ float a; return a; } void foo (char a){} void main (){ int j = foo (); foo (a); foo (7 1 * 2 % 3); }

The first call for the foo method is correct, due to the fact, the foo method returns an in value and this is assigned to a int variable, j. The second call to the method foo, is also correct, due to the fact there is another method, which its signature is different from the first one, that expects a char as parameter and do not return anything. For the last call of the foo method, we would need to verify that the whole expression 7 1 * 2 % 3 is char, why a char, due to the fact there is not variable assigned to that call.

68

Now for the following cases, following the rules in Java, which cases are correct and which are not correct? Case 1.
int i; char j; int m; void method(int n, char c) { int n; short l; i = j; i = m; }

The expression i=j could be incorrect. It would depend if the language allows that chars and integer works together, i.e. that a char value can be assigned to an integer variable. Case 2.
int i, j; void method() { int i = 5; int j = i + i; int i = i + i;

The problem here is that the variable i is defined two times inside of the method, so they both are in the same scope and it is incorrect. Notice that even though there is also a variable named i in the global scope. Case 3.
int i, m, k; boolean j; void main() { if (i>5) { ++i; } while (i + 1) { ++i; } do {++i; } while (i); for (i = 0; m; ++i) { k++; } }

In this case, if we follow the rules in Java, the problem is in the expressions while (i + 1) and while (i) due to the fact that i is and integer. Case 4.
int a; int b; int c, d; char c1, c2; int test1(int x, int y) { return x+y; } void main() { int i; i = a++; i = test1(a, b); i = test1(c1, c2); i = test1(a, c1); } }

69

If we follow the rules in Java, the problem here is the calls for the method test1. In the second call of the method c1 and c2 are character values and the method is expecting integer values. In the third call of the method the problem is with the c1, again it is a character value when an integer is expected. Case 5.
int i, m; boolean j; public void main() { int m; int a[]; a = new int[j]; }

Here the problem will be that we are using a boolean variable as an index for an array. Again, some languages might consider this valid, some others not. That is why it is important to define the semantic rules for your language. Case 6.
int i; void main(int m) { i++; return i; }

Here the problem is with the return expression. The method does not expect to return anything, its type is void; however, there is a return expression with an integer value. To implement the semantic analyzer it will be necessary to have a good understanding of the syntax rules, due to the fact the code from the semantic analyzer will be added on the top of the methods created for the syntax rules. Beside that, it will be necessary to create a Symbol Table conformed with all the variables and methods defined in a program, and a cube of operations that define which operations are valid for each type of data and the type for the result of that operation. 4.1. Symbol table In order to implement the semantic analyzer it is needed to create a table, named Symbol table. This table stores the name or signature of each variable as well as its type and scope. So, every time a variable is declared it is stored in the symbol table. And every time a variable is used it is important to check that the variable was already declared. Consider this code:
int i; char j; int m; void method(int n, char c) { int n; short l;

70

i = j; }

i = m;

int i, j; void method() { int i = 5; int j = i + i; int i = i + i; } int i, m, k; boolean j; void method(int i) { if (i>5) { ++i; } while (i + 1) { ++i; } do {++i; } while (i); for (i = 0; m; ++i) { k++; } }

While creating the table of symbols one row is created for each variable. So, for instance for the variable i, there will be a row: i global int method int Meaning that the variable i is declared as a global int variable but also inside the method called method as an int value.
int a,b; char c, d; float e,f; void foo1(int a) { // float a = 1; float a = a; } void foo2(char b) { int a = c + d; } int foo3() { int i = a + b; }

The symbol table for this code will be the following: a global int foo1_int int foo1_int float foo2_char int b global int foo2 char c global char d global char e global float f global float foo1_int function void 71

foo2_char function void foo3_ function int i foo3 int Note that for the methods not only the name is stored but the whole signature that consists on the name and the type of parameters. So for the case of variable a there is a problem (marked in red), due to the fact there are two is inside the same scope foo1_int. Now in terms of usage of the variables, what happen when we need to review that one a variable is defined before it is used? Well we are going to use the symbol table to review that. In general terms, when using a variable, it can be said that local variables have priority over global variables. In order to create the symbol table it is possible to modify the methods of the Parser (program, dvar_global, dmethod, and parameter) in order to store the variables and methods while the Parser recognizes a valid variable, method, or parameter. The red balloons indicate the places in the method where it is need to add some code. Some of that code will include to add the variables into the Symbol Table.

So, when a program starts it is expected to have some global variables and/or some methods. So, starting with the program rule, each time that a type is recognized it can be stored as well as each time an identifier is recognized it can be stored. Now, there are two options either it is a global variable or it its a method. So if it is a global variable it could be only one or a list of them, so every time the rule completes the loop of a new identifier this identifier could be also stored using the last recognized type. Now, when the semicolon is detected, this means that the list of variables ends and that it is a syntactically correct list of global variables, then this is here when those variables should be added to the Symbol Table.

72

Now if it is a method, every time a type of a parameter is identified it should be stored in order to create the method signature but it would be until all parameters are detected (after the closing parenthesis) that it is possible to say the all are syntactically correct, when we should store the method signature and all its parameters to the Symbol Table.

4.2. Variables are previously declared Now, if we need to review that a identifier was defines before it is used, or that a method exists before it is used, we need to add some lines of code in the term and call_function rules.

4.3. Type matching and the cube of types

73

possible to evaluate if the type of the expression, calculated with the derivation tree and the cube, matches the type of the identifier. Homework 2. Review your Parser to assure you understand all the rules for your grammar.

76

WEEK 09
EXAM REVIEW: LEXICAL AND SYNTAX ANALYSIS
Consider the following code:
int a; int c( int b) { Return b * 3 * 2 * 1; } void main () { a = 1; boolean a = }

c (14) / 2 > 1;

1. Obtain the list of Tokens. 2. Build the derivation tree. The list of Tokens will be as follows:
int a ; int c ( int b ) { return b * 3 * 2 * 1 ; } void type id delimiter keyword id delimiter type id delimiter delimiter keyword id operator integer operator integer operator integer delimiter delimiter type main id ( delimiter ) delimiter { delimiter a id = operator 1 integer ; delimiter booleantype a id = operator c id ( delimiter 14 integer ) delimiter / operator 2 integer > operator 1 integer ; delimiter } delimiter

The derivation tree will look like as follow (consider that not all the terminal elements are shown in the tree):
NOTE: There is an error in the call_function rule, there is a semicolon (;) that is misplaced; it cannot be in that rule. Please check it and fix it.

77

78

EXAM REVIEW: SEMANTIC ANALYSIS


Now, considering the same code above, we can obtain the Symbol table for it.
int a; int c( int b) { Return b * 3 * 2 * 1; } void main () { a = 1; boolean a = }

c (14) / 2 > 1;

Now the Symbol table will look as follows:


a global int main boolean c_int function int b c_int int main function void

So, at this moment, we have the tokens, their types, and also the symbol table. Remember that, in order to do the semantic analysis we need to check for six things: 1. 2. 3. 4. 5. 6. Review that a variable is unique and that before it is used, the variable was declared. Review that the value assigned to a variable corresponds to the type of the variable. Review that the indexes in an array are integer. Review that all the conditions in an if and while structures are boolean. Review that the return value of a method corresponds to the type of the method. Review that the parameters in the call of a method corresponds in number and type to the declaration of the method.

Remember that we are not handling arrays, so we only have to check for 5 things. In terms of programing, we are going to use a stack as data structure to implement the semantic analyzer. One important thing to consider is that at this point we have all the tokens, all the types, we know that all the expressions are correct, and we also have the cube of types. Using the stack it is possible to review the result type of any expression. Lets consider that we are analyzing the line b * 3 * 2 *1 So, when we are creating the derivation three, and we are dealing with an expression and get to a term this means that we have an operand. So, looking at the derivation tree, we

79

get into a term when we processed the token b. So, we are going to store the type of the token in the stack:

int

Now, when we find the token *, we need to store it in order to check that the operation we are about to process return a valid type. It is important to remember that * is a binary operator, this means that it needs two operands. When we read an operator that is unary, the way to proceed is different, due to the fact we dont need to process the operation with one element. Now, we get again to a term now with the integer 3, so we store the type in the stack: int int 3 b

Now, when to process the operands in the stack. Every time that we get to the end point of the product rule, we are going to process the types on the stack using the stored operator. So, in our example, we have two integers in the stack and we have stored the operator *. So, we need to review in the cube of types if the operator * have a valid type for two integers. So in this case it has it, and the resulting type is an integer. We will pull out then the two processed types and push in the result. Now the token will looks like this:

int

b*3

We will follow the same process to process the rest of the expression b * 3 * 2 *1. At the end, we will have an int type in the stack. This means that the result type of the whole expression is an int. At this point, it is possible to review it the expression matches, in this case, with the return type of the method. So, in terms of implementation, when we get to the end of the rule return we can check for type matching between the type of the method and the type of the expression on the return statement.

80

This same works when we are reviewing and assignment statement. When we finish to process the expression on the right side of the assignation we can review that the result type of the right side matches the result type of the left side. Lets review what happen when we call a method. What happens in terms of matching types. In the semi expression c(14) We are calling c(14), so we need to know if the type of the parameter used in the call matches the type of the declared parameter. To do this, every time we get to the point in the rule call_function where we find the closing parenthesis, we can review that the type of the explog matches the type defined for the method. Again, the type of the explog will be the type at the top of the stack, and the type of the defined parameter will be in the symbol table. Now, in order to review the line with the declaration of the local variable that also has an initial declaration:
boolean a = c(14) / 2 > 1;

we need to know both the type of the left and right side of the = symbol. In this particular case the left side is an identifier that through the table of symbols we find out it is a boolean. We need to obtain the type of the right side. We start with the expression, and using the stack we can obtain the result type of that expression. So, first we get the call to the function c(14), for the prior explanation we know that the type matches and that the resultant type is int, so in the stack we will have an int.

int

c(14)

Now, the next token is /, so we store that operator. The next token is 2 that is an integer. So, when we get to the term rule we store in the stack the type. So, the stack looks like: int int 2 c(14)

Now, as we pass twice for the product rule, we need to process the operator, /, in this case. We need to review in the cube if the operator / have a valid result type for two

81

integers, in this case lets assume that the resultant type is an int. So, we pull out the two integers and we push in the resultant type

int

c(14) / 2

Now, the next operator is >, so again, this is a binary operator. So, read the next token that is 1. int int 1 c(14) / 2

Now, when we pass the second type for the product rule we process the operator. So, if we look into the cube for the valid types for the operator >, we will find that the resultant type for two integers is a boolean.

boolean

c(14) / 2 > 1

So, when in the rule dvarlocal we get to the point where we find the semicolon, then we can review that if there was an initial assignation for a variable that the type matches. In this particular case, we now that a the local variable is a boolean (looking at the symbol table) and then the expression is also a boolean.

82

WEEK 10
5. CODE GENERATION AND COMPILING
So far what we have done is review that a piece of code follows the lexical, syntax, and semantic rules of a particular language. The next step in our complier will be to translate a high level language to a low level language to generate the code and compile it. 5.1. Compilers

Consider the following code:


int a (int b, int c, int d, string s) { if (s == s) { return b * c + d; } } void main { print ( a(4,5,6,s) ); }

So, if we translate the prior code to a low-level language it would be something like the following:

83

sto sto sto sto lod lit opr jmc lod lod opr lod opr sto opr lod lit lit lit lit cal lod opr opr

0, s 0, d 0, c 0, d s, 0 s, 0 0, 14 f, _E1 b, 0 c, 0 0, 4 d, 0 0, 2 0, a 0, 1 _E2, 0 4, 0 5, 0 6, 0 s, 0 a, 1 ;name and address a, 0 0, 21 0, 1

5.2. Instruction and variables handling If we could review an .exe file content or a JavaByteCode file content, we could see (figuratively) that the file have a section that corresponds to the variable declaration and a section with the actual code. That means that in order to execute a piece of code we need to define all the elements (variables in the code) and their type, in order to reserve the needed memory to load/store the variables. The good news is that all that information is already in the symbol table. In other words, when a piece of code is going to be executed two things are loaded in the memory of the computer: the symbol table (that is stored in the Stack Data segment) and the instructions (instruction segment). Also, as part of the information stored in the data segment is a number that indicates which line of the lowlevel code is the one where the main method of the program starts. When a piece of code is loaded and is executed, data flows from the (execution) memory to the memory inside of the CPU, the registers, and vice versa. The low-level code uses only operations that the CPU is able to understand (e.g., assembly commands/instructions).

84

5.2.1. Set of instructions In this example, the instructions available in the low-level language are the following: LIT LOD STO OPR JMC JMP CAL Moves a value (e.g., 5) to a register in the CPU. Moves the value in a variable (e.g., a) to a register in the CPU. Stores the value of a register to a variable (symbol table). Indicates to execute an operation. Each operation has a number as an ID. Indicates a conditional jump. Indicates a jump to a particular line. Indicates the call to a method. The parameter is the line where the method starts.

5.2.2. Registers The CPU has its own memory, called registers. This registers store the values that are going to be used when an operation is executed. That means that the values from the memory should be loaded to the registers on the CPU. If the result of any instructions needs to be stored in a variable, then the value of the register is loaded into a variable, an space in the memory. There is a special register in the CPU, PC (Program Counter) register, which is a counter that indicates which line of the code is being executed. Every time an instruction is

85

executed the value in this register is incremented in one. The program finishes when the OPR -1,0 instruction is found in the code. This operation indicates the end of the program. 5.2.3. Stack The variables of a program are loaded in the Memory Stack. This space of the memory is reserved as the variables are defined. Every time that a variable is assigned with a new value that value is stored in the memory stack. Every time that it is requested to print the value of a variable, the value on the stack is taken and printed. Consider the following code:
void method (int a) { print (a); if (a < 10) method(a+1); else return; }

As it can be seen this is a recursive method. In this case, every time the method is called the value of a will change. In terms of memory, this means that every time that the method is called the value in a is going to change. However, it is not overwritten, a new space in memory is created, but it is not created in the memory but in the heap.

5.2.4. Heap Beside the stack, there is also a heap. This heap is also part of the memory. The heap is used to store complex variables, such as arrays, objects; i.e., those types that have several basic types inside of them. The stack stores then the addresses of where the set of values that compound that complex type stars in the heap.

For example, the following code is in C/C++:


int x; //the value of x will be stored in the stack.

86

int *a; a int []b; of

//the stack will have the address where the value of //is stored in the heap //the stack have the address where the first element //the array is. The index of the //increment the address of the first //places ahead in the heap. //the stack stores and address. This //stored another address that is the //value a is stored. array is used to element index address will have address where the

int **c;

For the case of have an object in Java, the stack stores the address where the first element (attribute/variable) of the object starts.

5.2.5. Virtual machine Some languages such as Java, have what is called a virtual machine (VM). A VM is kind of a translator (not a compiler). What this translator does is to take the compiled code and then translate it in a low-level code, which will be specific to the OS and the architecture of the computer where the VM is running. The VM is able to perform the memory handling and have garbage collection features.

87

WEEK 11
5.3. Compiling We have being talking about compilers; in fact we are working in the development of a compiler. One of the tasks that the compiler does, as the final step, is to take the code and translate it into a low level code and make it readable and understandable for the CPU. Some compilers, such as Java, have what is called a Virtual Machine. This VM is an interpreter and it the one in charge of making this translation of the code to a low level code.

6. MEMORY
While creating a program there are different ways to manage the memory. 6.1. Runtime Memory One important thing to consider when running/executing a program is the use of memory, i.e., how much memory does the program needs in order to execute. The amount of memory depends directly on the number of variables, the type of the variables and the number of instructions on it. 6.1.1. Stack When a program is executed a certain amount of memory will be used. Some memory will be used to store the symbol table; part of the things that are stored is also the value of the variable in all moment during the execution of the program. All this is stored in the stack. However, how this space in memory is used? How the values are stored? Consider the following piece of code, if the first call of the method is method (1), what would be the output in the console?
public class test { public static void method (int a) { System.out.print (a+","); if (a <10) method(a+1); else return; System.out.print (a+","); } public static void main(String [] args) { method (1); }}

88

In this piece of code, the method method is a recursive method, i.e., it calls itself. Looking into the code, the value of a, the parameter, is printed twice, before the if, where the method is called recursively, and after the if. The input of this code would be: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 How is the red sequence is generated? This sequence is generated by the print instruction after the if, but how? Every time the method is called the value of a will be printing and the method "method" will be called, but there is a print statement pending. When a gets to the value of 10, then the recursive call will stop, but there are 9 prints pending that now are going to be executed. Now, the interesting part comes that when ever each of this pending calls executes it will remember the actual value of the variable a on each call. So, the values are stores in the memory, so when ever the pending line is executed, the value of a can be retrieved. 6.1.2. Memory Space When in a program we define a variable, what we are doing is requesting a space in the Stack of the memory. In that space we are storing value of the variable, and we are going to refer to that value using the name of the variable. There are different ways to refer to that space in memory. Each programming language provides different ways and approaches to do this. One of these approaches is the one used in C/C++, where it is possible to use pointers. A pointer is a reference (address) to variable/value. So, the space needed to store the address is reserved from the Stack, however, the space of memory where each pointer points is reserved from the heap. Consider these lines of code:
int a;

This line means that a space of the memory (in the Stack) is requested to store an integer. x will be the reference of this space of memory, and this space of memory has an address, e.g. 0xFF.
int *b;

This means that b is a pointer to an integer. This means that b will store an address (this value is in the Stack), an address that refers to an integer value, i.e. a space in the memory (in the Heap) that is able to store/allocate an integer. So, using the example above, b can store the address of a variable. E.g., if we have this assignation: b = &a. What is the meaning of the ampersand? While the asterisk means that the variable is a pointer, the ampersand means that we want to use the address (instead of the value) of a

89

variable. So the assignation above is saying that b is going to store the address of where the variable a is stored, in this example, 0xFF.
int []b;

An array is kind of a pointer. In this case, b is a pointer to the first element of the array. Again the address (the pointer) is stored in the Stack, while the actual values are stored in the Heap.
int **c;

In this case, we have a pointer to a pointer, this means that we a variable that store the address that points to another address. 6.2. Pointers in C/C++ In C/C++ when we want to create a pointer we ask for the space of memory, using instructions as new/malloc. In the same way, when we dont need those pointer any more, it is important to let the program know we dont need them anymore in order to clean/free the memory. If we dont clean the memory what happen? Well, we will have spaces in memory, addresses that are storing garbage. Some languages, such as Java, have what is called a Garbage Collector. The garbage collector is part of the Virtual Machine. So, the VM is kind of the nanny, and it will be in charge or free those spaces in the memory that are not been used any more; but, how does it know that the memory is not used any more? Well, the key is to review if there is any pointer to that specific space of memory. Java does not have pointers; however, every time we define a variable we are defining, somehow, to a space in memory. Every time we define a new variable, we are accessing a new available space in memory, i.e., no any other program is using that space. However, when we allocate this space of memory we need to be sure that the space is clean, i.e., it is a good idea to initialize the variable. Some languages, such as Java, provide clean spaces (via the VM); however, languages as C/C++ does not warranty that the space is clean and this can produce some awkward behavior in the code. Consider the following piece of code:
char ch = c; char *chptr =&ch;

ch is a variable, and chprt is a pointer. In this case, we can say that chptr is pointing to the address of ch.
int i = 20; int *intptr = &i;

90

again, i is a variable, and intptr is a pointer to an integer.


char *ptr = I am a string;

In this case, ptr is a pointer to a char. Remember, a pointer stores an address (and it is in the Stack). In this case, we have a pointer to the first letter on the string and we can access the whole string (that is in the Heap) by incrementing the pointer. So, a string is not anything else than a pointer to a sequence of characters, where the variable stores the address of the first character. 6.2.1. Alias When defining a pointer, we need to reserve/allocate the space of the memory, this space will be ad-hoc to the type of pointer we want to define.
int *x, *y; x = (int *) = malloc (sizeof (int)); *x = 1; y = x; // alias *y=2;

In the code above we have two pointers: x and y and we are reserving memory to x. Then with the instruction *x = 1, we are saying that we want to store the integer value 1 in the address stored in x. So, x points to an address in the Heap, and now this space in the Heap will have a 1 on it. We are saying y=x to have a second pointer to the same space of memory in the heap. So, when we have the line *y=2, we are saying that the address pointed by y should store a 2. Notice that this will overwrite the number 1 stored before, this happen because y is pointing to the same space in memory than x. It is said that y is an alias of x. Another example:
int [] x = {1, 2, 3}; int[] y = x; x[0] = 4; printf (%d\n, y[0]); printf (%d\n, *y); printf (%d\n, *(y+0));

x is a pointer, as we mentioned before, arrays are pointers, where the variable x (in the Stack) stores the address of the first element of the array (on the Heap). y is another array, which will have the points to the same address than x.

91

If we want to store a value in the array we need to specify in which address (space) we want to store that value that is specified using an index. In the line x[0]=4; what we are saying is that we want to store the value 4 in the first address of the array. The last three lines of the code, generates output in the console. In this case, the values printed in the console will be: 4; then the content (*) of the first element on the array y, and then again the content (*) of the first element (+0) on the array y. 6.2.2. Dangling Now, consider this code:
int *x, *y; x = (int*) = malloc (sizeof(int)); *x =1; y=x; // alias free(x); printf (%d\n, *y); //illegal

We have two pointers, x, and y, but we only request a space for x. When we say y=x; what we are saying is that y will point to the same space of memory that x. When we free x, what we are saying is that x is not longer pointing to that space, meaning that we lost the reference to that space in memory. So, even when y was not freed, it is not able to access the address anymore due to the fact x was freed and due to the fact y was only an alias, meaning that when x is lost y is lost, too. This is called dangling. We can review a new example:
int * dangle () { int x; return &x; } void main() { int *p = dangle(); printf (%d\n, *p); //illegal }

This is another case of dangling. Why? Because the method dangle is returning the address of the local variable x, and as soon as the program return to the main method this reference will be freed and then the reference will be lost. 6.2.3. Garbage Now, one more example:
int *x; x = (int x) malloc (sizeof((int));

92

x = 0; // What happens in the memory?

In this case the problem is that this piece of code is generating garbage. Why, because we are never freeing that space of memory. Even though we are saying x=0 this doesnt free the pointer, we just make that it losses the reference of the space in memory that it was pointing, but the space is memory is still marked as used but now it is no accessible for any reference. In the following code,
void p() { int *x; x = (int *) malloc(sizeof(int)); *x =2; // What happens in the memory? // missing free ?! }

In this case, again, we are generating garbage. When the method finishes, the pointer x will be no longer used; however, the space in memory will be marked as used. This will cause that that space in memory remains unavailable for any other program. 6.2.4. Exercise Consider the code below:
int array[] = {45, 67, 89}; int *array_ptr; array_ptr = array; printf (First element: %i\n, *(array_ptr++)); printf (Second element: %i\n, *(array_ptr++)); printf (Third element: %i\n, *array_ptr);

In this case we have two pointers, array and array_ptr. Remember that using the ++ means that first we use the value and then we do the increment. So, the first line prints in fact the first element of the array and then the pointer is incremented in one position. Thus, the second line will print the second element of the array and then, again, the pointer is incremented in one position. Thus, the third line will print the third element of the array. And the pointer will remain pointing to the third element of the array. The definition of the variables above could be also defined as follows:

93

int array[] = {45, 67, 89}; int **array_ptr; array_ptr = &array;

The difference is that array_ptr is a pointer to a pointer. So, when initializing array_ptr, it is possible to assign the address of another pointer.

94

WEEK 12
7. FUNCTIONAL PROGRAMMING LANGUAGE: LISP
So far we have review imperative or procedural languages. Now we are going to review a functional programming language, LISP. Functional programming languages are called this due to the fact everything is expressed as a mathematical function. LISP has its own virtual machine / interpreter. So, it takes care about the memory, so we dont need to worry about memory managing. 7.1. Atomic expressions One of the elements that differentiate one language from another is the lexical rules, i.e., the tokens that are recognized as valid. In LISP, these tokens are referenced as atomic expressions. Here is a list of atomic expressions valid in LISP:
-3 2.43 12334230392341232341132323401292349234123123023492341023923441 23 [it has very long numbers] #C(3.2 2) [the complex number 3.2 + 2i ] 2/3 [the fraction 2/3. NOT "divide 2 by 3"] -3.2e25 [the number -3.2 x 10^25] #\g #\{ #\Space #\Newline #\\ [The character '\'] #\tab #\newline #\space #\backspace #\escape "Hello, World! "It's a glorious day. "He said \"No Way!\" and then he left. "I think I need a backslash here: \\ Ah, that was better. t [t- symbolize TRUE] nil [nil symbolize FALSE]

95

7.2. Everything is a list Now in terms of syntax, in LISP, everything is a LIST. Both, data and commands are represented with lists, and all lists are enclosed in parenthesis. In LISP, there are NOT operators there are only functions. The function is the first element on the list. Examples:
(+ 3 2 7 9) [add 3+2+7+9 and return the result] 21 (* 4 2.3) [multiply 4 by 2.3 and return the result] 9.2 (subseq "Hello, World" 2 9) [return the substring] "llo, Wo"

Some functions such as addition (+) and multiplication (*) do not have fixed number of parameters. Some of them such as subseq (for strings) have a fixed number or optional arguments. This can be see as overloading a function, it has the same name but with a different signature. 7.3. Predicates Those functions that return a boolean (true or false) value are called predicates. So, relational and logical operators are represented with predicates.
[5]> (= 4 3) NIL (< 3 9) T (numberp "hello") NIL (oddp 9) T [is 4 == 3 ? ] [is 3 < 9 ?] [is "foo" a number?] [is 9 an odd number?]

7.4. Errors In terms of errors, every time that there is an error in the code, LISP will stop, in order to continue the error should be fixed.

7.5. Recursion In LISP it is possible to have functions inside functions, i.e., recursion. List (parenthesis) are used to specify that a new function is called.

96

Example:
(+ 33 (* 2.3 4) 9)

There are two lists, meaning there are two functions, one inside of another. In this case, the second list (multiplication) will become the second parameter for the first list (addition). 7.6. Control Structures LISP has kind of a IF-ELSE structure. The name of the function is if, and the structure of the list is as follows: (if (condition) (function when the conditions is true) (function when the condition is false)) Example:
(if (<= 3 2) (* 3 9) (+ 4 2 3))

In this case, this means that if 3 is lesser or equal to 2, then the function will return (3*9) if not it will return (4 + 2 + 3). In this case, the result will be 9. It is also possible to have nested structures:
(if (= 2 2) (if (> 3 2) 4 6) 9)

where the result would be 4.

97

WEEK 13
7.7. Blocks So far all the expressions that we have review have only ONE line. What about if we want to have several lines, similar to the case where an if else has several lines and we use the curly brackets to represent a block of lines. In LISP is possible to define several lines, a block, by using the special form prong, which takes the form (progn expr1 expr2 expr3 ...). For example,
(if (> 3 2) (progn (print "hello") (print "yo") (print "whassup?") (+ 2 2))

(+ 1 2 3))

7.8. Variables By it nature, LISP is not a language that cares that much for variables. However, it is possible to define variables using a macro, the macro setf.
(self x 0) this expression is similar to have x = 0;

(setf x (* 3 2)) 6 x 6 (setf y (+ x 3)) 9

7.9. Loops It is also possible to have loops. LISP define a loop by using the iterator dotimes that takes the following form:
(dotimes (var high-val optional-return-val) expr1 expr2 )

98

Example:
(dotimes (x 4) (print "hello"))

This will print four lines with the world hello.


(dotimes (x 5) (print x) )

This will print the numbers from 1 to 5.


(setf bag 2) (dotimes (x 3) ) - notice that we do not have data types (setf bag (* bag bag))

For this code the first time x=1 and bag=2, so after the first loop bag will have the value 4; for the second time x=2 and bag=4, then after the loop bag = 16; for the third time x=3 and bag=16, then after the loop bat=256. The code in Java for the example below would be :
int bag=2; for (int x=1; x<=3; x++) bag = bag * bag;

7.10. Local Variables When using the clause (setf x 4), we are defining global variables. So, what happen if we want to create local variables? In order to create local variables we use the let macro.
(defglobal x 20) ;x set globally (setf x 4) (let ((x 3)) ;x declared local (print x) (setf x 9) ;the local x is set (print x) (print "hello") ;Why does "hello" print twice )

The output for this block will be:


3 the result of (print x) 9 the result of the second assignation hello hello (print x) after we do the

99

Notice that the word hello is printed twice, why? The first hello is the result of the line (print hello). But remember that, LISP always print the result of the expression, in the case of having a set of expression the result of the whole set is the value of the last expression. In the example below, the last expression (print hello) returns hello, so the result of all that set would be also hello. If after the let we have the line (print x), the output will be 4, the value of the global x. The let macro allows defining many local variables, e.g.,
(let ((x 0) (y 1) (z 3)))

Notice that each variable is defined inside parenthesis. Lets review another example:
(let ((x 3) (y (+ 4 9))) (* x y) )

The result of this let will be: 39 Lets review a new example:
(let ((x 3)) (print x) (let (x) ;note that as x does not have a value (print x) (let ((x "hello")) (print x) ) (print x) ) (print x) (print end") )

The output for this code would be:


3 NIL "hello" NIL 3 "end" "end"

100

7.11. Writing functions In LISP is also possible to create our own functions, using the defun macro. The form of this macro is:
(defun function-name (parameters) action)

Lets review some examples. Example 1: If we define the following function:


(defun do-hello-world ( ) "Hello World)

Notice that the function does not have parameters. The system will print the name of the function in upper cases.
DO-HELLO-WORLD

Now if we type the name of the function, notice that it should be in parenthesis:
(do-hello-world)

The result will be:

"Hello, World!

Example 2: If we define the following function:


(defun add-four (x) (+ x 4))

Note that the function has one parameter, x, and the action is to perform the addition of the parameter plus 4. The system will print the name of the function in upper cases.
ADD-FOUR

Now if we type the name of the function, notice that it should be in parenthesis and that we need to provide the parameter.
(add-four 7)

The result will be:


11

Example 3:
(defun hypotenuse (length width) (sqrt (+ (* length length)(* width width)))) HYPOTENEUSE (hypotenuse 7 9) 11.4017

We can have more than one parameter:

101

Example 4: We can create the following function:


(defun first-n-chars (string n reverse-it) (if reverse-it (subseq (reverse string) 0 n) (subseq string 0 n))) FIRST-N-CHARS (first-n-chars "hello world" 5 nil) "hello" (first-n-chars "hello world" 5 t) "dlrow" (first-n-chars "hello world" 5 2013) "dlrow" ;nil value

Exercise 1: Write the corresponding code in LISP for the following code in Java:
public static double factorial (double n) { double sum = 1; for (double i=0;i < n; i++) sum = sum * (1 + i); return sum; } public static void main(String [] args) { System.out.println(factorial(1000)+""); }

The name of the method is factorial. sum is a global variable for is going to be a dotimes, where i is index for the loop, inside the loop we have the operation sum=sum * (1+ i); The function should print the factorial.

The translation for this method in LISP would be as follows:


(defun factorial ((n)) ( let (sum 1) (dotimes (i n) (setf sum (* sum (+ i 1))) ) sum ) )

102

Exercise 2: Translate the following recursive method in Java to LISP.


public static double factorial (double n) { if (n<=1) return 1; else return n * factorial (n-1); }

The translation would be:


(defun factorial (n) (if (<= n 0) 1 (* n (factorial (- n 1))) ) )

7.12. Data Structures Remember that everything in LISP is a list. How can we define a list.
(quote (1 hello 2 word))

This will set a list with four elements: 1, hello, 2, and word.
word)

Having (quote ()) is similar to have (). So the previous list can be set as (1 hello 2 We can also have lists with list on it, such as (1 2 (hello 3) 3 (bye) 3) There are functions to retrieve values from a list. There is the function car or first, which returns the first element of the list (it could be an element or a list). Then, there is the function rest or cdr that returns a list with all but the first element. Lets see some examples. Having the list ((A B) C (D)), this list has 3 elements. One list with A and B, then a single value C, and then a list with the element D. If we want the (car (car L))
;(first (first (L))

So in order to solve this we can do it step by step. 103

(first (first (((A B) C (D))))

So the inside first will return the list (A B), so now we have this:
(first ((A B)))

And the result will be A. If we want the (cdr (car L)) ;(last (first L)) So first we do (first L) that returns the list (A B). Then we need the (cdr (A B)), that will be a list with A on it (A)
(car (cdr L)) ;(first (last L))

so first we do (last L) this will return the list (C (D)). Then we need the first of (C (D)), which will return the element C. It is important to mention that we cannot get the cdr (last) of a single element. Consider the following list:
(A (B () (C () () )) (D (E () () ) () ) ) (car (car L)) ;( first ( first (L))

We need the first of the list, which will be the element A. If we ask for the first element of a single element, this will be an ERROR.
(cdr( car L)) ;( last ( first (L))

First we will get the first of the list, which will be the element A. If we ask for the last (a list) of a single element, this will be an ERROR.
(car (cdr L)) ;( first ( last (L))

First we need the reminder of the list that will be the list ((B () (C () () )) (D (E () () ) () )), then we need the first of that list, which will be the list (B () (C () () )). A good question for the final exam: If we have the list (A) and we ask for the cdr of that list, what would be the answer? a) NIL b) Empty list () c) ERROR Homework: Review the last five slides of the slide file. They contain interesting examples and the instructions to download LISP and PROLOG.

104

7.13. Example Here we are defining a function, my-reverse, which has a parameter, list.
(defun my-reverse (list) (let (new-list) ; empty list (dolist (x list) (setf new-list (cons x new-list)) ) new-list) )

The loop inside the function will take each of the values on the list and concatenate (cons) them in an empty list (new-list). The fact that they are reversed is because the command cons concatenate the element x at the beginning of the list. What is the purpose of the line new-list? Being the last action in the function, it is there to return the complete list as a result of the function. 7.14. Exercises
a) Given the list ((((a)))). It is important to understand that this is a bunch of nested

lists. If we want to recover the element a we will need this expression:


(first(first(first(first L)))).

What if we add one more first in the expression above? The return value will be NIL. b) Write a method that prints the even numbers between two given values a and b given as parameters. In Java this will look like this?
void method (int a, int b){ for (int i=a; i<b; i++){ if (i/2 == 0) { print (i); } } }

HOMEWORK. Write this function in LISP. Also, use the exercises in the slide titled Exam as preparation for your final exam.

105

Question. Give 3 elements in LISP that makes it better than iterative programming languages: It has larger numbers, which allow us to handle There are not data types. No type casting. There is a fraction type. Memory handling is easier, LIPS takes care of this. No pointers.

106

WEEK 14
8. LOGICAL PROGRAMMING LANGUAGES: PROLOG
PROLOG is an example of a logical programming language. Logical programming is different from imperative and functional. When talking about logical programming we can think in a deductive database (DB) where we can make queries to retrieve information from that DB. Logical programming is closer to a database manager (e.g., SQL) than to a functional or iterative programming language (e.g., Java). 8.1. Facts and Rules The kind of data that is stored could be either facts or rules. In PROLOG all the information (data) is identified because they are in lowercases. Variables are written using uppercases (at least the first letter of the name). Example:
instructor(john,cs365). instructor(mary,cs311). instructor(paul,cs446). enrolled(joseph,cs311). enrolled(joseph,cs365). enrolled(joseph,cs446). enrolled(danielle,cs365). enrolled(danielle,cs446).

8.2. Queries What happen if we insert this query?


?- instructor(X,Y)

It will return all the pairs of the facts for instructor. What happen if we insert this query?
?- instructor (paul, X). It will return cs446.

8.3. Defining rules Now, what if we want to add some rules. Rules define new predicates using facts. Using the example above, if we provide the following rules:
teaches(Professor,Student) enrolled(Student,Course). :instructor(Professor,Course),

107

classmates(Student1,Student2) enrolled(Student2,Course).

:-

enrolled(Student1,Course),

Note that the notation to define a rule is:


Fact :- Fact, Fact,

Where :- is kind of an if and the comas in the right side of this symbol :- represent an and. So the first rule above teaches could be read as follows: Professor teaches Student IF Professor is instructor of the Course AND the Student is enrolled in the Course. Now, if we use the following query:
?teaches(john, WHO)

WHO is a variable, and this variable will take the value that makes this fact true. So, considering the facts above, the system will return the following values:
WHO = joseph WHO = danielle

Considering the facts defined above, what will be the answer for this query:
?- classmates(javier,paul).

The system will check the provided facts. However, there is not a fact that has information about javier, so the result will be false. Now, if we ask the following query:
?- classmates (joseph,danielle).

The system will check all the facts enrolled for both joseph and danielle.
enrolled(joseph,C). enrolled(danielle,C).

And if there is a C that is equal for the both of them, then the rule classmates will be true. In this case there are two values for C that makes this rule true. So the system will return those values. Now, we can create a more complex rule:
classmates(Student1, Student2, Course) enrolled(Student1,Course),enrolled(Student2,Course). :-

In this case, if we can create the following query: 108

classmates(joseph,danielle,C).

The system will return all the values for C where both joseph and danielle are enrolled. Now if we ask: cs446. 8.4. Exercises Consider the following facts and rules:
mother_child(trude, sally). father_child(tom, sally). father_child(tom, erica). father_child(mike, tom). sibling(X, Y) :- parent_child(Z, X), parent_child(Z, Y). parent_child(X, Y) :- father_child(X, Y). parent_child(X, Y) :- mother_child(X, Y). classmates(joseph,danielle,cs446). The system will return true, due to the fact both joseph and danielle are enrolled in the

Facts:

Rules:

What does is it means to have the same rule defined twice, e.g., parent_child(X,Y)? This means and OR. Why does the query ?- sibling(sally, erica). returns a TRUE value? It is true, because there are two facts: father_child(tom,sally) and father_child(tom,erica), so the sibling rule is satisfied. What if we want to add a grand_father rule?
grand_father (X, Y) :- father_child(X,Z) , parent_child(Z,Y)

Now what if we ask the query:


grand_father(mike, erica).

The system will review if there are any fact where:


father_child(mike,Z) AND parent_child(Z,erica) exits.

In this chase the answer for the first fact, Z=tom. For the second fact, there are two values for Z: sally and erica. As erica is there, then the condition father_child(mike,tom) AND parent_child(tom,erica) is true.

109

If there is the following facts in the database:


exam_grade(helen,pass). ?- final_grade_cse340(helen,X). X=pass ?- final_grade_cse340(helen,pass). true

Now if we change the rule:


final_grade_cse340(STUDENT,GRADE) project_grade(STUDENT). :exam_grade(STUDENT,GRADE),

If we ask final_grade_cse340(helen,pass). The result will be false, due to the fact we do not have any information about project_grade for helen. If we add the following fact:
project_grade(helen)

Then the result will be true, due to the fact both conditions will be true.

111

July, 2013

Javier Gonzalez-Sanchez javiergs@acm.org Maria-Elena Chavez-Echeagaray helenchavez@acm.org

112

Das könnte Ihnen auch gefallen