Reg Ex

Exercises
1. Write a regular expression for each of the following sets of binary strings. Use only the basic operations. a. 0 or 11 or 101 b. only 0s Answers: 0 | 11 | 101, 0* 2. Write a regular expression for each of the following sets of binary strings. Use only the basic operations. a. all binary strings b. all binary strings except empty string c. begins with 1, ends with 1 d. ends with 00 e. contains at least three 1s Answers: (0|1)*, (0|1)(0|1)*, 1 | 1(0|1)*1, (0|1)*00, (0|1)*1(0|1)*1(0|1)*1(0|1)* or 0*10*10*1(0|1)*.
2. Write a regular expression to describe inputs over the alphabet {a, b, c} that are in
sorted order. Answer: a*b*c*. 3. Write a regular expression for each of the following sets of binary strings. Use only the basic operations. a. contains at least three consecutive 1s b. contains the substring 110 c. contains the substring 1101100 d. doesn't contain the substring 110 Answers: (0|1)*111(0|1)*, (0|1)*110(0|1)*, (0|1)*1101100(0|1)*, (0|10)*1*. The last one is by far the trickiest. 2. Write a regular expression for binary strings with at least two 0s but not consecutive 0s. 3. Write a regular expression for each of the following sets of binary strings. Use only the basic operations. a. has at least 3 characters, and the third character is 0 b. number of 0s is a multiple of 3 c. starts and ends with the same character d. odd length e. starts with 0 and has odd length, or starts with 1 and has even length f. length is at least 1 and at most 3
Answers: (0|1)(0|1)0(0|1)*, 1* | (1*01*01*01*)*, 1(0|1)*1 | 0(0|1)*0 | 0 | 1, (0|1)((0|1)(0|1))*, 0((0|1)(0|1))* | 1(0|1)((0|1)(0|1))*, (0|1) | (0|1)(0|1) | (0| 1)(0|1)(0|1).
2. For each of the following, indicate how many bit strings of length exactly 1000 are matched by the regular expression: 0(0 | 1)*1, 0*101*, (1 | 01)*.
3. Write a regular expression that matches all strings over the alphabet {a, b, c} that contain: a. starts and ends with a b. at most one a c. at least two a's d. an even number of a's e. number of a's plus number of b's is even
2. Find long words whose letters are in alphabetical order, e.g., almost and beefily. Answer: use the regular expression
'â*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$'. 3. Write a Java regular expression to match phone numbers, with or without area codes. The area codes should be of the form (609) 555-1234 or 555-1234.
4. Find all English words that end with nym. 5. Final all English words that contain the trigraph bze. Answer: subzero. 6. Find all English words that start with g, contain the trigraph pev and end with
e. Answer: grapevine.
7. Find all English words that contain the trigraph spb and have at least two r's. 8. Find the longest English word that can be written with the top row of a standard
keyboard. Answer: proprietorier.
9. Find all words that contain the four letters a, s, d, and f, not necessarily in that order. Solution: cat words.txt | grep a | grep s | grep d | grep f.
10. Given a string of A, C, T, and G, and X, find a string where X matches any single character, e.g., CATGG is contained in ACTGGGXXAXGGTTT.
11. Write a Java regular expression, for use with Validate.java, that validates Social Security numbers of the form 123-45-6789. Hint: use \d to represent any digit. Answer: [0-9]{3}-[0-9]{2}-[0-9]{4}. 12. Modify the previous exercise to make the - optional, so that 123456789 is
considered a legal input.
13. Write a Java regular expression to match all strings that contain exactly five vowels
and the vowels are in alphabetical order. Answer:
[âeiou]*a[âeiou]*e[âeiou]*i[âeiou]*o[âeiou]*u[âeiou]*
14. Write a Java regular expression to match valid Windows XP file names. Such a file name consists of any sequence of characters other than / \ : * ? " < > | Additionally, it cannot begin with a space or period.
15. Write a Java regular expression to match valid OS X file names. Such a file name consists of any sequence of characters other than a colon. Additionally, it cannot begin with a period.
16. Given a string s that represents the name of an IP address in dotted quad notation,
break it up into its constituent pieces, e.g., 255.125.33.222. Make sure that the four fields are numeric. YYYY where Month consists of any string of upper or lower case letters, the date is 1 or 2 digits, and the year is exactly 4 digits. The comma and spaces are required.
17. Write a Java regular expression to describe all dates of the form Month DD,
18. Write a Java regular expression to describe valid IP addresses of the form a.b.c.d where each letter can represent 1, 2, or 3 digits, and the periods are required. Yes: 196.26.155.241. 19. Write a Java regular expression to match license plates that start with 4 digits and end with two uppercase letters.
20. Write a regular expression to extract the coding sequence from a DNA string. It
starts with the ATG codon and ends with a stop codon (TAA, TAG, or TGA).reference
21. Write a regular expression to check for the sequence rGATCy: that is, does it start with A or G, then GATC, and then T or C. 22. Write a regular expression to check whether a sequence contains two or more repeats of the the GATA tetranucleotide.
23. Modify Validate.java to make the searches case insensitive. Hint: use the (? i) embedded flag.
24. Write a Java regular expression to match various spellings of Libyan dictator Moammar Gadhafi's last name using the folling template: (i) starts with K, G, Q, (ii) optionally followed by H, (iii) followed by AD, (iv) optionally followed by D, (v) optionally followed by H, (vi) optionally followed by AF, (vii) optionally followed by F, (vii) ends with I.
25. Write a Java program that reads in an expression like (K|G|Q)[H]AD[D] [H]AF[F]I and prints out all matching strings. Here the notation [x] means 0 or 1 copy of the letter x. 26. Why doesn't s.replaceAll("A", "B"); replace all occurrences of the letter A with
B in the string s? Answer: Use s = s.replaceAll("A", "B"); instead The method replaceAll returns the resulting string, but does not change s itself. Strings are immutable.
27. Write a program Clean.java that reads in text from standard input and prints it back
out, removing any trailing whitespace on a line and replacing all tabs with 4 spaces. Hint: use replaceAll() and the regular expression \s for whitespace.
28. Write a regular expression to match all of the text between the text a href =" and the next ". Answer: href=\"(.*?)\". The ? makes the .* reluctant instead of greedy. In Java, use Pattern.compile("href=\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE) to escape the backslash characters. 29. Use regular expressions to extract all of the text between the tags <title> and <\title>. The (?i) is another way to make the match case
insensitive. The $2 refers to the second captured subsequence, i.e., the stuff between the title tags. String pattern = "(?i)(<title.*?>)(.+?)(</title>)"; String updated = s.replaceAll(pattern, "$2");
30. Write a regular expression to match all of the text between <TD ...> and </TD> tags. Answer: <TD[^>]*>([^<]*)</TD>
Regular Expression to DFA

Regular Expression to DFA
Aim : Regular Expression to DFA ( To be taken from compiler point of view)
Objective: - To understand the role of regular expressions and finite automata in applications such as Compilers. Theory: Regular expressions are used to specify regular languages and finite automata are used to recognize the regular languages. Many computer applications such as compilers, operating system utilities, text editors make use of regular languages. In these applications, the regular expressions and finite automata are used to recognize this language.
Compiler is a program which converts the given source program in high-level language into an equivalent machine language. While doing so, it detects errors and reports errors. This process is quite complex and it is divided into number of phases such as Lexical Analysis, Syntax and Semantic Analysis, Intermediate Code generation, Code Generation and Code Optimization. The lexical analysis phase of compiler reads the source program, character by character and then groups these characters into tokens which are further passed to next phase, which is nothing but parsing or syntax or semantic analysis. After syntax and semantic analysis, Intermediate Code is generated which is followed by actual code generation. Lexical Analyzer recognizes the tokens from series of characters. A C program consists of tokens such as Identifiers, Integers, Floating Point Numbers, Punctuation symbols, relational and logical and arithmetic operators, keywords and comments
(to be removed). To identify these tokens, lexical analyzer needs the specification of each of these symbols. The set of words belonging to a particular token type is a regular language. Hence each of these token types can be specified using regular expressions. For example, consider the token Identifier. In most of the programming languages, an identifier is a word which begins with an alphabet (capital or small) followed by zero or more letters or digits (0..9). This can be defined by the regular expression (letter) . ( letter | digit)* where letter = A|B|C||Z| a| b |c ||z and digit = 0|1|2|.|9 One can specify all token types using regular expressions. These regular expressions are then converted to DFAs in the form of DFA transition table. Lexical analyzer reads a character from a source program and based on the current state and current symbol read, makes a transition to some other state. When it reaches a final state of DFA, it groups the series of characters so far read and outputs the token found. Formal definition of Regular expression The class of regular expressions over is defined recursively as follows: 1. The letters and are regular expressions over . 2. Every letter a c is a regular expression over . 3 If R1 and R are regular expressions over , then so are (R1|R2), (R1.R2) and (R1)* Where | indicates alternative or parallel paths. . Indicates concatenation * indicates closure 4. The regular expressions are only those that are obtained using rules (1) and (2). Formal definition of DFA: The formal definition of finite automata is denoted by a tuple ( Q, ,d, qo, f) Where
Q Finite set of table finite input alphabet qo Initial state of FA,qo, qo Q F set of final states, F c Q d Transition function called as state function mapping Q*Q i.e. d= Q* Q A FA is called deterministic (DFA) if from every vertex of its transition graph, there is an unique input symbol which takes the vertex state to the required next state. DFA is constructed directly from an augmented regular expression ( r )#. We begin by constructing a syntax tree T for ( r ) # and then we compute four functions Nullable, Firstpos, Lastpos and Followpos. The functions Nullable, Firstpos, Lastpos are defined on the nodes of a syntax tree and are used to compute Followpos which is defined on set of positions. We can short circuit the construction of NFA by building the DFA whose states correspond to the sets of positions in the tree. Positions, in particular , encode the information regarding when one position can follow another. Each symbol in an input string to a DFA can be matched by certain positions. An input symbol c can only be matched by positions at which there is a c but not every position with a c can necessarily match a particular occurrences of c in input stream. Algorithm The steps in algorithm are 1. Accept the given regular expression with end of character as # 2. Covert the regular expressions to its equivalent postfix form manually. ( students need not write the code for converting infix to postfix but, they can directly accept postfix form of the infix expression) 3. Construct a syntax tree from the postfix expression obtained in step 2. 4. Assign positions to leaf nodes 5. Compute following functions. Nullable, Firstpos, Lastpos, Followpos Computation of Nullables : All nodes except the * nodes are not nullable. Also if some leaf node is for , then it is also nullable.
Firstpos (Firstposition): At each node n of the syntax tree of a regular expression, we define a function firstpos(n) that gives the set of first positions that can match first symbol of a string generated by sub expression rooted at n. Lastpos (lastposition) : At each node n of the syntax tree of a regular expression, we define a function lastpos(n) that gives the set of last positions that can match last symbol of a string generated by sub expression rooted at n. To compute firstposition and last position, we need to know which nodes are the roots of sub expression that generate languages that include the empty string. Such nodes are Nullable. We define nullable(n) to be true if node n is nullable , false otherwise. Computation of Followpos : Followpos(i) tells us what positions can follow position i in the syntax tree. This can be computed as follows. 1. if n is a . (cat) Node, with a left child C1 and right child C2 and i is a position in the Lastpos(C1), then all positions in Firstpos(C2) are in Followpos(i) 2. if n is a * (closure) Node and i is a position in the Lastpos(n), then all positions in Firstpos(n) are Followpos(i) 6. Construct DFA from Follow Pos. Note : Step 5 can be done during construction of tree, since you are building the tree from bottom to top, and when computations at some root of sub tree are to be done, information of sub tree is available. So no need to do any traversal. Data Structures: Node Structure for Parse Tree { Leftchild and Rightchild : pointers to the node structure Nullable : Boolean Type Data : Character Type Fistpos and Lastpos : set of integers Pos : integer (this may or may not be part of tree node)
} Stack : Stack is required to build the tree. This can be implemented either using link list (preferable) or as an array. Item or data that will be pushed into or popped out from stack is pointer to the node structure of a Parse Tree and not just a single character.. Computations of Firstpos and Lastpos. Node n N is a leaf labeled Nullable(n) true Nullable(c1) or Nullable (c2) N N is a leaf labeled with position i false {i} {i} Firstpos(n) Firstpos (C1) U Firstpos(C2) Lastpos(n) Lastpos (C1) U Lasttpos(C2)
If nullable (C1) then Nullable(c1) and n Nullable (c2) Firstpos (C1) U Firstpos(C2) else Firstpos(C1) Firstpos (C1) true n Algorithm for construction of DFA transition table
If nullable (C2) then Lastpos (C1) U Lastpos(C2) else Lastpos(C2) Lastpos (C1)
1. Initially , the only unmarked state in Dstates is firstpos(root), where root is the root of a syntax tree. 2. While there is an unmarked state T in Dstates do Begin Mark T For each input symbol a do Begin Let U be the set of positions that are in Followpos(P) for some P in T,
such that the symbol at position P is a. If U is not empty and is not in Dstates then add U as an unmarked state to Dstates Dtran [T,a] = U End End

Reg Ex

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Reg Ex

Hochgeladen von

Copyright:

Verfügbare Formate

Exercises

Regular Expression to DFA

Das könnte Ihnen auch gefallen