Sie sind auf Seite 1von 80

Modularization of Text-to-Model Mapping Specications

A Feasibility Study Using Scannerless Parsing


Diploma Thesis at the Institute for Program Structures und Data Organization Chair for Software Design and Quality Prof. Dr. Ralf H. Reussner Fakultt fr Informatik a u Karlsruhe Institute of Technology by cand. inform. Martin K ster u

Advisor: Prof. Dr. Ralf H. Reussner Dipl.-Inform. (FH) Thomas Goldschmidt

Date of Registration: Date of Submission:

2009-04-20 2009-11-02

Xperf=1.00 Xloss=0.01

Chair for Software Design and Quality

I declare that I have developed and written the enclosed Diploma Thesis completely by myself, and have not used sources or means without declaration in the text. Karlsruhe, 2009-11-02

Abstract
Domain-specic languages (DSLs) are developed for a specic concern and limited by nature. Ideally, DSLs and tools generated for them can be easily combined for reuse. When using textual concrete syntax for DSLs, the editing framework must be aware of the language composition. This involves modularizing the mapping between abstract and concrete syntax and combining the lexical and syntactic analyzers from both language toolkits. We present a generic generation of a scannerless parser that avoids lexical conicts, especially keyword pollution. For that purpose, the existing textual modeling framework FURCAS employs the Rats! parser generator to create domain-parsers. This serves to evaluate the feasibility of migrating to scannerless parsing in order to facilitate exible language composites.

Zusammenfassung
Domnenspezische Sprachen (DSLs) werden fr einen speziellen Anwendungsbea u reich entwicklt und sind daher von Natur aus beschrnkt. Idealerweise knnen a o DSLs und die dafr entwickelten Werkzeuge einfach kombiniert und wiederverwenu det werden. Beim Einsatz textueller Syntaxen fr DSLs muss das Editor-Framework u das Komposit der neuen Sprachen untersttzen. Dies erfordert Mglichkeiten zur u o Modularisierung von Text-zu-Modell Abbildungsbeschreibungen sowie das Zusammenfhren der beiden Komponenten zur lexikalischen und syntaktischen Analyse. u Wir stellen die generische Erzeugung eines scannerfreien Parsers vor, der lexikalische Konlikte, inbesondere durch importierte Schlsselwrter, vermeidet. Zu diesem u o Zweck verwendet das bestehende textuelle Editor-Framework FURCAS den Rats! Parsergenerator, um Domnenparser zu erstellen. Dies dient einer Machbarkeitsa analyse der Migration zu scannerfreiem Parsen, um exible Sprachkompositionen zu ermglichen. o

Contents

1 Introduction 1.1 1.2 1.3 1.4 1.5 Textual Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 6 6 7 7 7 9

Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Language Composition Vision . . . . . . . . . . . . . . . . . . . . . Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Analysis 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 Keyword Pollution and Scannerless Parsing . . . . . . . . . . . . Parsing Techniques and Grammar Classes . . . . . . . . . . . . .

TCS Mapping Language . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9 An Introductory Example . . . . . . . . . . . . . . . . . . . . . 12 tcs.ConcreteSyntax . . . . . . . . . . . . . . . . . . . . . . . . . 13 tcs.Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 tcs.ClassTemplate . . . . . . . . . . . . . . . . . . . . . . . . . 14 tcs.Sequence and tcs.SequenceElement . . . . . . . . . . . . . . 16 tcs.Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 tcs.Operators and OperatorTemplate . . . . . . . . . . . . . . . 19 tcs.FunctionTemplate . . . . . . . . . . . . . . . . . . . . . . . 20 tcs.EnumerationTemplate . . . . . . . . . . . . . . . . . . . . . 20

2.3 2.4

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

viii 3 Design 3.1 3.1.1 3.1.2 3.1.3 3.2 3.3

Contents 23 Technological Overview . . . . . . . . . . . . . . . . . . . . . . 24 Higher-Order Transformation Approach . . . . . . . . . . . . . . 25 Bootstrapping the TCS . . . . . . . . . . . . . . . . . . . . . . 26

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

TCS Modications for Composition . . . . . . . . . . . . . . . . . . . . 26 TCS-to-Grammar Transformation . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 Concrete Syntax to Grammar . . . . . . . . . . . . . . . . . . . 30 Class Templates to Productions . . . . . . . . . . . . . . . . . . 30 Operator Templates to Productions . . . . . . . . . . . . . . . . 34 FunctionTemplate to Production . . . . . . . . . . . . . . . . . 36 EnumerationTemplate to Production . . . . . . . . . . . . . . . 37 tcs.Sequence to xtc.Sequence . . . . . . . . . . . . . . . . . . . 37 Keywords and Symbols to Productions . . . . . . . . . . . . . . 40 41

4 Implementation 4.1 4.2

Handler-Based Transformation . . . . . . . . . . . . . . . . . . . . . . . 42 Packrat Parser Specics . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 4.2.2 4.2.3 4.2.4 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Parser Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 45 Actions and Bindings . . . . . . . . . . . . . . . . . . . . . . . 46 Parameterized Rules . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 4.4

Lightweight Nested Transactions . . . . . . . . . . . . . . . . . . . . . 47 Ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.1 4.4.2 4.4.3 Greedy Parse - Shift/Reduce Conicts . . . . . . . . . . . . . . . 49 Ordering of Choices . . . . . . . . . . . . . . . . . . . . . . . . 51 A Heuristic for Shadowed Alternatives . . . . . . . . . . . . . . . 53 White Space Denition . . . . . . . . . . . . . . . . . . . . . . 55 Assignment of Token Types . . . . . . . . . . . . . . . . . . . . 56 Tokenizing via Lightweight Transactions . . . . . . . . . . . . . 56

4.5

Tokenization and Scannerless Parsing . . . . . . . . . . . . . . . . . . . 54 4.5.1 4.5.2 4.5.3

4.6

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6.1 4.6.2 4.6.3 4.6.4 Incrementality . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 From Embedding to Composition . . . . . . . . . . . . . . . . . 59

Contents 5 Summary and Conclusions Bibliography

ix 61 63

List of Figures
1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Components of a CTS framework . . . . . . . . . . . . . . . . . . . . Model of a compiler front end . . . . . . . . . . . . . . . . . . . . . . 3 8

Overview of Rats! module modication syntax . . . . . . . . . . . . 11 MOF diagram of TCS element ConcreteSyntax . . . . . . . . . . . . 13

MOF diagram of TCS element Templates . . . . . . . . . . . . . . . . 14 MOF diagram of TCS element ClassTemplate . . . . . . . . . . . . . 15 MOF diagram of TCS element Sequence . . . . . . . . . . . . . . . . 16 MOF diagram of TCS element Expression . . . . . . . . . . . . . . . 16 MOF diagram of TCS element SequenceElement . . . . . . . . . . . . 17 MOF diagram of TCS element Property . . . . . . . . . . . . . . . . 18

2.10 MOF diagram of TCS element OperatorList . . . . . . . . . . . . . . 19 2.11 MOF diagram of TCS element FunctionTemplate . . . . . . . . . . . 20 2.12 MOF diagram of TCS element EnumerationTemplate . . . . . . . . . 21 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 24 TCS to grammar transformation . . . . . . . . . . . . . . . . . . . . . 25 Using the legacy TCS parser to create a TCS instance . . . . . . . . . 27 MOF diagram of tcs.ConcreteSyntax after adding import . . . . . . . 27 Handler-based transformation of TCS instances . . . . . . . . . . . . 42 Rats parser optimizations . . . . . . . . . . . . . . . . . . . . . . . . 46 MOF class diagram of xtc.parser.InjectorState . . . . . . . . . . . . . 48 Sample metamodel illustrating operatored expressions . . . . . . . . . 50 MOF extract: namespaces and classiers . . . . . . . . . . . . . . . . 51 Parse tree for input PrimitiveTypes::String . . . . . . . . . . . . . . 52 DFA recognizing keyword followed by identier . . . . . . . . . . . . 55

Listings
2.1 2.2 3.1 4.1 4.2 4.3 4.4 4.5 Identier-Keyword conict . . . . . . . . . . . . . . . . . . . . . . . . . 8

TCS introductory example (from TCS.tcs) . . . . . . . . . . . . . . . . 12 Extract of the modied concrete syntax specication for TCS . . . . . . 28 Memoization example - TCS mapping snippet . . . . . . . . . . . . . . 43 Memoization example - simplied generated grammar . . . . . . . . . . 43 Memoization example - simplied parser code . . . . . . . . . . . . . . . 44 TCS.tcs: concrete syntax for classiers and namespaces . . . . . . . . . 50 TCS.tcs: conditionals and expressions . . . . . . . . . . . . . . . . . . . 53

1. Introduction
Model-driven engineering (MDE) or model-driven software development (MDSD) regard models as rst-class entities. Instead of merely describing or documenting a process they serve as the central entity driving the process. Models are used in every stage of a development process - not only the design phase - and may range from high-level to very specic. Model-to-model transformations, optionally augmented with handwritten code, and are used to specify the relation between models on dierent levels of abstraction. Domain-specic languages (DSLs) follow the paradigm that code is a model, too. In a circumscribed and well-understood application domain they are used to describe entities, relationships and behavior. The fundamental dierence to general-purpose languages is that they do not strive for universality. This implies that if used in another setting, the languages describing the entities are combined rather than extended to avoid the evolution of DSLs to large, monolithic languages because this would nullify the advantage of domain-specicity in the rst place. Tools are crucial for high productivity in model-centric development and for a higher acceptance of model-driven processes. This conclusion is justied by the revolutionary increase in productivity caused by the availability of integrated development environments (IDEs) for programming languages. MoNeT, short for modeling needs tools, is a project at SAP AG that aims to provide better tools for modeling. The presented work was devised within this project as a cooperation of Forschungszentrum Informatik (FZI), Karlsruhe, and SAP.

1.1

Textual Modeling

Domain-specic languages are dened by an abstract syntax and one (or many) concrete syntaxes. Similar to abstract syntax trees of general-purpose languages (GPLs), the abstract syntax is a representation of a language artifact on a level that is independent from the artifacts textual or graphical appearance. So a concrete syntax must be given to allow for creation and editing of domain-specic code.

1. Introduction

Graphical editors have been around for more than a decade, mostly promoted by the general-purpose modeling language UML1 and its various diagram types. One major advantage of graphical concrete syntaxes for DSLs is their well-dened interface for operations creating, changing or deleting model elements. This allows to make local changes to models, which is not trivial in textual editors as it requires incrementality of the editor framework. However, textual syntaxes bring forth several advantages. [GKR+ 07] as a position paper, points out ten items summarized in the following: Information content Graphical representations of large, complex models turn out to exceed what can be grasped by a developer. Speed of creation Especially for experienced users, a graphical editor that requires numerous mouse clicks impedes rapid creation and evolution of models. Integration of languages Conditions and actions attached to graphical representations are textual already, but often badly integrated. Complete textual representations are considered more productive. Speed and quality of formatting Formatting algorithms for graphical models are doubted to be as eective as textual formatters because good graphical layout cannot be guaranteed automatically without taking into account the model semantics. Platform and tool independence Text can be edited with any editor. Refraining from convenient additional functionality (syntax highlighting, code completion etc.) this can be done without a special tool. Version control Text can be shared in repositories easily since this task is well-understood and supported by all versioning systems. Methods to compare, replace or merge text are much simpler than comparing graphical representations. Editors (almost) for free Features like syntax-highlighting and code completion need to be put on top of existing textual editing environments. Outlines and graphical overviews Textual syntaxes can be used to create outlines in a generic way allowing to take advantage of a graphical representation as a view on the textual syntax. Parsers, pretty print, code generators and translators are rather easily developed These components can be derived generically from the concrete syntax. (Note: This advantage is specic to the MontiCore environment in which abstract syntax denitions are is derived from the grammar denition) Composition of modeling languages (Note: Shared symbol tables and attribute grammars are claimed to promote language composition. We doubt that this approach prevents from the general problem of keyword pollution as detailed in Sec. 2.1.1)
1 Unied Modeling Language specied by the Object Management Group. See http://www. omg.org/spec/UML/2.0/

1.2. Setting

1.2

Setting
Mapping Definition
references reads references Legend artefact active comp.

Grammar

CTS Framework
generates

Metamodel

r/w access dependency instance of rel.

instance of parses

Generated CTS Tools Lexer Parser Sem. Anal.

instance of creates / updates

communication

Text Artefact

Reads/ manipulates

Editor

Reads/ manipulates

Model

emits

Emitter

reads

Figure 1.1: Components of a concrete textual syntax framework (from [GBU08]) Fig. 1.1 depicts the general design of a concrete textual framework which is basically common to all frameworks that support textual editing of domain-specic languages. Grammar Grammars are used to specify how well-formed textual artifacts look like. In all practically relevant cases, these grammars are context-free in the language-theoretic sense allowing to be specied with the Extended Backus-Naur Form (EBNF). So code (as a textual artifact) must obey the syntactic rules declared in the grammar. Some aspects of language-theoretic consequences are discussed in Sec. 2.1.2. There is a second notion of grammars in this context: a grammar serves as input for a parser-generator. Language-independence of the CTS framework can only be achieved by generating the domain-parser automatically from the abstract and concrete syntax denition. Details of the parser-specic grammar format employed here can be found in Sec. 3.3. Metamodel (=Abstract Syntax Model) The abstract syntax is dened as a metamodel, i.e. described with elements from a meta-metamodel (such as MOF, Ecore, KM3). Its elements are similar to possible nodes of an abstract syntax tree (AST) known from compiler techniques. Unless carrying semantic information, specic details of the expected concrete syntax should not be included in the metamodel. The resulting gap between what is textually present and how it should be represented in the abstract syntax must be covered in the mapping denition. Mapping Denition The mapping denes a bridge between abstract and concrete textual syntax. This bridge may be specied as part of the framework (advancing rapid development of

1. Introduction

a new language) or as a separate artifact (enhancing exibility and expressiveness of the targeted languages). Bridging concrete and abstract syntax can be tackled from two ends: EBNF-like mappings are very close to the expected concrete syntax. They dene rules that can be easily transformed into a format acceptable for a parser generator. Template-manner mappings dening the expected concrete syntax for each metamodel element. Prettyprinting an instance of the abstract model is easier with this approach. In [JBK06], details of a representative of this approach are presented. A modied version of the syntax denition language in [JBK06] is employed in the CTS framework discussed here. We therefore highlight its syntax and semantics in Sec. 2.2. Both approaches come with drawbacks. The higher abstraction incorporated in the syntax metamodel creates a gap to the concrete textual syntax dened by a grammar. Depending on the choice of one or the other approach, the syntax editing framework must close this gap without outside information to be both exible and expressive. Lexer/Parser One of the most central tools generated by the CTS framework is the parser for textual artifacts (code). It is triggered by the textual editor upon changes and updates the syntax model accordingly. The editor is responsible for all user-interaction and must hide the model-based nature of the syntax from the programmer. As all generated tools are language-dependent, they need to be re-generated when the metamodel or the syntax mapping is changed.

1.3

The Language Composition Vision

Flexibly composing and embedding languages has been longed for for quite some time. Recently this topic has earned new attention in the context of domain-specic languages. DSLs are mostly used as silo development units, independently from other domains. Composability of DSLs is a big challenge and currently investigated by several research teams. In the domain of modeling, a prime representative for language composition is embedding constructs from the object constraint language OCL2 into some domainspecic language D. The problem of composition is three-fold: Integrating the type system and structure of the two languages (aecting the abstract syntax) Combining the concrete syntax (aecting the domain-lexer and -parser)
2

see http://www.omg.org/spec/OCL/2.0/

1.3. The Language Composition Vision

Reconciling the two textual editors (with respect to syntax highlighting, code completion, ...) The domain-specic paradigm implies that all artifacts and tools for both languages are present, i.e. we have a lexer, a parser, an editor, an abstract syntax model and a mapping for both languages. Now composing the two implies that, rst, the abstract syntax for D has to somehow reference the OCL-statement element. Second, the concrete syntax needs to know where the textual representation of the OCL query is expected. If we consider analyzing a textual artifact conforming to D with the OCL embedding we need to answer a number of fundamental questions: Question 1: How do we handle lexical conicts? A lexical conict arises during the lexical analysis phase, before parsing an input. Lets say we have a keyword from OCL that matches an identier name in the code (such as context). We cannot tell whether this is acceptable and may be handled later in the parsing phase or whether it must be prohibited. This is due to the fact that lexers are nite automata that do not take into account the context of a construct. The latter solution (prohibiting) would imply restricting all keywords from all imported languages (plus their closure). The outcome of a parse then depends on which import statements were made, even if no constructs from the imported language are used. A alternative way to tackle this is to tell the lexical analyzer about the imported keywords. The automaton can then assign an indenite type instead of either keyword or identier. This will have an eect on the parsing phase as indenite tokens must be accepted where either a keyword or an identier is expected. Question 2: Can we keep the toolkits separate with slight modications or do we have to create a compound parser/lexer pair? Although re-generating all tools with the modied concrete and abstract syntax denitions seems tempting, it violates a fundamental paradigm in software engineering: dont repeat yourself (DRY). For each combination of two languages we get a compound toolkit, which is clearly undesirable. So lets assume we can access the correct methods from the other parser to start analyzing a construct accordingly. Then the next question arises considering scopes. Question 3: How can we make the two analyzers know each others variables? Here we have the rst issue that points to the fact that composing is more than embedding. Say we declared and initialized a reference to a model element outside the OCL query. This reference should be available inside the query to allow for realistic and complex queries. Claiming that the parser toolkits need to be kept separate we see that an interface-like mechanism is needed consolidating the constructs in question. The work at hand does not claim to answer all questions involving composition of domain-specic languages. This is why the section is entitled language composition vision. Instead, the specic subquestions arising from lexical conicts are specically focused on.

1. Introduction

1.4

Goal

We aim to design and implement or modify a transformation framework that targets language composites. By using a parser-generator framework that works without prior lexical analysis phase we tackle the issue of identiers conicting with keywords from an imported language. This involves the following sub items: Review available parser generator technologies with respect to expressiveness and support of scannerless parsing. Select a scannerless parser generator meeting the requirements given by the concrete textual editing framework. Highlight the mapping denitions specics and point out critical points of the transformation to a grammar. Compare the grammar formats (existing and new parser generator) and devise a mechanism to auto-generate a grammar from a metamodel and the specied syntax mapping. Implement and test the transformation and validate the output. Bootstrap the DSL describing syntax mappings. Implement a mechanism substituting the lexical analysis phase, i.e. emulating a tokenizer. Discuss requirements for an editor-integration. Point out consequences for incremental parsing and error reporting. Estimate the feasibility for a complete migration to scannerless parsing in the CTS framework.

1.5

Outline of the Thesis

The work is structured as follows. Ch. 2 investigates problems with the existing textual editing framework and states the detailed problem and the involved techniques and languages. Ch. 3 presents, on an implementation-independent level, the transformation of a textual syntax specication to the selected grammar format. Ch. 4 gives details about the implementation of the transformation. Crucial aspects arising from the migration to a scannerless parsing technique are discussed. An algorithm substituting the lexer phase by a token creation at parse-time is presented. Ch. 5 gives a brief summary and outlook on the feasibility of scannerless parsing in the context of textual modeling.

2. Analysis
Modularizing domain-specic languages must be tackled from both the abstract and the concrete syntax denition. Support for modularization on the abstract syntax level is fairly straightforward. Metamodeling the abstract syntax allows to use namespaces and key attributes (Ecore) or unique IDs (MOF) to identify referenced elements from more than one language specication even if there are name clashes. The same is not true for specications of concrete syntax, i.e. denitions of the mapping between abstract and concrete syntax. One key concept of textual concrete syntaxes is to hide things like namespaces and model element resolution from the programmer. However, this implies that the editor framework has to take care of the disambiguation of name clashes and related issues. This chapter explores requirements and questions that arise from composing languages on the level of textual concrete syntax.

2.1

Problem Statement

The existing technology supporting textual editors (FURCAS) suers from the problem that it only works on monolithic language denitions: a strict one-to-one relation between abstract and concrete syntax denition on the one hand and a parser for the textual artifacts on the other hand. Reuse of already specied languages and of generated components is only possible by copying and pasting the concrete syntax denition into the new language and re-generating, leading to undesirable eects including duplication of code, redundancy, large and complex specications et cetera. This causes double maintenance as a known issue. Supporting modularity on the textual syntax level requires small changes to the DSL that species the mapping between concrete and abstract syntax (TCS ) but has more impact on the generation of a parser for the combined language because the existing parsers cannot be glued together easily.

2.1.1

Keyword Pollution and Scannerless Parsing

Traditional compiler front ends can be modeled as depicted in Fig. 2.1. The rst

2. Analysis

source program

Lexical Analyzer

tokens Parser

syntax tree

Intermediate Code Generator

three-adress code

Symbol Table

Figure 2.1: A model of a compiler front end (from [ASU86]) step, which is called lexical analysis or scanning or tokenizing phase, is accountable for separating the input character stream in meaningful units so called lexemes. Furthermore, these lexemes are assigned abstract token types such as identier, relational operator or a keyword as dened by the grammar specication. Specically, lexical rules for identiers are usually dened by regular expressions (for example: a character followed by an arbitrary sequence of characters and digits). When the lexical rules for identiers are applied to keywords, they meet the criterion as well. That is why keywords declared in the language are stored in a table. Before deciding on whether a lexeme is a keyword or an identier this table is looked up. This can be implemented easily and is totally sucient for non-composed languages as long as the set of reserved words is stable. When putting two language denitions together, the problem arises that an identier might match the keyword rule from the imported or embedding language. Consider the well-known SQL language standard in which over 200 words are reserved1 . Chances are likely that the embedding language comes into conict with one of these keywords. The conict can be inherent (identical keywords in both languages with dierent semantics) or can depend on the source input (identiers used in the source code identical to some keyword). This phenomenon is referred to as keyword pollution. Instead of determining the token type of a lexeme in a stage prior to syntactic analysis (parsing), information from the syntactic analysis can be used to tell whether a lexeme is an identier or a keyword. Consider the compound statement stated in Listing 2.1. Here, we have a conict between the boolean identier select and the reserved SQL word. It is obvious that the context where the lexeme select occurs helps to distinguish the two cases. Assume that the only location where an SQL construct is expected is in the argument list of the constructor of SQLStatement . Then there is no doubt that the rst and second occurrence (lines 1 and 2) must be an identier and the third occurrence (line 3) must be a keyword. 1 boolean select = getStatus(); 2 if ( select ) { 3 Statement s = new SQLStatement(select from table 1;) 4 } Listing 2.1: Identier-Keyword conict
1 depending on the SQL version. It is 295 for SQL1999 as specied by ISO/IEC 9075 see http://www.iso.org

2.1. Problem Statement

Lexing without syntactic information cannot discriminate between the two! This leads to the conclusion that the assignment of token types to lexemes must be postponed to the syntactic analysis. Integrating lexing with parsing is usually called scannerless or token-free parsing. The term was coined by Salomon and Cormack in [SC89]. Based on this work, scannerless parsing has received considerable attention over the past years ([Vis97], [For02], [Gri06], [KKV08]). The work at hand seeks to take advantage from the scannerless technique, too. So summing up, the rst requirement that a parser generator used for language composition must meet is scannerlessness in contrast to ANTLR employed up to now. While the absence of a separate lexing phase is a more technical question which should not aect the amount of languages covered by the framework, grammar classes are aected by the parsing technology. These are highlighted in the following.

2.1.2

Parsing Techniques and Grammar Classes

Almost all construts from programming languages can be described by context-free grammars (CFGs). The context-free language class is thus theoretically best-suited. However, due to practical considerations proper subsets such as LL(k) or LR(k) have been much more relevant. Theoretical results show [RS83] that parsing any context-free sentence with well-known algorithms such as the Cocke-Younger-Kasami algorithm or Earleys algorithm take space n2 and time n3 for an input string of size n. Non-tabular, backtracking algorithms may even take exponential time and linear space. These complexities are unacceptable for practical settings and call for more ecient solutions. The two equally relevant grammar classes are those accepted by top-down parsers (LL) and bottom-up parsers (LR) and variants, both with xed lookahead k. They are not discussed here in detail. The reader is referred to [AU72]. Generalized LR Parsing Despite the unpleasant result that the full class of context-free languages is hard to parse, CFGs are specically interesting for composition of languages. In contrast to other grammar classes, CFGs are closed under composition [Bra08]. That is why much eort has been put into generalized LR parsers that fork on non-determinism in the parse table and construct parse forests instead of parse trees leading to acceptance of all CFGs, [Vis97] based on [Tom87]. Parsing Expression Grammars A completely dierent approach to tackle the problem of ambiguities in contextfree language constructs was more recently proposed by Ford [For04]. The PEG formalism is similar to the Extended Backus-Naur Form, but it adds prioritized choice to avoid ambiguities. Parsing expressions and PEGs are dened as follows (from [For04]): Denition 2.1 (Parsing Grammar). Let G be a parsing expression grammar G = (VN , VT , P, S) with nonterminals VN , terminals VT , VT VN = productions P of the form A e with nonterminal A VN and a parsing expression e and designated start symbol S VN .

10

2. Analysis

Denition 2.2 (Parsing Expression). The empty string , every terminal a VT and every nonterminal A VN is a parsing expression. Let e, e1 , e2 be parsing expressions. Then a sequence e1 e2 , a prioritized choice e1 /e2 , the Kleene closure e and the not-predicate !e are parsing expressions, too. There is a strong connection between PEGs and backtracking recursive-descent parsers. It is very straightforward to write such parsers for PEGs. With prioritized choices there is no need to construct parse forests because the alternatives can be tried until a matching alternative is found. As pointed out before, the backtracking nature can exhibit exponential time complexity. This is covered by memoization to guarantee linear parsing time. Packrat Parsing The parsing technique described in [For02] avoids the situation of exponential parsing time due to backtracking by saving the intermediate parsing results as they are computed in order to avoid that parts of the input are parsed more than once, trading memory consumption for performance. Additionally, the packrat parsing technique exhibits some interesting properties that are useful for the discussed area of application: Unlimited Lookahead Packrat parsers have no lookahead restriction. Unlike LL(k) or LR(k) parsers, which take into account the following k tokens for transitions or reductions, packrat parsers have no xed lookahead. The backtracking nature (in contrast to prediction) allows to recognize a broader class of languages. Scannerlessness In predicting parsers, tokens are needed in order to predict the next action. Looking at the next few characters only is generally not sucient to decide on what to do. That is why predicting parsers always rely on a separate lexical analysis phase for token creation. Packrat parsers can use the unlimited lookahead to scan tokens of arbitrary length enabling to integrate lexical and syntactic analysis. Implementations of packrat parsers are thus usually scannerless by design. Composability Predicting parsers are not suited for composition. Consider an evolving grammar to which an alternative is added containing a new arbitrary nonterminal in the middle. Whenever there is another alternative with the same prex, a predicting parser with limited lookahead will fail to predict the correct alternative because the nonterminal might be nested and thus of arbitrary length. In contrast, packrat parsers are able to look beyond the nonterminal and eventually decide on whether the alternative ts. Hence, composition of language constructs can be facilitated with a packrat parser. The current implementation of the framework uses ANTLR to produce domainparsers. ANTLR is a recursive-descent, predicated LL(*) parser generator. This

2.2. TCS Mapping Language # 1 2 3 4 5 6 Syntax Type Nonterminal += <Name 1 > e / <Name 2 > ... ; Type Nonterminal += <Name 1 > ... / <Name 2 > e ; Type Nonterminal -= <Name> ; Type Nonterminal := ... / <Name> e ; Type Nonterminal := e ; Attributes Type Nonterminal := ... ;

11

Figure 2.2: Overview of Rats! module modication syntax. The dierent modications (1) add a new alternative before an existing one, (2) add a new alternative after an existing one, (3) remove an alternative, (4) override an alternative with a new expression, (5) override a production with a new expression, and (6) override a productions attributes, respectively. (from [Gri06]) implies that it works top-down and accepts a proper subset of the class of contextfree languages. Although it features backtracking and memoization, too, the deterministic nite automaton (DFA) employed for arbitrary lookahead is not as powerful as a (deterministic) pushdown automaton. Furthermore it relies on tokens as lexical units which prohibits composability. Rats! is considered the most appropriate parser generator [Gri06]. It produces packrat parsers, is freely available with sources and written in Java. Apart from that, its most prominent feature is the support for modular syntax denitions. Possible rule modications are listed in Fig. 2.2. In addition, modules can be parameterized. This is particularly helpful when imported syntax elements change. Then the host language automatically gets the changes from the module passed as argument.

2.2

TCS Mapping Language

A language for specifying mappings between abstract and concrete textual syntax was proposed by Jouault et al. [JBK06]. The main idea was to provide a simple yet concise template-based language that provides bidirectional specications between abstract and concrete syntax. Bidirectional here means that a TCS artifact can be read in two directions: From model to text Given a model instance conforming to an abstract syntax model and the TCS mapping determine a textual representation capturing all model elements, attributes and associations as dened by the mapping. This direction is sometimes referred to as prettyprinting the model. From text to model In order to edit text from which a model can be created unambiguously the mapping must specify the other direction, too. Parsing in combination with model injection is the process transforming text to a model. Therefore, TCS needs constructs that guide a parser when recognizing textual constructs representing model elements.

12

2. Analysis For illustration, consider an arithmetic expression like 3 + 4 9. An abstract syntax tree for it will have an addition element at the root node with an integer literal (3) on the left and a multiplicative element on the right containing two leaves (4 and 9). Emitting text for this abstract representation is straightforward. Without knowing arithmetic rules, an emitter may roll out the abstract syntax and emit text for each element, from left to right. One might even add parentheses surrounding the multiplicative expression to emphasize the precedence (known from the tree structure of the AST). Now coming from the textual concrete syntax the TCS mapping needs to declare, rst, the precedence of operators and, second, how compound expressions are represented as models. This is why the mapping contains a list of operators and templates for theses operatored constructs. With this information it must be possible to derive a parser for the domain of arithmetic expression with the abstract and concrete syntax denition.

The entire section deals with the dierent TCS constructs (always depicted as MOF diagrams of their meta-elements). The focus is on language constructs that are most relevant for the creation of parsers automatically generated from a TCS syntax specication and the respective abstract language syntax.

2.2.1

An Introductory Example

To illustrate the fundamental idea of concrete-to-abstract syntax mappings the following excerpt from TCS.tcs is given: 1 syntax TCS { 2 3 primitiveTemplate stringSymbol for PrimitiveTypes::String using STRING: 4 value = unescapeString(%token%); 5 6 template TCS::ConcreteSyntax main context 7 : syntax name (isDened(k) ? ( k = k )) { [ 8 templates 9 (isDened(keywords) ? keywords { [ keywords ] }) 10 (isDened(symbols) ? symbols { [ symbols ] }) 11 operatorLists 12 tokens 13 (isDened(lexer) ? lexer = lexer{as = stringSymbol} ;) 14 ] {nbNL = 2} } 15 ; 16 17 ... 18 19 } Listing 2.2: TCS introductory example (from TCS.tcs) Without going into detail, a few things should be noticed about listing 2.2:

2.2. TCS Mapping Language

13

Mappings dene concrete syntax per meta class (two in the example: primitivetypes.String and tcs.ConcreteSyntax ) in a template manner. Syntax provided for a meta class can depend on its attributes and references or not (class template vs. primitive template) Within a class template a meta classs attributes can be referenced regardless of their multiplicity. Following the template style, concrete textual syntax for the respective elements is expected (or emitted) where the attribute is referenced (name, k, templates, keywords etc.). Additional formatting information may be given (square brackets and option nbNL=2 ) to specify the exact output.

2.2.2

tcs.ConcreteSyntax
Symbol +symbol 0..n 1 +concretesyntax +concretesyntax 1 +concretesyntax 0..n +token 0..n Token pattern : OrPattern isOmitted : Boolean 1

ConcreteSyntax lexer : String k : Integer

+concretesyntax 1 +operatorlist 0..n

OperatorList name : String

+keyword Keyword

1 +concreteSyntax

+templates 0..n Template disambiguate : String disambiguateV3 : String

Figure 2.3: MOF diagram of TCS element ConcreteSyntax The main meta element of a TCS mapping is tcs.ConcreteSyntax holding, as a composition, templates, keyword and symbol denitions, token specications and operator lists. The attributes lexer and k specically for use with the ANTLR parser generator. They can hold the maximum lookahead (k ) and a string representing lexer code that can override default lexer provided by ANTLR . Hence, they can be ignored in the context of the work at hand.

2.2.3

tcs.Template

The abstract type template is the backbone of a syntax denition. For meta classes from the abstract syntax (metamodel), concrete textual syntax can be specied, usually one template per meta element.2 As a QualiedNamedElement every template references a Classier from the M3 metamodel, i.e. an element of the abstract syntax. This is the element whose concrete textual syntax is specied by the respective template. The various subtypes of tcs.Template each serve a special purpose:
2

template modes are intentionally omitted for simplicity


21:02:00 Dienstag, 13. Oktober 2009 Class Diagram: tcs / ConcreteSyntax Page 1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl

14
EnumerationTemplate automatic : Boolean FunctionTemplate functionName : String

2. Analysis

ConcreteSyntax lexer : String k : Integer +importedBy 1 0..n +imports +concreteSyntax 1 +templates 0..n

Template disambiguate : String disambiguateV3 : String

OperatorTemplate isContext : Boolean isReferenceOnly : Boolean

PrimitiveTemplate templateName : String tokenName : String value : String serializer : String orKeyword : Boolean isDefault : Boolean

ClassTemplate isAbstract : Boolean isDeep : Boolean isOperatored : Boolean isMain : Boolean isMulti : Boolean isContext : Boolean isAddToContext : Boolean isNonPrimary : Boolean isReferenceOnly : Boolean mode : String

Figure 2.4: MOF diagram of TCS element Templates Primitive templates dene syntax for simple, lexical constructs. Class template is the most important type. It is used to specify concrete syntax for complex classiers. Operator templates are dened for model elements representing compound expression connected by one or more designated operators. Typically they are used with operatored expressions. Enumeration templates refer to classiers of an enumeration type and may specify the appearance of its literals. Function template serves to factorize syntax that appears more than once.

2.2.4

tcs.ClassTemplate
Page 1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl The central element to specify concrete syntax19:33:58 Montag, 5.is ClassTemplate . It tcs / Templates for models Oktober 2009 Class Diagram: oers a vast amount of features most of the changes in the SAP-specic version target the class template.

Class templates dene the textual representation of a classier. The right-hand side sequence of a class template species what is used to produce text from a given instance of the meta class or how the parser should interpret text to create such an instance. This sequence consists of textual elements independent from the model (string literals for example) and references to model properties (for details see 2.2.6).

2.2. TCS Mapping Language

15

+classtemplate ClassTemplate +prefixContainer 0..1 0..1 +prefixsequence

0..1

0..1

+operatorlist

OperatorList name : String

Sequence isReferenceOnly : Boolean mode : String

isAbstract : Boolean isDeep : Boolean isOperatored : Boolean isMain : Boolean isMulti : Boolean isContext : Boolean isAddToContext : Boolean isNonPrimary : Boolean isReferenceOnly : Boolean mode : String

0..1

+templatesequence

0..1

+templateContainer

Figure 2.5: MOF diagram of TCS element ClassTemplate For generation of a parser from TCS denitions three attributes are specically interesting:3 main: When a template is dened as main the parser will start parsing a given syntax with this rule. In the introductory example (2.2) the class template for tcs.ConcreteSyntax is tagged as main implying that textual artifacts conforming to the specied language start with syntax and end with a closing curly brace. abstract: The prevalent use cases for abstract templates is, rst, with model elements inheriting from others and, second, with operatored templates. It is possible to state for an abstract model element that its syntax is specied in through the subtypes. With the operatored option the user can specify that an abstract model element consists of various subtypes connected by operators (of dierent priorities). In the context of parser generation for domain-specic languages this is the toughest case and needs special consideration. referenceOnly: When a complex model element is referenced but should never be created a referenceOnly template can be provided for it. The right-hand side of a class template is a Sequence consisting of 0..* SequenceElements which is the abstract super type of everything representing any contribution to the syntax or specifying detailed semantics for the creation of models (g. 2.6 and 2.8). References to a models attributes can be established simply by stating the elements name within a sequence. Theses properties can be complemented by various options (subtypes of PropertyArg in 2.9). Following the template paradigm the right template for the referenced model element is looked up and the according syntax File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl specied by the class template is inserted. 19:29:23 Montag, 5. Oktober 2009 Class Diagram: tcs / ClassTemplate In the opposite direction (parsing) the attributes are set according to what is textually present.
3 template attributes multi, context, addToContext, deep not discussed here. Discussion of nonPrimary postponed to sec. 2.2.7

Page 1

16

2. Analysis

2.2.5

tcs.Sequence and tcs.SequenceElement


SequenceElement +sequenceelement 0..n +thenSequence +sequence 0..1 1 Sequence isReferenceOnly : Boolean mode : String

0..1 +elseSequence

1 +sequence

+thenContainer 0..1 ConditionalElement +elseContainer 0..1 +block 0..1

Block

Alternative isMulti : Boolean

+alternative 1

+sequences 0..n

SequenceInAlternative disambiguate : String

Figure 2.6: MOF diagram of TCS element Sequence As right-hand sides of a class template, Sequences represent the mapping from one meta class to text and vice versa. Its elements (of abstract type SequenceElement ) can be of structuring kind such as Block , Function or CustomSeparator or modelrelated type such as Property and InjectorActionsBlock or related to choice such as Alternative or ConditionalElement or purely syntactical such as LiteralRef . See g. 2.8 for a MOF diagram of all elements inheriting from SequenceElement .

PropertyReference 1 +conditionalelement 0..1 +expression Expression name : String 1 +propertyReference

ConditionalElement

AndExp

+andExp 1

+atomexp 0..n

+atomexp 0..1 AtomExp

BooleanPropertyExp

IsDefinedExp

EqualsExp

OneExp

InstanceOfExp

Figure 2.7: MOF diagram of TCS element Expression


File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl ConditionElement 19:33:23 Montag, 5. Oktober 2009 Class Diagram: tcs / Sequence Page 1

Depending on a condition of type Expression (depicted in g. 2.7) dierent sequences (then-sequence and optional else-sequence) can be stated in a ConditionalElement of the following form:

2.2. TCS Mapping Language condition ? thenSequence : elseSequence

17

This is especially useful for sequences that are displayed depending on whether a reference is set with isDened.
Alternative isMulti : Boolean SequenceElement

ConditionalElement

FunctionCall

Block

CustomSeparator name : String

Property 0..1 +property

InjectorActionsBlock 1 +injectorActionssblock

LiteralRef

+literalref 0..n

+referredLiteral +injectorActions 0..n InjectorAction Literal

value : String

+propertyReference 1 PropertyReference name : String

+propertyReference 1 0..1 +propertyinit

PropertyInit value : String

Keyword

Symbol

PrimitivePropertyInit

LookupPropertyInit

Figure 2.8: MOF diagram of TCS element SequenceElement

Alternative Although usually incorporated in the abstract syntax via inheritance TCS allows alternative textual syntax within a sequence for multivalued references. For this purpose, alternatives reference 0..* sequences of the specialized type SequenceInAlternative allowing nested sequences.

2.2.6

tcs.Property

Model attributes of primitive type such as string or integer do not require special consideration when printing or parsing. The more critical structural features are references to other model elements. The following questions arise when referencing model elements in textual syntax: How can the referenced model element be identied?
File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:45 Montag, 5. Oktober 2009 created if the reference Page 1 How (and where) should model elements be Class Diagram: tcs / SequenceElementcannot be resolved?

18

2. Analysis

TypedElement
(from model)

<<enumeration>> AutoCreateKind SequenceElement always : AutoCreateKind ifmissing : AutoCreateKind never : AutoCreateKind

0..1 +strucfeature +property 0..n PropertyReference +propertyReference name : String 1

+property 0..1

Property

DisambiguatePArg disambiguation : String

1 +property PartialPArg +propertyarg 0..n PropertyArg ImportContextPArg CreateAsPArg name : String SeparatorPArg AutoCreatePArg value : AutoCreateKind ForcedUpperPArg value : Integer

ForcedLowerPArg value : Integer

LookInPArg propertyName : String AsPArg value : String QueryPArg query : String

FilterPArg filter : String invert : String

CreateInPArg propertyName : String

ModePArg mode : String

RefersToPArg propertyName : String

Figure 2.9: MOF diagram of TCS element Property Identication of referenced elements The property options refersTo , lookIn and query serve to identify the model element that is being referenced through the syntactic construct. In the easiest case, this can be a uniquely identifying attribute (name for some examples). To specify the scope of the lookup more explicitly (expanding or restricting it) property arguments with the lookin clause can be used. Common usage is to specify #all to leave the current context or a path expression starting from the current meta element. The QueryPArg is an extension to the original TCS . It allows to use OCL or MQL4 statements identifying model elements that could not be referenced otherwise (by a path expression or by attribute lookup). Creation of referenced elements Usually referenced elements should be created by the model injector attached to the parser. After all, this is the central idea of textual modeling. In some specic cases this behavior needs to be overridden however. By using the property argument autoCreate the mapping designer can specify whether the referenced model element should be created never, always or ifMissing (default).
4

MQL: MOIN Query Language, a query language for models syntactically similar /to SQL Page 1 File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:10 Montag, 5. Oktober 2009 Class Diagram: tcs Property

2.2. TCS Mapping Language

19

2.2.7

tcs.Operators and OperatorTemplate

A central characteristic of abstract language specications is the absence of detailed description how the language constructs are represented. TCS provides a valuable feature to close the gap between abstract language specication (metamodel) and textual concrete syntax by means of Operators and OperatorTemplates . The basic complication arises from the fact that the mapping is not supposed to specify explicit constructs akin to grammar rules. The textual concrete syntax should instead be dened on a higher level by stating operators and priorities and the respective classand operator templates. Thus, the framework generating a parser and model creator from the metamodel and the language mapping must close this gap. Details on that are discussed in ch. 3 and 4. OperatorList The dierent priority levels associated with an expression are specied in an operator list. In addition, for each operator arity and associativity can be declared.
<<enumeration>> Associativity right : Associativity left : Associativity

OperatorList name : String 1 +list

Literal value : String

+literal 1

+operator 0..n

Operator isPostfix : Boolean arity : Integer 0..n +operators

+operator 0..n +priority 1 Priority value : Integer associativity : Associativity +priority 0..n

+templates 0..n OperatorTemplate isContext : Boolean isReferenceOnly : Boolean

Figure 2.10: MOF diagram of TCS element OperatorList These operators can be used in combination with abstract operatored class templates. When an operator list is stated in the header of an abstract operatored class template this means whenever the abstract type is referenced the concrete subclasses must be processed respecting the given priorities of the operator list. In an arithmetic expressions example, the abstract classier Expression might be given an abstract operatored template with an operator list consisting of two priorities: the multiplicative ones on level 0 (highest) and the additive ones on level 1. Now whenever an arithmetic expression is expected two things are actually processed: the most elementary constituents of arithmetic expressions (number literals most likely) also called primaries the various combinations of primaries by means of operators. Typically binary expressions of multiplicative or additive type. The detailed denition of such compounds is only possible with OperatorTemplates (see below)

20 OperatorTemplate

2. Analysis

Adding an operator list to an abstract template merely species in which order the sub-expressions must be parsed. However for a mapping of these constructs to the abstract syntax OperatorTemplates are needed. They dene how elements from the abstract syntax are put together. Consider the following operator template specication: operatorTemplate PlusExp(operators = opPlus, source = leftside, storeRightTo = rightside); This implies that in abstract syntax all occurrences of the operator opPlus within an operatored expression are represented as a model element of type PlusExp with reference leftside set to what is parsed left of the operator opPlus (and rightside respectively).

2.2.8

tcs.FunctionTemplate

Similar to programming languages, in TCS constructs common to more than one model element can be factored out using functions (expressed by function templates). FunctionTemplates are parameterized with a model element type. Its sequence (right-hand side) can access all properties dened in that element. Thus, functions dened that way can be used (called) within the sequence of a template specifying the concrete syntax for any of the subclasses of the parameter type.
SequenceElement Template disambiguate : String disambiguateV3 : String

FunctionCall

0..n

+calledFunction 1

FunctionTemplate functionName : String

+functionContainer 0..1 +sequence 1 Sequence isReferenceOnly : Boolean mode : String

+functioncall

Figure 2.11: MOF diagram of TCS element FunctionTemplate

2.2.9

tcs.EnumerationTemplate

For enumeration types, TCS provides a means to specify concrete syntax both easily and exibly. In the basic case, EnumerationTemplates rely on the enumerations literals and create syntax automatically. In a more complex case, 0..* mappings for (some or all) literals of an enumeration can be specied where the right-hand side is an arbitrary sequence element as introduced in sec. 2.2.5.

2.3. Terminology
LocatedElement location : String commentsBefore : String commentsAfter : String

21

SequenceElement 1 +element +enumliteralmapping 1 EnumLiteralMapping

Value

1 +enumliteralmapping

1 +enumliteralval

EnumLiteralVal name : String

0..n

+enumliteralmapping

+enumerationtemplate

EnumerationTemplate automatic : Boolean

Figure 2.12: MOF diagram of TCS element EnumerationTemplate

2.3

Terminology

For sake of clarity some of the most important terms are listed below with their synonyms. Although every care was taken to use the terms consistently this section seeks to prevent from misunderstandings. Syntax Denition Syntax denition or syntax specication refers to the denition of the concrete textual syntax of a DSL. Although a DSLs consists of abstract and concrete syntaxes, the term syntax is usually used for the concrete syntax while metamodel refers to the abstract syntax. Mapping Throughout the work at hand, the term mapping refers to concrete-to-abstract syntax mappings specied in TCS as dened in [JBK06]. Domain Parser In the context of textual concrete syntax frameworks the term domain parser refers to the parser generated from a mapping and a metamodel in order to parse code written in the domain-specic language.
File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 15:26:43 Dienstag, 6. Oktober 2009 Class Diagram: tcs / EnumerationTemplate Page 1

Injector Injector is used for a framework component manipulating models. This component is driven by the editor and domain parser.

22

2. Analysis

2.4

Related Work

A number of textual concrete syntax frameworks have been proposed recently. A comprehensive study analyzing the various frameworks can be found in [GBU08]. This publication investigates and classies the CTS frameworks based on an exhaustive set of framework features. For the discussion of related works, especially modularity concepts or support for language composition is most interesting. xText [BCE+ ] provides grammar mixins allowing for import of rules from another grammar. To our knowledge the parser generator is scannerless, too5 . However, details considering lexical conicts are not discussed in detail. Keyword-keyword conicts are avoided by favoring new keywords over imported ones. The Stratego/XT toolkit [BV04] uses a scannerless generalized LR-parsing technique and comprehensive disambiguation features developed within the ASF+SDF Meta Environment [vdBvDH+ 01]. The general dierence is that it focuses on the concrete syntax while for the work at hand a central premise is that languages are modeled with an abstract syntax dened by a metamodel.

5 see a blog entry of the project leader of the Xtext framework http://blog.etinge.de/2009/01/ xtext-new-parser-backend.html

3. Design
As outlined in Ch. 2 a full-edged textual editing framework (or concrete textual syntax framework) produces a parser that accepts artifacts written in a domainspecic language and the textual editor that interacts with this parser. Usually, there will be a lexer, too, for lexical analysis beforehand. The central idea to avoid explicit creation of tokens and associate token types, such as keyword or identier, with a given token entails the need for complete rework of the mechanisms that produce the parser on the one hand and for integration with the existing techniques that update or manipulate the model. Conceptually, there is a signicant amount of parallels between the existing approach using the traditional two-stage parser generator ANTLR and the one using Rats! proposed here. These parallels can be exploited reusing code from the transformation of a TCS instance into a valid grammar le. However, two fundamental dierences impose careful reconsideration of some central parts of the transformation code. First the way Rats! -generated parsers work, which is backtracking, recursivedescent with an ordered set of alternatives, aects grammar code that deals with model injection (updating or creating model instances according to textual input) since no expansion of a nonterminal rule is guaranteed to succeed. Thus actions that create model elements (or proxies thereof) must be potentially undone when a rule fails. Incrementality of the resulting parsers do not allow to apply a multi-pass strategy. Second the scannerless parsing technique aects all code relying on tokens as the most basic syntactic elements. Error logging and reporting makes extensive use of tokens and is not likely to be easy-to-write in a scannerless environment. To our knowledge, this is a known and unsolved problem of scannerless parsing in general. This chapter outlines the design for the two major parts of the work, which are transforming a TCS -instance into a Rats! grammar and integrating the generated parser into the existing textual editor framework. Discussion of design decisions resulting from more technical barriers, such as Rats incapacity to accept parameterized rules or its performance optimizations, are left to the implementation Ch. 4.

24

3. Design

3.1

Overview

Before stating the detailed design decisions that were made for a migration to a scannerless parser-generator the following section gives a brief overview of both technological and conceptual issues that set limits to the solution space.

3.1.1

Technological Overview

Figure 3.1 gives a compact overview of the components of the textual editing framework. Central to the process of developing a new domain-specic language are two documents that are shaded in dark gray: The mapping denition specied in TCS is needed in order to dene the textual representation of a given model and vice versa. Usually, it is edited with the same framework that is employed to edit language instances (code). A language instance, i.e. an instance of the language specied by the abstract syntax given as syntax model and the concrete textual representation to be derived from the mapping denition.

Figure 3.1: Architecture Overview. For a legend of the FMC diagrams see Fig. 1.1 Given a TCS mapping, the framework needs to be capable of parsing the denition (TCS parser) and of producing all components that are necessary for the analysis of code written in that designated language (DSL parser and -editor shaded in light

3.1. Overview

25

gray). While the TCS parser must be shipped with the editing framework1 languagespecic parts are subject to change and dynamically updated (upon a save action on the mapping for instance). All parsers included in the editing framework communicate with an observer responsible for updates and/or creation of model elements via requests stated in an ecient model query language. Changes in a textual editor opened for a language instance are reected in the underlying abstract code model when the code is re-parsed incrementally. Although editing of code written in a DSL works just as editing code in Java, for example, the textual representation is never stored as such. Instead, a decorated domain model (instance of the abstract syntax model) is stored. Details advocating this approach can be found in [GBU09].

3.1.2

Higher-Order Transformation Approach

Migrating from one parser-generating technology (ANTLR ) to another (Rats! ) rst and foremost concerns the way parsers for DSLs are constructed. Irrespective of the fact that the generated parsers must t into the FURCAS framework, which severely relies on tokens as lexical constructs and Java methods as units of syntactic constructs, a valid parser must be produced from a TCS instance and the respective abstract syntax modeling the language. As depicted in Fig. 3.2, the parser generator processes a grammar le which is produced by the TCS-to-grammar transformation. The grammar can be regarded as a condensed description of what is the concrete and

Figure 3.2: TCS to grammar transformation abstract syntax of a DSL. From a language-theory viewpoint, for a domain-specic language LD and its grammar G the following must hold: L(G) LD , i.e. every sentence w can be generated by a series of steps from the start symbol S following the grammar: w LD : w {w : S w }.2 G
Note that TCS itself can be treated as a domain-specic language (with abstract and concrete syntax specied just the same way. Using the editing framework with this language specication results in a bootstrapped version of the TCS parser that is part of the framework. This parser/editor-pair can be used as a proof of concept, but is discussed later. 2 The exact specication of the vocabulary is omitted here, since this would usually be the set of tokens. Obviously, the declaration of tokens is not implicit in a scannerless environment. As
1

26

3. Design

The transformation described in the following is of higher order: rst, a grammar can be viewed as a way to transform textual input into an abstract syntax. Second, it is the output of a transformation itself. This suggests the classication as a higher-order transformation.

3.1.3

Bootstrapping the TCS

The aforementioned transformation needs a TCS instance (an instance of the TCS abstract syntax represented by its metamodel) as input, which is generally not present. Arguing that the TCS model can be generated using the concrete and abstract language specication leads to a vicious circle: it assumes the existence of a fully-functional CTS framework with which it is generated since the TCS parser is part of the framework (see 3.1)3 The good news is that there is a solution that does not beg the question: The TCS model could be created using a model editor from outside the CTS framework. Graphical tree-based editors with appropriate property sheets can serve the purpose. Especially references between model elements can be quite hard to set and need special attention to be correct. One could question the entire higher-order transformation approach and start by writing a TCS grammar by hand. Although this solution seems tempting the problems arising will be more subtle can only be understood knowing details of the targeted environment. Parts of the framework that are to be reused include the parsing observer that observes the process of parsing any language artifact. It is responsible for delegating actions for model creation or update. These actions need to be in the right position in the parser code to work properly. Adding such action code to the hand-written grammar is not only tedious but also error-prone. The special task of the work at hand is to study how an existing CTS framework can be modied to allow language specications to be composed. This oers the chance to not only reuse signicant parts of the framework but also use them in order to create the new one. Creation of a TCS instance modeling the TCS language is such a point: feeding the TCS syntax mapping and the metamodel into the environment leads to the desired model. Leaving aside technical issues such as serializing the model appropriately carries out the task in an elegant way. This solution is favored and the most important components and artifacts of the bootstrap are depicted in Fig. 3.3 with the desired result shaded in light gray.

3.2

TCS Modications for Composition

As part of the design towards language compositions, the TCS metamodel and concrete syntax needs slight modications. This highlights the fact that TCS is a
can be see later, explicit creation of tokens can be emulated, however, by what the requirement makes sense. 3 A classic Petitio Principii in which the proposition is assumed to be true as part of the premise.

3.2. TCS Modications for Composition

27

Figure 3.3: Using the legacy TCS parser to create a TCS instance domain-specic language, too. Its domain is the specication of concrete syntaxes for models. Similar to all DSLs changes to the language must apply to the abstract syntax and all of its concrete syntaxes. Modication of Abstract Syntax On the abstract syntax level composition can be supported by adding a loop imports to ConcreteSyntax allowing to reference (import) concrete textual syntaxes specied in dierent TCS specications. The modied TCS excerpt is depicted in Fig. 3.4. The imports form a tree with a designated root representing the concrete syntax of the main model element. Within this document, constructs can appear representing imported syntax.
ImportDeclaration 0..n Symbol +symbol 0..n 0..n 1 +concretesyntax +concretesyntax 1 +concretesyntax 0..n +token 0..n Token pattern : OrPattern isOmitted : Boolean 1 +concreteSyntax +concretesyntax 1 +operatorlist 0..n 1 +concreteSyntax OperatorList name : String +imports

ConcreteSyntax lexer : String k : Integer

+keyword Keyword

+templates 0..n Template disambiguate : String disambiguateV3 : String

Figure 3.4: MOF diagram of tcs.ConcreteSyntax after adding import There are two possible designs for compositional syntax mappings: Import of a whole abstract-to-concrete syntax mapping including all syntax denitions dened in the mapping. This resembles the import declaration import mypackage.* in Java that imports all classes from a package

28

3. Design Import of a subset of constructs dened in a mapping: this requires checks that this subset is closed as dened in 3.3 (well-formedness).

The rst choice is preferred since a goal of composing languages with the described framework is to reuse syntax specications already written. This should not require white box knowledge of its internal constructs (templates etc.). From a language designer viewpoint, it should be possible to simply use an existing language construct. Since the relationship between the dierent templates of a syntax denition is implicitly dened by the referenced metamodel it is essential to provide a whitebox import mechanism. As opposed to importing Java classes, the interface of the imported denitions is not always known in advance. The parsers resulting from each imported mapping are merely supposed to be plugged in. This refers to composition on the tool level. The work at hand focuses on the combination of lexical and syntactic analyzers and considers tool composition as a succeeding task. We therefore investigate whether such functionality can be consolidated with a scannerless parsing technique of the underlying domain parsers. Modication of Concrete Syntax The common syntax for TCS mappings must provide a way to specify import declarations textually. A simple construct referencing imports before the actual syntax denition can be added as shown in Listing 3.1. Note that the additional template for ConcreteSyntax is needed in order to specify an import just by stating its (qualied) name. 1 2 template TCS::ConcreteSyntax main context 3 : imports 4 syntax name (isDened(k) ? ( k = k )) { [ 5 templates 6 (isDened(keywords) ? keywords { [ keywords ] }) 7 (isDened(symbols) ? symbols { [ symbols ] }) 8 operatorLists 9 tokens 10 (isDened(lexer) ? lexer = lexer{as = stringSymbol} ;) 11 ] {nbNL = 2} } 12 ; 13 14 template TCS::ImportDeclaration 15 : import concreteSyntax {refersTo = name} 16 ; Listing 3.1: Extract of the modied concrete syntax specication for TCS The combination of modied abstract syntax (TCS metamodel) and concrete syntax (TCS mapping for TCS ) can be processed by the bootstrapped TCS parser as outlined in the overview 3.1.2. The result will be a TCS parser recognizing syntax that contains compositional elements (import statements). However, the crucial part of composing languages is creating parsers for the composed languages. The following section discusses the transformation from a TCS instance to a grammar which is the basis for compositional parsers.

3.3. TCS-to-Grammar Transformation

29

3.3

TCS-to-Grammar Transformation

For correctness of the transformation of a TCS instance into a parser, a formal specication of both the source and target languages involved is of great use, irrespective of its actual implementation as a model-to-model transformation (operational or relational) or written in a general-purpose programming language. In the following, particular elements of the source metamodel (TCS ) are opposed to their transformed result (grammar element). For better readability the textual representation of grammar elements is chosen. The transformation task can be stated formally by the following: Let T be a TCS instance conforming to the metamodel MT CS described in Sec. 2.2 specifying the concrete textual syntax of a DSL L with abstract syntax given by the metamodel MDSL . Let G be the output grammar. Then G is expected to meet the following conditions: Well-formedness G must be a valid Rats! grammar. The code generating facilities of Rats! must accept the grammar if the mapping is a closed specication of a concreteto-abstract syntax mapping. Closed here means that templates are specied for all elements of MDSL that are (directly or indirectly) referenced from main model element. This is a weaker assumption than restricting oneself to complete syntax mappings. Complete would mean that for all model elements concrete syntaxes need to be specied. This relaxation is especially useful for large metamodels where only parts are to be edited by means of textual syntax. Generally, a separate step testing the validity of the input model is desirable. Validation on T should thus be performed before start of the transformation to guarantee proper output. Syntactic correctness of action code If G contains action code, which is true for all non-trivial cases of T , this code must t syntactically into the generated parser code. Although this might sound like a trivial requirement snippets of code containing variable declarations and compound statements are likely to produce syntactically incorrect code. The correctness criterion must hold for all closed instances T . Adaptability Without knowing details of MDSL an editor must be able to use the parser generated from G. Especially partial re-parsing of textual input needs to be supported by directly invoking parser methods. For this purpose, a nomenclature for metamodel elements is needed stating an injective map nom : MDSL modes D with D denoting the domain of legal names for Java identiers and modes being the set of template modes used. With this name convention lookup and reective invokation of methods can be performed. A designated method name for parsing the main element of a syntax must be associated with the rule originating from the TCS template with isMain set to true.

30

3. Design

3.3.1

Concrete Syntax to Grammar

As explained in Ch. 2 a promise of the Rats! parser generator framework is its support for modularization of syntax specications via Modules that can be imported, extended and reused. For composition of concrete-to-abstract syntax mappings this is considered a valuable feature. Consequently, the transformation is designed to have one module per concrete syntax denition. The whole tree of denitions importing each other is captured by a Rats! grammar.

Input 3.3.1.1 (ConcreteSyntax). C with syntax imports I, set of template declarations T , set of keywords K, set of symbols S, list of operators O. For future ease of reference, elements of T can be partitioned according to their types. With TC denoting ClassTemplates , TP denoting PrimitiveTemplates , TO denoting OperatorTemplates , TE denoting EnumerationTemplates and TF denoting FunctionTemplates : T = Ti
i{C,P,O,E,F }

Output 3.3.1.1 (Grammar). G with Modules for all syntax imports i I and ModuleModications if the imports are not disjoint with respect to their set of nonterminal. If no imports are present (leaves in the import tree) there is a one-to-one relation between ConcreteSyntax and Module without modications. Those modules contain productions resulting from the transformation of templates T (see 3.3.2), operators O (see 3.3.3), keywords K and symbols S (see 3.3.7). Additionally, for model injection and observation of the parsing process the modules options need to be set to stateful . The detailed semantics of this keyword is left to Sec. 4.3.

3.3.2

Class Templates to Productions

A ClassTemplate states what is the expected textual representation of a model element. This may be a combination of lexical parts, such as references to symbols and keywords, or parts that depend on the elements structural features. Each metamodel element that is supposed to be edited with concrete textual syntax needs a corresponding specication by one (or more) class templates. As can be seen in Fig. 2.5 many dierent cases transforming class templates need to be considered (according to the assignments of the various attributes). This part is central for the TCS to grammar transformation. Special cases to be considered include: abstract or abstract operatored with or without syntactic contribution

3.3. TCS-to-Grammar Transformation main template modes allowing dierent syntax denitions for one meta class referenceOnly templates context tags attached to templates

31

Common to all class templates is the fact that their sequences need to appear on a right-hand side of a production so that a context-free parser generated from the grammar can expand the respective non-terminal to all syntax representing the model instance. Input 3.3.2.1 (ClassTemplate non-abstract). t TC with reference to Sequence s. Let t be non-abstract.

Output 3.3.2.1 (Production). p resulting with nonterminal name set to the value of the nomenclature map nom applied to the meta type for which t was specied and t.mode, pre- and post-actions Apre and Apost . ps declared return type is Object to allow any model element to be created and p is a stateful production to allow micro-transactions to observe the parsing process. Visibility of p is public only if t.isM ain equals true . Otherwise it is private . Thus, only the top-level language constructs can be parsed from outside. The right-hand side of p will be the result of the transformation applied to all elements of ss SequenceElements .

Output 3.3.2.2 (Action). Apre creating a model element proxy with the referenced meta type. This proxy will store all attribute values until it is passed to the model query engine in the post-action. Apre must pass the context information, i.e. the boolean ag t.isAddToContext and its optional context tags to the proxy for resolution of references. If t is a referenceOnly template, a dedicated reference proxy will be created within Apre that never leads to model creation.

Output 3.3.2.3 (Action). Apost setting the nal return value. This will always be the result of the delayed model creation or resolution (can only be completed after parsing the entire template sequence.

For abstract class templates the transformation is slightly more complicated. This is due to the fact that in TCS it is possible to carry the notion of choice expressed

32

3. Design

by the is-a relationship in the abstract syntax via inheritance over to the textual syntax denition. Adding the keyword abstract to a class template is enough to specify a template inherits the textual representation of a super type. Furthermore, abstract class templates play a part in the context of operatored expressions. Adding the operatored keyword and an operator list to an abstract class template denes that the concrete textual syntax of an abstract model element is the combination of its sub-elements using the operators with the specied priorities and associativities. For generation of a parser, however, the notion of choice (abstraction) and priorities (operators) needs to be encoded into the grammar via alternatives. The following two input/output relations are for abstract [operatored] class templates: Input 3.3.2.2 (ClassTemplate abstract). ta TC with reference to (possibly or likely empty) Sequence sa . Let ta be abstract now.

Output 3.3.2.4 (Production). pa s head is very similar to the one in 3.3.2.1. But, Apre and Apost are much simpler (see below) and the production need not be stateful because model creation is done in the actual concrete subtemplates. If sa is the empty sequence, the right-hand side of pa is a set A of alternatives referencing subtemplates (with s L denoting s species syntax for L): A {t T : M, M MDSL : M extends M ta M t M}

Output 3.3.2.5 (Production abstract contents). If oa is not empty there is an additional alternative in pa for the abstract contents. This is a separate production of type non-abstract as detailed in 3.3.2.1.

Output 3.3.2.6 (Actions). Apre is empty for abstract templates. Apost is only responsible for assigning the return value which is the result of one of the alternatives in the sequence.

A prominent feature of TCS is its uncomplicated specication of operatored expression. This is done via abstract operatored class templates that have a corresponding abstract syntax element and a list of operators with associated priorities. To recognize expressions of the operatored fashion, there is a need for a sequence of rules parsing structures from dierent priority levels plus the rule responsible to parse the actual concrete syntax for subtypes of the abstract syntax element (called primary rule in the following).

3.3. TCS-to-Grammar Transformation

33

Consider arithmetic expressions consisting of positive integers connected by operators + and * with multiplication having a higher priority than addition. The concrete syntax for such language fragment might be expressed by an abstract template for Expression pointing to the two-element priority list containing * on level 0 and + on level 1.4 The transformation to a grammar will result in four productions: An entry rule for the abstract syntax element Expression following 3.3.2.2 A primary rule for concrete subtypes parsing integer literals in the example A rule for additive structures (priority 1) A rule for multiplicative structures (priority 0) These will be executed in the order entry priority 1 priority 0 primary. Structurally, the priority i rules are dened only by the operator list. The actual right-hand side, however, is determined by the existence of an OperatorTemplate for the respective operator. Only this operator template species which abstract syntax element corresponds to the combination of elements with this operator must be created or modied. In the above example, there might be a generic BinaryExp model element referencing a left- and right-hand side. Detailed discussion of the transformation of operator templates is postponed to Sec. 3.3.3. Input 3.3.2.3 (ClassTemplate abstract operatored). Let ta TC now be abstract and operatored with operator list oa containing m priority levels prio0 through priom1 .

Output 3.3.2.7 (Production abstract operatored). pa as entry rule: not stateful, only delegating to the priority m rule with lowest priority. If ta has an additional nonempty sequence 3.3.2.5 applies.

Output 3.3.2.8 (Production priority k). For each level of priority k = 0...m 1 a production pprio k is produced parsing syntax corresponding to priority level priok . See 3.3.3 for details of the right-hand side of these productions.

Output 3.3.2.9 (Production primary). pprim having alternatives A referencing template rules for subtypes of Ma that are not marked with keyword nonprimary . So, similar to 3.3.2.4 A is: A = {t TC TP TE : Ma , M MDSL : M extends Ma ta Ma t M t TC t.nonP rimary}

4 Priorities will be numbered from 0 (highest) ascending to m1 (lowest). Note that this implies that the highest operator has the lowest index.

34

3. Design

Output 3.3.2.10 (Actions). Pre- and post-actions of the entry rule and of the primary rule are identical to the abstract case without operatored keyword: 3.3.2.6. Actions for the priority k case are discussed in 3.3.3.

3.3.3

Operator Templates to Productions

While the structure of rules created from operatored templates can be inferred from an abstract operatored class template and the referenced list of operators, an OperatorTemplate is needed in order to specify how the constructs expressed with theses operators are represented in abstract syntax. In the above arithmetic example (in 3.3.2) an abstract syntax element BinaryExp was suggested referencing two (abstract) expressions as left- and right-hand side. This illustrates the complication that arises when translating the abstract syntax to rules: while it is perfectly allowed to represent binary expressions of various types by one abstract model element (e.g. varying only in an operator attribute) the parser needs a hierarchical structure to set the references correctly. For that, not only precedence of operators (via priorities) but also arity and associativity aect the rule creation: Arity Arity of an operator is its number of arguments. Usually only unary (n=1) and binary (n=2) operators are present. Ternary (n=3) expressions or operations of higher arity generally need two dierent operators such as the ternary expression for an if-then-else statement in Java syntax: T value; if (exprIF) value = expr1; else value = expr2; written with the operators ? and : as a single statement T value = exprIF ? expr1 : Associativity Associativity of binary operators states how parentheses can be rearranged, i.e. whether op(op(x, y), z) = op(x, op(y, z)). Non-associative operators must be bracketed or specied as left- or right-associative Left-associativity: x op y op z := (x op y) op z Right-associativity: x op y op z := x op (y op z) x, y, z x, y, z expr2;

Subtraction is a well-known example of a left-associative operation, i.e. 8 2 2 = c c (8 2) 2 = 4. In contrast, exponentiation is right-associative: ab = a(b ) . Consider the case of only one binary left-associative operator connecting variables denoted by letters. The abstract operatored expression then calls the level 0 rule priority 0 which is responsible for parsing all elements on that priority level. An input such as a b c must be parsed by three rules

3.3. TCS-to-Grammar Transformation

35

rule priority 0 for expressions on level 0 (which can be the entire input in the example) rule primary for the literals (not discussed in further detail) rule binary as operator rule for the connection of expressions with .

Associativity comes into play when connecting the dierent rules. In both cases rule binary is responsible for creating the model element that associates the left- and right element of the binary operation. In the left-associative case, this must result in an element having a as left and b as right side and another element having this compound as left and c as right side. Consequently, rule binary must parse only the next construct on the same priority level and set it as the right side. This yields the parse (a b) c. In the right-associative case, rule binary gets a as the left side and its right side can be an arbitrary expression (another binary b c in the example). This yields the parse a (b c). Prex vs. Inx vs. Postx The usual notation for binary operations is inx, i.e. operand1 followed by operator followed by operand1. In TCS , there is a way to specify an operator as postx leading to notations of the form operator1 is followed by operator2 followed by operator. While inx is most common for binary operators unary operators are commonly denoted both prex and postx. Consider the two variants of the increment operator ++i and i++ in Java. Transforming a unary prex operator and the according operator template to a grammar leads to some interesting aspects. Usually the rule for the binary operator template is called when the parser reaches the operator (after parsing the rst operand) and the left-hand side is passed to the operator template which can create or update the according model element. Unary prex operators must be parsed, however, before the operand is parsed. These observations can be stated more formally in the following relation. Note that the complication of transforming operator templates arises from the need to bring information from dierent parts of the syntax denition together in one rule. Input 3.3.3.1 (OperatorTemplate). to that associates a classier with a set of operators O from the universe of declared operators O, i.e. O O. Let to have a sequence so and point to structural features of metamodel type ML (as left-hand side) and MR (optionally, as right-hand side) with ML , MR MDSL .

Output 3.3.3.1 (Production priority k). pprio k on level k results from the operators on priority level k in an operator list. Its right-hand side is a call to next-lower priority rule pprio (k1) or the pprim if k = 0 and a repeated element with j alternatives with j being the number of operators on priority level k.

36

3. Design

The alternatives each consist of the operator literal followed by an action Apsh followed by a call to the rule representing an operator template to for it (optionally followed by a call to a rule tass parsing the right-side of the operation). Depending on the associativity of the operator tass is either pprio k (right-associative) or pprio k1 , pprim resp. (left-associative).

Output 3.3.3.2 (Action PSH, POP). Apsh and Apop push and pop model references onto / from a stack to be processed by the operator template rule.

Output 3.3.3.3 (Production). production for operator template to . Right-hand side is the transformation of sequence so . Model creating pre- and post actions according to 3.3.2.2 and 3.3.2.3 setting the references to left operand (via Apop ) and optionally to right operand.

So the idea of the operator templates is to collect all information that is needed in order to create the model element representing the expression. By passing references to model elements on the left side of an operator to the operator template rule (which has information where to store the left- and right side) the model elements properties are collected piece-by-piece. For left-recursive constructs, specifying operator templates is the only way to implement them in a fashion that can be accepted by LL-type parser generators. This is why they are usually employed for more than only typical operatored expressions. This may give a hint to the importance of this part of the transformation.

3.3.4

FunctionTemplate to Production

Function templates provide concrete syntax for parts of model elements. The idea is to specify concrete syntax for features of a model element that are common to many (or all) of its subtypes. By simply calling the function the function sequence is executed as if it was pasted into the class templates sequence. For the transformation, this makes it easy to specify the desired input-output relation. Input 3.3.4.1 (FunctionTemplate). Let tf be the function template to transform with sequence sf .

Output 3.3.4.1 (Production). The result is pf returning Object and name identical to f.f unctionN ame. Its right-hand side is the result of the transformation applied to sf . Details discussed in 3.3.6. Note: If accessed from a stack the appropriate model proxy need not be passed through a rule parameter.

3.3. TCS-to-Grammar Transformation

37

3.3.5

EnumerationTemplate to Production

Enumeration templates can be translated to Rats! grammars very easily. An enumerations literals are alternatives and represent only string literals. While the automatic mode gets the string literals directly from the enum literals the language designer can specify string literals representing the enum literals explicitly. This leads to the following relation: Input 3.3.5.1 (EnumerationTemplate). Let te be the enumeration template to transform. Automatically or manually, a set SL of string literals can be derived from te .

Output 3.3.5.1 (Production). The result is pe returning Object and name derived from te by means of nomenclature nom. Its right-hand side is an set of alternatives (=ordered choice), one for each element sl SL of the form (Aenter , sl , Aexit , Apost ) with injector actions Aenter and Aexit as in 3.3.6.1 and Apost returning an enum literal with the specied string representation.

3.3.6

tcs.Sequence to xtc.Sequence

A more intuitive part of the transformation concerns the TCS element Sequence. Everything that is on the right-hand side of a template is a sequence consisting of 0..* SequenceElements . As can be seen in Fig. 2.8, concrete instances of SequenceElements can be of various types. Some of them are more related to the structure of the concrete syntax and their grammar representation is relatively straightforward (Block or LiteralRef ). Others require inspection into the underlying metamodel MDSL (Property and FunctionCall ) or special actions for the parsing observer (InjectorActionsBlock or Alternative ). A general pattern of sequences is nesting: Fig. 2.6 shows that three sequence element subtypes can be (or have) sequences themselves. For example, a sequence can have 0..n blocks, which themselves contain one or more sequences. This leads to a circle in the diagram representing the nested structure. Input 3.3.6.1 (SequenceElement abstract). Let s be a sequence element of type no further specied appearing within a template t.

Output 3.3.6.1 (xtc.parser.Sequence). Regardless of whether s is atomic or nested a tcs.Sequence sxtc results with sxtc = (Aenter , s , Aexit ). Actions Aenter and Aexit are notication to the parsing observer, s is the transformed result of the concrete subtype of SequenceElement (see below).

38

3. Design

If s is an instance of Alternative , Block or ConditionalElement there is a nested sequence (conditionals have two references which can be, however, regarded as two alternatives being a nested structure again) to be transformed, too. Input 3.3.6.2 (SequenceElement, nested). Let s be of nested element type and sn its nested sequence.

Output 3.3.6.2 (xtc.parser.Sequence). The result of the transformation of a nested sequence element is a sequence sxtc containing all nested elements concatenated (denoted by ): (s) = sxtc = (s ) s sn

Additionally, Blocks need parentheses surrounding the sequence and (optionally) line breaks. Alternatives require separation of their elements by / for an ordered choice. Conditionals always represent optional elements producing two alternatives (one of which is empty).

While these denitions are recursive atomic sequence elements such as Property , LiteralRef , FunctionCall and InjectorActionsblock can be transformed directly. Property Probably the most important sequence element in TCS is Property . It refers to a structural feature of a model element. If a property appears as part of a sequence this means that the syntax at that location. From its meta type and the syntax lookup the transformation can infer the rule to be called and whether the part is required or repeatable or optional. The general pattern for the transformed result (textual notation for clarity) is always (temp i:template rule name { setProperty(...) })* or (temp i:template rule name { setRef(...) })* However, property arguments impede an entirely straightforward solution just calling the rule representing the appropriate template. Most nontrivial situations arising from additional property arguments need handling by the model injecting facility. But some are related to the syntax and parser generation and are hence discussed in the following: AsPArg : instead of the template inferred from meta lookup the specied primitive template must be called. ModePArg : modes added to a property enforce execution of a specic variant of a class template. The transformation needs to call the appropriate rule. Therefore, the required nomenclature dened in 3.3 takes a mode as second parameter to uniquely identify the rule.

3.3. TCS-to-Grammar Transformation

39

SeparatorPArg : if a separator is specied for multivalued properties this has to be added to the sequence that is a repeatable element. ForcedLowerPArg and ForcedLowerPArg : multiplicities of features that are overridden in the syntax specication are supposed to ensure a minimum (maximum) number of elements. Since this cannot be mapped elegantly5 to a grammar supporting only repeated (*) and optional (?) elements the TCS feature is ignored. Input 3.3.6.3 (Property). Given a property prop with property arguments prop.mode and prop.as referencing primitive template tP TP . Let MP be the type of the structural feature and prop.sep be the optional separator argument.

Output 3.3.6.3 (Quantication). A quantied element qprop which, in turn, contains two required elements: a Binding bprop and an injector action Ainj . Optionally (depending on prop.sep) there is a third element: a Terminal containing the separator. bprop is bound to the semantic value of a nonterminal resulting from the syntax lookup of props template uniquely identied by nom(MP , prop.mode) or which is simply tP . Ainj is not detailed here since it requires discussion of all other property arguments such as createAs , refersTo , query etc.

InjectorActionsBlock As depicted in Fig. 2.8, an InjectorActionsBlock contains 0..* PropertyInits. Transforming them does not concern the syntax analyzer as it aects only how model elements are created or updated. Thus, details of the TCS injection mechanisms are not discussed. Code generated from the transformation of PropertyInits can be inserted similarly as in the ANTLR version (in curly braces). There are, however, technical consequences that arise from the backtracking nature of all Rats! generated parsers. These are covered in the implementation Ch. 4. FunctionCall A function call always references its function template. Given a correct syntax denition the grammar must already contain a transformed result of that template. Calling this function works by merely inserting the nonterminal associated with the transformed template. When implemented with a model proxy stack no passing of arguments is needed.
5 Apart from unfolding the forcedLower argument n to a sequence of exactly n elements plus a repetition

40 LiteralRef

3. Design

String literals appearing as part of a sequence can be trivially transformed to grammar elements. Since they represent keywords in the language denition their transformation is detailed in 3.3.7.

3.3.7

Keywords and Symbols to Productions

Keywords and symbols are explicit specications for what is to be considered an immutable unit of lexical syntax. The dierence between keyword and symbol is blurred by the fact that for literal reference consisting of more than one character it is hard to guess if the quoted literal is considered a keyword or symbol.6 The design for the transformation of keywords and symbols to grammar rules needs to take into account that Rats! does not produce tokens which are denitely very useful for error reporting and editor features such as code completion. Usually, token information in TCS is specied by the following elements: literal references appearing as sequence elements are automatically considered keywords symbols and keywords may be specied additionally in the according TCS section lexer code can be provided in the lexer section overriding the default lexer implementation of ANTLR Building tokens on the y, i.e. during the parsing process, requires for all dierent tokens a dedicated rule that can be observed via micro-transactions detailed in Sec. 4.3. Formally stated, we need the following relation: Input 3.3.7.1 (Keywords and Symbols). The set of keywords K can be separated into sets of explicitly dened (Kdef ) and referenced (Kref ) keywords: K = Kdef Kref . These sets need not be disjoint. The same is true for symbols (let S, Sdef and Sref be the according symbol sets).

Output 3.3.7.1 (Literal Produtions). For each k K and for each s S a production p results returning a String representing the literal. The production must be stateful for transactional handling and transient to suppress memoization. An init-action is needed in order to communicate to the parsing state observer that committing this transaction leads to creation of a new token.

Note: details of both production attributes (stateful and transient ) and their eect to the transformation are described in the implementation Ch. 4.
6 A possible solution for that can be assuming symbols are never alphanumeric characters, which is true for most languages.

4. Implementation
To evaluate the feasibility of language composition technical issues arising from employing a scannerless parser generator need to be discussed. For this purpose a prototypical implementation of the transformation exhibited in Ch. 3 was developed. The existing textual editing framework was modied in order to use the Rats! parser generator instead of ANTLR . When migrating to the scannerless parser technology the following specic questions have been of interest: Given the diering paradigm, is it possible to implement a transformation from TCS to Rats! grammar such that for all possible mapping specications and abstract syntax models a domain parser results? How can pre- and post-actions like the ones specied in Sec. 3.3 be added as pure Java code to permit injections (model creations) during the parse? Does the backtracking nature of every Rats! -generated parser pose a problem when creating model elements or tokens? How does migrating to a token-free environment aect error reporting and error recovery? What are the obstacles when striving for incrementality (with respect to both lexing and parsing since they are to be integrated)? What technical requirements are placed on the generated parsers considering the integration into the textual editing framework? Most of the above questions could be dealt with when implementing the prototype. The bottom line is that with features provided by Rats! , such as nested transactions (see Sec. 4.3), and a combination of a unique naming function for variable bindings (see Sec. 4.2.3) with a heuristic (see Sec. 4.4.3) ordering the alternatives appearing on a right-hand side the designed transformation can be realized. A remaining unresolved issue is the question how an incremental version of the parser can be integrated into the editing framework. It is doubted that small modications to the generated tokenizing parser are sucient to support incrementality

42

4. Implementation

with respect to both parsing and lexing. We consider this a separate topic which goes beyond the scope of this work. Requirements for a solution are discussed in Sec. 4.6.

4.1

Handler-Based Transformation

As pointed out in the design chapter (Sec. 3.1.2) the transformation from TCS to Rats! could be implemented by a model-to-model transformation or coded in Java. The prototypical implementation uses specialized handlers for the most important Rats! metamodel elements to create a Rats! grammar instance.

Figure 4.1: Handler-based transformation of TCS instances Implementing the transformation with special purpose classes for the main model elements was chosen for the following reasons: Testing: essential for the correctness of the whole transformation is the correctness of each transformed element. JUnit tests can be derived from the dierent handlers and the expected results. M2M engine: using a model-to-model transformation brings forth another external library with potentially low-performing transformation engines and additional dependencies. The authors experience with relational QVT, for example, suggests that in complex transformations a clean and concise transformation specication is hard to develop (see [Ks08]) u Concept Reuse: From the ANTLR -based framework a signicant amount of code is similar to the code needed for the transformation to Rats!

4.2

Packrat Parser Specics

Some details of the code generator provided by the Rats! parser generator framework need to be considered in order to produce the correct domain parsers from the generated grammar. The most important and specic detail of the code generation is the fact that all generated parsers use memoizing. All created parsers are backtracking which usually implies exponential time complexity. The idea of

4.2. Packrat Parser Specics

43

packrat parsers is to store intermediate parsing results, called memoizing, in order to guarantee linear-time complexity at the cost of additional space consumption. Details aecting the implementation are discussed in Sec. 4.2.1. Another critical point during code generation may be the various parser optimizations that Rats! employs. As opposed to memoizing, these can be deactivated, however. Additionally, they are mostly related to literal and transient productions. See Sec. 4.2.2. Action code must be added directly to the output grammar for model creating actions and calls to the parsing observer. The protoypical implementation must make assumptions about how action code is inserted into the generated parser code. Issues resulting from those assumptions are detailed in Sec. 4.2.3.

4.2.1

Memoization

Common to all backtracking parsers with unlimited lookahead is an exponential worst-case complexity. Obviously this is unacceptable and must be taken care of. Packrat parsers use memoization to avoid re-parsing of input already processed. With additional storage overhead for the intermediate results, packrat parsers exhibit a linear time complexity and are thus of practical relevance and applicable to larger inputs, too. Technically, memoization is implemented by a lookup table table (referred to as memoization table). For dedicated nonterminals, the table stores the result obtained from invoking the nonterminals method at a specied index, i.e. such a table is a map of the form Index Result. If the result is present at the specied index it is retrieved from the table. If none is present, the method to parse the construct is invoked and the table is lled. As shown in detail in Sec. 4.3 and 4.5.3 transactional handling of methods is used for model injection and token creation. In a way, memoization violates the transactional contract that the start, commit and abort methods are invoked whenever a nonterminal is parsed. This is illustrated in Listings 4.2 and 4.3. It shows the general pattern of memoized nonterminal productions. Consider the following TCS extract for EnumLiteralMappings (chosen because of their syntactical simplicity): 1 template TCS::EnumLiteralMapping 2 : literal = element 3 ; Listing 4.1: Memoization example - TCS mapping snippet It states that an EnumLiteralMapping is represented in concrete syntax as pair of EnumLiteral literal and the associated SequenceElement element separated by an equals symbol. In the produced Rats! grammar one rule (tcs enumliteralmapping is created as follows: 1 stateful Object tcs enumliteralmapping = 2 pre action (create model proxy) 3 4 temp 1:tcs enumliteralval {set(proxy, literal , temp 1);} Spacing EQ Spacing

44

4. Implementation

5 temp 2:tcs sequenceelement {set(proxy, element, temp 2);} Spacing 6 7 { yyValue = commitCreation(proxy, null, false); } 8 ; Listing 4.2: Memoization example - simplied generated grammar The nonterminal tcs enumliteralmapping appears repeatedly throughout the whole syntax specication which causes the code generator to create two methods: a virtual rule tcs enumliteralmapping and the actual parsing rule tcs enumliteralmapping$1. When performing the rst parse of a nonterminal tcs enumliteralmapping at a given index the state modifying methods start() , and eventually commit() or abort() are invoked following the transactional paradigm. If the same nonterminal is parsed again (because an alternative in a calling rule aborted) the parsed result is retrieved directly from the memoization table (yyColumn.chunk1.ftcs enumliteralmapping ) bypassing the transactional methods. 1 private Result ptcs enumliteralmapping(nal int yyStart) throws IOException { 2 3 TCSColumn yyColumn = (TCSColumn)column(yyStart); 4 if (null == yyColumn.chunk1) yyColumn.chunk1 = new Chunk1(); 5 if (null == yyColumn.chunk1.ftcs enumliteralmapping) 6 yyColumn.chunk1.ftcs enumliteralmapping 7 = ptcs enumliteralmapping$1(yyStart); 8 return yyColumn.chunk1.ftcs enumliteralmapping; 9 } 10 11 12 private Result ptcs enumliteralmapping$1(nal int yyStart) throws IOException { 13 Result yyResult; 14 Object yyValue; 15 16 yyState.start () ; 17 18 // PRE ACTION create model proxy etc. 19 20 yyResult = ptcs enumliteralval(yyStart); 21 if (yyResult.hasValue()) { 22 Object temp 1 = yyResult.semanticValue(); 23 set(proxy, literal , temp 1); 24 25 yyResult = pSpacing(yyResult.index); 26 if (yyResult.hasValue()) { 27 yyResult = pEQ(yyResult.index); 28 if (yyResult.hasValue()) { 29 yyResult = pSpacing(yyResult.index); 30 if (yyResult.hasValue()) { 31 yyResult = ptcs sequenceelement(yyResult.index);

4.2. Packrat Parser Specics 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 } if (yyResult.hasValue()) { Object temp 2 = yyResult.semanticValue(); set(proxy, element, temp 2); yyResult = pSpacing(yyResult.index); if (yyResult.hasValue()) { yyValue = commitCreation(proxy, null, false); yyState.commit(); return yyResult.createValue(yyValue); } } } } } } yyState.abort(); Listing 4.3: Memoization example - simplied parser code

45

For that reason, tokenization in the presence of memoization is more complex and special post-processing must be carried out. Fortunately, for the model injection code this behavior is acceptable due to the following observation: Lemma 4.1 (Memoization). Model elements are created correctly even in the presence of memoization. Without a formal proof the lemma is motivated in the following: Let prod be the stateful, memoized template production. If prod is called only once there is no dierence to a non-memoized parse. So the rst call of prod can have succeeded (implying a proxy object in the memoization table) or failed (implying a null object in the table). By context-freeness and the observation that references are set outside any transactional operations we can conclude that in either case, retrieving the object from the memoization table will lead to the same result as parsing the syntax again.

4.2.2

Parser Optimizations

The Rats! parser generator includes a number of optimizations. Table 4.2 lists the available options. Most of them are tuned to increase throughput. The most important are Chunks, Transient, Repeated and GNodes. Heap utilization is a second major advantage which is decreased mainly by the options Chunks and Transient. For the generation of domain parsers with high performance optimizations play a signicant role. Still, for a deterministic result of the transformation some of the optimizations can cause problems. For the prototypical implementation discussed here we chose to switch o all optimizations in order to guarantee that code fragments inserted into the grammar code appear at the right spot within the generated parser code. Especially factorization of common prexes can lead to syntax errors in the generated code and was therefore deactivated. See Sec. 4.2.3 for typical errors resulting with action code.

46 Name Chunks Grammar Terminals Cost Transient Nontransient Repeated Left Optional Choices1 Choices2 Errors Select Values Matches Prexes GNodes

4. Implementation Description Organize memoized elds into chunks. Fold duplicate productions and eliminate dead productions. Optimize recognition of terminals, incl. using switch statements. Perform cost-based inlining. Do not memoize transient productions. Automatically recognize productions as transient. Do not desugar transient repetitions. Implement direct left-recursions as repetitions, not recursions. Do not desugar options. Inline transient void and text-only productions into choices. Inline productions that are marked inline into choices. Avoid creating parse errors for embedded expressions. Avoid accessor for tracking most specic parse error. Avoid creating duplicate semantic values. Avoid accessor for string matches. Fold common prexes. Specialize generic nodes with a small number of children. Figure 4.2: Rats! parser optimizations (from [Gri06])

4.2.3

Actions and Bindings

Rats! oers a convenient way to put action code into a grammar. The injected code can be arbitrary Java code snippets and is not subject to any syntactic restrictions. That is why the transformation engine needs to take that the combined parser code is valid Java code. Especially two items are critical:

Duplicate local variables: bindings to grammar elements, i.e. the semantic value of an executed parser method assigned to local variables may result in duplicate declarations. This is due to the fact that when stating a binding like temp:seqElem1 , Java code is generated assigning the return value of the method pSeqElem1 to a newly declared variable temp . In a TCS sequence numerous sequence elements are parsed one after another leading to duplicate local variables when the identier name is not unique. Therefore, the implementation always creates bindings subscripted with ascending integers (per sequence). Undeclared variables: rules created from TCS operator templates are usually responsible for parsing the operator symbol which is, however, associated with the calling rule created from an abstract operatored class template (see 3.3.3 for details). Passing the operator symbol back to the class template rule can be done by accessing a global variable opSymbol . But there is no guarantee that this variable lies within the same scope as the code where the symbol is bound. Memoized chunks and factorized method parts may complicate the situation.

4.3. Lightweight Nested Transactions

47

4.2.4

Parameterized Rules

Context-freeness of the languages discussed here usually implies that a produced Rats! rule is also context-free in the sense that it does not require information from outside the scope of the rule. This is not completely true for the case of OperatorTemplates . The general pattern for operator templates is that the model is built up from model elements that are created before or after parsing the syntax for the operator. By passing the result of the previous sequence elements (primaries to be specic) to the operator rule it is possible to build the model correctly respecting both arity and associativity. In contrast to ANTLR , Rats! does not allow parameterized rules, however. So the implementation uses a stack holding the model element usually passed in. As outlined in Sec. 3.3.3 the push and pop operations surround the call to the respective operator rule allowing to gradually build the model for the compound expression.

4.3

Lightweight Nested Transactions

Productions of a Rats! grammar can be tagged with the attribute stateful allowing to specify functionality associated with the success of failure of a rule. The operations performed upon success or failure of a production can be nested in the sense that a transaction can start other transactions before nally ending (successful or unsuccessful). In combination with the code that is injected by the engine transforming each TCS element into a grammar item transactional operations are crucial to the functioning of the outlined framework. Typically, starting a transaction may lead to the construction of model element proxies while successful completion can imply their resolution to actual model elements. Another example of behavior associated with the completion of rules is the creation of tokens (discussed in detail in Sec. 4.5.3). The semantics of stateful Rats! productions is as follows. Consider the following rule: stateful Object C = A / B / "text"; The nonterminal C can be expanded to A, B or the string literal alternatively. The keyword stateful indicates that the specied implementation of the State interface (as depicted in Fig. 4.3) observes the parsing process. That is: start() is called before C is expanded. commit() is called after the rst successful parse of a C alternative. Here, nesting of transactions comes into play. Before a commit operation is called either of the alternatives has to be completed successful. abort() is called when all alternatives of C have failed. An important fact is that the transactional methods are parameterless. So the State interface observing the parse does not know which rule starts, commits or aborts. In two cases, the generated parsers are transactional: rules creating proxies

48

4. Implementation

ParserBase State start() : void abort() : void commit() : void reset(s : String) : void

RatsObservablePatchedParser

#observer

IParsingObserver

<<implements>>

-parser

RatsObservableInjectingParser pmain(yyStart : Integer) : Object ...

-injector

IModelInjector InjectorState tokenLookup : Map<Integer, Token> currentToken : Token tokens : List<Token> tokenStack : Stack<List<Token>> isTemplateRule : Boolean isTokenizing : Boolean proxyStack : Stack<NamedProxyWrapper> sanityCheck() : void newToken(yyStart : Integer, ttype : Integer) : void getCurrentProxy() : IModelElementProxy getCurrentMetaType() : List<String> createOrResolve()

#state

<<final>> DSLParser paramStack : OperatorTemplateParameterStack

Figure 4.3: MOF class diagram of xtc.parser.InjectorState (i.e. generated from tcs.ClassTemplate ) need to be stateful for correct creation and deletion of proxies and resolution of references. The other type of rules that are transactional by default are lexical (=tokenizing) rules. When a lexical rule aborts because a token cannot be recognized completely, all tokens created by its subrules must be rolled back. This can only be achieved in the abort method. The implemented InjectorState 4.3 has therefore additional ags (isTemplateRule, isTokenizing) indicating whether a rule produces model proxies and/or tokens. These values can only be set correctly outside the transactional methods. According actions are inserted at the beginning of rules generated from templates or literals. Summing up, the state-modifying transactions provided by the Rats! parser-generator facility are essential for the implementation of a model-injecting domain parser. Still, some customization needed to be implemented to allow for handling dierent types of transactional rules.

4.4 File: C:\Users\c5126086\Martin_DA\thesis\figures\src\xtc.parser.mdl Ambiguities

22:26:08 Montag, 19. Oktober 2009

Class Diagram: xtc.parser / Main Page 1

Every grammar containing ambiguous constructs suers from the problem that no parser generated from it will be able to decide which is the right derivation for a given input. A prominent characteristic of all parsers relying on parsing expression grammars (PEGs) is that the alternatives are ordered (see 4.4.2 for implementational consequences) and that they hence avoid ambiguities. This sounds intriguing but the result is that the author of a PEG will have to consider the ordering of alternatives carefully to dene the correct parse. Ambiguity can be illustrated with the wellknown dangling-else problem. Without any sentinel identifying the beginning and

4.4. Ambiguities

49

the end of a statement (usually curly braces) nested if-then-else statements of the following form are ambiguous if Exp1 then if Exp2 then Stmt2 else Stmt2
innermost

It is unclear whether the the else-clause belongs to the rst or the second if-statement resulting in two dierent parse trees. Grammars for languages like C or Java arbitrarily decree that the else clause always belongs to the innermost if-statement. This behavior is captured in the grammar by distinguishing open and closed statements resolving the ambiguity. Such precedence solves the ambiguity. In Rats! , the dangling-else problem cannot be solved more precisely. However, ordering of alternatives makes it easier to dene precedence rules within a statement. With the same precedence of innermost else clauses a statement of the form if Exp1 then if Exp2 then if Exp3 then Stmt3 else Stmt3 cannot be nested as follows without explicit braces: if Exp1 then if Exp2 then if Exp3 then Stmt3 else Stmt2
one two

As can be seen from the above examples the author of a Rats! grammar needs to take care of precedence in order to obtain the desired parse. Since in our case the grammar is generated from a mapping denition the transformation is in charge of establishing the correct order of alternatives (see 4.4.3) or provide other means to avoid wrong derivations (see 4.4.1).

4.4.1

Greedy Parse - Shift/Reduce Conicts

Processing the TCS.tcs revealed a fundamental problem that arises in situations that are related to abstract operatored templates and their transformation to a grammar. Usually, operatored templates are employed when a concrete model element type needs to be created as a compound expression that can be of any subtype of a generic expression type, allowing for operatored expressions - which would otherwise lead to a non-LL left recursion in the produced rules. Consider Fig. 4.4 as an illustration. The generic binary expression type refers to a left- and a right-hand side of the abstract type Expression . However, BinaryExp is not abstract itself. Assume we have an operator list with only one priority level containing only one binary left-associative operator op. A mapping with an abstract operatored class template for Expression and an OperatorTemplate for BinaryExp will result (according to transformations stated in 3.3.2.3 and 3.3.3) in a grammar with a priority 0 rule priority 0 primary expression ( op binary exp primary expression )*

50
+opRight Expression +opLeft

4. Implementation

IntegerLit value : Integer

BinaryExp opName : String

Figure 4.4: Sample metamodel illustrating operatored expressions and associated actions pushing the result of the primary rule onto the stack (for processing by binary exp setting it as left-hand side of BinaryExp ) and an after action setting the result of the last primary expression to its right-hand side (omitted here for better readability). Parsing an input such as 1 op 2 op 3 will yield op(op(1, 2), 3) as desired. As pointed out in the beginning, the syntax mapping for the TCS language itself contains more sophisticated constructs that lead to rules in accordance with the specied transformation, but which are no parsable inputs using simple operatored constructs. 1 2 3 4 5 6 7 8 9 10 11 12 template Model::Namespace abstract operatored(DBLCOLON); template Model::Classier referenceOnly : ( isDened(container) ? container :: name : name) ; template Model::GeneralizableElement referenceOnly : name ;

operatorTemplate Model::ModelElement(operators=opDlColon, source = container) referenceOnly 13 : name 14 ; Listing 4.4: TCS.tcs: concrete syntax for classiers and namespaces For references to model elements, TCS contains a referenceOnly template for Classiers from the M3 model. These can (and most often will) be fully-qualied. The File: C:\Users\c5126086\Martin_DA\thesis\figures\src\opExp.mdl 18:09:26 Mittwoch, 14. Oktober 2009 Class Diagram: that abstract syntax is partially depicted in 4.5. All model elements are abstract soLogical View / Main a Classier has an optional container of type Namespace and a name attribute. In concrete syntax, this containment relationship is expressed by a series of package or class names with double colons separating the name spaces from each other. This gives rise to the syntax listed in 4.4. The conditional element in Classier constitutes the critical element here. In the direction from abstract to concrete syntax it is totally clear meaning display the container (if present), then a double colon, then its name.

Page 1

4.4. Ambiguities
ModelElement name : String +containedElement 0..n {ordered}

51

Namespace

0..1 +container

GeneralizableElement

Classifier

Figure 4.5: MOF extract: namespaces and classiers However, the other direction is less intuitive. Parsing concrete syntax means answering the question how the textual parts should be represented as model elements. Therefore, conditional elements are represented as an alternative having the two choices stated in the then- and in the else-clause. A sample input PrimitiveTypes::String should be parsed to a tree as depicted in 4.6. However, after parsing the rst identier the parser tries to repeat the element on the right. The input can be matched (for one repetition), so rule model namespace succeeds. But then no input is left for the following double colon and identier in model classifier and the rule throws a parse error. The discussed issue is an instance of a typical shift/reduce conict known from LR-parsers. However, it cannot be easily detected in a scannerless environment. Disambiguation with syntactic predicates Comparison with an ANTLR generated parser showed that the ANTLR version parses the input correctly because of automatically added syntactic predicates within the repeated element: dblcolon priority 0 primary model namespace ( "::" File: C:\Users\c5126086\Martin_DA\thesis\figures\src\m3.mdl 19:29:08 Mittwoch, 14. Oktober 2009 model modelelement &"::" )*

Class Diagram: Logical View / Mai

Finding the correct spot for placing a syntactic predicate is not a trivial task. In the above example it is imperative to put the syntactic predicate into the repeated element. Otherwise, the rule always succeeds with too many repetitions and the calling rule fails. This is why the prototypical implementation does not contain any automated injection of predicates but relies on patches to the resulting grammar instead.

4.4.2

Ordering of Choices

In Rats! grammars, the alternatives on the right-hand side of a production are prioritized, i.e. they represent an ordered choice. This is intended to avoid ambiguities

52

4. Implementation

Figure 4.6: Parse tree for input PrimitiveTypes::String of the grammar. While it does avoid situations in which the parser is uncertain which nonterminal to expand it does not automatically guarantee the correct (=intended) parse. This is due to the fact that the grammar can contain productions with shadowing alternatives. Denition 4.1 (Shadowing alternatives). Let Ai and Aj be two alternatives of an ordered choice. Then Ai shadows Aj (Ai Aj ) i Aj succeed on some input + and Ai succeeds on a prex of . If Aj is the correct alternative for some input but Ai (shadowing Aj ) has a higher priority then Ai will always be selected resulting in a parse error in the calling rule if is a proper prex of (since there will be some input left) or an incorrect parse if = . Note that two specic situations can be identied resulting in shadowing alternatives: alternatives for identiers usually shadow keyword alternatives the alternative with an empty sequence (-alternative) shadows all possible alternatives. This situation can be detected by the parser generator, however, and will lead to an error while processing the grammar.

4.4. Ambiguities

53

This leads to the conclusion that shadowing alternatives should be avoided in order to guarantee the correct parse. Applying the Rats! generated parser to some valid TCS inputs revealed that such situations actually need special handling. For that, consider the element ConditionalElement , Expression , PropertyReference and related from the TCS metamodel (found in Fig. 2.7 on page 16). The common concrete syntax is specied by the following mapping: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 template TCS::ConditionalElement : ( condition ? thenSequence (isDened(elseSequence) ? : elseSequence) ) ; template TCS::Expression abstract; template TCS::BooleanPropertyExp : propertyReference ; template TCS::IsDenedExp : isDened ( propertyReference ) ; template TCS::PropertyReference : (isDened(strucfeature) ? strucfeature{refersTo=name, query =OCL:let ..., as = identierOrKeyword} : > name{as = identierOrKeyword}) ; Listing 4.5: TCS.tcs: conditionals and expressions

The rule generated from the abstract template for Expression contains alternatives for all subclasses, especially for the listed IsDenedExp and BooleanPropertyExp. The AsPArg in PropertyReference s template leads to a call to a lexical rule Identifier for all property references (model creating actions omitted). Thus, since the string literal isDened can be parsed as identier we have: tcs booleanpropertyexp tcs isdefinedexp

leading to a parse error on each occurrence of an isDened expression.

4.4.3

A Heuristic for Shadowed Alternatives

The typical identier-vs.-keyword conict in alternatives produced from a TCS mapping (as illustrated in 4.4.2) can be xed revising the ordering of alternatives in a rule. The heuristic takes into account the following: Two alternatives starting with dierent string literals cannot shadow each other A lexical rule alternative can shadow an alternative starting with any nonempty xed string literal, but not vice versa.

54

4. Implementation Two alternatives starting with a lexical rule may be ordered such that longer alternatives precede shorter ones, i.e. in a setting with rules1 A B C D B/C D D ( E ) [a zA Z][a zA Z0 9 ]

Rule B shadows C and consequently the ordering of As alternatives must be: C preceding B. Another way to solve the issue of shadowing alternatives is via syntactic predicates, similar to the ones inserted as described in Paragraph 4.4.1. Especially in situations where nonterminal A is expanded to a B in almost all cases the ordering C preceding B would be a performance drawback. But since Rats! factors out the common prex D of B and C and memoizes intermediate results this argument can be disregarded. The -alternative (=empty right-hand side) must be the last of all choices.

4.5

Tokenization and Scannerless Parsing

As pointed out several times before Rats! does completely without tokens. No lexical analysis is performed prior to syntactic analysis and parsing is performed on a character-by-character basis. While this has been the central argument for choosing a scannerless technique in the rst place, some signicant drawbacks result from the absence of tokens including: Dicult error reporting based on expected characters only; Limited chances for applying well-studied error recovery techniques such as panic mode recovery since they are based on designated synchronizing tokens (see for example [ASU86]). That is why the implementation outlined here supports token creation as part of the parsing process. It integrates lexing with parsing by emitting tokens during syntactic analysis - just as a DFA would do in a prior lexical analysis step. However, in contrast to a classical scanner2 the tokenizing scannerless parser can decide which token type to assign based on contextual information. Traditional lexers are usually dened as a set of rules associating regular expressions with token types. These can be used to create a combined DFA with the designated characters transitioning, nally, to an accepting or error state. Token types are very specic to grammar denitions. But, common to most lexers is the notion of characters which separate one token from another. Sec. 4.5.1 details how these can be implemented. The process of creating tokens on the y, both conceptually and technically, is orthogonal to the question what a token is. The solution presented here makes use of Rats! micro transactions to ensure correctness of tokenization even in the presence of backtracking. Details are discussed in Sec. 4.5.3.
1 This could be rewritten as A B ( ( E ) )?. However, as the grammar is auto-generated such beautications are (in general) not possible. 2 the terms scanner, tokenizer and lexical analyzer are used as synonyms throughout the work

4.5. Tokenization and Scannerless Parsing

55

4.5.1

White Space Denition

In traditional compiler construction the denition of white space characters plays a secondary role. In most cases the default denition of what is considered a blank character will suce to separate tokens. The scannerless paradigm demands special treatment of blank characters however as they may aect the parse. Parser generator frameworks provide means to specify lexical rules separate from syntactic rules. In ANTLR for example, lexer rules may be composed from regular expressions. These rules can be identied by a non-terminal starting with a capital letter. In the Rats! scannerless environment we need to distinguish two types of blank characters: blanks and required blanks. Denition 4.2 (Blank Character). A character (or sequence of characters) is considered blank if adding it between to tokens of the input does not change the semantics. Indentation and comments are typical blank sequences since adding them to a piece of syntax does not change the semantics of the code. In contrast, not all blank character symbols are dispensable. Some are needed in order to separate two tokens of variable length, e.g. given by a regular expression. This gives rise to the second denition Denition 4.3 (Required Blank). A blank character is required if removing it changes the type of the created tokens. Example 4.1. Consider the code snippet keyword someIdentifier . If the blank is omitted the lexer cannot recognize the keyword and will lex the construct as identier. Thus, this occurrence of a blank character is required. The above distinction between blanks and required blanks is necessary due to some specics of the Rats! parser generator. While parser generators with a separate lexing phase will produce only one token from a keyword and a following identier without a separating blank Rats! parses the syntax characterwise and reaches an accepting state when a literal is recognized. This leads to the conclusion that the phrase discussed in Example 4.1 needs an additional separator recognizing at least one blank character. Fig. 4.7 shows the automaton representing the lexer with forced recognition of a white space between keyword and identier.

ws 1 key 2 ws 3 id 4

Figure 4.7: DFA recognizing keyword followed by identier In the generated Rats! grammar, blanks and required blanks are recognized by two dierent default rules:

56

4. Implementation Spacing for blanks depending on rules WS and COMMENT representing a single white space character and a comment respectively FSpacing for required blanks analogously; multiplicity is + instead of *

The rules WS and COMMENT are subject to change by the author of the TCS mapping. E.g. syntax for comments can be specied in the respective section of the TCS document: token COMMENT : endOfLine(start = "--"); If no user-dened mappings for comments and white spaces are found the framework creates default rules.

4.5.2

Assignment of Token Types

For several reasons (including more convenient error reporting and recovery) tokens are helpful in the process of parsing a textual artifact. That is why the outlined implementation employs a strategy to create tokens during syntactic analysis. Since the parsing process is backtracking all created tokens are subject to deletion upon abort of a rule. This can be handled by having the transactional actions start() , commit() and abort() execute any token creation or deletion. Denition 4.4 (Token). A token is a 4-tuple T = (tt, s, e, val) where tt is the token type, s and e are start and end indices and val is an (optional) token value. If not otherwise stated tokens refer to the denition of what a token is. In contrast, occurrences of tokens in a textual artifact are referred to as lexemes. The central task of tokenization is thus assigning tokens to lexemes and setting the token attributes, which is usually done in the scanner phase. The main benet of deferring this assignment to the syntactic analysis phase - integrating the lexer with the parser - is as follows: when creating token instances the parser can use information from the parse, i.e. the expected type of construct.

4.5.3

Tokenizing via Lightweight Transactions

Technically, the implementation of an integrated lexer/parser architecture must take into account aborts of rules while token instances are created because all Rats! generated parsers are backtracking and employ memoization for better performance. The micro transactions discussed in Sec. 4.3 can observe the parsing process and eject or retract the appropriate tokens. Algorithm 4.1 (Tokenization via micro transactions). curT ok := (0, int max); stack := [ ]; tokens := [ ]; begin while !EOF do if statef ul start(); success := call appropriateParseRule(yyStart);

4.5. Tokenization and Scannerless Parsing if lexical call newToken(yyStart, ttype); if statef ul if success call commit(); else call abort(); od where proc start() stack.push(tokens); tokens := [ ]; . proc commit() added := stack.peek(); added.addAll(tokens); stack.pop(); tokens := added; . proc abort() tokens := stack.pop(); . proc newToken(start, type) curT ok := new Token(start, yyCount, type); tokens := tokens curT ok; .

57

This is realized via the algorithm listed in Algorithm 4.1. The central observations are Only lexical rules emit tokens. For each nested transaction a list of tokens is established holding all tokens created after start of the transaction. A stack contains elements for each transaction level. After the last commit operation, tokens contains all tokens created so far, i.e. all tokens when the end of le is reached with a valid parse. Memoization A critical point when intercepting parsing methods is memoization: for better performance Rats! memoizes intermediate results. Instead of calling a method again whose result was memoized Rats! simply accesses the memoization table and returns the semantic value associated with the parse. In a way, this violates the general assumption that execution of all stateful rules can be intercepted by the transactional methods. For the process of tokenization, this inconvenient fact requires some additional storage of tokens created in a transaction that will be aborted. A java.util.HashMap mapping indices to tokens is used to guarantee that no gaps exist in the output list of tokens. After nishing the parse of the entire artifact tokens a sanity check is carried out according to 4.2.

58

4. Implementation

With the presented algorithms tokens can be created eectively on the y. Generally, this apples to parsing of a complete syntax only. However, the algorithms can be modied towards incrementality of the generated parsers, too. Such changes include adjustments to the token indices in consideration as well as a method to locally update the set of newly created tokens without invalidating all tokens created so far. Algorithm 4.2 (Sanity check for emitted tokens). ret tokens := [ ]; idx := 1; hasAdded := f alse; do forall tok tokens if !hasAdded idx tok.lower then tok := tokens.next(); hasAdded := f alse; if tok.lower = idx + 1 then idx := tok.upper; ret tokens := ret tokens tok; continue; else hasAdded := true; tok := lookup(idx + 1); idx := tok .upper; ret tokens := ret tokens tok ; continue; od

4.6

Challenges

As a prototypical implementation the proposed grammar generation neither claims to be stable nor complete. Open issues remaining include an automatic injection of syntactic predicates instead of grammar patches detailed in 4.4.1. As part of an editing framework, the generated model-injecting parsers need to be integrated into the editor environment. For integration into a textual editing framework parsers generated for domain-specic languages need to support three critical features: incrementality, error handling and error recovery. These are discussed in the following.

4.6.1

Incrementality

As a highly user-interactive process textual editors must provide means to keep the latency resulting from parsing, model resolution etc. as low as possible. Completely parsing a large document with a fair amount of model updates or creations still needs seconds to nish. This is unacceptable for an interactive environment. For that

4.6. Challenges

59

reason, only incremental changes to the textual artifact must be re-lexed and -parsed. The top-down recursive descent parsing technique Rats! relies on is considered to support incremental re-parsing of text regions similar to parsers generated by the ANTLR framework. Lexing - or tokenization - (because performed here during syntactic analysis) is regarded a more complex task. The incremental lexing algorithm currently implemented (similar to [CW01]) is not applicable to the scannerless parsing techniques that Rats! parsers apply. The strict separation of the two components requires full redesign of token handling. When tokens are emitted on the y a set-based or list-based storage mechanism of tokens may be more appropriate than the typically used stream-based data structure. The TextBlock-based approach presented in [GBU09] must be adapted with the new tokenizer. Some parts of the incremental lexical analyzer are no longer necessary with the scannerless approach. This includes the calculation of lookback-indexes from lookahead-values originating from the deterministic nite automaton (lexer). Granularity of re-parses must be considered. There is a one-to-one relationship between rules in Rats! grammars and generated parser methods. For incremental parsing, these methods are invoked with the designated partial input of the textual artifact. The textblock decorator approach was designed to minimize the work needed when some textblocks are changed. The question what is the minimum impact must be answered for the scannerless parsing technique.

4.6.2

Error Handling

Syntactic errors in the textual artifact should be notied to the user with proper indication of the region that could not be parsed. Tokens are the syntactic units preferably proposed as expected constructs. Suggestions what constructs are expected are favored over constructs that were not able to parse. Semantic errors communicated to the user must contain meaningful information what elements could not be resolved.

4.6.3

Error Recovery

Especially for editing environments, error recovery is an important feature. During editing of a textual artifact the code will contain errors most of the time. Powerful development environments therefore provide recovery for the most frequent syntactic errors: duplicate tokens and missing tokens. Panic-mode error recovery (see [ASU86] for details) is a feature most likely requested by a textual modeling framework. How this can be consolidated should be investigated.

4.6.4

From Embedding to Composition

As pointed out in the introduction 1.3 some questions must be answered when languages are composed rather than embedded. The existence of two separate language toolkits that are combined by an editor are not sucient. A clear and exible design for composition of scopes and symbol tables must be devised in order to allow cross-referencing and expressive language composites. We envision an interfacebased mechanism that creates a new toolkit on top of the existing languages to be composed to maintain the fundamental demand for side eect-free reusability.

60

4. Implementation

5. Summary and Conclusions


We investigated how scannerless parsing can solve problems involved in composition of languages. While integrating lexical and syntactic analysis entails the fundamental advantage that lexical conicts can be solved with the help of a constructs context, challenges for error handling result from the absence of tokens. Applying a backtracking parsing strategy instead of a predicting one has shown to be manageable in this context. Feasibility By designing and implementing the transformation from a concrete-to-abstract syntax mapping and the specied language metamodel we showed that model-injecting parsers can be automatically derived and lead to valid domain parsers, notwithstanding details involving scopes and repetitions. Feasibility of error reporting and recovery techniques is most likely a more complex task. This is due to the fact that when tokens are created during the syntactic analysis phase they cannot be considered for erroneous code fragments. The absence of tokens is the most critical issue when handling errors. A two-pass strategy is not expected to succeed either: creating tokens beforehand without using them for syntactic analysis is probably a too expensive task and impedes performance unnecessarily. Backtracking of the generated parsers is, in principle, not considered a critical issue for composition. We showed, by using lightweight nested transactions in the implementation 4.3, that the parsing process can be observed suciently. We therefore recommend to continue the work on scannerless parsing techniques in the context of concrete textual syntax for models.

62

5. Summary and Conclusions

Bibliography
[ASU86] Alfred V. Aho, Ravi Sethi, and Jerey D. Ullman. Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986. Alfred V. Aho and Jerey D. Ullman. The Theory of Parsing, Translation, and Compiling, volume I: Parsing of Series in Automatic Computation. Prentice Hall, Englewood Clis, New Jersey, 1972. Heiko Behrens, Michael Clay, Sven Etinge, Moritz Eysholdt, Peter Friese, Jan Khnlein, Knut Wannheden, and Sebastian Zarnekow. o Xtext User Guide version 0.7. Martin Bravenboer. Exercises in Free Syntax. Syntax Denition, Parsing, and Assimilation of Language Conglomerates. PhD thesis, Utrecht University, Utrecht, The Netherlands, January 2008. Martin Bravenboer and Eelco Visser. Concrete syntax for objects: domain-specic language embedding and assimilation without restrictions. In OOPSLA 04: Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 365383, New York, NY, USA, 2004. ACM. Phil Cook and Jim Welsh. Incremental parsing in language-based editors: user needs and how to meet them. Softw. Pract. Exper., 31(15):14611486, 2001. Bryan Ford. Packrat parsing:: simple, powerful, lazy, linear time, functional pearl. In ICFP 02: Proceedings of the seventh ACM SIGPLAN international conference on Functional programming, pages 3647, New York, NY, USA, 2002. ACM. Bryan Ford. Parsing expression grammars: a recognition-based syntactic foundation. In POPL 04: Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 111122, New York, NY, USA, 2004. ACM. Thomas Goldschmidt, Steen Becker, and Axel Uhl. Classication of concrete textual syntax mapping approaches. In ECMDA-FA 08: Proceedings of the 4th European conference on Model Driven Architecture, pages 169184, Berlin, Heidelberg, 2008. Springer-Verlag.

[AU72]

[BCE+ ]

[Bra08]

[BV04]

[CW01]

[For02]

[For04]

[GBU08]

64 [GBU09]

Bibliography Thomas Goldschmidt, Steen Becker, and Axel Uhl. Textual views in model driven engineering. In Proceedings of the 35th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 2009. Hans Grnniger, Holger Krahn, Bernhard Rumpe, Martin Schindler, o and Steven Vlkel. Textual modeling. In Proceedings of the 4th Ino ternational Workshop on Language Engineering (ATEM 2007), 2007. Thomas Goldschmidt. Towards an incremental update approach for concrete textual syntaxes for uuid-based model repositories. pages 168177, 2009. Robert Grimm. Better extensibility through modular syntax. In PLDI 06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 3851, New York, NY, USA, 2006. ACM. Jakob Henriksson, Florian Heidenreich, Jendrik Johannes, Steen Zschaler, and Uwe Amann. Extending grammars and metamodels for reuse: the reuseware approach. IET Software, 2(3):165184, 2008. Frdric Jouault, Jean Bzivin, and Ivan Kurtev. Tcs:: a dsl for e e e the specication of textual concrete syntaxes in model engineering. In GPCE 06: Proceedings of the 5th international conference on Generative programming and component engineering, pages 249254, New York, NY, USA, 2006. ACM. Lennart C. L. Kats, Karl Trygve Kalleberg, and Eelco Visser. Generating editors for embedded languages. integrating SGLR into IMP. In A. Johnstone and J. Vinju, editors, Proceedings of the Eighth Workshop on Language Descriptions, Tools, and Applications (LDTA 2008), Budapest, Hungary, April 2008. Holger Krahn, Bernhard Rumpe, and Steven Vlkel. Ecient edo itor generation for compositional dsls in eclipse. In A. Johnstone and J. Vinju, editors, Proceedings of the 7th OOPSLA Workshop on Domain-Specic Modeling (DSM 07), Montral , Canada, October e 2007. Martin Kster. EMF implementation of a grammar based transforu mation framework for source code analysis. Studienarbeit, Universitt Karlsruhe, 2008. a Warren X. Li. A simple and ecient incremental LL(1) parsing. In SOFSEM 95: Proceedings of the 22nd Seminar on Current Trends in Theory and Practice of Informatics, pages 399404, London, UK, 1995. Springer-Verlag. V J. Rayward-Smith. A rst course in formal language theory. Blackwell Scientic Publications, Ltd., Oxford, UK, UK, 1983.

[GKR+ 07]

[Gol09]

[Gri06]

[HHJ+ 08]

[JBK06]

[KKV08]

[KRV07]

[Ks08] u

[Li95]

[RS83]

Bibliography [SC89]

65 Daniel J. Salomon and Gordon V. Cormack. Corrections to the paper: Scannerless nslr(1) parsing of programming languages. SIGPLAN Notices, 24(11):8083, 1989. John J. Shilling. Incremental ll(1) parsing in language-based editors. IEEE Trans. Software Eng., 19(9):935940, 1993. Masaru Tomita. An ecient augmented-context-free parsing algorithm. Computational Linguistics, 12(1-2):3146, 1987. M.G.J. van den Brand, J. Scheerder, J. J. Vinju, and E. Visser. Disambiguation lters for scannerless generalized lr parsers. In Compiler Construction (CC02), pages 143158. Springer-Verlag, 2002.

[Shi93] [Tom87] [vdBSVV02]

[vdBvDH+ 01] Mark van den Brand, Arie van Deursen, Jan Heering, H. A. de Jong, Merijn de Jonge, Tobias Kuipers, Paul Klint, Leon Moonen, Pieter A. Olivier, Jeroen Scheerder, Jurgen J. Vinju, Eelco Visser, and Joost Visser. The asf+sdf meta-environment: A component-based language development environment. In CC, pages 365370, 2001. [Vis97] Eelco Visser. Syntax Denition for Language Prototyping. PhD thesis, Faculteit Wiskunde, Informatica, Natuurkunde en Sterenkunde, Universiteit van Amsterdam, 1997. Tim A. Wagner. Practical algorithms for incremental software development environments. Technical report, Berkeley, CA, USA, 1998.

[Wag98]

66

Bibliography