Sie sind auf Seite 1von 64

i

Abstract

Programming is taught at many institutions in the world. It is introduced as a tool to both
Computer Science majors and non-Computer Science students. After the duration of studies
most students lack appreciation of the subject mainly because the introductory courses failed
to cultivate enthusiasm in the students by starting with formal languages like C which the less
passionate students fail to comprehend. In the development world there are different types of
programmers mainly good and bad, the distinction comes because of different tastes and
styles. It is fairly difficult to comprehend someones code that is not well commented,
debugging can be a painful task, also because of code that is not self-documenting. The
aforementioned reasons brought the task of coming up with a natural language interface for
programming. The long term goal is to simplify software development processes and come
up with models that speed up development so as to minimise losses due to projects taking
longer than necessary.

















ii

Acknowledgements
Firstly I would like to thank my Supervisor, Mr K. Muzheri for his guidance, technical
support and constructive criticism throughout the project and my Co-Supervisor, Mr S.
Ngwenya for his tireless supervision and expert contribution throughout the project. Special
thanks, to my family who have provided for me throughout my entire program both
financially and with moral support. I would like to acknowledge everyone who contributed to
this project but are not mentioned above, do not be disheartened; your contributions were
greatly appreciated. I thank you all for giving me ideas where I lacked wisdom and for your
unconditional support throughout the project and more.










iii


Table of Contents
Abstract .................................................................................................................................................... i
Acknowledgements ................................................................................................................................. ii
List of Figures .......................................................................................................................................... v
List of Tables .......................................................................................................................................... vi
Chapter 1:Introduction ........................................................................................................................... 1
1.1 Introduction .................................................................................................................................. 1
1.2 Background ................................................................................................................................... 1
1.3 Aim ................................................................................................................................................ 3
1.4 Objectives...................................................................................................................................... 3
1.5 Justification ................................................................................................................................... 3
1.6 Scope ............................................................................................................................................. 4
1.7 Expected results ............................................................................................................................ 4
1.8 Project overview ........................................................................................................................... 4
1.9 Project Plan ................................................................................................................................... 5
1.10 Conclusion ................................................................................................................................... 6
Chapter 2:Literature Review ................................................................................................................... 7
2.1 Introduction .................................................................................................................................. 7
2.2 Natural Language Processing ........................................................................................................ 7
2.3 Natural Language Programming ................................................................................................. 20
2.4 Compiler Design .......................................................................................................................... 21
2.5 Conclusion ................................................................................................................................... 23
Chapter 3:Methodology ........................................................................................................................ 24
3.1 Introduction ................................................................................................................................ 24
3.2 Research Methodology ............................................................................................................... 24
3.3 Rapid Application Development ................................................................................................. 25
3.4 Agile ............................................................................................................................................ 28
3.5 Tools ............................................................................................................................................ 31
3.6 Preferred Methodology and Justification ................................................................................... 32
3.7 Conclusion ................................................................................................................................... 33
Chapter 4:System Analysis and Design ................................................................................................. 34
4.1 Introduction ................................................................................................................................ 34
iv

4.2 Requirements Elicitation ............................................................................................................. 34
4.3 Feasibility study........................................................................................................................... 36
4.4 Requirements specification ........................................................................................................ 37
4.6 System Requirements ................................................................................................................. 37
4.4 System Design ............................................................................................................................. 39
4.6 Conclusion ................................................................................................................................... 44
Chapter 5:Implementation and Testing ................................................................................................ 45
5.1 Introduction ................................................................................................................................ 45
5.2 Tools ............................................................................................................................................ 45
5.3 Deployment Architecture ........................................................................................................... 46
5.4 System Functionality ................................................................................................................... 46
5.5 Software Testing ......................................................................................................................... 48
5.6 Conclusion ................................................................................................................................... 49
Chapter 6:Recommendations and Conclusions .................................................................................... 50
6.1 Introduction ................................................................................................................................. 50
6.2 Classification of the System ........................................................................................................ 50
6.3 Review of the Projects Aim and Objectives .............................................................................. 50
6.4 Challenges Encountered .............................................................................................................. 50
6.5 Recommendations for Future Work ............................................................................................ 51
6.6 Conclusion .................................................................................................................................. 51
References ........................................................................................................................................ 52
APPENDICES .......................................................................................................................................... 54








v

List of Figures
Figure 2.1: Parse Tree. 16
Figure 3.1: A generic agile development process features an initial planning stage, rapid repeats
of the iteration stage, and some form of consolidation before release.....28
Figure 4.1: Suggested Interface.36
Figure 4.2: Architecture Diagram for the Natural Language C Cross Compiler.. 40
Figure 4.3: Class Diagram for the Natural Language C Cross Compiler front-end. ...42
Figure 4.4: Sequence Diagram for the Natural Language C Cross Compiler..43
Figure 5.1: The interface... 47














vi

List of Tables

Table 1.1: Project Plan5
























1

Chapter 1
Introduction
1.1 Introduction
Natural Language Programming (NLP) is an ontology-assisted way of programming in terms
of natural language sentences for example English. The goal of NLP is to make computers
easier to use and enable people who are not professional computer scientists to be able to
teach new behaviour to their computers.

Natural Language Programming builds up a single program or a library of routines that are
programmed through natural language sentences using an ontology that defines the available
data structures in a high level programming language. The smallest unit of statement in
Natural Language Programming is a sentence. Each sentence is stated in terms of concepts
from the underlying ontology. In a Natural Language Program text each sentence
unambiguously compiles into a procedure call in the underlying high level programming
language such as C, C++, Java, etc.

The goal of easy-to-use interfaces for programming would be a natural language interface -
just tell the computer what you want.

Attempts have been made in the natural language programming field to create natural language
interfaces for programming. The NLC prototype in 1979 (Liu and Lieberman, 2005) was build, with
the capabilities of handling low level operations as the transformation of type declarations into
programmatic expressions. More recently a system called METAFOR capable of translating natural
language statements to class descriptions with associated objects and methods (Liu and Lieberman,
2005). Efforts have been focused on experiments on the feasibility of using natural language in
programming and less have been done to come up with fully functional NLP interfaces for
programming.

1.2 Background
A natural language interface for programming should result in greater readability, as well as
making possible a more intuitive way of writing code. Code written in English is much easier
to read and understand than in a traditional programming language. Quite often, it is a
2

difficult task to read another programmers code. Even understanding ones own code can be
hard after a period of time. This is because without sufficient commenting one cannot tell
what individual steps are meant to do together.
Debugging is a generic term for finding and fixing errors in a program. These errors can be
syntactic, which are normally detected by the compiler or interpreter, or logical, which cause
unwanted behaviours and can be very difficult to detect (Halpern, 1966). The latter, however,
can be extraordinarily difficult to find. It involves knowing exactly what each line in the
program does. If what the programmer believes a statement to do and what it actually does
are disjointed, there is the potential for catastrophe.
Early work in natural language programming was deemed ambitious, targeting the generation
of complete computer programs that would compile and run. For instance, the NLC
prototype (Ballard B and Bierman, 1979) aimed at creating a natural language interface for
processing data stored in arrays and matrices, with the ability of handling low level
operations such as the transformation of numbers into type declarations as e.g. float-constant
(2.0), or turning natural language statements like add y1 to y2 into the programmatic
expression y1 + y2.

More recently, however, researchers have started to look again at the problem of natural
language programming, but this time with more realistic expectations, and with a different,
much larger pool of resources for example broad spectrum common sense knowledge (Singh
P, 2002) and a suite of significantly advanced publicly available natural language processing
tools. For instance, (Pane J t. al, 2001) conducted a series of studies with non-programming
fifth grade users, and identified some of the programming models implied by the users
natural language descriptions. In a similar vein, (Lieberman, 2005) have conducted a
feasibility study and showed how a partial understanding of a text, coupled with a dialogue
with the user, can help non-expert users make their intentions more precise when designing a
computer program. Their study resulted in a system called METAFOR (Liu & Lieberman,
2005), able to translate natural language statements into class descriptions with the associated
objects and methods.

The challenge most programmers face today is trying to make code readable for the next
programmer. Current methods of system documentation do not do much in terms of
documenting the code itself. This leaves commenting code as the only documentation any
3

source code has. A natural language interface uses the comments as program statements
which results in code that is self-documenting and readable for anyone who is to go through
the code. This makes debugging a less daunting task as program logic will be in plain
English.

1.3 Aim

The aim of the project is to develop a natural language compiler based on the C programming
language.
1.4 Objectives
To compile natural language text as program input
To perform syntax error checks on input
To perform grammatical error checks on input
To generate equivalent object code executable on the main platform
1.5 Justification
It is notoriously difficult to construct conventional software systems systematically and
timely (Somerville, 2008), with up to 20% of industrial development projects failing. With
further study and improvements the aim is to bridge the gap between how problems are
defined and how they are solved. Problems are defined in natural language and implemented
using formal programming languages. The gap has caused delays in the delivery of software
which ultimately translates to losses in the millions in some cases.

A natural language interface gives code that is readable and easier to maintain in essence self-
documenting code. Consider a scenario where as a programmer you are tasked to add
functionality to an application you did not write but whose code you have at hand which is
not well documented. This task takes a long time if the code is not well documented. The
current methods used for documentation of software projects do less in terms of documenting
the code itself. Going through another programmers code is a difficult task and sometimes
even your own code after a long time. Good programmers write self-documenting code and
yet when faced with the less preferred scenario it will take a long time to make a simple
change to a system.
4

1.6 Scope
The project is aimed at developing a natural language C based compiler. The natural
language used is English. The compiler extracts information about variables, operators and
information on loops from analysing the natural language program input.
It focuses on the representation of parts of the natural language, English, that can be mapped
to existing data structures as variables, structs, lists, arrays and loops. The compiler extracts
nouns, verbs and overly expressed actions; these can be mapped through the use of ontologies
to variables, iterations (loops) and statements.
The compiler is based on the data structures that are built in the C programming language. It
does not cover the graphical implementation of the C programming language.

1.7 Expected results
The compiler takes English natural language text as input. It performs grammatical and
syntax error checks on the natural language program, giving back the errors in the program. It
compiles the natural language program if no errors are found. The compiler generates the
equivalent object code that is executable on the main platform that is, windows or Linux
platforms.
1.8 Project overview
This project is divided into six main chapters. The first chapter being the introductory
chapter, the second is the literature review followed by methodology, systems analysis and
design, implementation and conclusion and recommendations for future research.
The first chapter gives an introduction to the project. It states the background and aim of the
project and the justification to why the research project topic was selected.
The Literature review gives an overview of the current systems and the fundamentals
employed in the area of this research. This chapter also develops an argument on the
relevance of this research.
The Methodology chapter gives an overview of methodologies available and justifications of
the selection of a methodology to be followed are elaborated in this chapter. The chapter also
gives the techniques and the methodology used in the development of the Natural Language
C Cross Compiler.
5

The Systems Analysis and Design chapter focuses on the analysis and design of the system. It
is the detailed analysis of the functional requirements and it has a summary of the system in
the form of system development designs. Using Unified Modelling language the conceptual
model of the system is shown and it is on these designs that the system is implemented.
The Implementation chapter is shows how the transformation from design to application is
done. Screen shots of the completed system are shown. This chapter also focuses on testing
and overview of problems as well as resultant solutions encountered during implementation
stage.
The last chapter gives the conclusion of the research topic, set objectives are measured
against results and suggestions for further research are stated. Divergences if any are justified
in this chapter.

1.9 Project Plan
The project is to follow the following plan with the activities and the timelines for each
milestone shown in the following table.
Activity Number Milestone Timeline
1 Analysis and design 4 Weeks
2 Implementation 5 Weeks
3 Testing and Evaluation 4 Weeks
4 Final Demonstration and
reporting
2 Weeks
Table 1.1: Project Plan






6

1.10 Conclusion
This chapter discussed the research project, its aim and objectives. To introduce the project
the Natural Language Processing and Programming was discussed and the need for the
system was argued as well. Also included in this chapter was the project plan which is
followed in the outline of the rest of this project documentation.




















7

Chapter 2
Literature Review
2.1 Introduction
Natural Language Processing is an interdisciplinary research area at the border between
linguistics and artificial intelligence aiming at developing computer programs capable of
human-like activities related to understanding or producing texts or speech in a natural
language, such as English. Morden approaches to NLP are based on machine learning, a type
of artificial intelligence that examines and uses patterns in data to improve a program's own
understanding. The most important applications of natural language processing include
information retrieval and information organisation, machine translation, and natural language
interfaces, among others.
The task of improvement of Natural Language Processing has been divided into tasks useful
for application development and analysis. These range from syntactic analysis such as part-
of-speech tagging, chunking and parsing to semantic analysis such as semantic role labelling,
named entity extraction and anaphora resolution.
Natural Language Programming is a branch separate to Natural Language Processing but
within Artificial intelligence. Natural Language Programming is the interpretation and
compilation of instructions communicated in natural language into object code. It is depended
on the advances in Natural Language Processing.

2.2 Natural Language Processing
Natural Language Processing (NLP) targets the conversion of human language into formal
representations that can be manipulated using computers. Natural Language Processing is not
often considered as a goal in or of itself but rather as a means for accomplishing a certain task
for instance we have information retrieval systems that use NLP.
Natural Language Processing seeks to accomplish human like processing. That is to be able
to paraphrase input text, convert the text to another language and answer questions about the
text.

8

2.2.1 Natural language Processing Applications
There are huge amounts of data on the Internet. Applications for processing large amounts of
texts require Natural Language Processing expertise. Some requirements are to classify text
into categories, index and search large texts, automatic translation, speech understanding:
understand phone conversations, information extraction: Extract useful information from
resumes, automatic summarisation: that is condense one book into one page, question
answering, knowledge acquisition, text generations or dialogues.
Natural language processing provides both theory and implementations for a range of
applications, some of the applications include; information retrieval, information extraction
information extraction focuses on the recognition, tagging, and extraction into a structured
representation, certain key elements of information, for example persons, companies,
locations, organisations, from large collections of text. These extractions can then be utilised
for a range of applications including question-answering, visualisation, and data mining.
Question Answering in contrast to Information Retrieval provides the user with either just the
text of the answer itself or answer-providing passages which provides a list of potentially
relevant documents in response to a users query.
At the higher levels of NLP at the discourse level there is summarisation. Its implementation
reduces larger text into a shorter richly constituted abbreviated narrative representation of the
original document.
Machine Translation (MT) can be considered to be the oldest of all NLP applications, various
levels of NLP have been utilised in MT systems, ranging from the word-based approach to
applications that include higher levels of analysis.

2.2.2 Computational Linguistics
A simple sentence consists of a subject followed with predicate. A word in a sentence acts a
part of speech (POS). For an English sentence, the parts of speech are: nouns, pronouns,
adjectives, verb, adverb, prepositions, conjunctions, and interjections. A noun tells us about
names, whereas the verb talks of action. Adjectives and adverbs are modifying the nouns and
verbs, respectively. Prepositions are relationships between nouns and other parts of speech.
Conjunctions join words and groups together, and interjections express strong feelings. In the
9

spoken language, the problem of understanding speech can be divided into three areas
acoustic-phonetic, morphological-syntactic, and semantic-pragmatic processes.
In computational linguistics the lexicon supplies paradigmatic information about words,
including part of speech labels, irregular plurals, and sub-categorization information for
verbs. In the past, lexicons were quite small and were constructed largely by hand. Effective
natural language processing requires increased amounts of lexical information. A recent trend
has been the use of automatic techniques applied to large corpora for the purpose of acquiring
lexical information from text (Zernik 1991). Statistical techniques are an important aspect of
automatically mining lexical information. (Manning 1993) uses such techniques to gather
sub-categorisation information for verbs. (Brent 1993) also discovers sub-categorisation
information; in addition he attempts to automatically discover verbs in the text, (Liu and Soo
1993) describe a method for mining information about thematic roles. The additional
information being added to the lexicon increases the complexity of the lexicon. This added
complexity requires that attention be paid to the organisation of the lexicon (Zernik 1991).
(McCray et al 1993) discuss the structure of a large lexicon designed and implemented to
support syntactic processing.

Automatically disambiguating part-of-speech labels in text is an important research area since
ambiguity is particularly prevalent in the spoken language. Programs that resolve part-of-
speech labels (often called automatic taggers) typically are around 95% accurate (Bod 1998).
Taggers can serve as pre-processors for syntactic parsers and contribute significantly to
efficiency. There have been two main approaches to automatic tagging: probabilistic and
rule-based. Typically, probabilistic taggers are trained on disambiguated text and vary as to
how much training text is needed and how much human effort is required in the training
process. (Schtze 1993) described a tagger that requires very little human intervention.)
Further variation concerns knowing what to do about unknown words and the ability to deal
with large numbers of tags.
One drawback to stochastic taggers is that they are very large programs requiring
considerable computational resources. (Brill 1992) describes a rule-based tagger which is as
accurate as stochastic taggers, but with a much smaller program. The program is slower than
stochastic taggers, however. Building on Brills approach, (Roche and Schabes 1995)
propose a rule-based, finite-state tagger which is much smaller and faster than stochastic
implementations. Accuracy and other characteristics remain comparable.
10

A traditional approach to natural language processing takes as its basic assumption that a
system must assign a complete constituent analysis to every sentence it encounters. The
methods used to attempt this are drawn from mathematics, with context-free grammars
playing a large role in assigning syntactic constituent structure. (Partee at al 1993) provide an
accessible introduction to the theoretical constructs underlying this approach, including set
theory, logic, formal language theory, and automata theory, along with the application of
these mechanisms to the syntax and semantics of natural language. For syntax, it uses a
unification-based implementation of a generalised phrase structure grammar (Gazdar et al.
1985) and handles an impressive number of syntactic structures. In continuing research in this
tradition, context-free grammars have been extended in various ways. The mildly context
sensitive grammars, such as tree adjoining grammars, have had considerable influence on
recent work concerned with the formal aspects of parsing natural language. Several recent
papers pursue non-traditional approaches to syntactic analysis. One such technique is partial,
or underspecified, analysis. For many applications such an analysis is entirely sufficient and
can often be more reliably produced than a fully specified structure. (Chen 1994), for
example, employ statistical methods combined with a finite state mechanism to impose an
analysis which consists only of noun phrase boundaries, without specifying their complete
internal structure or their exact place in a complete tree structure. (Agarwal and Boggess
1992) successfully rely on semantic features in a partially specified syntactic representation
for the identification of coordinate structures. In an innovative application of dependency
grammar and dynamic programming techniques, (Kurohashi and Nagao 1994) address the
problem of analysing very complicated coordinate structures in Japanese.
A recent innovation in syntactic processing has been investigation into the use of statistical
techniques. In probabilistic parsing, probabilities are extracted from a parsed corpus for the
purpose of choosing the most likely rule when more than one rule can apply during the course
of a parse (Magerman and Weir 1992). In another application of probabilistic parsing the goal
is to choose the (semantically) best analysis from a number of syntactically correct analyses
for a given input (Briscoe at al 1993).
Another application of statistical methodologies to the parsing process is grammar induction
where the rules themselves are automatically inferred from a bracketed text; however, results
in the general case are still preliminary. (Pereira and Schabes 1992) discuss inferring a
grammar from bracketed text relying heavily on statistical techniques, while (Brill 1993) uses
only modest statistics in his rule-based method.
11

Automatic word-sense disambiguation depends on the linguistic context encountered during
processing. (McRoy 1992) appeals to a variety of cues while parsing, including morphology,
collocations, semantic context, and discourse. Her approach is not based on statistical
methods, but rather is symbolic and knowledge intensive. Statistical methods exploit the
distributional characteristics of words in large texts and have need of training, which can
come from several sources, as well as human intervention. (Gale et al 1992) give an overview
of several statistical techniques they have used for word-sense disambiguation and discuss
research on evaluating results for their systems and others. They have used two training
techniques, one based on a bilingual corpus, and another on Rogets Thesaurus. (Justeson and
Katz 1995) use both rule based and statistical methods. The attractiveness of their method is
that the rules they use provide linguistic motivation.
Formal semantics is rooted in the philosophy of language and has as its goal a complete and
rigorous description of the meaning of sentences in natural language. It concentrates on the
structural aspects of meaning. The papers in (Rosner and Johnson 1992) discuss various
aspects of the use of formal semantics in computational linguistics and focus on Montague
grammar (Montague 1974). (King 1992) provides an overview of the relation between formal
semantics and computational linguistics. Several papers in Rosner and Johnson discuss
research in the situation semantics paradigm (Barwise and Perry 1983), which has recently
had wide influence in computational linguistics, especially in discourse processing. Lexical
semantics (Cruse 1986) has recently become increasingly important in natural language
processing. This approach to semantics is concerned with psychological facts associated with
the meaning of words. (Levin 1993) analyses verb classes within this framework, while the
papers in Levin and Pinker 1991 explore additional phenomena, including the semantics of
events and verb argument structure. Another application of lexical semantics is WordNet
which is a lexical database that attempts to model cognitive processes. The articles in (Saint-
Dizier and Viegas 1995) discuss psychological and foundational issues in lexical semantics as
well as a number of aspects of using lexical semantics in computational linguistics.
Another approach to language analysis based on psychological considerations is cognitive
grammar (Langacker 1988). (Olivier and Tsujii 1994) deal with spatial prepositions in this
framework, while (Davenport and Heinze 1995) discuss more general aspects of semantic
processing based on cognitive grammar.
12

Discourse analysis is concerned with coherent processing of text segments larger than the
sentence and assumes that this requires something more than just the interpretation of the
individual sentences. (Grosz, Joshi and Weinstein 1995) provide a broad-based discussion of
the nature of discourse, clarifying what is involved beyond the sentence level, and how the
syntax and semantics of the sentences support the structure of the discourse. In their analysis,
discourse contains linguistic structure (syntax, semantics), focus of attention, and intentional
structure (plan of participants) and is structured into coherent segments. During discourse
processing one important task for the hearer is to identify the referents of noun phrases.
Inferencing is required for this identification. A coherent discourse lessens the amount of
inferencing required of the hearer for comprehension. Throughout a discourse the particular
way that the speaker maintains focus of attention or centring through choice of linguistic
structures for referring expressions is particularly relevant to discourse coherence.
Other work in computational approaches to discourse analysis has focused on particular
aspects of processing coherent text. (Hajicova et al 1995) distinguish topic that is old
information from focus that is new information within a sentence. Information of this sort is
relevant to tracking the focus of attention. (Lappin and Leass 1994) are primarily concerned
with intra-sentential anaphora resolution, which relies on syntactic cues, rather than discourse
cues. Nonetheless, they also address inter-sentential anaphora, and this relies on several
discourse cues, such as saliency of a noun phrase, which is determined by such things as
grammatical role, frequency of mention, proximity, and how recent a sentence is. (Hul et al
1995) use a similar notion of saliency for anaphora resolution and resolve deictic expressions
with the same principles. (Passonneau and Litman 1993) study the nature of discourse
segments and the linguistic structures which cue them. (Sonderland and Lehnert 1994)
investigate machine learning techniques for discovering discourse-level semantic structure.
Several recent papers investigate those aspects of discourse processing having to do with the
psychological state of the participants in a discourse, including, goals, intentions, and beliefs:
(Asher and Lascarides 1994) investigate a formal model for representing the intentions of the
participants in a discourse and the interaction of such intentions with discourse structure and
semantic content. (Traum and Allen 1994) describes the idea of social obligation to shed light
on the behaviour of discourse. (Wiebe 1994) investigates psychological point of view in third
person narrative and provides an insightful algorithm for tracking this phenomenon in text.
The point of view of each sentence is either that of the narrator or any one of the characters in
the narrative.
13

2.2.3 Levels of knowledge in language understanding
A language understanding program must have considerable knowledge about the structure of
the language including what the words are and how they combine into phrases and sentences.
It must also know meaning of the words, how to contribute meaning of the sentence and to
the context in which they are being used. In addition, the program must have general world
knowledge and knowledge about how the humans reason.
The components of the knowledge needed to understand the language are phonological which
relates sounds to the words we recognise. Phoneme which is smallest unit of sound, and the
phones are aggregated into word. Morphological is the lexical knowledge, which relates to
word construction from basic units called morphemes. A morpheme is the smallest unit of
meaning, for example, the construction of friendly from friend and ly. Syntactic is knowledge
about how the words are organised to construct meaningful and correct sentences. Pragmatics
is the high level knowledge about how to use sentences in different contexts and how the
context affects the meanings of the sentences.

2.2.4 Grammars and Languages
A language can be generated given its grammar G = (V,_, S, P), where V is set of variables, _
is set of terminal symbols, which appear at the end of generation, S is start symbol, and P is
set of production rules. The corresponding language of G is L(G).
Consider that various tuples are as given in Listing 2.1.











14

V = {S,NP,N, V P, V,Art}
_ = {boy, icecream, dog, bite, like, ate, the, a},
P = {S NP V P,
NP N,
NP ART N,
V P V NP,
N boy | icecream | dog,
V ate | like | bite,
Art the | a}
Listing 2.1: Language Generation
Using above we can generate the following sentences:
The dog bites boy.
Boy bites the dog.
Boy ate ice cream.
The dog bite the boy.
Listing 2.2: Second Language Generation

To generate a sentence, the rules from P are applied sequentially starting from the beginning.
However, we note that a grammar does not guarantee the generation of meaningful sentences,
but generate only those are structurally correct as per the rules of the grammar.
It is not always possible to formally characterize the natural languages with a simple
grammar like above. The grammars are defined by Chomsky hierarchy, as type 0, 1, 2, 3. The
typical rewrite rules for type 1 are given in Listing 2.3:

15

S aS
S aAB
AB BA
aA ab
aA aa
Listing 2.3:Third Language Generation

Where uppercase letters are non-terminals and lowercase are terminals.
The type-2 grammars are:
S aS
S aSb
S aB
S aAB
A a
B b
Listing 2.4: Fourth Language Generation
The type 3 grammar is simplest having rewrite rules as:
S aS
S
Listing 2.5: Fifth Language Generation
The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence
the corresponding names for languages also. The formal languages are mostly based on the
type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.
The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence
16

the corresponding names for languages also. The formal languages are mostly based on the
type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.

2.2.5 Structural Representation
It is convenient to represent the sentences as tree or a graph to help expose the structure of the
constituent parts. For example, the sentence, the boy ate a ice cream can be represented as a
tree shown in Figure 2.1.








Figure 2.1: Parse Tree
For the purpose of computation a tree must also be represented as a record, a list or some
similar data structure. For example, the tree above is represented as a list:
(S (NP ((Art the)
(N boy))
(VP (V ate) (NP (Art a) (N Icecream)))))
Listing 2.6: Tree Representation as a list


17

A more extensive English grammar can be obtained with the addition of other constituencies
such as prepositional phrases PP, adjectives ADJ, determiners DET, adverbs ADV , auxiliary
verbs AUX, and many other features. Correspondingly, the other rewrite rules are followings.
PP Prep NP,
V P V ADV
V P V PP,
V P V NP PP
V P AUX V NP
Det Art ADJ,
Det Art
Listing 2.7: Rewrite of rules
2.2.6 Pattern matching
The idea here is an approach to natural language processing is to interpret input utterances as
a whole father than building up their interpretation by combining the structure and meaning
of words or other lower level constituents. That means the interpretations are obtained by
matching patterns of words against the input utterance. For a deep level of analysis in pattern
matching a large number of patterns are required even for a restricted domain. This problem
can be ameliorated by hierarchical pattern matching in which the input is gradually
normalised through pattern matching against sub-phrases. Another way to reduce the number
of patterns is by matching with semantic primitives instead of words.

2.2.7 Syntactically driven Parsing
Syntax means ways that words can fit together to form higher level units such as phrases, clauses and
sentences. Therefore syntactically driven parsing means interpretation of larger groups of words are
built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the
opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic
analyses are obtained by application of a grammar that determines what sentences are legal in the
language that is being parsed.
18

2.2.8 Semantic Grammars
Natural language analysis based on semantic grammar is bit similar to syntactically driven
parsing except that in semantic grammar the categories used are defined semantically and
syntactically. There here semantic grammar is also involved.
Case frame instantiation is one of the major parsing techniques under active research today.
The has some very useful computational properties such as its recursive nature and its ability
to combine bottom-up recognition of key constituents with top-down instantiation of less
structured constituents.

2.2.9 Applications of Natural Language Processing
As natural language processing technology matures it is increasingly being used to support
other computer applications. Such use naturally falls into two areas, one in which linguistic
analysis merely serves as an interface to the primary program, and another in which natural
language considerations are central to the application.
Natural language interfaces to data base management systems (for example Bates 1989)
translate users input into a request in a formal data base query language, and the program
then proceeds as it would without the use of natural language processing techniques. It is
normally the case that the domain is constrained and the language of the input consists of
comparatively short sentences with a constrained set of syntactic structures. The design of
question answering systems is similar to that for interfaces to data base management systems.
One difference is that the knowledge base supporting the question answering system does not
have the structure of a data base. Processing in this system not only requires a linguistic
description for users requests, but it is also necessary to provide a representation for the
encyclopaedia itself. As with the interface to a Database Management System, the requests
are likely to be short and have a constrained syntactic structure. (Lauer at al 1992) provide
some general considerations concerning question answering systems and describe several
applications.
In message understanding systems, a fairly complete linguistic analysis may be required, but
the messages are relatively short and the domain is often limited. (Davenport and Heinze
1995) describe such a system in a military domain.
19

In information filtering, text categorisation, and automatic abstracting no constraints on the
linguistic structure of the documents being processed can be assumed. One mitigating factor
is that effective processing may not require a complete analysis. For all of these applications
there are also statistically based systems based on frequency distributions of words. These
systems work fairly well, but most people feel that for further improvements, and for
extensions, some sort of understanding of the texts, such as that provided by linguistic
analysis is required.
Information filtering and text categorization are concerned with comparing one document to
another. In both applications, natural language processing imposes a linguistic representation
on each document being considered. In text categorization a collection of documents is
inspected and all documents are grouped into several categories based on the characteristics
of the linguistic representations of the documents. (Blosseville et al. 1992) describes an
interesting system which combines natural language processing, statistics, and an expert
system. In information filtering, documents satisfying some criterion are singled out from a
collection. (Jacobs and Rau 1990) discuss a program which imposes a quite sophisticated
semantic representation for this purpose.
In automatic abstracting, a summary of each document is sought, rather than a classification
of a collection. The underlying technology is similar to that used for information filtering and
text categorisation: the use of some sort of linguistic representation of the documents. Of the
two major approaches, one (McKeown and Radev 1995) puts more emphasis on semantic
analysis for this representation and the other (Paice and Jones 1993), less. Information
retrieval systems typically allow a user to retrieve documents from a large bibliographic
database. During the information retrieval process a user expresses an information need
through a query. The system then attempts to match this query to those documents in the
database which satisfy the users information need. In systems which use natural language
processing, both query and documents are transformed into some sort of a linguistic structure,
and this forms the basis of the matching. Several recent information retrieval systems employ
varying levels of linguistic representation for this purpose. (Sembok and van Rijsbergen
1990) base their experimental system on formal semantic structures, while (Myaen et al
2004) construct lexical semantic structures for document representations. (Strzalkowski
1994) combines syntactic processing and statistical techniques to enhance the accuracy of
representation of the documents. In an innovative approach to document representation for
20

information retrieval, (Liddy et al 1995) use several levels of linguistic structure, including
lexical, syntactic, semantic, and discourse.

2.2.10 Natural Language Processing based systems
A number of systems currently use natural language processing for the accomplishment of a
number of targeted tasks. Some of these include systems that do text summarising, page
ranking and natural language interfaces to databases, text mining and language translation.
Currently researchers have been working on coming up with natural language programming
interfaces. Some of the systems have been targeted to learning institutes for the acquisition of
the first programming language. This is because it has been noted across tertiary institutes
that the rate of dropouts for courses with programming has been significantly high as high as
30% ( G u z d i a l & S o l o w a y , 2 0 0 2 ) also causing a less appreciation of
programming in general for students majoring in Computer Science.
The vast amount of information on the internet and the information needed for doing day to
day tasks in a number of fields has called for systems that can search comprehensively for
information to improve productivity. This has led to data and text mining systems in fields
like medicine.

2.3 Natural Language Programming
Natural Language Programming is the interpretation and compilation of instructions
communicated in natural language into object code. It uses natural language processing
techniques for the extraction of information from natural language text input.

2.3.1 Natural Language Programming based systems
The NLC prototype (Ballard B and Bierrman, 1979) was one of the attempts made to come
up with a natural language programming interface. It had the capabilities of handling low
level operations as the transformation of type declarations into programmatic expressions.
The system is capable of turning statements like add y1 to y2 into the expression y1 + y2.
21

More recently in 2005 a system called METAFOR was implemented. METAFOR is capable
of translating natural language statements into class descriptions with associated objects and
methods. METAFOR interactively converts English sentences to partially specified program
code, to be used as a starting point for a more detailed program. A user study by Henry
Lieberman showed that METAFOR is capable of capturing enough Programmatic Semantics
to facilitate non programming users and beginners conceptualisation of programming
problems.

2.4 Compiler Design
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler
has four major tasks which are to determine the correctness of the syntax of programs, to generate
correct and efficient object code, performing run-time organisation, and it formats output according to
assembler and/or linker conventions. A compiler consists of three main parts: the frontend, the
middle-end, and the backend.

The front end checks whether the program is correctly written in terms of the programming language
syntax and semantics. Here legal and illegal programs are recognised. Errors are reported, if any, in a
useful way. Type checking is also performed by collecting type information. The frontend then
generates an intermediate representation or IR of the source code for processing by the middle-end.
The middle end is where optimisation takes place. Typical transformations for optimisation are
removal of useless or unreachable code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop), or specialisation of computation
based on the context.

The middle-end generates another intermediate representation for the following backend. Most
optimisation efforts are focused on this part. The back end is responsible for translating the IR from
the middle-end into assembly code. The target instructions are chosen for each IR instruction.
Register allocation assigns processor registers for the program variables where possible. The backend
utilises the hardware by figuring out how to keep parallel execution units busy, filling delay slots, and
so on. Although most algorithms for optimisation are in NP, heuristic techniques are well-developed.



22

2.4.1 What is a compiler?
In order to reduce the complexity of designing and building computers, nearly all of these are
made to execute relatively simple commands (but do so very quickly). A program for a
computer must be built by combining these very simple commands into a program in what is
called machine language. Since this is a tedious and error prone process most programming
is, instead, done using a high-level programming language. This language can be very
different from the machine language that the computer can execute, so some means of
bridging the gap is required. This is where the compiler comes in. A compiler translates (or
compiles) a program written in a high-level programming language that is suitable for human
programmers into the low-level machine language that is required by computers. During this
process, the compiler attempts to spot and report obvious programmer mistakes.

2.4.2 The phases of a compiler
A typical way to structure the writing of a compiler is to split the compilation into several
phases with well-defined interfaces (Alfred 2007). Theoretically, these phases operate in
sequence (though in practice, they are often interleaved), each phase (except the first) taking
the output from the previous phase as its input. It is common to let each phase be handled by
a separate module. Some of these modules are written by hand, while others may be
generated from specifications. Often, some of the modules can be shared between several
compilers.
In some compilers, the ordering of phases may differ slightly, some phases may be combined
or split into several phases or some extra phases may be inserted between those mentioned in
the following paragraphs.
Lexical analysis is the initial part of reading and analysing the program text: The text is read
and divided into tokens, each of which corresponds to a symbol in the programming
language, for example, a variable name, keyword or number.
Syntax analysis phase takes the list of tokens produced by the lexical analysis and arranges
these in a tree-structure (called the syntax tree) that reflects the structure of the program. This
phase is often called parsing.
23

Type checking phase analyses the syntax tree to determine if the program violates certain
consistency requirements. That is, if a variable is used but not declared or if it is used in a
context that does not make sense given the type of the variable, such as trying to use a
Boolean value as a function pointer.
In intermediate code generation the program is translated to a simple machine independent
intermediate language. On the register allocation phase the symbolic variable names used in
the intermediate code are translated to numbers, each of which corresponds to a register in the
target machine code.

2.5 Conclusion
We gave an overview of the techniques used in Natural Language Processing and where in
real life they are applied. From the literature review, an approach to designing the Natural
Language Compiler would be to start from utilising tools and algorithms for Natural
Language Processing. This assists in getting the relevant information from the input to be
used subsequently in the preceding stages of the overall system. A lot of work would have to
be done after the initial stages of processing that is for a compiler to be fully functional an
exhaustive number of functions have to be written in the underlying language that comply
unambiguously to the actions to be performed on the parameters passed.









24

Chapter 3
Methodology
3.1 Introduction
A software development methodology is a structure imposed on the development of a
software product or alternately a framework that is used to, plan, and control the process of
developing an information system. It includes procedures, techniques, tools and
documentation aids which help system developers in their task of implementing a new
system. The aim of a methodology is to formalise what is being done, making it more
repeatable.
A study conducted by the Forrester research group (Hoffman T, July 2003) states that nearly
one-third of all IT projects commenced would, on average, be three months late. In many
cases the failure is the result of either not using a methodology or using the wrong
methodology. This shows the importance of a software development methodology in a
software project for it somewhat determines the success or failure of a software project. This
chapter discusses some development methodologies that are used and also highlights the
methodology that is adopted for this project and why it was chosen.

3.2 Research Methodology
A research methodology is a way to systematically solve a research problem. It may be
understood as a science of studying how research is done scientifically. We studied are the
various steps generally adopted by a researcher in studying a research problem along with the
logic behind. Research methodologies use procedures, methods and techniques that have
been tested for their. Some research methodologies are discussed below.

The build research methodology consists of building an artefact, either a physical or a
software system, to demonstrate that it is possible. For it to be considered research, the
construction of the artefact must be new or must include new features that have not been
demonstrated before in other artefacts.

Another research methodology is process methodology which is used to understand the
processes used to accomplish tasks in a task. This methodology is mostly used in the areas of
25

Software Engineering and Man-Machine Interface which deal with the way humans build and
use computer systems. The study of processes may also be used to understand cognition in
the field of Artificial Intelligence.

The last research methodology discussed for the project is the model methodology. It is
centred on defining an abstract model for a real system. The model is much less complex than
the system that it models, and therefore allows the researcher to better understand the system
and to use the model to perform experiments that could not be performed in the system itself
because of cost or accessibility. The model methodology is often used in combination with
other methodologies. Experiments based on a model are called simulations. When a formal
description of the model is created to verify the functionality or correctness of a system, the
task is called model checking.

3.3 Rapid Application Development
Rapid Application Development (RAD) is a software development methodology that focuses
on building applications in a very short amount of time; traditionally with compromises in
usability, features and execution speed. RAD employs joint application design (to obtain user
input), prototyping, CASE technology, application generators, and similar tools to expedite
the design process.
Rapid Application Development has four essential aspects: methodology, people,
management, and tools. If any one of these ingredients is inadequate, development will not be
high speed. Development lifecycles, which weave these ingredients together as effectively as
possible, are of the utmost importance.

3.3.1 Strengths, weaknesses, and limitations
Rapid application development promotes fast, efficient, accurate program and/or system
development and delivery. Compared to other methodologies, RAD generally improves
user/designer communication, user cooperation, and user commitment, and promotes better
documentation.
Because rapid application development adopts prototyping and joint application design, RAD
inherits their strengths and their weaknesses. More specifically, RAD is not suitable for
26

mathematical or computationally oriented applications. Because rapid application
development stresses speed, quality indicators such as consistency, standardization,
reusability, and reliability are easily overlooked.
Speed and quality are the primary advantages of Rapid Application Development, while
potentially reduced scalability and feature sets are the disadvantages. The primary advantage
lies in an applications increased development speed and decreased time to delivery. Projects
developed using RAD lack scalability of a project that was designed as a full application
from the start. Rapid Application Development is not appropriate for all projects. The
methodology works best for projects where the scope is small or work can be broken down
into manageable chunks. Business objectives need to be well defined before the project can
begin, so projects that use RAD should not have a broad or poorly defined scope.

3.3.2 RAD Concepts and Phases
Rapid application development (RAD) is a system development methodology that employs
joint application design (to obtain user input), prototyping, CASE technology, application
generators, and similar tools to expedite the design process. Initially suggested by James
Martin, this methodology gained support during the 1980s because of the wide availability of
such powerful computer software as fourth-generation languages, application generators, and
CASE tools, and the need to develop information systems more quickly. The primary
objectives include high quality, fast development, and low cost.
Rapid application development focuses on four major components: tools, people,
methodology, and management. Current, powerful computing technology is essential to
support such tools as application generators, screen/form generators, report generators,
fourth-generation languages, relational or object-oriented database tools, and CASE tools.
People include users and the development team. The methodology stresses prototyping and
joint application design.
A strong management commitment is essential. Before implementing rapid application
development, the organisation should establish appropriate project management and formal
user sign-off procedures. Additionally, standards should be established for the organisations
data resources, applications, systems, and hardware platforms.
27

Martin suggests four phases to implement rapid application development: requirements
planning, user design, construction, and cutover (Martin, 2005). Requirements planning is
much like traditional problem definition and systems analysis. RAD relies heavily on joint
application design (JAD) sessions to determine the new system requirements.
During the user design phase, the JAD team examines the requirements and transforms them
into logical descriptions. CASE tools are used extensively during this phase. The system
design can be planned as a series of iterative steps or allowed to evolve.
During the construction phase, a prototype is built using the software tools described earlier.
The JAD team then exercises the prototype and provides feedback that is used to refine the
prototype. The feedback and modification cycle continues until a final, acceptable version of
the system emerges. In some cases, the initial prototype consists of screens, forms, reports,
and other elements of the user interface, and the underlying logic is added to the prototype
only after the user interface is stabilised.
The cutover phase is similar to the traditional implementation phase. Key activities include
training the users, converting or installing the system, and completing the necessary
documentation. Once the prototype has been developed, within its time box, the construction
team tests the initial prototype using test scripts developed during the user design stage, the
design team reviews the application, the customer also reviews the application. Lastly the
implementation stage, also known as the deployment stage, consists of integrating the new
system into the business. The design team trains the system users while the users perform
acceptance testing. If there was an old system in place, the design team would help the users
transfer from their old procedures to new ones that involve the new system. The design team
also troubleshoots after the deployment, for testing purposes on a test environment, and
identifies and tracks potential enhancements. The amount of time required to complete the
Implementation Stage varies with the project.
As with any project there are post project activities, which are typically the same for most
methodologies. For RAD; final deliverables should be handed over to the client and such
activities should be performed that benefit future projects. Specifically it is a best practice for
a Project Manager to review and document project metrics, organise and store project assets
such as reusable code components, Project Plan, Project Management Plan (PMP), and Test
Plan. It is also a good practice to prepare a short lessons learned document
28

3.4 Agile
The focal aspects of light and agile methods are simplicity and speed. In development work,
accordingly, the development group concentrates only on the functions needed at first hand,
delivering them fast, collecting feedback and reacting to the received information. An agile
development process is one were software development is incremental, cooperative,
straightforward and adaptive. Agile methodology is based on iterative and incremental
development, where requirements and solutions evolve.
The core of agile software development methods is the use of light-but-sufficient rules of
project behaviour and the use of human and communication-oriented rules. The agile
process is both light and sufficient. Lightness is a means of remaining manoeuvrable.
Sufficiency is a means of staying in the game (Cockburn 2002).
Agile methodologies embrace iterations. Small teams work together with stakeholders to
define quick prototypes, proof of concepts, or other visual means to describe the problem to
be solved. The team defines the requirements for the iteration, develops the code, and defines
and runs integrated test scripts, and the users verify the results.


Figure 3.1: A generic agile development process features an initial planning stage, rapid repeats
of the iteration stage, and some form of consolidation before release.



29

3.4.1 Two agile software development methodologies
The most widely used methodologies based on the agile philosophy are Extreme
programming and Scrum. These differ in particulars but share the iterative approach
described above.

3.3.1 Extreme Programming
This methodology concentrates on the development rather than managerial aspects of a
software projects. Extreme programming was designed so that organisations would be free to
adopt all or part of the methodology. It relies on constant code improvement, user
involvement in the development team and pairwise programming.
XP projects start with a release planning phase, followed by several iterations, each of which
concludes with user acceptance testing. When the product has enough features to satisfy
users, the team terminates iteration and releases the software. The life cycle of XP consists of
five phases namely; exploration, planning, release, production, maintenance and final release.
In the exploration phase, the customers write out story cards that they wish to be included in
the first release. Each story card describes a feature to be added into the program. At the same
time the project team familiarise themselves with the tools, technology and practices they will
be using in the project. The planning phase sets the priority for the stories and an agreement
of the contents of the first small release is made. The iterations to release phase includes
several iterations of the systems before the first release. The schedule set in the planning
stage is broken down to a number of iterations that will each take one to four weeks to
implement. The first iteration creates a system with the architecture of the whole system. The
production phase requires extra testing and checking of the performance of the system before
the system can be released to the customer.
To create a release plan, the team breaks up the development tasks into iterations. The release
plan defines each iteration plan, which drives the development for that iteration. At the end of
iteration, users perform acceptance tests against the user stories. If they find bugs, fixing the
bugs becomes a step in the next iteration.
XP has rules and concepts that govern it, some of which are described below. The first of
which is integrate often; it means development teams must integrate changes into the
30

development baseline at least once a day. This is also known as continuous integration.
Project velocity is another governing principle which is the measure of how much work is
getting done on the project. This metric drives release planning and release planning and
schedule updates. Another principle is user story which describes problems to be solved by
the system being built. These stories must be written by the user and should be about three
sentences long. This is one of the main objections to the XP methodology, but also one of its
greatest strengths.

3.3.2 Scrum
This methodology follows the rugby concept of scrum, which is related to scrimmage, in the
sense of a huddled mass of players engaged with each other to get a job done. Scrum for
software development came out of the rapid prototyping community because prototyping
groups wanted a methodology that would support an environment in which the requirements
were not only incomplete at the start, but also could change rapidly during development.
Unlike XP, Scrum methodology includes both managerial and development processes.
After the team completes the project scope and high-level designs, it divides the development
process into a series of short iterations called sprints. Each sprint aims to implement a fixed
number of backlog items. Before each sprint, the team members identify the backlog items
for the sprint. At the end of a sprint, the team reviews the sprint to articulate lessons learned
and check progress.
The Scrum development process concentrates on managing sprints. Before each sprint
begins, the team plans the sprint, identifying the backlog items and assigning teams to these
items. Teams develop, wrap, review, and adjust each of the backlog items. During
development, the team determines the changes necessary to implement a backlog item. The
team then writes the code, tests it, and documents the changes. During wrap, the team creates
the executable necessary to demonstrate the changes. In review, the team demonstrates the
new features, adds new backlog items, and assesses risk. Finally, the team consolidates data
from the review to update the changes as necessary.

Scrum also has some rules and concepts that govern it, some are described below. Sprint
backlog is the list of backlog items assigned to a sprint, but not yet completed. In common
31

practice, no sprint backlog item should take more than two days to complete. The sprint
backlog helps the team predict the level of effort required to complete a sprint. Another
concept is Product backlog. It is the complete list of requirements including bugs,
enhancement requests, and usability and performance improvements that are not currently in
the product release.

3.5 Tools
In the development of the system there are tools that we are going to employ so as to come up
with a system of high quality, below are some of the tools needed in the development of our
system.

3.5.1 Unified Modelling Language
Requirements for a business are best met by modelling business rules at a very high level,
where they can be easily validated with clients, and then automatically transformed to the
implementation level. The Unified Modelling Language (UML) is now widely used for both
database and software modelling. It is used as a standard language for object-oriented
analysis and design and is also used to model the Natural Language C Cross Compiler front
end. UML's object-oriented approach facilitates the transition to object-oriented code hence
its use in this project.
The design models can either be static which describe the static structure of the system in
terms of object classes and relationships or dynamic which describe the dynamic interactions
of the objects.
UML diagrams are used to depict system requirements and functionality and some of these
UML diagrams are used to view what this system does and the system goals in the design
phase. The UML diagrams to be used in the analysis phase are activity diagrams, sequence
diagrams and use case diagrams. In the design phase class diagrams and an Entity
Relationship diagrams are used.
Two more UML diagrams come into play during implementation stage, these are;
deployment diagrams and component diagrams. Deployment diagrams are implementation
32

level diagrams that show how the hardware and software elements that make up this
application are configured and set into operation. Component diagrams are also
implementation level tools that show the structure of the code be it source code files,
executable files or binary code files connected by dependencies.

3.5.2 Why UML
Although UML has no specification for the modelling of user interfaces, has no way to
formally specify serialisation and object persistence and no way to specify that an object
resides on a server process and shared among instances of a running process; it is the chosen
modeling language for this project. This is so because UML offers all the benefits of Object
Orientated development such as inheritance and polymorphism. It also helps to communicate,
explore potential designs, and validate the architectural design of the software. Importantly, it
uses simple intuitive notation that non-programmers can also understand its models.

3.5.3 Other Tools
A text editor with source code formatting is used for this project. Specifically Notepad++,
this is because of its simplicity and adaptability in being able to be used for multiple
languages offering a quick switch between languages. This project is being developed using
three languages C, Java and some assembly language hence the need to shift between
languages as often as possible during the development stage.

Other than the standard development environments this project uses open source scanner and
parser generators mainly LEX and YACC on Linux distributions and FLEX and Bison a
variation of YACC on windows.

3.6 Preferred Methodology and Justification
The chosen methodology is extreme programming, a type of agile development methodology.
This methodology has been chosen because it concentrates on the core development rather
than managerial aspects of a software projects. This methodology best suits this project in
that it puts more emphasis on the core programming processes of a project. This project has a
lot of programming processes; these are given more focus by this methodology. Extreme
programming promotes the fast development of a software product by dividing the whole
33

project into small components which are developed in iterations. By promoting fast
development with the use of iteration, it simultaneously promotes the production of a high
quality product since the modules are independently produced in the iterative process.

3.7 Conclusion
Here we discussed the different methodologies that can be used in the research and
development of the Natural Language C Cross Compiler, giving both the advantages and
disadvantages of each. The development of the Natural Language C Cross Compiler can be
modularised and it has a lot of programming processes hence the chosen methodology.
Justification for not choosing the other methodologies is aligned in this chapter. Included also
are the tools that are used in the development of the system.





34

Chapter 4
System Analysis and Design
4.1 Introduction
Systems analysis is the dissection of a system into its component pieces to study how those
component pieces interact and work with a view of changing it or improving the already
working. We do a systems analysis to subsequently perform a systems synthesis which is the
re-assembly of a system's component pieces back into a whole system-it is hoped an
improved system. Traditionally, systems analysis is associated with application development
projects, that is, projects that produce information systems and their associated computer
applications. Systems analysis methods can be applied to projects with different goals and
scope. In addition to single information systems and computer applications, systems analysis
techniques can be applied to strategic information systems planning and to the redesign of
business processes. There are also many strategies or techniques for performing systems
analysis. They include modern structured analysis, information engineering, prototyping, and
object-oriented analysis.

4.2 Requirements Elicitation
During this phase we gathered system requirements using a number of techniques to ensure
unambiguity, completeness, consistency, correctness and verifiability of the requirements
both non-functional and functional. Methods used include interviews, questionnaires and
examination of similar existing systems. This stage is of utmost importance as the system is
built directly from these.

We conducted interviews with a number of programmers from the University in an attempt to
establish the desired system functionality. The output formed a basis of the structure of the
input to the system that is natural language.
We had a brainstorming session with fellow students passionate in programming including an
expert programmer currently at e-solutions private limited. These were intended to
complement the interviews in trying to come up with solid system requirements that do not
tract from set standards of the development of software.

35

4.2.1 Needs Analysis
Compilers in existence currently and before give feedback via a character user interface or a
graphic user interface for compilers integrated within Integrated Development Environments
(IDEs). These compilers give information about errors encountered either semantic or syntax
by detecting the actual location of the error by line number and by underlining in colour for
most IDEs for example NetBeans.

From the interviews and brainstorming sessions we had, a point was raised that errors are
easily seen in using colours as with IDEs and text editors with source code formatting. We
then decided that there is need for the Natural Language C Cross Compiler to have a
graphical user interface with a text area for the input and a portion for displaying results or
errors.

The user interaction via the graphical user interface prompted that the text editor should have
basic operations for a text editor. These were designated to be the basic cut, copy, paste and
open file for external files. Apart from the basic text editor operations a point raised was that
the Compiler should have a menu with a compile and build function.

Due to the nature of the input being a natural language a basic grammar checking facility was
introduced. This checks British English and is able to give suggestions on spelling mistakes
and less complex verb to noun agreement. This helps in alerting the programmer what the
compiler takes as symbol when the words used are not recognised by the grammar checker.

The suggested Graphical User Interface is given in Figure 4.1.
36

\
Figure 4.1 Suggested Interface
4.3 Feasibility study
Feasibility is the state or degree of a project being easily or conveniently done. A feasibility
study is an evaluation and analysis of the possibility of the proposed project which is based
on far-reaching investigation and research (Georgakellos, et al, 2009). A feasibility study is
done so as to support the process of decision making. This study is essential in systems
development because it is done before the system is developed and hence the sponsors of the
system, the users and developers can conclude from this study if the development should
proceed or not. Outlined below are the feasibility study reasons for the development of the
Natural Language C Cross Compiler.

4.3.1 Economical Feasibility
In the development of the Natural Language C Cross Compiler the inputs that are put into the
system are mainly time. Monetary input is very low and there is no financial reason why the
37

development of this system should not proceed. Economically therefore the system is found
to be feasible and hence the development.
4.3.2 Technical feasibility
The Natural Language C Cross Compiler is developed at the National University Science and
Technology (NUST) as part of the requirements for the fulfilment of the B.Sc. (Honours)
Degree in Computer Science. The necessary development tools needed for this systems
development for instance the Java compiler, are readily available for free and hence making
this development process technically feasible. This system is also developed as a research and
training method for the university hence making the development technically feasible.
4.4 Requirements specification
The requirements for the Natural Language C Cross Compiler can be divided into two
categories which are the functional requirements and the non-functional requirements.
Functional Requirements are services that the system should provide, how the system should
react to particular inputs and how the system should react to particular situations
(Sommerville, 2010). They depend on the type of software being developed and can be sub-
divided into input, processing and output requirements. Functional Requirements can be
further subdivided into functional user requirements and functional system requirements.
Functional user requirements are a high level description of what the system should do while
functional system requirements describe the system service in detail. In order to produce
quality software in a software project development, it is essential to implore all system
requirements and clearly understand them.
Non-functional requirements are constraints on the services or functions offered by the
system. They include taming constraints and constraints on the development process and
standards. Non-functional requirements often apply to the recruitment system as a whole and
also relate to performance that will be required of the system and the technologies that will be
used for development of the system under study. They do not usually just apply to individual
system features or services. They are also known as quality requirements

4.6 System Requirements
These describe the functionality or system services the system should provide, how the
system responds to certain input and how the system should behave in particular situations.
38

There are functional user requirements and non-functional system requirements. Functional
user requirements are a high level description of what the system should do whereas
functional system requirements describe the system service in detail. Non-functional
requirements describe other characteristics of the product. There are several categories of
these requirements that are constraints, external interface, performance and quality attributes.
To produce quality software in a software project development, it is essential to solicit all
system requirements and clearly understand them.
The non-functional requirements of the system entail that the system should cater for
different type of users who interact with it. It should be compatible to a number of software
and hardware specifications available on entry level mainstream personal computers. The
system should be easily portable, should be available when needed and it should provide an
aspect of extensibility and modifiability to a degree to enable future improvements.
Non-functional requirements include the hardware and software specifications including the
functional environment for the system, performance requirements and user interface
requirements. This system was developed and tested on an Intel x86 based machine with
2Gig RAM, 1.9GHz Duo Core microprocessor, and running Windows 7 with Java Runtime
Environment (JRE) 1.7. It was also tested on a Linux platform running CentOS. There is no
guarantee that it runs on later or previous versions of windows as it was not tested on such.
Performance requirements describe the system properties and constraints such as reliability,
response time and size or capacity constraints on the data to be processed by the system. The
system should compile in a program in a considerable amount of time, 30 seconds being the
most for a complex program.
The user interface is a means of interaction between the user and the system. This system
should have a graphical interface with minimal buttons and menus but providing a way of
graphical feedback if errors are encountered during compilation.
Functional requirements are a high level description of what the system needs to do and in the
case of this system. The Natural Language C Cross compiler compiles natural language input,
generating equivalent object code executable on the main platform. Some compilers can
generate assembly language as the compiled output and some go a step further to assemble
the assembly language into object code for a targeted platform.
39

4.4 System Design
System Design transforms a logical representation of what a system is required to do from
requirements specification into a physical specification. In object oriented design the
emphasis is placed on defining software objects and how they collaborate to fulfil the
requirements. In this section the author illustrates his design through the use of UML
diagrams.
4.5.1 Architectural Design
Architectural design is a graphical representation of major system components and how they
communicate with each other. Below is the design for the Natural Language C Cross
Compiler.
40

Figure 4.2 Architecture Diagram for the Natural Language C Cross Compiler
The compiler accepts user input as natural language text, performs basic grammar checks that
is simple verb noun agreement as imposed by British English grammar. After the checks are
complete it generates intermediate code that is user defined. This is a c program that is used
in the next step of parsing.
41

The patterns in the above diagram are a file created using a text editor. LEX reads the
patterns and generates C code for a lexical analyser or scanner. The lexical analyser matches
strings in the input, based on the defined patterns, and converts the strings to tokens. Tokens
are numerical representations of strings, and simplify processing.
When the lexical analyser finds identifiers in the input stream it enters them in a symbol
table. The symbol table may also contain other information such as data type (integer or real)
and the location of each variable in memory. All subsequent references to identifiers refer to
the appropriate symbol table index.
The grammar in the above diagram is a text file created with a text editor. YACC reads the
grammar and generate C code for a syntax analyser. The syntax analyser uses grammar rules
that allow it to analyse tokens from the lexical analyser and create a syntax tree. The syntax
tree imposes a hierarchical structure the tokens. The next step, code generation, does a depth-
first walk of the syntax tree to generate code.
After parsing has been performed on the input the next step will be intermediate code
generation. This is the object code that can be executed on the main platform.

4.4.2 Class Diagram for the compiler front end
A class diagram is a UML diagram used to define software objects, their attributes and their
methods. Figure 4.2 shows the class diagram for the front end of the Natural Language C
Cross Compiler.

42


Figure 4.3: Class Diagram for the Natural Language C Cross Compiler front-end.

4.4.3 Sequence Diagram
A dynamic model that details how operations are carried out, what messages are sent to
which object and when they are sent. This is achieved by arranging objects horizontally at the
top from left to right and the time represented vertically such that the model is read from top
to bottom. Sequence diagrams shows object interactions arranged in time sequence and it
depicts the objects and classes involved in the scenario and the interactions between the
objects needed to carry out the functionality of the system. Figure 4.3 is the compound
sequence diagram of our system that we came up with to show the sequence of activities.

43




Input
warnings and errors
Saves intermediate code
Saves Assembly code
Saves object code
Issue a run command
Runs compiled program


Exit



Figure 4.4: Sequence Diagram for the Natural Language C Cross Compiler
In Figure 4.4, the user/programmer enters input as a natural language program. After issuing
the compile command the system gives feedback as warning or errors. If there are no errors
the compiler saves the intermediate representation that is the object code.



Programmer
File System
Compiler
44

4.6 Conclusion

The current chapter covered the aspects of requirements gathering and design of the Natural
Language C Cross Compiler. Blueprints of the system from the Graphical User Interface and
the core of the Compiler were produced using Object Oriented Analysis and a number of
methods of structured analysis that included prototyping. At this juncture there is a better
understanding of the requirements both functional and non-functional. The economic and
technical feasibility studies yielded results that favoured the continuation of the project into
the implementation phase.















45

Chapter 5
Implementation and Testing
5.1 Introduction
Implementation is the translation of the design model into a functional system by coding the
various system modules that interact together to form subsystems which in turn collaborate to
form the whole system that satisfies the user requirements. We describe the tools that we used
to come up with the system and we go a step further and describe how the system is be
deployed and also how it undergoes some tests. We conclude the chapter by explaining the
major functionalities of the system with the aid of screenshots where possible.

5.2 Tools
A text editor with source code formatting is used for this project. Specifically Notepad++,
this is because of its simplicity and adaptability in being able to be used for multiple
languages offering a quick switch between languages. This project is being developed using
three languages C, Java and some assembly language hence the need to shift between
languages as often as possible during the development stage.

Other than the standard development environments this project uses open source scanner and
parser generators mainly LEX and YACC on Linux distributions and FLEX and Bison a
variation of YACC on windows.

5.2.1 Programming Language
The Natural Language C Cross compiler was implemented using java for the interfaces and
the compiler front end, C and assembly language for the compiler backend. Java was used
mainly because of the availability of an API to the Stanford parser that offers a Probabilistic
Context Free Grammar (PCFG) model for the natural language passing before intermediate
code generation.
The choice of using C was mainly because the compiler is based on the structures that are in
the C Language thus it would be faster to develop the core of the compiler in C.
5.2.2 Development Environment
46

A text editor is software that is used by most developers for basic editing if source files.
Some text editors support source code formatting for multiple languages. Notepad++ along
with basic compilers, GNU gcc 4.8.0 and java version 1.7 are used for this project. Java was
developed by Sun Microsystems and it is open source.

5.3 Deployment Architecture
The compiler consists of three major components a graphical user interface for interactions
between the user and the system, an intermediate code generator written in java, a parser
developed using YACC specifications, C and a lexical dictionary developed using LEX, an
assembly language code generator and an assembler. The parser and the assembly language
code generator are intertwined to form one module.

5.4 System Functionality
The Natural language C Cross Compilers gives user interaction via the use of a Graphical
User Interface. The system users give input to the system and receive feedback as colour
coded errors by highlighting where errors occur and by the form of text in the output text area
as shown in Figure 5.1.
public void HighLt(int index, int end){
try {
hilit.addHighlight(index, end, painter);
jTextArea.setCaretPosition(end);
jTextArea.setBackground(entryBg);
index = 0; end = 0;
} catch (BadLocationException e) {
e.printStackTrace();
}
}
Listing 5.1: Function for error highlighting in colour
47

Listing 5.1 gives the implementation of the highlighting function that is called recursively to
highlight each word or token perceived as an error from the starting point of the word marked
by index to the end marked by end.

5.4.1 Graphical User Interface
The system provides a basic editor for program input, this has basic functionality for example
copy, paste and cut. Using the Language Tool the basic editor can do British English
grammar checking and provides feedback in terms of highlighting the word with perceived
errors in green. The explanations, warnings and possible suggestions for a possible spelling
error are all given in the output textbox just below the editor.

Figure 5.1: The interface
48

5.5 Software Testing

System testing is a crucial stage in software development with the objective of verifying and
validating that the application meets the user requirements. Testing of the system is an
iterative process where each logical unit of code is tested. Upon successful testing of each
unit, the module is deployed for integration testing with other modules of the system, which
would have undergone testing as well.

5.5.1 Unit Testing
We have conducted unit testing throughout the development phase. The objective of unit
testing is to test the modules of the application separately. These modules should meet a
written contract that the piece of code must satisfy. Each module was tested individually
before proceeding to the next module. We have tested if each modules code was meeting the
expected requirements, for an example, in the registration module if the code was perfectly
registering the patrons and auditors giving them their roles respectively.

5.5.2 Integration testing
Integration testing is the phase of software testing in which the individual software modules
that were tested during unit testing phase were combined and tested as a body. Integration
testing took as its input modules that had passed unit testing, grouped them into larger
aggregates, applied tests defined in an integration test plan to those aggregates, and delivered
as its output the integrated application ready for system testing. The purpose of integration
testing is to detect any inconsistencies between the software units that are integrated together.

5.5.3 System Testing
System testing is testing that we conducted on the complete, integrated Intelligent Human
Tracking System to evaluate the systems compliance with its specified requirements. System
testing took as its input, all of the integrated software components that had successfully
passed integration testing. The arm behind system testing was to detect defects both within
the separate modules and also within the system as a whole. We were assisted by some
friends in testing if the system met the initial specified requirements.
49

5.6 Conclusion

A detailed outline of the activities that took place in the implementation phase has been
outlined here. This includes some code snippets when they were necessary. The system was
tested to thoroughly but this does not eliminate bugs that could have been overlooked during
this stage of development.











50

Chapter 6
Recommendations and Conclusions
6.1 Introduction
This chapter reviews and concludes what transpired during the project development. It also
highlights the accomplishments, challenges and recommendations for future work.

6.2 Classification of the System
After analysing and implementing the system, we classified it as a natural language interface
to programming. However, with very limited capabilities.

6.3 Review of the Projects Aim and Objectives
During the project introduction an aim and objectives were set. The success of a project is
dependent on the ability to accomplish these. It is essential that most or all of the objectives
are met and we managed to meet the following objectives:
1. To compile natural language text as program input
2. To perform syntax error checks on input
3. To perform grammatical error checks on input
4. To generate equivalent object code executable on the main platform

6.4 Challenges Encountered
In the development of the natural language compiler, we encountered very few challenges
that hindered our progress or limited our capabilities. We had limited time in developing the
system due to the initial scope that we had set for the project, the scope needed more
development time thus we could not accomplish all our initial objectives. Apart from a broad
scope of the project we have to start the research into the project late, as late as the end of the
first semester. This was not ideal as the project should be a course that spans two academic
semesters, thus practically the project was started in the final semester. To successfully
implement the project a number of additional concepts for example natural language
processing in this project had to be mastered hence more time should be allocated to enable
51

adequate research. Given this fact, more time should be allocated from the initial project
proposals to the date of final presentations. In our view this can help in yielding high quality
projects and perhaps fuel enthusiasm in students to pursue new concepts which are in line
with the University goal of becoming a leading institute in research.

6.5 Recommendations for Future Work

We highly recommend that future developers of similar systems should consider an in-depth
analysis of Natural Language Processing as the Programming is highly depended on the
processing. For improving the efficiency of the programs future researchers should make way
for code optimisation after the code generation phase as this improves the efficiency of the
resultant programs.

6.6 Conclusion
The opportunity we had in working on this research project has opened our minds on the
possibilities of research. It is a great endeavour that has seen us applying the concepts we
learnt during the course of the program. Apart from the taught theories and concepts the
project went beyond in helping us embrace new technologies and how they can be applied for
the greater good.




52

References

Ballard B and Bierman, (1979) Programming in natural language: NLC as a prototype.
In Proceedings of the 1979 annual conference of ACM/CSC-ER
Bates, M. and Weischedel R, (1993), Challenges in natural language processing.
Cambridge University Press, Cambridge.
Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language, Cambridge
University Press.
Dijkstra E, (1979), On the foolishness of, Natural Language Programming. In Program
Construction, International Summer School
Guzdial, M. & Soloway, E. (2002) Log on education: teaching the Nintendo generation to
program. Communications of the ACM, 45(4), 17-21.
Kate R, Y Wong and R Mooney, (2005), Learning to transform natural to formal
languages. In Proceedings of the Twentieth National Conference on Artificial Intelligence
(AAAI-05) Pittsburgh
Lieberman H and H Liu, (2005), Feasibility studies for programming in natural language.
Kluwer Academic Publishers.
Liu H and H Lieberman, (2005) Metafor: Visualising stories as code. In ACM Conference
on Intelligent User Interfaces, San Diego
Liu H and H Lieberman, (2005), Programmatic semantics for natural language interfaces.
In Proceedings of the ACM Conference on Human Factors in Computing Systems, Portland.
Mark Halpern, (1966) Foundations of the case for natural-language programming. In
AFIPS '66 (Fall): Proceedings of the November 7-10, fall joint computer conference, New
York.
Miller, S, (2000), A Novel Use of Statistical Parsing to Extract Information from Text. In
Proceedings of ANLP.
Pane J Ratanamahatana, C and B Myers, (2005), Studying the language and structure in
non-programmers solutions to programming problems. International Journal of Human-
Computer Studies
Saint-Dizier, P. and Viegas E, (1995), Computational lexical semantics., Cambridge
University Press, Cambridge
53

Singh P, (2012) The Open Mind Common Sense project. http://www.kurzweilai.net/. 09
November 2012
Sommervile, (2010), Software Engineering, Addison-Wesley
Tang L and R Mooney, (2001), Using multiple clause constructors in inductive logic
programming for semantic parsing. In Proceedings of the 12th European Conference on
Machine Learning, Freiburg, Germany

















54

APPENDICES
APPENDIX A
public class GrammarChecker{
BritishEnglish BritE = new BritishEnglish();
AmericanEnglish AE = new AmericanEnglish();
JLanguageTool lang;
String language = "en-UK";
List<RuleMatch> compareVal;
String RawText = "default test string for the tokeniser"; //to avoid null pointers
public void LoadLanguage(String language1) throws IOException{
if(language1 == "en-UK"){
lang = new JLanguageTool(BritE);
lang.activateDefaultPatternRules();
}
else if(language1 == "en-US"){
lang = new JLanguageTool(AE);
lang.activateDefaultPatternRules();
}
else{
lang = new JLanguageTool(BritE);
lang.activateDefaultPatternRules();
}
}
public String getString(String code){
/**
* gets the program code from input box
*/
this.RawText = code;
return this.RawText;
}
public void CheckText(){
55

/**
* This is the method that checks the grammar
* it uses the value of RawText set to a test string by default
*/
try{
compareVal = lang.check(RawText);
}
catch(Exception e){
System.out.println("failed to check text");
e.printStackTrace();
}
}
public List<RuleMatch> getErrors() throws Exception{
/**
* this checks for errors in the given program
*/
LoadLanguage(language);
compareVal = lang.check(RawText);
return compareVal;
}
}
Listing A1: Grammar Checker implementation using Language Tool 2.0



class Tokeniser{
String Grammar = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
String[] options = { "-maxLength", "80", "-retainTmpSubcategories" };
LexicalizedParser lp = LexicalizedParser.loadModel(Grammar, options);
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
56

GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
String RawInput = "take two numbers x and y, add the numbers and display result";
Iterable<List<? extends HasWord>> input;
Tree pas;
Tree parse;
Tokenizer<? extends HasWord> tokens;// = tlp.getTokenizerFactory().getTokenizer(new
StringReader(RawInput));
public void setTokens(String input){ //working well
tokens = tlp.getTokenizerFactory().getTokenizer(new StringReader(input)); //list of the
words
RawInput = input;
}
public Iterable<List<? extends HasWord>> tokeniseSent(Tokenizer<? extends HasWord>
token){ //done
List<? extends HasWord> sentence = token.tokenize();
List<List<? extends HasWord>> tmp = new ArrayList<List<? extends HasWord>>();
tmp.add(sentence);
input = tmp;
return tmp;
}
public void printTree(){

if (input!=null){
for (List<? extends HasWord> sentence : input) {
Tree parse = lp.apply(sentence);
parse.pennPrint(); //this prints the tree
System.out.println();
}
}
else System.out.println("input is null");
57

}

public void printTagged(){
if (input!=null){
for (List<? extends HasWord> sentence : input) {
Tree parse = lp.apply(sentence);
System.out.println();
System.out.println(parse.taggedYield()); //tagged yield to be sent to file
}
}
else System.out.println("input is null");
}
public Collection printScoredOut(){ //working very well
Collection c = null;
for (List<? extends HasWord> sentence : input) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed(true);
c = tdl;
}
return c; //collection
}
public Tree parseSent(Iterable<List<? extends HasWord>> sent){

for (List<? extends HasWord> sentence : sent) {
parse = lp.apply(sentence);
}
return parse;
58

}
public void setGrammar(String grammar, String Options[]){
Grammar = grammar;
options = Options;
}
public String getSentence(String sent){
RawInput = sent;
return RawInput;
}
public void setSentence(){
RawInput = "";
}
public Tokenizer<? extends HasWord> getTokens(){
return tokens;
}
}
Listing A2: Implementation of the tokeniser using the Stanford parser

Das könnte Ihnen auch gefallen