Sie sind auf Seite 1von 11

ISSUES IN ENGINEERING FIT-FOR-PUPOSE NATURAL LANAGUAGE PROCESSING SYSTEMS

ALEXANDER CANHAM University of the West of England

Abstract: This paper aims to review the significant problems in


engineering commercially viable natural language processing systems. Issues regarding inadequacies in training experience and difficulty dealing with maintenance have been raised. In addition the difficulty in meeting the functional business requirements for natural language technology has been found to be difficult, due to current limitations of NLP methods not being clearly understood. This paper endorses agile methods as well as suggesting revisions to training data engineering for future commercially viable NLP system.

Introduction Computer systems engineering has developed systems which exist in all domains of information technology (IT). While a variety of different engineering protocols are utilized in software development, many suffer from their lack of scope beyond functional system requirements [Gerlach, J & Kuo, F, 1991]. This paper considers the implications of developing software systems which are intrinsically linked to interacting with complex human cognitive faculties and behaviour. Engineering a natural language processing system is an extremely complex procedure, considered an AI-complete problem in artificial intelligence; the issues relating to its complexity lie in questions that require further exploration in fields such as mathematics and neurobiology. While completely accurate systems are far from completion, natural language (NL) computational methods exist in many state-of-theart technologies with relative success. This paper shall evaluate the difficulties in engineering a satisfactory NLP system that is considered fit for purpose, with special regards to examining the issues related to engineering a sufficiently wide coverage training experience that the supposed software system learns from. The hypothetical requirements for such a fit for purpose system shall be considered as well as discussing the current difficulties in measuring system performance. Discussions on optimal engineering procedures from NL computation shall be posited. Fundamentals Natural language processing has become a widely researched and exciting subfield of computer science, linguistics and information studies. The exponential growth of research into NLP is largely due to the significance the field has in developing new language technologies for scientific, economic, social and cultural applications [Bird et al, 2009]. The field has long been researched from within the academic community, however recently its theories and methods are beginning to be deployed in industry, largely associated with work in human-computer interaction, with many large companies endeavouring to implement automatic question and answer applications to their existing business information systems, as well as employing sophisticated NLP methods to utilize sentiment analysis to electronically mine the web in order to gain valuable information about what is being said about their products [Kaiser, 2009]. A natural language interface accepts user inputs of NL allowing interaction with

some system; typical applications, which employ NLP currently, are information retrieval systems that return output based on language string query statements [Daoud et al, 2011: 249] as well as machine translators, spell check devices and speech recognition software. Most of these systems are restricted in terms of their ability to translate and comprehend grammatically and lexically non-standard language structures that contain ambiguity. This is largely due to machines having no real world knowledge for recognising any context for which humans regularly refer to. Machine learning has become the normalized method of engineering NLP systems within the last fifteen years [Navilgi, 2009]. Previously NLP systems were constructed using rule-based techniques which required meticulous building of hand crafted grammars, which failed to show much consideration for real natural language use. Early research at the IBM Watson Research Centre concluded rule based approaches untenable and statistical methodology began to be considered. Todays NLP systems require electronic corpora to train their classification methods. Electronic instances of natural language are in abundance, however, depending on the training experience, different systems require different types of data. A machine learning methodology requires supplying a system with data in order to train a classifier which has been coded to analyse certain properties, or features, of the training data depending on the type of learning experience that has been specified. Learning experiences are portioned into distinct supervised, unsupervised and semi/minimally-supervised categories, depending on whether the classifier is being trained from humanly annotated or raw data. It is up to the developers prior to the design stage to specify a fine grain set of requirements so that a satisfactory amount and variety of training data can be engineered. It should be noted that since text can contain information at many different granularities, from simple word or token-based representations, to rich hierarchical representations, to high level logical representations across document collections, selecting the right level of analysis can be difficult. Never the less, it can be well understood that in order to develop a successful wide coverage NLP application, the system should be trained with as much data is economically possible. Significant Issues It has been examined how supposed NLP systems that are to be considered fit for purpose must be engineered to learn from a corpora sufficiently sized in order to be wide coverage where wide coverage is defined as a system that is able to understand some optimal quantity of language use. In NLP, data sparsity is the term used to describe to phenomenon of not observing enough data in a corpus to model language accurately. In other words, true observations about the distribution and pattern of language cannot be made because there is not enough data to see the true distribution [Allison, 2006]. This problem is typically exhibited in systems engineered to a supervised learning specification. Supervised methods require large databases of tagged material. For instance, word sense disambiguation systems, which attempt to correctly classify the sense of a polysmous word, which may exhibit several distinct meanings depending on its context, requires computational lexicography to engineer corpora which facilitates the automatic classification of senses from supplied sense tags, found in the corpus [Bates, 1993]. Such corpora currently available are several magnitudes in size smaller than the contrasting untagged versions, such as the British National Corpus (BNC), which contains over one hundred million English language words. Thus the question should be posited whether using supervised methods to engineer an NLP

system would be considered fit for purpose when they are relying on training material orders of magnitude smaller than an unsupervised approach has to offer. To quantify, many supervised system currently train their classifiers by engineering data complied from the Wall Street Journal (WSJ) text corpus from the Penn Treebank corpora. The corpus contains over a million words of annotated phrases, collected from a few thousand select stories from WSJ material of 1989 [Marcus et al 1995]. The benefits of using such a database are obvious; the material is supplied with syntactic tags and semantic information, making it easier, and arguably more efficient, to train a classifier. However, one should note the limited coverage of the corpus. A system that is trained by one genre of language is likely to be a failure in a real world setting due to the limited amount of linguistic variation the system is able to understand. Supplying a system with as much variation and data, it seems, is intrinsically related to functional performance. To contrast, the one hundred million-word BNC general corpus is compiled of instances of language, both spoken and written, from a wide variety of genres. Making for a more adequate albeit more difficult training experience as the system has no supplied information to learn from. Generally, unsupervised methods do not attempt to apply strict rules in their language models; instead they use clustering methods to correlate similar elements of the training data together in order to build heuristic models. Their performance, in academia, is generally considered sub standard when comparing them to supervised approaches [Mitchell, 1997]. To further illustrate this notion, NLP is largely represented within the academic community as part of the SensEval/SemEval conferences organized by ACL (association for computational linguistics). The conference represents an opportunity for teams to present their NLP systems [Navilgi, 2009]. Systems displayed at SensEval/SemEval have produced impressive results, with supervised methods achieving average accuracy rates of 85% when attempting to disambiguate all English words from the supplied test materials. Although their contribution to producing commercially viable systems is rendered moot due to the narrow selection of data that the systems are optimized for and tested against [Vlachos, 2011]. There has been much debate from within the academic community on how best to overcome the data sparsity problem, with many advocating the use of the WWW as a viable method for engineering wide coverage NLP systems. The issues related to lack of coverage, and limited construction of the current data sets used by NLP systems stems from the fact that current NLP algorithms are typically optimized, tested and compared on fairly small data sets, as noted by introducing SensEval/SemEval systems, though data sets several orders of magnitude larger are available [Banko et al 2001]. This seems counterproductive when considering what the WWW has to offer. The web can be employed to obtain frequencies of n-grams that are unseen in a given corpus, where n-grams represent the sequence of n items over a given sequence of text, due to the massive quantity of text on the web far exceeding that of any current corpus. Large corpora can also be difficult and time consuming to efficiently query. Xu [2004], however, notes that web data should not necessarily be considered a solution for the NLP coverage issue, as web data constitutes as noisy data, and that language models using web data still have the sparseness problem. Zhu and Rosenfeld [2001] further presented this notion when discovering that in 24 random web news sentences, 46 out of 453 tri-grams searched were not covered by the search engine; AltaVista. Never the less there has substantial work in the field that shows using the WWW as a very large corpus to exhibit great promise. Nakov and Hearst [2005] proposed a method to resolve PP (preposition phrase) attachment ambiguity using unsupervised algorithms that cleverly exploited the WWW, making use of its surface features and paraphrases. This

was based on the assumption that phrases found on the WWW are sometimes partially disambiguated and annotated by content creators. Their system achieved an average accuracy of 83.82% percent when resolving PP-attachment ambiguity, a level of success comparable to the highest achieving systems presented at SensEval conferences. In summation, it is difficult to conceive of an NLP system that does not exploit massive quantities of data, being fit for purpose. As eluded to when discussing system requirements, a system that exhibits wide coverage seems to be critical, therefore it seems, special time and consideration should be put into making this possible. It appears foolish to rely on annotated tagged material from corpora in the immediate future when considering how crucial wide coverage systems appear to be. Furthermore, it has been estimated that sufficient tagged corpora, in addition to being tremendously expensive to engineer, will take in the region of 29 human years to compile [Ng, 1997]. Aside from the issues relating to the theoretical underlying principles of the system in regard to training data engineering, a significant issue related to not only NLP applications but also all modern software systems is efficiency. Human users are very demanding, system response times >4 seconds can render a system unacceptable [Shneiderman, 1997]. Whether NLP systems can live up to this requirement is currently unknown since efficiency is not a top priority in research and many questions related to user experience and software design are often considered implementation detail. Additionally, it is not yet known whether NLP systems are capable of being more efficient than technologies currently available to the user such as keyboards and menus. Ideally, it is this consideration developers should make during the requirements phase of a supposed NL system engineering project. In order for a system to be efficient it has to have satisfactory performance. However, because NLP is an emerging field and its limitation are not yet truly understood, issues such as this are unlikely to be apparent to the engineer during the requirements phase, thus the software development cycle is perhaps stuck in a permanent state of changing requirements in order to meet a presupposed optimization for system efficiency (See. Figure 1)

Figure 1: Waterfall model problems Requirements engineering is a very complex process [Jiang, 2008] and it is unwise to declare an end to the phase, especially when considering systems that are subject to potential significant change over the course of system development, due to issues as exemplified above. NLP systems would greatly benefit from an agile system methodology, because of their unpredictable nature and due to NL exhibiting some
4

currently unknown limitations when considering performance and efficiency. Specifically, agile methods consideration for customer collaboration [Leffingwell, 2010], which for systems directly reliant on human input is imperative. The developer, through consultation with the customer, can determine the sufficient amount and variety of data needed to train the system. For example, hypothetically consider a software system thats requirement specification is to compute one word verbal commands, the depth and variety of language that this system would need to learn would be far less than a system which requires comprehending sentences of both spoken and written language. Thus vastly reducing, although not eliminating, frequent amendments to requirements. It is for this reason that an agile engineering approach would be most beneficial to NLP systems, due to the positive feedback cycle associated with feedback and collaboration. Additionally, debugging NLP systems is very challenging [Sukkarieh,J & Kamal, J, 2009], thus an agile methods approach of dealing with defect detection via iteration based functional testing, levels out defects found over the course of the development process as well as reducing costs. Theoretically, a very small amount of defects will exist in the system at the end of development.

QuickTime and a decompressor are needed to see this picture.

Figure 2. Typical cost of defects and time over lifecycle of production when detected1 This approach rectifies issues that are presented with changes in business requirements as well as changes that may need to be made for the system to perform efficiently. This appears to be the most logical resolution to apply to NLP system rather than relying on a traditional waterfall approach. System Evaluation There is a manifest need for a good evaluation methodology for NLP systems. As documented, systems that achieve high accuracy rates in terms of SensEval/SemEval are misleading, because of their actual supposed usefulness outside of a strict, highly controlled, scientific laboratory environment. NLP success is perhaps difficult to measure currently because the field has been solely researched from within
1

Source: http://www.agilevoices.com/aggregator/categories/2?page=4
5

this academic setting up until very recently. Somerville [2001] notes how it is notoriously difficult to construct conventional software systems systematically and timely, while project failures are well documented in other domains, estimating failure of NLP systems is difficult, arguably because the language engineer faces higher levels of complexity. Because NLP technology is now beginning to be considered for implementation into real functional software systems, it is perhaps time to consider some standardized methodology for evaluating system performance [Jones, 1994]. However, imposing standards on a system that computes NL may prove difficult based on environmental factors. Environmental variables are associated with both input and outputs, separate from the actual parameters of the system, based on the type of language data inputted into the system and the language requirements of the end user. This may be concerned with the systems recognition for monolingual variation over geographical displacement so that the output is mutually intelligible and readable to the end user. Jones [1994] identified the difficulty in determining the appropriate level of granularity for environmental variables for NLP systems, thus came to the conclusion whether assessment of a systems functional performance refers to its ability to meet its internal design objectives or its external function requirements. For example, in machine translation, for a system to be determined as more successful than a competing one, should performance focus on accuracy and completeness of parses, as specified at in an initial requirements engineering stage, or is output that doesnt confuse the reader more successful. The growing evidence of the relevance of human behavioural factors in the success of the development of new products, processes and services, especially when considering large complex socio-technical systems, is beginning to receive substantial research [Dori, D, Sussman, J, 2011]. The premise behind this research recognises the fact that the architecture, design and implementation of many systems are normally carried out through careful study and documentation of the functional, structural and procedural aspects of the system under development. However, throughout the duration of the systems lifetime, human behaviour requires changes in the system to enhance performance and maintain some or all of their components or subsystems. NL applications are prime example of such systems; language variation is incredibly widespread, cross linguistically as well as throughout linguistically monolingual geographic areas, therefore a system must account for this with sufficient training data, if this is available. Unfortunately however language chance occurs sporadically throughout time, and an element of language under going change often occurs in parallel with its preceding form while they fight for superiority [Croft, 2006]. Thus, many changes may need to be implemented into the system to account for this phenomenon. Figure 2 below is an example of such linguistic change; the bi-gram larger amount is shown to have dramatically reduced in frequency over time, while I dunno has exponentially increased in occurrence [Michel et al, 2010]. While the reasons for the bi-gram fluctuations are of concern to linguists and not to the computer scientist / software engineer, it is worth noting that such change often occurs through socio-cultural influence, thus highlighting the unpredictable nature of human behaviour.

Figure 2. Bi-gram frequency of larger amount and I dunno grammatical constructions. Between the years 1950 2008.2 It is therefore easy to conclude that NL systems will require a significant amount of maintenance due to the nature of human behaviour and language evolution. As the bi-gram occurrences show, decades ago training an NLP system to be able to compute the phrase I dunno would have been relatively redundant. Maintainability is a key feature of any successful piece of software [Somerville, 2001], change in the needs of users is an inevitable consequence; therefore it is imperative that recurrent problems of software maintenance are not replicated. The problem with implementing changes to software is that these changes are rarely documented, and even if they are, the documentation is often inaccessible to the numerous architects and designers involved in such changes over time frames that can last many decades and often transcend the work lifespan of the involved individuals [Dori et al, 2011]. The question therefore returns to whether it is better for success if NL systems employ satisfactory heuristic models instead of focussing on achieving absolute accuracy. Engineering a system that attempts completeness will be difficult to maintain over time and be subject to much regular and complex reengineering, as well as potentially creating massive amounts of additional documentation, which in turn dramatically increases the complexity of the system. Conversely satisfactory heuristic NLP systems will likely require less frequent reengineering, theoretically making them more robust and arguably easier to evaluate. Never the less, in an NLP system, maintainability is a critical requirement, and systems should be engineered with compartmentalization in mind, so that frequent changes in the code can be made without significant difficulty. The questions regarding whether an NLP should sacrifice attempting to achieve completeness and accuracy for more robust performance in some sense concerns the question of what makes satisfactory software. Output returned that does not confuse the
2

Data presented from the Google n-gram viewer. Google n-gram displays frequency of occurrences over a period of time from the written material that constitutes Google books. A bigram is an n-gram, where n denotes the number of items from a given sequence of the size of 2, in this case 2 lexical tokens. URL: http://books.google.com/ngrams

user, as well as being mutually intelligible to a wide variety of different users makes for a far more useable system. Fit for purpose NLP system should ideally recognise, in the initial requirements stage, that language is fluid and irregular; engineering a system to work with one specific user in mind is rather meaningless. In many ways the fundamental differences between NLP systems and conventional software is the incompleteness property; since the current language processing techniques can never hope to produce all and only correct results the whole system design is affected by having to provide appropriate fallbacks for countering this problem. Conclusion & Discussion NLP has been rigorously researched in academic papers, although its consideration for its commercial implementation and architectural evaluation is often neglected. To date, it has been shown how failures in NL systems are difficult to measure, and this paper has highlighted the need to introduce a rigid evaluation methodology, like those that exist in other engineering practises as well as endorsing an agile systems methodology. Researchers should strive to develop NLP systems that are fit-for-purpose, rather than constructing prototypes to foster some academic theory. It has been seen how engineering a software system that interacts with human behaviour is hard. Fulfilling the requirements of good software is more complex than conventional software system it appears. It is perhaps time for software developers to expand their focus beyond functional requirements to include the behavioural needs of the user. In some sense, the notion of system efficiency is synonymous with an NLP system exhibiting wide coverage. If the system is able to detect many forms of language input, the system will undoubtedly be significantly more useable, however its effect of system efficiency is questionable. It is undeniable that much work needs to be done in considering the commercial viability of certain technical aspect of a supposed system, and academic researchers should work more closely with seasoned software engineers to consider implementation detail that is so crucial for the success of a system. When considering the need for frequent system updates, which are crucial for NLP based applications, the processes in software development of encountering a problem can be challenging. It has been seen how very frequent modifications and reprioritizing are, to a great extent, due to the nature of NL input and constant respecification, although they may come from changes in business requirements. Whether a system should tailor its training methods to a heuristic approach is arguably rendered moot when considering agile software development techniques. Agile takes care of issues, such as those examined in this paper, by allowing the development process to adapt to changes more quickly and retract/replace the last feature-based enhancements when the need arises. However, regardless of engineering approach, developers should take care to ensure adequate variety of training data. While state-of-the-art NLP technologies may be some way off yet, I believe the current tools are available for successful commercial systems as long as special care is taken in understanding the needs of a wide potential user market. The use of the WWW as a very large corpora and developments in agile methodology seems the most promising.

Word Count: 4131 References:

Allison, L (2006) A Programming Paradigm for Machine Learning, with a Case Study of Bayesian Networks. In Proceedings of Twenty-Ninth Australasian Computer Science Conference (ACSC 2006), Hobart, Australia. pp. 103-111

Bird, S, Klein, E, Loper, E (2009) Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit. OReilly Media Banko, M. Brill, E (2001) Scaling to Very Very Large Corpora for Natural Language Disambiguation. ACL 2001 Croft, W (2006) Explaining Language Change: An Evolutionary Approach. Harlow, Essex: Longman Daoud, D, El-Seoud, S (2011) Human Factors Required for Building NLP Systems. Technology for Facilitating Humanity and Combating Social Deviations: Interdisciplinary Perspectives. IGI Global Dori, D. Osorio, C. Sussman, J (2011) COIM: An Object-Process Based Method for Analyzing Architectures of Complex, Interconnected, Large-Scale Socio-technical System. Systems Engineering 14(4) pp. 364-382 Kaiser, C (2009) Combing Text and Data Mining For Gaining Valuable Knowledge from Online Reviews. IADIS International Journal on WWW/Internet 7 (1) pp. 63-78 Gerlach, J, Kuo, F (1991) Understanding Human-Computer Interaction for Information Systems Design. MIS Quarterly 15(4) pp. 527-549 Leffingwell, D (2010) Agile Software Requirements: Lean Requirements Practices for Teams, Programs, and the Enterprise. Addison- Wesley Jones, K (1994) Towards Better NLP System Evaluation. In Proceedings of the human language technology workshop. San Francisco pp. 102-107 Michel, J.B et al (2010) Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (6014) pp. 176-182 Xu, P. Jelinek, F (2004) Random Forests in Language Modelling. In Proceedings of EMNLP2004. Barcelona, Spain Navigli, R (2009) Word Sense Disambiguation: A Survey. ACM Computing Surveys 41(2) pp. 1-69 Marcus, M, Santorini, B, Marcinkiewicz, M (1993) Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2) pp.313-330 Mitchell, T (1997) Machine Learning. McGraw Hill Ng, T. (1997) Getting Serious About Word Sense Disambiguation. Proceedings of the ACL SIGLEXWorkshop on Tagging Text with Lexical Semantics: Why, What, and How? Washington D.C. pp. 17

10

Nakov, P. Hearst, M (2005) Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution. In Proceedings of HLT/EMNLP'05, Vancouver Somerville, I (2001) Software Engineering. Addison Wesley; 6th Edition Shneiderman, B (1997) Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley Sukkarieh, J. Kamal, J (2009) Towards Agile and Test-Driven Development in NLP Applications. Proceedings of the NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 4244, Boulder, Colorado Vlachos, A (2011) Evaluating unsupervised learning for natural language processing tasks, Proceedings of EMNLP 2011 Jiang, X (2008) Modelling and Application of Requirements Engineering Process Metamodel. IEEE 21(22) pp. 998-1001 Zhu, X. Rosenfeld S (2001) Whole Sentence Exponential Language Models: A Vehicle for Linguistic-Statistic Integration. Computer Speech and Language 15(1)

11

Das könnte Ihnen auch gefallen