Corpora in Human Language Technologies

Corpora and HLT Current trends in corpus processing and annotation Voula Giouli, Stelios Piperidis Insitute for
Language and Speech Processing
Introduction Over the last decades, corpora as linguistic resources have been widely used within the Language Engineering community for training, testing and benchmarking purposes. With the advent of modern technology, computer corpora may consist of millions of running words coupled with linguistic analyses, usually referred to as annotations, and these can be efficiently represented, accessed and manipulated by means of robust Language Engineering Tools. However, corpus building and maintenance has proved to be a time-consuming and costly task whose importance in systems development and evaluation has led to the notions of reusability and standardization. This section is intended to serve as an introduction to issues relevant to corpora, corpus collection and annotation standards, and their applications in HLT, the aim being to raise awareness in the newly-associated Balkan countries, Romania and Bulgaria. An overview of the topics to be discussed in this document is presented: Definitions Corpus Typology Corpora Subcorpora Specialized corpora
Corpus Design and Collection Corpora Use in the field of HLT Standardization and re-usability Corpus Annotation Structural Annotation Linguistic Annotation of Monolingual Corpora Layers of linguistic annotation Tools & Annotated Corpora The Hellenic National Corpus (HNC) The ILSP annotated corpus Linguistic Annotation of Parallel Corpora
Definitions In modern Corpus Linguistics and Computational Linguistics, the word corpus is used to refer to bodies of language pieces in either written or spoken format, basically in electronic form, gathered together in a systematized and structured way in order to be used for drawing conclusions about language usage. Therefore, a corpus should ideally represent the language(s) variety it is intended for, the latter ensuing that a given corpus should be built according to explicit design criteria for a specific purpose. There is much of a confusion over similar yet completely different - notions such as text archive, text library, and text collection their being distinct from a corpus in that they comprise texts which are of interest by themselves yet not structured. In particular, an electronic text archive is a repository of readable electronic texts not linked in any coordinated way, such as the Oxford Text Archive. Similarly, an electronic text library is a collection of electronic texts in standardized format with certain conventions relating to content, etc., but without rigorous selectional constraints . Therefore, a text collection is not considered to be a corpus unless it meets certain minimal criteria. And although there is no consensus yet among researchers with respect to a number of notions relevant to corpus design and building, there are certain key-issues widely accepted that should be taken into consideration as minimum requirements for a text collection to be considered a corpus. Format/Storage. A corpus should be in machine readable (electronic) form. Currently, computer corpora may store many millions of running words that can further be analyzed by means of applying linguistic annotations to the raw text. To this extent, a corpus should preferably be in plain text, that is ASCII characters, with any mark-up clearly identified and separable from the text. Nowadays it is likely that many texts will be in SGML format, the latest trend being XML, the universal format for structured documents and data on the Web. Structure. Textual data comprising a corpus should be structured in order for the users to be able to draw useful conclusions about the language. This means that data should be collected according to specific design criteria and the overall structure be documented in order to be of some use. Size. A corpus should be of sufficient size for the purpose it has been constructed for. There is no general agreement as to what the size of a corpus should ideally be. In practice, however, the size of a corpus tends to reflect the ease or difficulty of acquiring the material. In turn, this factor may be loosely related to the availability of the material to the public and therefore to its relative importance as influential language, as against material which is difficult to get, perhaps because it is of small circulation. Representativeness. A corpus should be representative of the language variety and/or sublanguage it is intended for. This holds true especially for large corpora aimed to be used as reference of the language they are intended for [McEnery et al., 1996]; [McEnery et al., 2002]; [Sinclair, 1987]; [Zampolli,1990]. To this extent, sampling procedures should be adopted prior to building a corpus. Balance. A corpus should be balanced. The latter applies to general purpose corpus building and implies that a range of different text types should be included
proportionally, so that the corpus reflects, in some more-or-less principled way, their levels of use within the language community. It is obvious that meeting the requirements mentioned above when entering the task of corpus building depends on the setting, that is, the specific purpose a corpus is being developed for. For example, large corpora used to reflect general language usage are in most cases balanced and representative of the language variety they are intended for, whereas corpora designed to be used within the discipline of Natural Language Processing for HLT tools development, testing and evaluation should meet the specifications set by the application (e.g. sublanguage or subject field, etc.). Corpora Subcorpora Specialized corpora According to the EAGLES specifications for corpus typology, a subcorpus is considered to be a part of a corpus that retains the properties and characteristics of the corpus it belongs to. Components, on the other hand, constitute a corpus and subcorpus building blocks, as corpora and subcorpora may further contain components. A component is not necessarily an adequate sample of a language and in that way it is distinct from a corpus and a subcorpus. It is a collection of pieces of language that are selected and ordered according to a set of linguistic criteria that serve to characterize its linguistic homogeneity. Whereas a corpus may illustrate heterogeneity, and also a subcorpus to some extent, the component illustrates a particular type of language. What are called sublanguages are components in this definition, but there are other restrictions on sublanguages which will be dealt with later [EAGLES documentation EAGTCWG CTYP / P].
Corpus Typology Significant progress towards the specification of corpus typology has been achieved within the framework of EU-funded programs during the last decade [EAGLES documentation EAG TCWG CTYP / P]. In practice, corpora are classified with respect to:
Modality: Speech corpora, in the form of audio data banks comprising .wav files optionally coupled with transcriptions, vs. written corpora, vs. multimodal corpora suitable for large-scale empirical investigation, the latter being the new trend in the discipline, as they are of crucial importance in a number of applications. Text type: spoken (transcribed) corpora (e.g. London-Lund corpus) vs. written corpora (e.g. Lancaster Oslo/Bergen corpus(LOB)) vs. mixed corpora (British National Corpus (BNC) or Bank of English). Medium: this is relative to the medium of original publication or first appearance, lending itself to the categories of text classification (newswire, books, periodicals, etc), an example being the Reuters Corpus Volume 1 (RCV1). Language coverage: General corpora vs. Sublanguage Corpora. General corpora, also referred to in the literature as reference corpora consist of general texts that do not belong to a single text type, subject field or register. These are designed to provide comprehensive information about a language and are widely used in the field of Lexicography, Corpus Linguistics, etc. The British National Corpus and Hellenic National Corpus are instances of large reference corpora reflecting British and Greek language usage respectively. The notions of representativeness and balance is inherent to this type of corpus. Size is also important if one wishes to draw safe conclusions based on statistics. On the other hand, sublanguage corpora known also as special or specialized corpora are sampled (chosen) from a particular variety of a language (i.e., a particular dialect) or from a particular subject area (i.e., the financial domain), the latter being basically important for the development and evaluation of HLT applications and NLP system components.
Genre/register: corpora of literary texts vs. corpora of technical documents vs. corpora of non-fiction (e.g. news texts) vs. mixed corpora covering all genres. Language variables: Corpora can consist of texts in one language (or language variety) only or of texts in more than one language classified thus as monolingual and multilingual respectively. Multilingual corpora are further classified into translation and parallel corpora: Translation corpora (as for example the Hansard Corpus) comprise texts in more than one languages, one being the original version with translations in at least one language. Parallel corpora, on the other hand, are collections of texts in different languages which are translations of a common source, being no need for the original version to be included. Parallel corpora are widely used to aid computer-assisted translation aligned to some extent to make them searchable within linked segments, alignment being performed at the level of paragraphs, sentences, phrases or even words. Finally, comparable corpora may fall into either categories (monolingual or multilingual), a comparable corpus being a collection of similar texts in more than one language or language variety [EAGLES documentation EAGTCWG CTYP /
P].What the nature of such a similarity might be is not generally clear; however, most general purpose corpora, are at least classifying their texts according to medium, genre, topic, time of production or publication, etc. It thus seems reasonable, as well as feasible, to base the comparison of corpora on those widely applied classifications which are to a great extent mutually independent. Comparable corpora components are usually of similar size and from similar domains. An instance of comparable corpora is the International Corpus of English (ICE) which comprises varieties of English throughout the world.
Production Community: native speakers vs. learner corpora. Native speakers corpora collect their material from the genuine communications of people. Authentic material is an added value to a corpus designed to reflect actual language usage. Learner corpora, on the other hand, comprise text collections produced by learners of a given language and are of crucial importance in drawing conclusions about learners behavior with respect to that language and for guiding teaching and learning methodologies. Markup: Plain (raw) vs. Annotated corpora. Computer-readable corpora can consist of raw text only, (i.e. plain text with no additional information), or alternatively, the raw text is further coupled with additional mark-up or annotations which reflect external data in the form of identifying information (edition date, author, genre, register, etc.), text structure (i.e., formatting attributes such as page breaks, paragraphs, etc.), or linguistic information on the basis of linguistic analyses (part of speech, syntactic structure, discourse information, etc.) open-endedness: closed, unalterable corpora (e.g. LOB, the Brown Corpus) vs. monitor corpora (Bank of English). Certain applications require that the corpora involved are of a steady size and constitution, not allowing, thus for changes to be made, in order to suffice as common reference (closed, unalterable corpora). Monitor corpora, on the other hand, are designed in such a way that they allow for constant, yet controlled modifications to take place. A monitor corpus is constantly refreshed with new material added at a regular pace, and old material is being proportionally removed and archived to be used for comparison purposes (against current and future samples), facilitating, thus, tracking of language change. Corpus maintenance and management, depends on the material rate of flow. Overall corpus size may remain, then, steady, whereas corpus constitution is parallel to its previous states, though it is likely that balance of its components can be modified over the time.
Corpora are further classified according to national varieties: British corpora (e.g. Lancaster Oslo/Bergen corpus) vs. American corpora (e.g. Brown corpus) vs. an international corpus of English. historical variation: diachronic corpora (Helsinki corpus) vs. synchronic corpora (Brown, LOB, BNC) vs. corpora which cover only one stage of language history (corpus of Old or Middle English, Shakespeare corpora) geographical/dialectal variation: corpus of dialect samples (e.g. Scots) vs. mixed corpora (The BNC spoken component includes samples of speakers from all over Britain) age: corpora of adult English vs. corpora of child English (English components of CHILDES) availability: commercial vs. non-commercial research corpora, online corpora vs. corpora on ftp servers vs. corpora available on floppy disks or CD-ROMs
Corpus classification By: Modality Speech Text Multimodal (transcribed text, audio, visual) spoken (transcribed) corpora (e.g. London-Lund corpus) written corpora (e.g. Lancaster Oslo/Bergen corpus(LOB)) mixed corpora (British National Corpus (BNC) or Bank of English) newswire, books, periodicals, etc, (i.e., The Reuters Corpus Volume 1 (RCV1)) Reference corpora Sublanguage or special Corpora corpora of literary texts corpora of technical documents corpora of non-fiction (e.g. news texts) mixed corpora covering all genres monolingual multilingual (Translation and parallel corpora) comparable corpora (either monolingual or multilingual) native speakers learner corpora Plain (or raw) corpora Annotated corpora closed, unalterable corpora (e.g. LOB, Brown) monitor corpora (Bank of English) British corpora (e.g. Lancaster Oslo/Bergen corpus) vs. American corpora (e.g. Brown corpus) vs. an international corpus of English diachronic corpora (Helsinki corpus) synchronic corpora (Brown, LOB, BNC) corpora which cover only one stage of language history (corpus of Old or Middle English, Shakespeare corpora) corpus of dialect samples (e.g. Scots) vs. mixed corpora (The BNC spoken component includes samples of speakers from all over Britain) corpora of adult English corpora of child English (English components of CHILDES) commercial vs. non-commercial research corpora, online corpora vs. corpora on ftp servers vs. corpora available on floppy disks or CDROMs
Text type
Medium Language coverage Genre/register
Language variables Production Community Markup Open-endedness National varieties Historical variation
Dialectal variation Age Availability
Corpus design and collection Similar to building any other type of linguistic resource, corpus design criteria should be elaborated prior to entering into the task. Common practice experience has shown that there are certain issues to be taken into consideration:
Purpose of corpus construction, which, in turn, determines the language or language variety to be covered, and the overall desired/needed size and sources available Methodological issues Specification of sampling procedures to be adopted. Metadata specification of annotation schema Technical Issues Format Storage Maintenance and Access Legal/ethical issues
These issues are further discussed hereby. As it has been stated earlier, a corpus is defined as a body of language built for a specific purpose. Purpose, therefore, determines such characteristics of a corpus like type, language or language variety, its size and sources, etc. For example, lexicographic work depends heavily on large reference corpora, from which safe conclusions about the language can be drawn. A monitor corpus, can further be of added value to Lexicography, Corpus Linguistics and for tasks relevant to the diachronic study of language, as new words, meanings or norms can be identified and language change can be tracked. In HLT, on the other hand, specialized corpora of relatively modest size are sufficient for guiding system development and assisting testing and evaluation. As far as methodological issues are concerned, sampling procedures and metadata specifications are to be considered when entering corpus work. Representativeness and balance are considered as minimum requirements for a text collection to count as a corpus. This holds true especially for reference corpora. A corpus, therefore, to be representative should be built up out of samples from a range of text material representing language use. There are many approaches to sampling and there is no general agreement yet as to whether samples should be of an even size or more statistically sophisticated methods should be adopted [Biber, 1993], [McEnery et al., 1996]. Random sampling techniques are standard to many areas of science and social science, and these techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of. Prior to defining sampling procedures, however, the limits of the population to be studied should be defined clearly [Biber, 1993]. In other words, the sampling frame, that is, the entire population of texts from which the samples are taken, should be defined. This can become feasible by using a comprehensive bibliographical index. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to a particular area of interest. Stratificational sampling, on the other hand, as opposed to pure probabilistic sampling is based on the advantage of determining beforehand the hierarchical structure (or strata) of the population [Biber, 1993]. This refers to defining the different genres, channels etc. that it is made up of (newspaper reporting, fiction and poetry, legal documents, scientific documents,
etc.). This approach is more representative than pure probabilistic approaches, as it allows each individual stratum to be subjected to probabilistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification [McEnery et al., 1996]. Other issues, such as optimal lengths and number of sample sizes as well as the problems of using standard statistical equations to determine these figures are also discussed in the literature [McEnery et al., 1996]. Metadata specification is also of crucial importance for a corpus to be of some use and involves external descriptive information and information on typographic characteristics of the text, as well as linguistic analysis of the content. These and relative issues are further discussed in a separate section. Technical issues relevant to corpus collection include data capture, format specifications and data storage, maintenance and access. Coping with the task of text collection for corpus building is not a trivial issue, which, first of all, depends on text availability. Early corpus building work had to cope with slow computers with significant storage limitations and material had to be collected via keyboard typing, which proved to be time-consuming as well as error-prone. In recent years, the invention of scanners to aid the input of certain types of texts, along with the expansion of the World Wide Web have led to the dramatic increase of text availability. The latter has been proven an invaluable content provider since there exists a large and fast growing number of web sites willing to provide content at low cost. Moreover, newsfeeds in various languages, although costly, can be purchased from newswire companies which tend to provide a high volume of articles per day. These include not only news information but also certain metadata (summary, people, keywords) useful in various tasks. Many stories are transmitted progressively: the initial transmission contains some details, and subsequent deliveries provide new information. In some cases, this must be taken into account when formatting the articles for inclusion in the corpus. The importance of corpora use and their usefulness in a number of applications has posed the need for the collection and annotation of large corpora and has given rise to discussions relevant to issues such as efficient representation formats of language resources. To this extent, the language engineering community has long tried to create a set of standards for encoding corpora and linguistic annotations applied to them. SGML, XML and RDF are existing formats accepted as European standards for corpus representation, XML being the most prevalent. XML is an extensible markup language (in fact it is a meta-language in that it is a means to describe language) used for the description of marked-up electronic text. The XML encoding formalism allows flexibility, portability and easy interchange of linguistic Resources [Lopez et al., 2000]. Storage of corpora is another important issue. Large reference corpora are stored on large databases which take advantage of the corpus encoding format and facilitate maintenance of the data and access to the linguistic content Finally, copyright issues concerning the collection of texts, publication of data examples, sharing of a corpus with others, etc. are of relevance in the case of publicly available texts. Copyright policies differ in various countries and there is always a question of whether text is obtained for profit or research purposes.
Corpora use in the field of HLT Corpora are thought of as linguistic resources which have been widely used in disciplines such as Computational Linguistics, Lexicography, Language Engineering, etc. for deriving both qualitative and quantitative evidence about language. It has been extensively argued, and practice has shown, that large corpora can be used in the field of Lexicography, whether computerised or not for demonstrating language usage. Word and phrase frequency counts are used to guide the macrostructure of lexicons; concordances provide invaluable information on the usage of words or phrases in context, and furnish lexicographers with examples of word usage to be included in reference dictionaries. Qualitative and quantitative evidence aid testing and proving of linguistic theories; moreover, language teaching and learning, especially as far as second language is concerned, benefit from the usage of large leaner corpora, which have been extensively used for modeling learner behavior deviating from the norm and establishing teaching methodologies. On the other hand, there is a growing number of business activities nowadays that work systematically with human languages for translation, terminology, text recognition, extraction, etc purposes. The fast growing industry of Human Language Technologies is asked to meet all these needs by developing the appropriate products for a wide range of applications. Development of robust NLP tools aimed at tasks such as information extraction, summarization, question-answering, etc. are in the core of Human Language Technology applications. Extensive language corpora with annotations at various levels of linguistic analysis are required for testing and evaluation purposes, in that they are used as gold standards against which the performance of either stand-alone tools or integrated systems is being compared. On the other hand, statistical approaches to building tools is extensively based on the use of annotated corpora of efficient size in order for the tools to be trained. Corpora, therefore, play a crucial role in the field of language engineering, as they are used in two different ways: systems development and their evaluation. More specifically, use of large annotated corpora facilitates the automatic or semi-automatic development of other linguistic resources such as lexicons, grammars, name lists etc the design and development of Natural Language Processing Tools the efficient evaluation of these Natural Language Processing Tools at system and component level.
More precisely, monolingual corpora are important in tasks such as Natural Language parser assessment, POS-tagger training and validation, Terminology Extraction, Automatic text summarization, etc. Multilingual or parallel corpora are widely used in Computer Assisted Translation for translation memories creation and validation, lexicon consolidation, Translation memory validation, etc. It should be noted, however, that building specialized corpora tailored for specific applications in a certain sublanguage, is an added value to the process of tools and other resources development. Building and maintenance of annotated corpora, therefore, though time consuming and costly task, is proved to be prominent in the field of HLT and can lead to increase cost-effectiveness and improve accuracy rates.
Standardization and reusability The extensive use of linguistic resources - whether lexical or corpus - entails the long discussed notions of resource re-usability, harmonisation and standardization. Provision of large-scale labeled language resources, such as tagged corpora or repositories of pre-classified text documents, is a crucial key to steady progress in an extremely wide spectrum of research, technological and business areas in the HLT sector. Annotating a corpus, however, is a timeconsuming and costly effort, and therefore, certain recommendations should be taken into account prior to entering into such an endeavor. To this end, many efforts have been made towards the reusability and harmonisation of existing corpora. Reusability, on the one hand, implies that corpora (like any other resource) should be constructed so that they can be used for a number of tasks other than the one they were initially designed for (multifunctional resources) and by a number of possible users outside the community they are intended for. Reusability of linguistic resources has been the focal point within the community for many years. Harmonization, on the other hand, of resources built by different organisations implies that they should be constructed and represented in a uniform way, following common specifications and methodologies, so that - in the case of corpora - their union comprises a comparable corpus. Harmonized corpora have been produced in the framework of PPPAROLE project, in that they are produced in many languages but following common procedures and guidelines. These corpora are harmonized with respect to text representation as well as composition [PAROLE MLAP: 63-386 Project Deliverable] with the aim of being used for a wide range of applications. Standardization of corpus encoding and linguistic metadata at all levels is a prerequisite for corpus harmonization and aims at setting common guidelines and core methodologies for corpus collection, presentation and processing in other to render them reusable. In this section, a brief presentation of standardization efforts is attempted - whether they might be relevant to corpus encoding (e.g., TEI, and CES), or to linguistic annotations (metadata) specification (e.g. EAGLES, PAROLE). Encoding and Metada Standards
The Text Encoding Initiative (TEI). Jointly sponsored by ACH (Association for Computing in the Humanities), ALLC (Association of Literary and Linguistic Computing) and ACL (Association of Computational Linguistics), the TEI was initially launched in 1987 as an international and interdisciplinary standard aimed at effectively representing any type of literary and linguistic text with the adoption of SGML markup language, lately transferred to XML. The Corpus Encoding Standard (CES). It specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora. It is designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative.
10
Multilingual Text Tools and Corpora Multext (MULTEXT). Multext is a series of projects aiming at standardizing the specifications for the encoding and processing of corpora as well as building corpora, tools and other linguistic resources with respect to the specifications set and for a variety of languages. Expert Advisory Group on Language Engineering Standards (EAGLES) and International Standards for Language Engineering (ISLE). EAGLES is a metadata initiative its aim being to provide public, commonly agreed standards for large-scale language resources including text corpora, computational lexicons and speech corpora. It is also targeted at consolidating means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools, and at defining means of assessing and evaluating resources, tools and products. ISLE is the follow-up of EAGLES aiming to integrate the former specification for a wide range of languages and novel applications. Open Language Archives Community (OLAC). OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. Browsable Corpus (BC). The Browsable Corpus concept was introduced at the Max Planck Institute (MPI) to make resource discovery easier by defining metadescriptions for language resources. The structure of linked meta-descriptions can be browsed and searched. Dublin Core Metadata Initiative (DCMI). The Dublin Core Metadata Initiative is an open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. DCMI's activities include consensus-driven working groups, global workshops, conferences, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.
Other initiatives: Codes for the Human Analysis of Transcripts (CHAT) European Science Foundation Second Language Databank (ESFSLD) Gesture Databank (GDB) Multimedia Content Description Interface (MPEG-7) Spoken Dutch Corpus (CGN Corpus Gesproken Nederlands) STANLEX (interdisciplinary working group established by the Danish Standard) Resource Providers
European Language Resources Association (ELRA). It is a non-profit organization in Europe aiming to make available the language resources for language engineering and to evaluate language engineering technologies. In order to achieve this goal, ELRA is active in identification, distribution, collection, validation, standardization,
11
improvement, in promoting the production of language resources, in supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources and evaluation.
Linguistic Data Consortium (LDC). The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.
Reference to corpora that have been extensively used for development and evaluation of HLT tools is also attempted in (but not limited to) the list below:
The Brown Corpus of Standard American English was the first of the modern, computer readable, general corpora constructed at Brown University by W.N. Francis and H. Kucera. It consists of one million words of American English texts sampled from 15 different text categories to make the corpus a good standard reference. Although small and slightly dated, the corpus is still in use. The British National Corpus (BNC) comprises 100 million words of British English written and spoken texts. The written part (90%) includes a wide range of text types covering newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins. It is accompanied by appropriate corpus management tools. The American National Corpus (ANC). This is a corpus comparable to the BNC that contains American English texts. The Hellenic National Corpus (HNC). A description of this corpus is provided below. The Reuters Corpus (Volume 1). It comprises ~810,000 English language news stories from the period 20/8/96 - 19/8/97, formatted in NewsML (a dialect of XML).
12
References [1] [Atkins, et al., 1992] Atkins, S., J. Clear and N. Ostler, 1992. Corpus Design Criteria. In Literary and Linguistic Computing 7 (1), 1-16. [2] [Biber, 1993] Biber, Douglas, 1993. Representativeness in corpus design. In Literary and Linguistic Computing 8, 243-57.
[3] [EAGLES documentation EAGTCWG CTYP / P]
EAGLES, Preliminary recommendations on Corpus Typology. EAGTCWG CTYP / P (version of May 1996 : http://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html)
[4] [Gavriilidou et al., 1998] Gavriilidou, M., Labropoulou, P., Papakostopoulou, N., Spiliotopoulou, S., and Nassos N., Greek corpus documentation. Technical report (ILSP : PAROLE LE2-4017 / 10369, WP2.9-WP-ATH-1, 1998). [5] [Hatzigeorgiou et al., 2000] Hatzigeorgiou, N., Gavriilidou, M., Piperidis, S., Carayannis, G., Papakostopoulou, A., Spiliotopoulou, A., Vacallopoulou., A., Labropoulou, P., Mantzari, E., Papageorgiou, H., Demiros, I., 2000. Design and implementation of the online ILSP Greek Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece. [6] [Lopez et al., 2000] Lopez, P., Romary, L., 2000. A Framework for Multilevel linguistic Annotations. In Proceedings of the 2nd Language Resources and Evaluation Conference, Athens, Greece. [7] [McEnery et al., 1996] McEnery, T. and Wilson, A. 1996. Corpus Linguistics. Edinburgh University Press (2 nd ed., 2001). [8] [McEnery et al., 2002] McEnery, T. and Wilson, A., 2002. Corpus Linguistics http ://www.ling.lancs.ac.uk/monkey/ihe/linguistics/contents.htm. part II.
[9] [PAROLE MLAP: 63-386 Project Deliverable] PAROLE, Design and composition of reusable harmonized written language reference corpora for European languages. Technical report (PAROLE Consortium. MLAP : 63386, WP 4-Task 1.1, 1995). [10] [Sinclair, 1987] Sinclair, J. M., 1987 (ed). Looking up : An account of the COBUILD project. Collins Publishers, 1987. [11] [Sinclair, 1991] Sinclair, J. M., 1991. Corpus, concordance and collocation. Oxford: Oxford University Press. [12] [Sinclair, 1987] Sinclair, J. M., 1994. Trust the text. In Advances in Written Text Analysis. Coulthard, M. (ed.) London, Routledge. [13] [Sperberg-McQueen et al., 1994] Sperberg-McQueen, C. M. and Burnarnd, L., eds. Guidelines for electronic text encoding and interchange : TEI-P3. Technical report (Chicago and Oxford : ACHACLALLC Text Coding Initiative, 1994). [14] [Vronis, 2000]
13
Vronis, J. (Ed.). (2000). Parallel Text Processing: Alignment and use of translation corpora. Dordrecht: Kluwer Academic Publishers. http://www.up.univmrs.fr/veronis/parallel-book.html [15] [Zampolli,1990] Zampolli, A., A survey of European corpus resources (UK SALT Club : 1990).
14
Corpus Annotation A given corpus to be of some use should further be annotated. It is a common practice for many corpus projects to add annotations, that is, special markup relevant to both external and internal data that facilitate access to and manipulation of the resource. Linguistic Annotations, finally, are common practice for corpora built to guide development or measure performance of relative HLT tools. Two types of annotations are applied to a corpus: corpus encoding, and Linguistic annotations which enrich the raw text with information resulting from linguistic analyses of the data. This section on corpus annotation is organized as follows: Structural Annotation Linguistic Annotation of Monolingual Corpora Layers of linguistic annotation Tools & Annotated Corpora The Hellenic National Corpus (HNC) The ILSP annotated corpus Linguistic Annotation of Parallel Corpora
Structural Annotation Structural Annotation also referred to as corpus encoding comprises the mark-up of both external and internal data present in the texts of the resource. By external data we mean documentation of the corpus, that is global information about the text, e.g. bibliographic information (author, publisher, edition, etc.) as well as information about distribution of the electronic corpus (institution, address, etc.). Internal data are relevant to structural annotation, which comprises the mark-up of structural elements in the raw text material. Gross structural mark-up and sub-paragraph mark-up are distinguished. The gross structure of a text consists of elements such as chapters, sections, paragraphs, etc. At the paragraph level, titles, lists, tables, etc. are also meaningful structural units. Sub-paragraph structures include elements such as sentences, abbreviations, dates, quotations, references, etc. A Header and a Body element is also identified and encoded for every text in the corpus. The header contains the documentation information and the body contains the raw text material and the mark-up for structural information. Corpus encoding is of primary importance in the discipline of NLP, in that it is considered to maximize resource re-usability. To this extent, many standardization efforts have been attempted (Text Encoding Initiative, Corpus Encoding Standard, etc.).
15
Linguistic Annotation of Monolingual Corpora The additional information added manually (by linguists), automatically (by relevant Natural Language Processing tools) or semi-automatically (by manually correcting the output of these tools) is referred to in the literature as (Linguistic) annotation(s) or linguistic metadata, or simply markup. These reflect the linguistic structure of a given corpus at various levels of linguistic analysis and may be added upon either textual data, or recorded linguistic signals. In this document, we are primarily interested in corpora comprising textual data, therefore we will not refer to annotations added to speech corpora (i.e., phonetic transcriptions, etc.). Linguistic analysis is performed at various levels: part-of-speech tagging, syntactic analysis, functional relations marking, Named Entity identification, co-reference annotation, and so on. The focus in this document is on annotations added to monolingual text corpora, on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. Levels of Linguistic Annotation Corpora used in the field of HLT are usually annotated at various levels of linguistic analysis. The most common form of annotation added to corpora is at the level of morphosyntax usually referred to as part-of-speech tagging. Phrase Structure is another level of linguistic annotation common to many corpora, such as the Susanne and NEGRA Corpus, which are examples of corpora that have been syntactically analyzed and annotated. Skeleton parses along with dependencies are included in the so-called Treebanks, such as the Penn Treebank for English, or the Prague Treebank for Czech. In this section, presentation of linguistic annotations is focused on monolingual text corpus processing in the framework of information processing, extraction and retrieval. The following annotation types will be discussed with reference to relevant annotation format specifications, tools and annotated corpora: Surface Text Analysis (tokenization and handling) Morphosyntactic Annotation Lemmatization Named Entity Recognition Surface Syntactic Analysis Computation of Grammatical Functions in sentence parses Semantic Annotation Coreference Annotation Term detection
16
Surface Text Analysis (Tokenization or Text Handling) Recognizing and labeling surface phenomena in the text is a necessary prerequisite for most Natural Language Processing tasks. Therefore, tokenization, that is, basic text handling, is the first level of text analysis. This includes identifying text structure at subparagraph level, that is, word boundaries, sentence boundaries, dates, abbreviations, etc. Identifying word and sentence boundaries in most cases involves resolving ambiguity in punctuation use since structurally recognizable tokens may contain ambiguous punctuation; this may be the case for numbers, alphanumeric references, dates, acronyms and abbreviations. And although many languages use spaces between words, there others such as Chinese and Thai that are written with no spaces between words, and locating the word boundaries is thus a necessary preprocessing task. Agglutinative languages such as Swahili and Turkish present a different tokenization challenge in that space-delimited words often contain multiple units, each expressing a particular grammatical meaning. A description of the MULTEXT tokenizer MtSeg along with downloadable tools is available here.
References [1] [Armstrong, 1996] Armstrong, S., 1996. MULTEXT: Multilingual text tools and corpora. In Arbeitspapiere zum Workshop Lexikon und Text: Wiederverwendbare Methoden und Ressourcen fr die linguistische Erschlieung des Deutschen, Lexicographica, pages 107-119. Max Niemeyer Verlag, Tbingen. [2] [Grover et al., 2000] Grover, C., Matheson, C., Mikheev, A., Moens, M., 2000. LT TTT - A Flexible Tokenisation Tool, In Proceedings of the 2nd Language and Resources Evaluation Conference, Athens, Greece. [3] [Di Christo et al., 1995] Di Christo, P., Harie, S., de Loupy, C., Ide, N., and Veronis, J., 1995. Set of programs for segmentation and lexical look-up, MULTEXT LRE 62-050 project deliverable
17
Morphosyntactic Annotation (Part of Speech Tagging) & Lemmatization Morphosyntactic annotation is the most basic type of linguistic analysis - the aim being to assign to each lexical unit in the text a code indicating its part-of-speech (verb, noun, adjective, preposition, etc) along with its morphosyntactic features (case, number, person, aspect, mood, etc.). Lemmatization is closely related to POS-tagging and involves the assignment of lemma, i.e., the base form of an inflected word, to every word token in a corpus once the morphosyntactic analysis has been performed. Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation), and it has been one of the first types of annotation to be applied on corpora usually with the aid of Natural Language Processing tools, named part-of-speech taggers, which carry out the task with a high degree of accuracy. There are several approaches to automatic part-of-speech tagging, ranging from simple rulebased tagging ([Brill, 1992]; [Karlsson et al., 1995]; [Oostdijk, 1991]) to statistically sophisticated techniques ([Church, 1988]; [Cutting et al., 1992]; [Merialdo, 1994]; [Ratnaparkhi, 1998]) and symbolic learning knowledge methods ([Brill, 1992]; [Brill, 1995]; [Daelemans et al., 1996]; [Roth et al., 1998]). Typical rule based approaches use contextual information (in the form contextual rules) to assign tags to unknown or ambiguous words. Morphological information is additionally used to aid the disambiguation process. Stochastic methods, on the other hand, are based on probability calculations of word and tag bindings or tag sequences, or both (tag sequence probabilities and word frequencies). Morphosyntactic annotation applied on corpora is of crucial importance in developing tools for the automatic pos-tagging whether the methodology adopted is purely linguistic or not. Supervised machine learning ([Brill et al., 1993]) on the basis of large annotated corpora is used for the induction of rules and probabilities calculation from the data. Examples of taggers (the list is only indicative): Eric Brills supervised and unsupervised POS-taggers can be downloaded from his home page. A demo of Eric Brills tagger powered by -TBL technology can be found here.
The EngCG-2 tagger (tool description, documentation and on-line demo can be viewed)

CLAWS Part-of-Speech Tagger developed at the University of Lancaster The Xerox part of speech tagger (from ftp.parc.xerox.com).
POS-tagged corpora

The Penn TreeBank has text annotated, among other levels, for parts of speech. The ILSP corpus also carries POS-tag annotation. For more details click here.
References
18
[1] [Brill, 1992] Brill, E., 1992. A simple rule-based part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento, Italy. [2] [Brill, 1993] Brill, E., 1993. Rule http://www.cs.jhu.edu/~brill based tagger, version 1.14. Available from
[3] [Brill et al., 1993] Brill, Eric & Marcus, M. 1993. Tagging an unfamiliar text with minimal human supervision. ARPA Technical Report. [4] [Brill, 1995] Brill, E., 1995. Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging. In Proceedings of the 3rd Workshop on Very Large Corpora. [5] [Cutting et al., 1992] Cutting, D., J. Kupiec, J. Pedersen and P. Sibun, 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. [6] [Church, 1988] Church, K. W., 1988. A Stochastic Parts program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the 2nd ACL Conference on Applied Natural Language Processing, Austin, Texas. [7] [Daelemans et al., 1996] Daelemans, W., J. Zavrel, P. Berck and S. Gillis, 1996. MBT: A Memory-Based Part of Speech Tagger-Generator. In Proceedings of the 4th Workshop on Very large Corpora, Copenhagen, Denmark. [8] [Karlsson et al., 1995] Karlsson, F., A. Voutilainen, J. Heikkila and A. Anttila, 1995. Constraint Grammar. A Language-Independent System for Parsing Unrestricted Text, Mouton de Gruyter. [9] [Merialdo, 1994] Merialdo, B., 1994. Tagging English Text with a Probabilistic Model. Computational linguistics, 20(2):155-171, 1994. [10] [Oostdijk, 1991] Oostdijk, N., 1991. Corpus Linguistic and the Automatic Analysis of English, Rodopi, Amsterdam, Netherlands. [11] [Papageorgiou et al., 2000] Papageorgiou H., Prokopidis, P., Giouli, V., Piperidis, S., 2000. A Unified POS Tagging Architecture and its Application to Greek, In Proceedings of the 2nd Language and Resources Evaluation Conference, Athens, Greece. [12] [Ratnaparkhi, 1998] Ratnaparkhi, A., 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Dissertation, University of Pennsylvania. [13] [Roth et al., 1998] Roth, D. and D. Zelenko, 1998. Part-of-Speech Tagging Using a Network of Linear Seperators. In Proceedings of the 36th Annual Meeting of the ACL COLING, Montreal, Canada. [14] [Schtze, 1993] Schtze, Hinrich. 1993. Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 251-258.
19
Syntactic Annotation (parsing) After part-of-speech tagging, syntactic annotation is probably the most commonly encountered form of corpus annotation aiming at recognition and labeling of syntactic constituents and clauses within sentences. Syntactically annotated (also known as parsed) corpora are sometimes known as treebanks a term that alludes to the tree diagrams or "phrase markers" used in parsing. Such visual diagrams, however, are rarely encountered in practice, yet rather, the identical information is represented using sets of labeled brackets. Indentation of bracket-based annotations retains the properties of a tree diagram (a system used by the Penn Treebank project). For example: [S [NP She NP] [VP sat [PP on [NP the mat NP] PP] VP] S] Because automatic parsing has a lower success rate than part-of-speech annotation, parser output is often either post-edited by human analysts or syntactic annotation is carried out by hand (although possibly with the help of parsing software). The disadvantage of manual annotation, however, is inconsistency, especially where more than one person is involved. Structural ambiguities is another difficulty with respect to the task at hand. Guidelines - as detailed as possible - should be available, but even then there can occur ambiguities where more than one interpretation is possible. As far as tools are concerned, there are many approaches to syntactic parsing. Deterministic parsing has been employed in general purpose systems ([Hindle, 1983a]; [Hindle, 1983b]; [Abney, 1990]) and has been targeted at processing free text by producing an analysis, partial if necessary, for each sentence. Correcting errors does not involve backtracking or unfolding the parser to an earlier state, thus avoiding speed compromises. Another parsing method proposed in [Brill, 1993a] and [Satta et al., 1996] involves a transformational grammar, capable of parsing text into syntactic trees, which are automatically learned from a training corpus. Training starts from a naive state (i.e., all phrases are initially tagged as NPs) and the system learns a set of ordered transformations, which can be applied to reduce parsing error, by repeatedly comparing the current state to the proper phrase structure for each sentence in the training corpus. [Karlsson, et al., 1995], [Voutilainen, 1993], and [Karlsson, 1990] describe a syntactic annotation algorithm implementing constraint grammar checking. According to this approach, all possible syntactic categories are assigned to each word from a lexicon, in a way similar to part-of-speech (POS) tagging with constraints. Like POS disambiguation constraints, syntactic constraints are used to discard all contextually illegitimate syntactic labels. A flat syntactic description of each sentence is given in the output. Within these approaches, an important distinction must be made between shallow syntactic analysis (also referred to as chunking) and labeling of recursive phrasal constituents. Shallow syntactic analysis is a less detailed approach which tends to use a less finely distinguished set of syntactic constituent types, namely chunks, and ignores, the internal structure of certain constituent types. Chunks are textual units of adjacent tokens, which can be linked mutually through unambiguously identified dependency chains. Chunks are defined strictly syntactically. According to [Abney, 1996], a chunk is the non-recursive core of an intraclausal constituent, extending from the beginning of a constituent to its head (or potential
20
governor) but not including post-head dependents. The fact that two substrings are assigned different chunks does not necessarily entail that there is no dependency relationship linking the two. Simply, this means that, on the basis of the available lexical knowledge, it is impossible to state unambiguously what chunk relates to its neighbor. Full parsing, on the other hand, aims to provide as detailed as possible analyses of the sentence structure with the identification of recursive phrasal constituents. Tools
Link Grammar Parser: a free syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. Works on a variety of platforms, including Windows.
LT CHUNK: a syntactic chunk parser from the Language Technology Group at
Edinburgh Corpora

A well known syntactically annotated corpus is the Penn Treebank. The NEGRA corpus is a syntactically annotated corpus for German. The ILSP corpus also carries syntactic annotation at the chunks and phrase level. Click here for more information on chunking. Recursive constituent analysis has also been performed.
References [1] [Abney, 1990] Abney, S., 1990. Rapid Incremental Parsing with Repair. In Proceedings of the 6th New OED Conference, Electronic Text Research. [2] [Abney, 1996] Abney, S., 1996. Partial Parsing via Finite-State Cascades. In Proceedings of the Robust Parsing Workshop, ESSLLI. [3] [Abney, 1997] Abney, S., 1997. Part of Speech Tagging and Partial Parsing. In Corpus-Based Methods in Language and Speech Processing, Steve Young and Gerrit Bloothooft (eds.), Kluwer Academic Publishers, pp. 118-136. [4] [Brill, 1993a] Brill, E., 1993a. Transformation-Based Error-Driven Parsing. In Proceedings of the 3rd International Workshop on Parsing Technologies. [5] [Brill, 1993b] Brill, E., 1993b. A Corpus-based Approach to Language Learning, Doctoral Dissertation, University of Pennsylvania. [6] [Hindle, 1983a] Hindle, D., 1983a. User Manual for Fidditch. Technical Memorandum #7590-142, Naval Research Laboratory. [7] [Hindle, 1983b]
21
Hindle, D., 1983b. Deterministic Parsing of Syntactic Non-Fluences. In Proceedings of the 21st Annual Meeting of the Association of Computational Linguistics. [8] [Karlsson, et al., 1995] Karlsson, F., A. Voutilainen, J. Hekkila, and A. Anttila, (eds.), 1995. Constraint Grammar: a Language Independent System for Parsing Unrestricted Text. Mouton de Gruyter. [9] [Karlsson, 1990] Karlsson, F., 1990. Constraint Grammar as a Framework for Parsing Running Text. In Proceedings of 10th International Conference in Computational Linguistics. [10] [Satta et al., 1996] Satta, G. and E. Brill, 1996. Efficient Transformation-Based Parsing. In Proceedings of the 34st Annual Meeting of the Association of Computational Linguistics. [11] [Voutilainen, 1993] Voutilainen, A., 1993. NPtool, a Detector of English Noun Phrases. In Proceedings of the Workshop on Very Large Corpora.
22
Functional Annotation Functional annotation is the next step to syntactic annotation and involves the identification of functional relations among constituents, that is subjects, direct and indirect objects, modifiers, etc. In contrast to the previous layers of information which only apply to constituency-based annotation schemata, the layer of functional relations applies in principle to both constituency- and dependency-based annotation practices. In dependency-based models, functional specifications are expressed in terms of headdependent relations, for example adjectives and the nouns they modify, arguments and the verb they co-occur with. In all cases, the relation is between words only; it is indicated with an arrow pointing from a head to a dependent. In practice, functional relations are implicitly encoded through the constituent structure of a sentence. For example, the object is the noun phrase immediately dominated by the verb phrase, and the subject is the noun phrase immediately dominated by the sentence node. In current annotation practices, however, this is true only of the Lancaster/IBM annotation schema. In all remaining cases, skeletal representations are adopted which provide only flat constituent structures and functional relations cannot be inferred on the basis of levels of embedding. Hence, in order to specify, for each relevant phrasal constituent, the function played within the sentence flat structures need to be augmented with explicit functional annotations. The EAGLES Syntactic Annotation Group proposes Subject, Object, Indirect Object and Adjunct as a common vocabulary of basic functional relations to be used in both constituency- and dependency-based annotation practices. At the phrase level, other functions such as Head and Modifier are also recognized. The Penn TreeBank includes also annotations at the level of functional relations.
23
Named Entity Detection The linguistic annotation at this level involves the recognition names in text and their classification. Named Entity Recognition is the first task in the series of Information Extraction and was initially introduced by DARPA as a part of the message understanding process. The task has also been included in Message Understanding Conferences (MUC) whereby extensive specifications have been provided catering for named entities of the type ENAMEX (further subcategorized as PERSON, ORGANISATION, LOCATION), TIMEX (with TIME and DATE subcategories) and NUMEX (subcategorized also as MONEY and PERCENT). Automatic Content Extraction (ACE) project has further extended entity classification providing a more fine-grained distinction of ENAMEX subtypes with the definition of FACILITY and GEO-POLITICAL (GPE) entities.
ENAMEX
Person Organization Facility Location Geo-Political Entities
Bush, he, the President Linguistic Data Consortium Alfredo Kraus Auditorium the Hudson River the Cubans protested the U.S. heartland Iraq agreed U.S. leader
TIMEX
Date Time
last year, 1st January 2002 14:30, an hour ago 250,000 euro 80-90%
NUMEX
Money Percent
References [1] [Chinchor, 1997] Chinchor N., 1997. MUC-7 Named Entity Task Definition, Version 3.5 [2] ACE Entity Detection and Tracking Phase 1. ACE Pilot Study Task Definition, v2.2 [3] Annotating Metonyms in EDT (Camp B Style). Version 1.2. February 13, 2001 [4] EDT Metonymy Annotation Guidelines. Version 2.2. 2001-06-05
24
Sense Tagging Sense tagging is an instance of semantic annotation, the latter being a broader term in that it also addresses other issues such as dependency, coordination, thematic role assignment, coreference, etc. In NLP, semantic annotation covers tasks such as word-sense disambiguation, summarization, information extraction. Sense tagging is an obvious next step beyond grammatical annotations and involves the assignment of semantic tags to POS-tagged words with the goal of distinguishing the lexicographic senses of the same word - a procedure also known as word sense disambiguation. Sense tagging is applied on content words only, and it is crucial in Machine Translation, cross-language information retrieval and extraction. In most cases, sense tagging is performed semi-automatically with the aid of lexica or well known ontologies (WordNet, VerbNet, etc). The UCREL semantic analysis system is a software system for undertaking the automatic semantic analysis of text. The system has been designed and used across a number of research projects undertaken at Lancaster. The semantic tagset was originally loosely based on Tom McArthur's Longman Lexicon of Contemporary English. It has a multi-tier structure with 21 major discourse fields, subdivided, and with the possibility of further fine-grained subdivision in certain cases. The full tagset is available on-line. The following is an excerpt of the semantic tagset developed at Lancaster:
B1 Anatomy and physiology B2 Health and disease B3 medicines and medical treatment B4 Cleaning and personal care B5 Clothes and personal belongings C1 Arts and crafts E1 EMOTIONAL ACTIONS, STATES AND PROCESSES General E2 Liking E3 Calm/Violent/Angry E4 Happy/sad E4.1 Happy/sad: Happy E4.2 Happy/sad: Contentment E5 Fear/bravery/shock E6 Worry, concern, confident F1 Food F2 Drinks F3 Cigarettes and drugs F4 Farming & Horticulture G1 Government, Politics and elections G1.1 Government etc. G1.2 Politics G2 Crime, law and order G2.1 Crime, law and order: Law and order G2.2 General ethics G3 Warfare, defence and the army; weapons
Semmantically annotated corpora
Semcor is a semantically tagged corpus developed at Princeton University. It is a subset of the Brown corpus and comprises more than 670,000 words form 352 text files, with all content words (nouns, adjectives, verbs and adverbs) manually annotated using WordNet senses in order to be used as a standard disambiguated resource for the evaluation of semantic taggers.
25
The Penn TreeBank is another corpus which has been tagged semi-automatically for word senses with the use of VerbNet, an extension and revision of the standard WordNet.
References [1] [Corrazzari et al., 2000] Corrazzari, O., Calzolari, N., Zampolli, A., 2000. An Experiment of LexicalSemantic Tagging of an Italian Corpus, In Proceedings of the 2nd Language Resources and Evaluation Conference, Athens, Greece. [2] [Kilgariff, 1998] Kilgariff, A., 1998, Gold Standard Datasets for Evaluating Word Sense Disambiguation programs. [3] [Palmer et al., 2000] Palmer, M., Hoa Trang Dang, Rosenzweig, J., 2000. Semantic Tagging for the Penn Treebank. In Proceedings of the 2nd Language Resources and Evaluation Conference, Athens, Greece. [4] [Wilks et al., 1997] Wilks, Y., Stevenson, M., 1997. Sense Tagging: Semantic Tagging with a Lexicon. [5] [Wilks et al., 1998] Wilks, Y., Stevenson, M., 1998. Word Sense Disambiguation using Optimised Combinations of Knowledge Sources. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 98), Montreal, Canada.
[6] [Wilson et la., 1993]
Wilson, A. and Rayson, P., 1993. Automatic Content Analysis of Spoken Discourse: a report on work in progress. In C. Souter and E. Atwell (eds), Corpus Based Computational Linguistics. Amsterdam: Rodopi.
26
Coreference Annotation Annotation at the level of coreference is considered to be a type of semantic analysis and is designed to assist coreference resolution algorithm evaluation. The overall task is a prerequisite to many applications such as information extraction and template filling, summarization, etc. The relation of coreference has been defined as holding between two noun phrases if they refer to the same entity ([Hirschman et al., 1997], i.e., if they have the same referent. Coreference annotation was first introduced at the Message Understanding Conferences (MUC). The assumption underlying most annotation schemes for coreference is that processing text involves building a discourse model containing discourse entities, and that anaphoric relations are relations between these discourse entities. Within the MUC-7 coreference task definition, special tags are used to annotate the text spans that introduce a discourse entity that is, that can be subsequently referred to by means of anaphoric expressions (referring expressions) [Hirschman, 1997]. Annotation, thus, proceeds in two steps: Referring expressions and candidate antecedents (NPs) are detected and characterized Anaphors are linked to their antecedents (coreference resolution)
In most annotation schemas, the coreference relation is marked between elements of the following categories: nouns, noun phrases, and pronouns, and elements of these categories are markables. The relation is marked, i.e., anaphors are linked to their antecedents, only between pairs of elements both of which are markables. This means that some markables that look anaphoric are not to be coded, including pronouns, demonstratives, and definite NPs whose antecedent is a clause rather than a markable. References [1] [Dagan et al., 1990] Ido Dagan, Alon Itai, 1990. Automatic Processing of Large Corpora for the Resolution of Anaphoric References. In Proceedings of the 13th International Conference on Computational Linguistics (COLING). [2] [Grosz et al., 1986] Barbara J. Grosz, Candace L. Sidner, 1986. Attention, Intentions, and the Structure of Discourse. In Computational Linguistics, Vol. 12, Number 3, July-September 1986, 175-204. [3] [Hirschman, 1997] Hirschman L., 1997. MUC-7 Conference Task Definition, Version 3.0. [4] [Hirschman et al., 1997] Hirschman L, Robinson, P., Burger, J., and Vilain, M., 1997. Automatic Coreference: The role of annotated training data. In proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing. [5] [Bagga et al., 1998] Bagga, A., Baldwin, B., and Shelton, S., 1998. Coreference and its applications. Call for papers for Workshop associated with the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, 1999. Available from www.cs.duke.edu/~amit/acc99-wkshp.html
27
Term Extraction based on the use of already annotated corpora Annotation at this level involves detection of nominal heads either single or multi-word functioning as terms in a given domain or subject matter and is targeted at aiding statistical or rule-based approaches to automatic term extraction. The latter has many applications in text indexing, information retrieval and extraction, text classification, automatic text abstracting and summarization, and parallel text alignment. POS-tagging and chunking are prerequisites for efficient term annotation. Experience has shown that manual acquisition of terminological data from texts is a very work-intensive and error-prone task. Automatic corpus analysis favors terminology acquisition by scanning the corpus for terminologically relevant information, generating lists of term candidates which have to be post-edited by humans.
References [1] [Bourigault, 1992] Bourigault D., 1992. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of the 14th International Conference on Computational Linguistics. [2] [Daille et al., 1994] Daille B., Gaussier E., Lange J. M., 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of COLING 94, pp 515-521. [3] [Frantzi et al., 1997] Frantzi, K.T. and Ananiadou, S., 1997. Automatic term recognition using contextual clues. In Proceedings of Mulsaic 97, IJCAI, Japan. [4] [Georgantopoulos et al., 1999] Georgantopoulos, B, Piperidis, S., 1999. Eliciting Terminological Knowledge for Information Extraction Applications, in Tzafestas, S. (Ed) Advances in Intelligent Systems: Concepts, Tools and Applications, Kluwer Academic Publishers, Microprocessor-based and Intelligent Systems Engineering Series, Vol. 21, pp [5] [Jacquemin et al., 1999] Jacquemin, C., and Tzoukermann, E., 1999. NLP for Term Variant Extraction: A Synergy of Morphology, Lexicon and Syntax. In T. Strzalkowski, edt. Natural Language Information Retrieval. Kluwer, Boston, MA. [6] [Justeson et al., 1993] Justeson, J., and Katz., S., 1993. Technical terminology: some linguistic properties and an algorithm for identification in text. Technical Report RC 18906, IBM Research Division. [7] [Voutilainen, 1993] Voutilainen, A., 1993. NPtool. A detector of English noun phrases. In Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio: Ohio State University, June 22, 1993.
28
A large monitor reference corpus: The Hellenic National Corpus The Hellenic National Corpus is a large monitor reference corpus of about over 27 million running words designed and constructed at the Institute for Language and Speech Processing (ILSP) to reflect contemporary Greek language. It is a general language corpus to the extent that it covers as many genres and topics as possible (classified by an open-end typology schema) and it is a monitor corpus since it is constantly updated: new texts are added on a daily basis, while older ones are not discarded. Currently the HNC contains more than 27 million words. Text collection adheres to the following selection criteria: mode of text production, date of text production, degree of readership, language coverage.
The corpus is structurally annotated at extratextual and intratextual level, according to the PAROLE Corpus Encoding Standard (CES) [PAROLE MLAP : 63-386, 1995], which follows the TEI and EAGLES guidelines [Sperberg-McQueen, et al., 1994]; [EAGLES documentation EAGTCWG CTYP / P, 1996]. In what concerns extratextual annotation, texts are classified as regards the parameters of Medium, Genre and Topic. Besides these three parameters, bibliographic information characterizing each text (author, publisher, publication date, etc.) is coded. This type of annotation is included in each text's Header, as specified by the EAGLES and TEI standards. A relational database is the core of the HNC, containing all the textual material, the annotation marks, the identifiers and the indexes used by the application [Hatzigeorgiou et al., 2000]. The corpus can be accessed for query and search purposes (concordances, statistics over words or lemmas, etc) over the Internet via a very user-friendly interface, with standard web page elements (drop-down-lists, checkboxes), and no demand for knowledge of or familiarization with annotation marks, special symbols or characters. Querying on a subcorpus is also permitted on the basis of the classification and annotation parameters accompanying each text. From the total number of features used in the annotation schema, the following have been selected as query elements: medium, genre, topic, detailed genre, detailed topic, publisher, author and date.
References
[1] [EAGLES documentation EAGTCWG CTYP / P, 1996]
EAGLES, Preliminary recommendations on Corpus Typology. EAGTCWG CTYP / P (version of May 1996, available on line here)
[2] [Gavriilidou et al., 1998] Gavriilidou, M., Labropoulou, P., Papakostopoulou, N., Spiliotopoulou, S., and Nassos, N., 1998. Greek corpus documentation. Technical report (ILSP : PAROLE LE2-4017 / 10369, WP2.9-WP-ATH-1, 1998). [3] [Hatzigeorgiou et al., 2000] Hatzigeorgiou, N., Gavriilidou, M., Piperidis, S., Carayannis, G., Papakostopoulou, N., Spiliotopoulou, S., Vacalopoulou, A., Labropoulou P., Mantzari, E., Papageorgiou, H., and Demiros, I., 2000. Design and implementation of the online ILSP Greek Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece. [4] [PAROLE MLAP : 63-386, 1995]
29
PAROLE, Design and composition of reusable harmonized written language reference corpora for European languages. Technical report (PAROLE Consortium. MLAP : 63-386, WP 4-Task 1.1, 1995). [5] [Sperberg-McQueen, et al., 1994] Sperberg-McQueen, C. M., and Burnard, L., eds. 1994. Guidelines for electronic text encoding and interchange : TEI-P3. Technical report (Chicago and Oxford : ACH ACLALLC Text Coding Initiative, 1994).
30
The ILSP multi-layer annotated corpus The corpus hereby presented was built at the Language Technology Applications Department of the Institute for Language and Speech Processing in order to be used for development and evaluation of tools in an Information extraction setting. The corpus comprises 870,000 words taken mainly from technical and financial texts. Legislative and sports texts are also included in the corpus collection. The corpus contains annotations applied at the level of surface syntactic analysis, part-of-speech tagging, Named Entity Annotation, syntactic analysis and coreference relations. Terminological data annotation has been attempted on a small fragment of the corpus. Annotation was performed semi-automatically, i.e., manual corrections were applied to the automatically annotated corpus. For each layer of linguistic analysis, annotation specifications were provided conforming to generally accepted standards for linguistic annotations of corpora. Inter-annotator agreement was calculated for all levels of linguistic annotations for checking consistency. Linguistic annotations are presented in XML format.
Input Text
Lexicon Context Rules
Surface Text Analysis Morphosyntactic Annotation Lemmatization
Text Handler
POS Tagger & Lemmatizer
Name Lists NE Rules Grammar Subcat
Named Entity Recognition Surface Syntactic Analysis Functional Analysis
Name Recognizer
Shallow Parser Sem_Processor
Frames
Domain Model Inference Rules
Coreference Resolution Template Construction
Discourse Interpreter
Template
31
Surface syntactic analysis At the surface syntactic analysis, word and sentence boundaries are detected, and punctuation marks, digits, abbreviations and simple dates are recognized. Multi-word units and compound words are merged, whereas contracted ones are split so that the appropriate tag and lemma will be assigned to each constituent at a later stage. Annotation has been performed automatically, with a MULTEXT like tokenizer [Di Christo et al., 1995] developed at ILSP. Following common practice, the tokenizer makes use of regular expressions coupled with downstream precompiled lists for the Greek language and simple heuristics. This proves to be sufficient for recognizing sentences and words effectively, with accuracy of up to 95%. The annotation schema at this level is in accordance with the Multext specifications.
<w id=w_0001 type=TOK start=0 end=9></w> <w id=w_0002 type=TOK start=11 end=21></w> <w id=w_0003 type=PUNCT class=PTERM_P start=23 end=23>.</w> Word.xml < PARTIAL VIEW of the final output of the tokenizer >
POS-tagging and Lemmatization Morphosyntactic annotation and lemmatization are performed on the tokenized output channeled to the part-of-speech (POS) tagging and lemmatization stage. A version of the Brill (1993a) tagger trained on Greek text and a PAROLE compatible tagset have been used, which, conforming to the guidelines set up by TEI and NERC, captures the morphosyntactic particularities of the Greek language. 584 different part-of-speech tags have been used. The accuracy is around 90% when all features are examined and around 96% when only POS tags are taken into account. First, the tagger assigns initial tags, looking up in a lexicon created from the manually annotated corpus during training. A suffix-lexicon is used for initially tagging unknown words. 799 contextual rules are then applied to improve the initial phase output. Automatic annotation was validated by two linguists with the use of a Tcl/Tk Graphical User Interface. Annotation specifications defined in the framework of previous work on corpus annotation at ILSP were thus augmented, clarified and formalized. Interannotator agreement was measured to allow for the identification and resolution of discrepancies. After part-of-speech tagging has taken place, lemmatization is performed, i.e., lemmas retrieval from the ILSP Greek morphological lexicon of about 70K lemmas.
<mw>
Markup Table
Id Type Tag Href Lemma
[ASCII] NoCm, NoPr, VbMn, AsPpSp (cfr. EAGLES tables) NoCmMaSgNm, NoCmFePlAc, AjBaNeSgNm (cfr. EAGLES tables) <cpw>, <w> [ASCII]
32
<?xml version="1.0" encoding="ISO-8859-7" ?> <Annotation type="morpho"> <sent id="s_38" start="7922" end="8062"> <mw id="mw_38_1" lex="O" tag="NoCmMaSgNm" lemma="O" start="0" end="1"></mw> <mw id="mw_38_2" lex="." tag="NBABBR" lemma="." start="2" end="4"></mw> <mw id="mw_38_3" lex="" tag="NoPrMaSgNm" lemma="" start="5" end="13"></mw> <mw id="mw_38_4" lex="" tag="NoPrMaSgNm" lemma="" start="14" end="21"></mw> <mw id="mw_38_5" lex="" tag="AtDfFeSgGe" lemma="" start="22" end="25"></mw> <mw id="mw_38_6" lex="HSBC" tag="RgFwOr" lemma="HSBC" start="26" end="30"></mw> <mw id="mw_38_7" lex="Group" tag="RgFwOr" lemma="Group" start="31" end="36"></mw> <mw id="mw_38_8" lex="," tag="PUNCT" lemma="," start="36" end="37"></mw> <mw id="mw_38_9" lex="" tag="VbMnIdPr03SgXxIpAvXx" lemma="" start="38" end="47"></mw> <mw id="mw_38_10" lex="" tag="CjSb" lemma="" start="53" end="56"></mw> <mw id="mw_38_11" lex="" tag="AtDfMaPlNm" lemma="" start="57" end="59"></mw> <mw id="mw_38_12" lex="" tag="AjBaMaPlNm" lemma="" start="60" end="65"></mw> <mw id="mw_38_13" lex="" tag="NoCmMaPlNm" lemma="" start="66" end="75"></mw> </sent> </Annotation> morph.xml < PARTIAL VIEW of the final output of the morphosyntactic analysis >
Named Entity Annotation Named Entity Annotation is performed on the POS-tagged data on the basis of the MUC-7 Named Entity task definition [Chincor, 1997] with certain adaptations and modifications for capturing the peculiarities of the Greek language, whereas there is currently on-going work for moving to an ACE-compatible schema [ACE EDT Phase 1, 2000]. Currently, ENAMEX (PERSON, ORGANISATION, LOCATION), TIMEX (DATE, TIME) and NUMEX (MONEY, PERCENT) are recognized and classified.
<?xml version="1.0" encoding="UTF-8"?> <Annotation type="ne"> <sent id="s_38"> <ne id="38_1" start="5" end="21" type="enamex" subtype="person"/> <ne id="38_2" start="26" end="36" type="enamex" subtype="org"/> <ne id="38_3" start="133" end="139" type="enamex" subtype="loc"/> </sent> </Annotation> ne.xml < PARTIAL VIEW of the final output of the NE annotation >
Surface Syntactic Analysis
33
Surface Syntactic Analysis consists in marking non-recursive phrasal categories, i.e., nominal, adjectival, adverbial, prepositional and verbal chunks. Main clauses and certain subordinated clauses are identified. The chunking notation is based on specifications developed at ILSP as to represent an intersection of different existing schemes.
<?xml version="1.0" encoding="ISO-8859-7" ?> <Annotation type="chunk"> <sent id="s_38" chunks="13" start="7922" end="8062"><ch id="ch_38_1" type="nc" subtype="nm" start="0" end="21"></ch> <ch id="ch_38_2" type="nc" subtype="ge" start="22" end="36"></ch> <ch id="ch_38_3" type="vg" subtype="" start="38" end="47"></ch> <ch id="ch_38_4" type="adjc" subtype="nm" start="60" end="65"></ch> <ch id="ch_38_5" type="nc" subtype="nm" start="57" end="75"></ch> <ch id="ch_38_6" type="vg" subtype="" start="76" end="84"></ch> <ch id="ch_38_7" type="adjc" subtype="ac" start="88" end="97"></ch> <ch id="ch_38_8" type="nc" subtype="ac" start="85" end="111"></ch> <ch id="ch_38_9" type="nc" subtype="ac" start="112" end="127"></ch> <ch id="ch_38_10" type="nc" subtype="ac" start="133" end="139"></ch> <ch id="ch_38_11" type="pc" subtype="" start="128" end="139"></ch> </sent> </Annotation> chunk.xml < PARTIAL VIEW of the final output of the chunking analysis >
Recursive Syntactic constituents labeling The corpus has also been annotated at the level of recursive nominal elements (noun phrases). Syntactic annotation at chunk and clause level is a prerequisite for this type of annotation. Nominal chunks and their modifying constituents (nominal chunks in genitive, prepositional phrases, relative clauses) are marked, coordinated and appositive constructions are labeled as NPs. Annotation has been applied semi-automatically, with the output of the syntactic parsing tool been corrected by two linguists. Grammatical Relations Functional Relations marking has been manually implemented to guide development of a module for identifying for any verb subjects, predicative complements, direct and indirect objects, prepositional phrases and clausal arguments. Annotation at this level was carried out according to the MATE (Multilevel Annotation, Tools Engineering) project specifications.
34
Coreference Annotation Following the MUC-7 coreference task definition [Hirschman, 1997], special tags are used to annotate the text spans that introduce a discourse entity that is, that can be subsequently referred to by means of anaphoric expressions (referring expressions). Annotation, thus, proceeds in two steps:

Referring expressions and candidate antecedents (NPs) are detected and characterized Anaphors are linked to their antecedents (coreference resolution)
The coreference relation is marked between elements of the following categories: nouns, noun phrases, and pronouns, and elements of these categories are markables. The relation is marked only between pairs of elements both of which are markables. This means that some markables that look anaphoric are not coded, including pronouns, demonstratives, and definite NPs whose antecedent is a clause rather than a markable.
<?xml version="1.0" encoding="UTF-8"?> <Annotation type="functional_verb"> <sent id="s_38"> <funct id="38_1" type="SFVb"> <head start="38" end="47" /> <dep id="38_1_1" start="0" end="36" type="subj" source="np_sg_nm" /> <dep id="38_1_2" start="53" end="139" type="comp" source="cl_o" /> </funct> <funct id="38_2" type="SFVb"> <head start="76" end="84" /> <dep id="38_2_1" start="57" end="75" type="subj" source="np_pl_nm" /> <dep id="38_2_2" start="112" end="127" type="dobj" source="np_pl_ac" /> <dep id="38_2_3" start="128" end="139" type="mod" source="pp" /> <dep id="38_2_4" start="85" end="111" type="mod" source="np_sg_ac" /> </funct> </sent> </Annotation> funct.xml < PARTIAL VIEW of the final output of the functional relations annotation >
35
Marker: A tool for multi-level annotation Annotations on the ILSP corpus were performed via Marker, a Java GUI that allows annotators to have simultaneous views of all levels of previous annotations while working on a particular task. Nested structures can be easily marked and viewed, whereas a comment facility allows users to add notes relevant to the markup task at the give point. The tool is further equipped with comparison facilities that allow for inter-annotator agreement and assessment of the relevant processing tools. The tool currently supports annotation at the level of POS-tagging, chunking, recursive noun phrases level, Named Entity, term and coreference marking. Annotations are stored as XML files, containing metadata relevant to the annotation level. Moreover, session information in the form of metadata is stored separately from the annotation data according to the Dublin Core Metadata Initiative. This includes Annotator (the person that carries out the annotation), Subject (what the annotation is about), Language (the Language of the document), Date (annotation creation date), and Resources. Marker runs on any PC or workstation equipped with a recent version of Java 2 Runtime Environment and is available free of charge for research purposes.
36
References [1] [ACE EDT Phase 1, 2000] Entity Detection and Tracking - Phase 1. ACE Pilot Study Task Definition. August 2000 [2] [Boutsis et al., 2000] Boutsis, S., Prokopidis, P., Giouli, V., and Piperidis, S., 2000. A Robust Parser for Unrestricted Greek Text. In Proceedings of the 2nd Language and Resources Evaluation conference, Athens, Greece. [3] [Brill, 1997] Brill, E., 1997. A Corpus-based Approach to Language Learning. PhD Thesis, University of Pennsylvania. [4] [CIMWOS, 2002] CIMWOS, 2002. Combined IMage http//:www.xanthi.ilsp.gr/cimwos/ and Word Spotting. IST project.
[5] [Chincor, 1997] Chincor, N., 1997. MUC-7 Named Entity Task definition, Version 3.5 [6] [Demiros et al., 2000] Demiros, I., Boutsis, I., Giouli, V., Liakata, M., Papageorgiou, H., Piperidis, S., 2000. Named Entity Recognition in Greek Texts, In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece. [7] [DCMI documentation] DCMI. Dublin Core Metadata Initiative. http//:www.dublincore.org/documents/ [8] [Di Christo et al., 1995] Di Christo, P., Harie, S., de Loupy, C., Ide, N., and Veronis, J., 1995. Set of programs for segmentation and lexical look-up, MULTEXT LRE 62-050 project deliverable [9] [Grishman, 1997] Grishman, R., 1997. Tipster Architecture design document version 2.3. Technical Report, DARPA. [10] [Hirschman, 1997] Hirschman, L., 1997. MUC-7 Coreference Task Definition, Version 3.0. In Proc. MUC-7 [11] [Lambropoulou et al., 1996] Lambropoulou, P., Mantzari, E., and Gavriilidou, M., 1996. Lexicon Morphosyntactic specifications: Language Specific Instantiation (Greek), PPPAROLE, MLAP 63-386 report. [12] [Lappin et al., 1994] Lappin, S., and Leass, H.J., 1994. An Algorithm for Pronominal Anaphora Resolution. In Computational Linguistics, 20 (4) [13] [Dybkjaer et al., 1998] Dybkjaer, L., Bernsen, N., O., Dybkjaer, H., McKelvie, D., and Mengel, A., 1998. The MATE markup Framework, MATE deliverable D1.2. [14] [Van Noord et al., 1999] Van Noord, G., and Gerdemann, D., 1999. An Extensible Regular Expression Compiler for Finite-state Approaches in Natural Language Processing. WIA, Potsdam, Germany. [15] [Papageorgiou et al., 2000]
37
Papageorgiou, H., Prokopidis, P., Giouli, V., and Piperidis, S., 2000. A Unified POS Tagging Architecture and its Application to Greek. In Proceedings of the 2nd Language and Resources Evaluation Conference, Athens, Greece. [16] [Papageorgiou et al., 2002] Papageorgiou, H., Prokopidis, P., Giouli, V., Demiros, I., Konstantinidis, A., and Piperidis, S., 2002. Multi-level XML-based Corpus Annotation. . In Proceedings of the 3rd Language and Resources Evaluation Conference, Las Palmas, Canary Islands, Spain.
38
Linguistic Annotation of Parallel Corpora A parallel corpus to be of some use should be aligned, alignment being the process of linking together pairs of words, phrases, terms or sentences in texts in different languages, which are translation equivalents. Alignment is the form of annotation carried out on parallel corpora to facilitate building and evaluation of translation memories used in Computer Assisted Translation. And although many parallel corpora are manually aligned, automatic alignment is in the core of parallel corpus processing with the development of high accuracy aligners The first approaches to alignment consisted in trying to link regions of texts according to the regularity of word co-occurrences across texts. In this method, pairs of words were linked if they had similar distributions in the source and target texts. On the other hand, approaches for tackling the alignment problem at phrase level, though not fully automated, involve linguistic preprocessing: a parser is used to analyze each input text at dependency rather than constituency level; semantically equivalent units are identified and linked between source and target, defining thus translation units, whether they might be single words or phrases, and monolingual referential relationships are established to cater for deixis and coreference. In this approach, translation units are proposed to the user by the system and the user has to confirm or correct the system's response. The system is endowed with a learning mechanism which makes it base its proposals for translation units on the whole of the text that has been processed. Coupling the automatic process with dictionary definitions has also been employed. Statistical alignment, however, has gained the focus of attention and is based on measurement of sentence character lengths or the number of words contained in sentences. Pairs of sentences proposed for alignment are assigned probabilistic scores. Additionally, certain points of the texts can be anchored dividing thus the texts into smaller sections to be aligned. Besides anchors, paragraph markers are also considered. Anchor points are specific to the text to be aligned and they usually appear in both texts. They are divided into major and minor anchors and alignment proceeds in two steps, first aligning major anchor points and then minor anchor points. In practice, therefore, linguistic annotation in the case of parallel corpora has the function of providing anchors for the alignment process; that is, the annotation of each individual text is the prerequisite markup according to which the units of one text (structural or linguistic) will be aligned with the units of its translation(s). The level of markup, therefore, will decide the level of alignment and vice versa, i.e., in order to have a certain level of alignment, we need to previously have a specific level of annotation. In practice, alignment is performed at three levels of text structure: at the sentence level, phrase level and word and terms level. Specifications for the annotations of parallel corpora are provided in the framework of the PAROLE project [Gavriilidou et al., 1998]. References [1] [Brown et al., 1991] Brown P. Lai J. and Mercer R., 1991. Aligning sentences in parallel corpora. In Proceedings of ACL 1991. [2] [Catizone et al., 1989] Catizone R. Russell G. Warwick S., 1989. Deriving translation data from bilingual texts. In Proceedings of the First Lexical Acquisition Workshop, Detroit 1989. [3] [Church, 1983] Church K., 1993. Char_align: A program for aligning parallel texts at character level In Proceedings of ACL 1993.
39
[4] [Gavriilidou et al., 1998] Gavriilidou, M., Labropoulou, P., Papakostopoulou, N., Spiliotopoulou, S., and Nassos N., Greek corpus documentation. Technical report (ILSP : PAROLE LE24017 / 10369, WP2.9-WP-ATH-1, 1998). [5] [Gale et al., 1991] Gale W. Church K., 1991. A program for aligning sentences in bilingual corpora. In Proceedings of ACL 91. [6] [Ide et al., 1994] Ide, N. and J. Veronis, 1994. Corpus encoding. Draft - Work in progress. EAGLES document EAG-CSG/IR-T2.1, version of October 1994. [7] [Johansson et al., 1993] Johansson, S., and K. Hofland, 1993. Towards an English-Norwegian parallel corpus, in U. Fries, G. Tottie and P. Schneider (eds.), Creating and using English language corpora: papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zurich 1993. [8] [Kaji et al., 1992] Kaji H., Kida Y., Morimoto Y., 1992. Learning Translation Templates from Bilingual Text. In Proceedings COLING 92. [9] [Papageorgiou et al., 1994] Papageorgiou H., Cranias L. and Piperidis S., 1994. Automatic alignment in parallel corpora. In Proceedings of ACL 94. [10] [Piperidis et al., 2000] Piperidis, S., Papageorgiou, H., Boutsis, S., 2000. From sentences to words and clauses. In Veronis, J. (Ed) Parallel Text Processing, Alignment and use of translation corpora, Kluwer Academic Publishers, Text Speech and Language Technology Series, pp. 117138 [11] [Sadler et al., 1990] Sadler V., Vendelmans R., 1990. Pilot Implementation of a Bilingual Knowledge Bank. In Proceedings of COLING 90. [12] [Simard et al., 1992] Simard M. Foster G. Isabelle P., 1992. Using cognates to align sentences in bilingual corpora. In Proceedings of TMI 1992. [13] [Sinclair, 1994] Sinclair, J., 1994. Corpus Typology. Draft - Work in progress. EAGLES document EAG-CSG/IR-T1.1. Version of October 1994. [14] [Sinclair et al., 1995] Sinclair, J. and J. Ball, 1995. Text Typology (External Criteria) : Draft, version of May 1995.
40
APPENDIX Links to on-line courses Below are some links to course syllabi posted by Michael Barlow at Corpora mailing list. Eugene Charniak's statistical course http://www.cs.brown.edu/courses/cs241/ Elisabeth Burr's Korpuslinguistik course http://www.uni-duisburg.de/FB3/ROMANISTIK/PERSONAL/Burr/corpus/home.htm Tony Berber Sardinha Corpus Linguistics courses: 1998-1999: http://www.tonyberber.f2s.com/teaching.htm 2000: http://www.cursos.f2s.com/pos.htm Mark Davies: "History of the Spanish Language" course. (http://mdavies.for.ilstu.edu/HisSpan) Assignments and projects: (http://mdavies.for.ilstu.edu/hisspan/tareas.htm) Chris Brew: Statistical NLP: http://ling.ohio-state.edu/~cbrew/795M/ Probabilistic modeling: http://ling.ohio-state.edu/~cbrew/795V/ Javier Perez-Guerra "English linguistics: seminar course". The syllabus (written in Galician) is at: http://www.uvigo.es/webs/h04/jperez/#caet Bilge Say's course on Using Corpora for Language Research http://www.ii.metu.edu.tr/nli/courses/cogs523/ Sabine Reich's Corpus course: http://www.uni-koeln.de/phil-fak/englisch/bald/outline.htm
41
Further Relevant Links

Linguistic Data Consortium supports language-related education, research and
technology development by creating and sharing linguistic resources: data, tools and standards. It hosts a web page on Linguistic Annotation, which describes tools and formats for creating and managing linguistic metadata on corpora. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases.
Language Technology world is a comprehensive WWW information service and
knowledge source on the wide range of technologies that deal with human language provided by the National Language Technology Competence Center at DFKI.
http://devoted.to/corpora. This web page lists corpora, Collections, Text archives,
links to Courses, free concordancers and other text analysis tools, etc.
http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html.
This page contains information on corpora holdings at Lancaster. UCREL has a wide variety of machine-readable corpora held in file storage or on CD-ROM. Some corpora are held only as plain orthographic text, whilst others are held with several kinds of annotation. developed at UCREL are presented here.
http://www.comp.lancs.ac.uk/computing/research/ucrel/annotation.html. Annotations
http://www.georgetown.edu/cball/corpora/tutorial.html: Tutorial on Concordances
and Corpora.
Bird, S., Liberman, M., A Formal Framework for Linguistic Annotation
http://www.ldc.upenn.edu/Papers/CIS9901_1999/revised_13Aug99.pdf
42

Corpora in Human Language Technologies

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Corpora in Human Language Technologies

Hochgeladen von

Copyright:

Verfügbare Formate

Corpora and HLT Current trends in corpus processing and annotation Voula Giouli, Stelios Piperidis Insitute for

Language and Speech Processing

Medium Language coverage Genre/register

Dialectal variation Age Availability

Person Organization Facility Location Geo-Political Entities

Semmantically annotated corpora

Surface Text Analysis Morphosyntactic Annotation Lemmatization

POS Tagger & Lemmatizer

Name Lists NE Rules Grammar Subcat

Named Entity Recognition Surface Syntactic Analysis Functional Analysis

Shallow Parser Sem_Processor

Coreference Resolution Template Construction

Id Type Tag Href Lemma

Surface Syntactic Analysis

Further Relevant Links

http://www.georgetown.edu/cball/corpora/tutorial.html: Tutorial on Concordances

Das könnte Ihnen auch gefallen