Part of Speech Tagger 4: Post

Page 1 GRACE: Grammar Checker Maxime Biais February 2, 2005 1 Introduction The project with the aim of creating
an open source grammar independent of the language. After some general outline of the grammar, we explain how we intend to pursue our project. The purpose of GRACE is to provide a correction based on learning and therefore independent of a particular language. 2 About the grammar The grammar is a real stake in the natural language processing and genner lot of enthusiasm on the part of publishing software because the market is high demand. However, the grammar has not infallible born. 2.1 takes to be corrected There are several types of errors in a text: - Spelling errors correspond to the use of an unrecognized word, meaning that word is not present in the dictionary used. We will return to the concept dictionary and its impact on the grammar - The grammar correspond to the incorrect use of a word present in the dictionnaire. It may be, among other things, a lack of agreement with a past participle, the use of wrong type or number. - The errors correspond to errors semantic meaning, that is invoked for use untary one word instead of another, or a word out of context. Will be understood only the spelling and grammar are corrected today with varying degrees of effectiveness. The sins of sense are they not detected on more they involve the notion of context. Indeed, the inappropriate use of a word may be voluntary in a specific context. 2.2 spelling mistakes in grammar The spelling is bound typescript in a text to a dictionary of words. The quality of correction tor depends directly spell of this document because it must contain a maximum of words. Yet we see more on this problem: - The lexical databases are incomplete, especially when using technical words. - Compound words and expressions are poorly managed, especially when there is no hyphen A Page 2 PART OF SPEECH TAGGER 4: POST - If inconnnus words, the words proposed correction are too numerous and not always proposed in a relevant order. Nevertheless, spell checkers have been adopted by most word processors. Insufficient level of grammar checkers is not due to the spell checker in the first place but very poor reliability. It is this which give poor reliability many users in their use. In fact users have strong grammar fear, Therefore, the correction adds grammitical their mistakes. While users with low grammar have no confidence and means to realize the relevance of the corrections.
2.3 Dilemma Correct the writing appears as a real Difficulty with which it is very difficult to to answer. Beyond the classic errors of agreement, gender, number, one is faced with what the author meant and you begin to touch the senses. Indeed the context, some forms may be accepted, while it will not in another. One thinks particularly larly to the use of a particular type of language (familiar, current, supported) but can also questions about the custom from the rules. For example, should we call a code or a subjunctive in a proposal beginning with "after"? The Academy and purists claim code, but rather use it for a long time the subjunctive and a phrase like "after he went to the factory" seems suspect. Finally, the grammar is not a substitute for learning grammar, but be a training tool perpetual control of the language. Therefore it is necessary Grammar for business to provide clear and helpful messages reminding basic rules and justifying the correction of the error. 3 Operation General GRAC The architecture of grace is illustrated in Figure 1. The principle of the implementation phase is simple: - It takes the input text, it is happening in a lexical analyzer for the word and cut into in sentence. - We use a Part Of Speech Tagger to assign to each word in the text that will undergo the corresponding rection a tag that corresponds to its nature. For example in the sentence: "I am a bear," we affect the word "I" tag "singular subject," the word "am" The tag: "first-person verb singular ". - After this phase, it is sufficient to verify that the tags in sequence in a sentence below the grammar rules that can be defined. For use with you first start a learning process whose goal is twofold: 1. Create the knowledge base of the Part Of Speech Tagger probabilistic. 2. Create a basic grammar rules automatically derived from a corpus without faults. 4 Part Of Speech Tagger: POST The purpose of Part Of Speech Tagger is to label each word in the text input by its charactergrammatical characteristics. POST operates in several phases: 2
PART OF SPEECH TAGGER 4: POST FIG
1. We put tags on all the words that do not pose a problem of ambiguity. For example the word "artichoke" is a name common the male plural and whatever the context in which it is used. 2. POST is used to determine the probability ambiguous words. It uses two databases Knowledge: (A) identifies a probability associated with each word that is labeled for a specific tag. By example the word "hunting" is used as a verb with a probability of 0.7 and as name with a probability of 0.3. The odds are of course different depending on the corpus learning used. (B) a database that contains all the possibilities to combine three tags in a row. For example a common name is preceded by an article and followed by an adjective with a probability of 0.6. These probabilities are also determined from the training corpus. The two bases are combined to determine the most likely tag to use the words that are not labeled. 3. It is possible that all the words of the text is not present in the first base. At this now comes the POST based on rules that will use different rules dependent language to determine the tag unknown words.
The architecture of the positions are shown in Figure 2. 4.1 POST rules-based Sometimes a word is unknown in the dictionary of words labeled and it is necessary to deduce its type in order to improve the correction effect globale.En you better have a minimum of words whose type is unknown. To do this we try to study the spelling of the word and particularly ment its termination and to draw generalities. For example a word ending in-ism is a sudden a common name for men. 3 Page 4
Page 5 6 MISCELLANEOUS This study is highly dependent on the language and there is not always generic rules, especially where the french words are often distinctive and not formed from a base and a generic suffix. In English, however, one can distinguish a few friendly rules: - If the ending of the word is of the form-dom,-ment,-tion,-ment,-ance,-ence,-er,-or,-ist, -Ness,-iciy, then we can predict a common name sigulier - If the word ends in-ly so we can predict an adverb. - If the word ends in-ive,-ic-al,-able,-y,-ous,-ful,-less we can predict an adjective. - If the word ends in-ize,-ise,-ate one can predict a verb in the infinitive. It is not as easy to make french in this type of recognition but we can also list the following rules: - If the ending of the word is of the form-ism, then we can predict a singular common noun men. 5 Grammar Error Detection: GED This part is not yet encoded in GRAC. The following sections describe how two Further to detect grammatical errors. 5.1 Probabilistic GED The GED is similar to probabilistic probabilistic POST. The learning phase is based on a corpus without grammatical mistakes that we will label. Then we will keep the chain of each sentence or at least parts of sentences. A chain is a sequence of tags, we can also be called rules of grammar deducted. Once learned these sequences, the execution is very simple. We spend every sentence of the text correct in each of the grammar rules and deducted if a sentence does not fit into any of these rules is that it is surely wrong. 5.2 GED rules-based The principle of execution is the same as the probability GED. The difference is that the rules are not derived from a corpus, but handwritten. This time we can determine the rules "False", ie common grammatical errors that we see often. The advantage of this type rule is that you may have a solution. For example there will be a rule: Articles plurals are always followed by a plural adjective or a noun plurals. If ever a phrase which contains an article does not plural in the rules, we can tell the user that must be singled article, or the adjective in the plural or the common name below. 6 Various 6.1 Progress of the project For now only the lexical analyzer and two types of POST were coded. We still So all part GED. Our main problem is finding a tagged corpus in French preferably large enough (more than one million words). For the moment we are working on a simple 5 Page 6 6 MISCELLANEOUS CESANA sample corpus. It is sufficient to test the POST but does not allow us to reaLiser a knowledge base large enough for one day grace be exploitable. To give thanks we have to use it:
- Find body labels and healthy body (without grammar and spelling) in as many languages. - Leave the detection module of grammatical error and the probability of detection module tion error based on grammatical rules. 6.2 Licenses All codes of the GRAC is subject to the GPL when the documentation and explanations techniques are subject to the FDL. These two licenses are available at: http: / / Www.gnu.org / licenses / licenses.html 6.3 Accommodation Project GRAC is hosted on SourceForge: http://sourceforge.net/projects/grac/. Note 6.4 This document has been subjected to any grammatical error detector.

Part of Speech Tagger 4: Post

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Part of Speech Tagger 4: Post

Hochgeladen von

Copyright:

Verfügbare Formate

Page 1 GRACE: Grammar Checker Maxime Biais February 2, 2005 1 Introduction The project with the aim of creating

PART OF SPEECH TAGGER 4: POST FIG

Das könnte Ihnen auch gefallen