Sie sind auf Seite 1von 6


ArunaDevi, K. , Savee!", R.# an$ Kar!"i%Se&vaKu'ar,K.(

1,2 !

Assistant Professor, Department of Computer Science Engineering, Coimbatore Institute of Technology, India

"unior #esearch $ello%, Indian School of &ines, India

Email' arunade(i )*cit edu in, sa(eeth r*cit edu in,))s)++*gmail com A)*!ra+! ,Te,t Classification is one of the central issues in the information systems dealing %ith te,t data because of the increasing amount of information stored in a digital form Te,t Classification Techni-ues ha(e been applied on Tamil language to e,tract meaningful information and )no%ledge from unstructured Tamil te,t Tamil language is a morphologically rich Dra(idian language, classifying a Tamil document is different than classifying a English te,ts In order to enhance the effecti(eness of information e,traction, %e ha(e compared feature e,traction techni-ues and te,t classifiers and suggested an efficient C.feature /Compound feature0, using C.feature %e can create an efficient (ocabulary set for Tamil te,t classification Keywords - Tamil Te,t Classification, $eature Classification, 1ocabulary set or bag.of2%ords, Te,t &ining, 3atural 4anguage Processing

Te,t mining or )no%ledge disco(ery is a process of finding useful information from te,t documents 5o%e(er, unli)e data mining, %hich mainly focuses on the structured information, te,t mining focuses on the te,t data that is in a unstructured form Te,t mining deals %ith the analysis of te,t data by using the support of machine #esearch %or)s in te,t mining sol(es the problem of te,t representation, te,t classification, te,t clustering information e,traction and modeling of hidden information 6e ha(e used data mining methods and statistics to handle their specific tas)s in the areas li)e information retrie(al, natural language processing and information e,traction In Information #etrie(al, %e find the documents %hich is containing the ans%ers and -uestions that is systems that retrie(e the documents based on the )ey%ords In Information E,traction, the e,traction of specific information from te,t documents ie it 3atural 4anguage embraces all acti(ities regarding document processing namely 3amely Entity #ecognition Processing /34P0 pro(ides a significant %ay for understanding natural languages by the use computers


Tamil is one of the morphologically richest and longest sur(i(ing classical languages in the %orld Tamil is al%ays a language The %ord order %ill be Sub7ect.8b7ect.1erb/S810 3ot all the Tamil sentence should contain sub7ects ,ob7ects and (erb, %e can ha(e a Tamil sentences %ithout anyone of these three in a meaningful %ay %ithout (iolating the grammar Tamil Te,t Classification is the automatic process of assigning Tamil documents into the (arious classes or group of documents based on their contents The enormous amount of information stored in a digital form are in unstructured te,ts cannot simply be used by computers, %hich normally handle te,t data as simple se-uence of strings Therefore %e need a specific preprocessing methods are re-uired in order to e,tract meaningful information from the documents $or

Tamil documents, fe% researchers ha(e used 1ector Space &odel and Artificial 3eural 3et%or) for Tamil te,t classification 91: The automatic classification of Tamil te,t plays an important role in building Tamil corpus The first corpus for modern %ritten Tamil %as built in the Central Institute of Indian 4anguages /CII40, &ysore, consists of around ! ; million %ords of %ritten Tamil The &o<hi Truest has also build a corpus of around ! million %ords of %ritten Tamil #. STEPS IN TEXT CLASSIFICATION The process of automatic Tamil te,t classification contains the three ma7or steps namely Document representation Dimensionality reduction Choice of classifier Tamil Te,t Classification in(ol(es Document representation, stop %ord remo(al, $eature e,traction , 1ocabulary set formation and choosing an classifier for te,t categori<ation

In document representation , a to)eni<ation process is re-uired that is a te,t document is split into a stream of %ords by remo(ing punctuation mar)s and a single %hite space is used to replace the tabs and other non te,t characters This to)eni<ed representation of data is used for future processing in te,t classification

Dimensionality reduction /D#0 is the process of identifying )ey attributes and thereby eliminating redundant and irrele(ant )ey%ords or attributes from the training corpus This process aims at reducing the comple,ity of te,t classification A number of machine learning, )no%ledge engineering, and probabilistic based classifier.induction methods ha(e been proposed for te,t classification The most popular methods include 3a=(e >ayes Classifier ie >ayesian probabilistic methods, regression models, e,ample.based classification, decision trees, decision rules, #occhio method, support (ector machines and association rule mining

A number of feature selection methods based on entropy, statistics and optimi<ation techni-ues had been proposed for te,t classification Some important popular $eature Selection methods such as document fre-uency,information gain /I?0, mutual information /&I0, 2 /C5I0 statistics, and term strength /TS0 Fea!ure e-!ra+!i.n/ It is a process that tries to generate a set of @syntheticA terms %hich %ill ma,imi<e effecti(eness of classifier used in te,t classification

T B from the original set T

In a traditional feature e,traction method, a bag.of.%ords %ill contain the combination of single terms as a )ey%ord %hich is used to represent a particular document, but here %e ha(e proposed a C.feature e,traction method %hich %ill ma)e the (ocabulary set or bag.of %ords containing a compound.feature to represent a document

E,traction of C.features ie composed of t%o terms that co.occur in Tamil documents is done %ithout any restriction on distance or group %ithin the document using the pair of terms %ill identify the document category The follo%ing )ey%ords are e,amples of C.feature

Fi0.. . C-Fea!ure 1r.' Ta'i& D.+u'en!*

$rom the abo(e $ig 1 , %e can understand the usage of compound feature i e C.$eature e,traction in Tamil te,t classification The first )ey%ord e,ample represent /eye, pain0 and the second e,ample represent /place, path0 The $irst )ey%ord represents the document related to medicine i e ha(ing pain in the eye and the second )ey%ord represents the document related to tra(el i e identifying a route for a particular place

Tamil te,t collection

$eature E,traction i e Compound $eature

Enhanced Tamil Te,t collection

Tamil Te,t Classification using classifier

Fi0.#. S!e2* in Crea!in0 V.+a)u&ar3 Se! 1r.' Ta'i& D.+u'en!*

The influence of a feature f in a class c is the conditional probability of c gi(en the occurrence of f, estimated the training set 4et $CD f1,f2,fnE be the set of features associated %ith a collection CCDc 1,c2,.cmE be the set of categories or classes that occur in a document collection, df/f i ,c7 0 be the number of training

documents associated %ith the class c 7 %hich contain the feature fi 6e define the influence of factor as In141a+!51i ,+6 7 8 5 7 $151i ,+6 7 96 i8 $151i ,+6 7 To obtain a high discriminati(e C.features that are pairs of s.features /terms0 using /10 There are three steps are follo%ed /i0 %e select the bests. feature s that %ill be used to build the C.features using information gain as the most important measure to ran) the s.features /ii0In the selection step %e select the generated C 2 features that %ill be used to increase the documents of the training and test Tamil sets i e , (ocabulary set of training and test Tamil documents As the selection criterion %e use C.features %ith high dominance for a gi(en category /iii0$inally, in the augmentation step %e insert only those C.features that ha(e high dominance in the class of the training document for increasing the training documents The increase of a test document is done by inserting all high dominance c.features that occur in the document The smaller the number of distinct classes %here a C.feature occurs, the higher the influence factor Influence factor or Dominance is important because it can be used to filter C.feature distributed une-ually in (arious classes and directly -uantifies the rele(ance of %hich classes should or should not ha(e their documents increased %ith a gi(en C. feature The third step is to increase the document collection, %hich aims to add C. features that help the classifier in performing its tas) 6e first perform the e,tension in the training set An important thing here is to determine %hether to include a C.feature to a document 6e use influence factor to determine %hether to include a C.feature in the training set Suppose that a C.feature f is composed of s.features, %here T is the set of s.features 4et CFDc1, c2,y,cGE be the set of classes that occur in the collection and let influence factor/f,c70 be the Influencing (alues of C.feature f in each class %here c7 of C $inally, %e can use the classifiers li)e (ector space model, 3a=(e >ayes classifier and ).nn classifier to classify the Tamil te,t document Hsing C.feature ,%e can easily assign a Tamil test document to its predefined category by its contents

In this paper, %e ha(e proposed an efficient method for e,tracting C.feature for classifying Tamil te,t documents Since, there are more number of digital document a(ailable in English, the classification of te,t documents are performed in English and also in some other languages Hsing the C.feature e,traction, %e can easily classify the documents because C.feature %ill contain a pair of terms to classify a document to a predefined category As future %or), %e ha(e planned to implement the e,traction of C.features from Tamil te,t documents and classify the document using the classifiers ) .nn and S1& classifier to retrie(e the Tamil te,t document efficiently

1 #a7an, G , #amalingam, 1 , ?anesan, & , Palani(el, S , Palaniappan, > /2IIJ0, @Automatic Classification of Tamil documents using 1ector Space &odel and Artificial 3eural 3et%or)A E,pert Systems %ith Applications, Else(ier, 1olume !; Issue +, 8ctober, D8IC 1I 1I1;K7 es%a 2IIJ I2 I1I 2 $aL bio $igueiredo , 4eonardo#ocha , ThiersonCouto , ThiagoSalles , &arcos AndreL ? , 6agner&eira"r /2I110 A6ord co.occurrence features for te,t classificationA, Information Systems, Else(ier,1olume !; ,pp +M!.+N+, D8I'1I 1I1;K7 is 2I11 I2 II2 ! 3a%ei Chen and Dorothea >lostein /2II;0, @A sur(ey of document image classification' problem statement, classifier architecture and performance e(aluationA, Springer.1erlag, D8IC 1I 1IIOKs1II!2. II;.II2I.2 M N $ Sebastiani, &achine learning in automated te,t categori<ation, AC& Computing Sur(eys !M /10 /2II20 12MO ?upta, 1ishal and 4ehal, ?urpreet Singh /2I110,A Preprocessing Phase of Pun7abi 4anguage Te,t Summari<ationA, Information Systems for Indian 4anguages, Communications in Computer and Information Science, 1ol 1!J, Springer.1erlag, pp 2NI.2N! ; G Chandrinos, I Androutsopoulos, ? Paliouras, C D Spyropoulos, Automatic %eb rating' filtering obscene content on the %eb, in' ECD4 PII' Proceedings of the Mth European Conference on #esearch and Ad(anced Technology for Digital 4ibraries, Springer.1erlag, 4ondon, HG2III, pp MI!2MI; O + J 4 Qhang, " Qhu, T Rao, An e(aluation of statistical spam filtering techni-ues, AC& Transactions on Asian 4anguage Information Processing /TA4IP0 ! /M0 /2IIM0 2M!22;J 3idhi and 1ishal ?upta /2I120 @ Pun7abi Te,t Classification using 3a=(e >ayes,Centroid and 5ybrid Approach @ 34P, C#RPSIS, ICAIT, ICDIP, pp 2MN22N2, D8I ' 1I N121Kcsit 2I12 2M21 S Dumais, " Platt, D 5ec)erman, & Sahami, Inducti(e learning algorithms and representations for te,t categori<ation, in' CIG& PJ+' Proceedings of the Oth International Conference on Informa. tion and Gno%ledge &anagement, AC&, 3e% Ror), 3R, HSA1JJ+, pp 1M+21NN 1I ?upta, 1ishal and 4ehal, ?urpreet S /2IIJ0, @A Sur(ey of Te,t &ining Techni-ues and ApplicationsA, "ournal of Emerging Technologies in 6eb Intelligence, 184 1, 38 1 11 ? $orman, $eature selection for te,t classification, in' Computa. tional &ethods of $eature Selection, Chapman and 5allKC#C2IIO, pp 2NO22O; 12 "ingnian Chen, 5ou)uan 5uang, Shengfeng Tian and Rouli Su /2IIJ0, @$eature selection for te,t classification %ith 3a=(e >ayesA, E,pert Systems %ith Applications' An International "ournal, 1olume !; Issue !, Else(ier 1! 4am 5ong 4ee, Dino Isa /2I1I0, @Automatically computed document dependent %eighting factor facility for 3a=(e >ayes classificationA, E,pert Systems %ith Applications, Else(ier, D8IC1I 1I1;K7 es%a 2I1I IN I!I 1M )aur, Dar(inder and ?upta, 1ishal /2I1I0, @A sur(ey of 3amed Entity #ecognition in English and other Indian 4anguagesA, I"CSI International "ournal of Computer Science Issues, 1ol O, Issue ;