A Semantic Vector Space and Features-Based Approach For Automatic

Expert Systems with Applications 26 (2004) 171179 www.elsevier.
com/locate/eswa
A semantic vector space and features-based approach for automatic information ltering
O. Noualia,b,*, P. Blacheb,1
a
` Laboratoire de logiciels de base, CE.R.I.S.T., Rue des 3 freres Assiou, Ben Aknoun, Algiers, Algeria b LPLUniversite de Provence, 29, Av. Robert Schuman, F-13621 Aix-en-Provence, France
Abstract With advances in communication technology, the amount of electronic information available to the users will become increasingly important. Users are facing increasing difculties in searching and extracting relevant and useful information. Obviously, there is a strong demand for building automatic tools that capture, lter, control and disseminate the information that will most likely match a users interest. In this paper we propose two kinds of knowledge to improve the efciency of information ltering process. A features-based model for representing, evaluating and classifying texts. A semantic vector space to complement the features-based model on taking into account the semantic aspect. We used a neural network to model the users interests (proles) and a set of genetic algorithms for the learning process to improve ltering quality. To show the efcacy of such knowledge to deal with information ltering problem, particularly we present an intelligent and dynamic email ltering tool. It assists the user in managing, selecting, classifying and discarding non-desirable messages in a professional or non-professional context. The modular structure makes it portable and easy to adapt to other ltering applications such as the web browsing. We illustrate and discuss the system performance by experimental evaluation results. q 2003 Elsevier Ltd. All rights reserved.
Keywords: Information ltering; Neural network; Expert system; Machine learning; Email
1. Introduction The advent of Internet has magnied the already huge amount of information in available electronic form to the users whose are facing increasing difculties in searching and extracting relevant and useful information. Obviously, there is a strong demand for building automatic tools that capture, lter, control and disseminate the information that will most likely match a users prole or interest. With the advantage of the representation of information in electronic form, it is necessary that a ltering process be carried out automatically by the system, which is given the responsibility to present information to users. Several information ltering systems have been proposed in several domains, such as Mailing List, Usenet News (articles), Electronic Mail, World Wide Web, Electronic
* Corresponding author. Address: Laboratoire de logiciels de base, ` CE.R.I.S.T., Rue des 3 freres Assiou, Ben Aknoun, Algiers, Algeria. Tel.: 213-21-916211; fax: 213-21-912126. E-mail address: onouali@mail.cerist.dz (O. Nouali). 1 Tel.: 33-42-95-36-23; fax: 33-42-59-50-96. 0957-4174/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0957-4174(03)00118-0
Conferences, Electronic bulletin boards, Clearing House Service, etc. BORGES is an information ltering tool for the articles of the USENET news. TAPESTRY is an experimental system for receiving and ltering electronic documents, which supports collaborative ltering. Users are encouraged to annotate documents they read (interesting, not interesting, etc.), and these annotations can then be used for ltering by other users Goldberg, Nichols, Oki & Terry (1992). SIFT is an information ltering tool that sorts large volumes of dynamically generated information and disseminate to the user the information which are likely to satisfy his or her requirement Tan and Garcia-Molina (1995). Such systems are limited because they are based on the occurrence of a given set of keywords identifying possibly relevant information. They involve no understanding of the input texts. They suffer particularly from the semantic aspect. The current research in articial intelligence, particularly in the eld of natural language understanding, has resulted in technologies, which in our opinion may help designing intelligent information ltering systems. Such systems may determine the relevance of a document by
172
O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179
analyzing the relationship between the content of this document and the users proles. They provide making it possible to improve the effectiveness of ltering by unifying the semantic of documents and the users prole. However, a pure natural language approach may involve several capabilities in building and reasoning from explicit representation of users goals. This may result in implementation and performance complexities that will not be acceptable and too costly (Manning & Schutze, 1999; Ram, 1991; Ram, 1992). We suggest that an efcient ltering system must use various strategies or approaches to retrieve, lter, or infer information. It may use a combination of the classical methods (statistical methods) and subset of some natural language techniques. This combination may lead to a balance between the complexity of natural language approach and the less precise results of the traditional ltering approaches. In this paper, we propose an articial intelligence approach to improve the efciency of the information ltering process by combining statistical and natural language approaches. Section 2 gives an overview of the information ltering process and a survey of different models of textual information ltering. Section 3 presents our intelligent and dynamic email ltering tool. Section 4 presents the experiments and the results. Section 5 summarizes the conclusions.
2. Related works Information ltering deals mainly with large amounts of incoming data. It is based on description of an individuals or a groups information preference, called proles, which typically represent long-term users interests. In our context, ltering process aims to select and/or eliminate information from a dynamic data-stream Belkin and Croft (1992). The design of an effective ltering tool involves two tasks: Selecting the most effective methods to match a users interests with available information. Finding out how should a user prole is described? In this section, we present the various manners to model users interests and the main models used in the ltering eld. 2.1. Users model The description of users interests (proles) is the most crucial and difcult operation in the building of an information ltering system (Stadnyk & Kass, 1992; Malone, Grant, Turbak, Brobst & Cohen, 1987). Indeed, this operation addresses the following questions: What are the users information interests? How are they identied? How are they represented? Can they be updated easily? etc. The effectiveness of ltering is closely dependent on
the quality of the proles representation. In general, the terms contained in a document are different from those the user would use to specify his interests. Thus, it is not easy for users to specify what those interests are because they differ from a user to another, and they are constantly changing (Belkin & Croft, 1992; Furnas, Landauer, Gomez & Dumais, 1987; Stadnyk & Kass, 1992). To know how to congure the users model for an information ltering system, an experimental study can be undertaken to observe how the users process, evaluate and classify their relevant documents. Keywords prole. Generally, people describe their interests by providing simply a set of keywords. Nevertheless, this method is ambiguous because a word can have several meanings and a concept can be described by several words. Checking for patterns of keywords is not enough to model interests. Semantic and contextual information must also be used. For example, other sources of information could be used, such as: Which articles they have read in the past? What organization they work in? Which books they have ordered? etc. Documents prole. In the approaches using this particular type of proles, the idea of ltering is generally to create a space of documents a user judged interesting. Each new document being close to the documents in this space is considered relevant. So the prole of a document provides a simple and a very effective representation of users interests. Moreover, a small number of relevant documents are also effective and much simpler to manipulate then a long list of words and/or expressions describing the interests of a person. 2.2. Filtering techniques overview Information ltering is closely related to information retrieval. Indeed, they have the same goal: namely they are both concerned with getting information to people who need it. The used methods are similar Belkin and Croft (1992); it is only the approach, which differs. The ltering process is the dual problem of the information retrieval: information retrieval is concerned with the indexing of documents, while ltering is concerned with the indexing of proles. Several methods have been proposed in information ltering eld, such as: The Boolean model. Boolean model is one of the most used models Yan and Garcia-Molina (1993). It is the standard model based on an exact match of the proles with the documents. The user expresses his proles by words which must exist or do not have to exist in the document to receive. A document is selected if contains all the words included in the users prole. The main advantage of this model is its simplicity, but it suffers particularly from strong limitations: it is difcult (even impossible) to determine the difference between the most signicant terms and those which are not, because all the words have the same weight and the same level of importance. Interesting documents may not be retained if they do not contain all the words
173
describing a users prole. In addition, a classication of retrieved documents by order of relevance is not possible. Vector space (VS). In this model, both documents and proles are described by words and represented as vectors in a multidimensional space Yan and Garcia-Molina (1994). Each word is assigned a weight, which represents its degree of importance. The degree of similarity between a document and a prole is measured by comparing their related vectors according for example, to the cosine similarity measure or inner product Salton and Buckley (1988). This model is more interesting, because it includes an evaluation of the relevance of the responses, but it requires dening the vector spaces. Furthermore, the word ordering is not taken into account. For example, documents as a horse is better than a car and a car is better than a horse are viewed in the same manner. In addition, this model doesnt represent the semantic aspect. Indeed, the dependence between the terms does not exist. For example, Take off your shoes and Remove the footwear are seen differently. However, they express the same idea. Latent Semantic Indexing (LSI). LSI represents documents with concepts. It requires studying the entire text to extract the useful relationship between the terms and the documents. A powerful and fully automatic statistical method is used to calculate and simulate these relationships, called Singular-value decomposition (SVD) (Dumais, 1997; Foltz, 1990). The estimated similarity between two documents, called conceptual similarity, is processed by the cosine or scalar product of their representation in the space of the concepts. Contrary to the traditional methods, LSI often takes into account some undesirable phenomena. Indeed, it can lter and select documents, which dont match any words with the users interests. However, it can be used to lter new information for more stable users interest. But the update operation of the concept space is expensive in time: it requires (1) the availability of a corpus to build the matrix (terms-proles) and (2) a long time to execute the SVD method. A solution to this problem is to run this operation regularly during hollow periods. Connexionist approach. The purpose of this model is to imitate some functions of the human brain. Connexionist ltering is based on neural networks (Davalo & Naim, 1993; Eberts, 1991; Elligot & Sorensen, 1994). The nodes of the network represent concepts and arrows represent relationships between concepts. The interest of this approach is the representation of information. It is not a set of distinct words, but rather a set of words interconnected and carrying weights. This model is dynamic: it can learn and modify its behaviour progressively. After the training phase, the network can be used as a black box to process new data. The use of this model in information ltering eld, consists in representing users interests as a network where the relationships between nodes (weight) are xed automatically at the beginning and are adjusted after several uses to come near a satisfactory behaviour. The main functions of
the connexionist model are an activation/propagation process and a training operation. However, one of the main disadvantages of the neural networks is their incapacity to explain the result, which they provide. In next section, we present our approach to deal with information ltering problem, particularly in electronic mail domain.
3. Intelligent and dynamic email ltering tool The communication via electronic mail is one of the most popular services of the Internet. The amount of incoming messages to users becomes increasingly important, especially when group communication tools like Mailing lists, Distribution lists, Newsgroups and Usenet news, etc. are used. In particular, unsolicited messages or spam, ood our email boxes, causing frustration, wasting bandwidth and time. They also cost money with connections and may expose minors to unsuitable content (e.g. when advertising pornographic sites). A spam email is a message that is received without being requested and which tries to promote a product or service, or to persuade the reader to join some quick self-enrichment scheme, or suggest miracle health products, especially for losing weight, or travel offers at a very attractive prices, etc. Future mailing systems will require more capable lters to help users in the selection of what to read and avoid them to spend more time on processing incoming messages. Several emails ltering systems are rule based (Cohen, 1996; Mackay, Malone, Crowston, Rao, Rosenblitt, & Card, 1989). They involve in human beings observing emails and writing a set of logical rules, which can lter and classify them. Moreover, as the nature of emails changes over time, these rule sets must be constantly tuned and rened by the user itself. Given the amount of work required to design such rules by hand (time-consuming and often tedious process) and the success of machine learning techniques in text classication Sebastiani (1999), led us to use learningbased approaches (adaptive methods). These approaches consist of building automatic classiers using machine learning methods trained on a collection of emails. A growing number of machine learning techniques have been applied to text categorization in recent years, including multivariate regression models (Joachims, 1997; Yan & Garcia-Molina, 1994), nearest neighbour classication Yang and Pedersen (1997), Bayes probabilistic approaches (Lewis & Ringuette, 1994; Mc Callum & Nigam, 1998; Mc Donough & NG, 1994), decision trees Lewis and Ringuette (1994), neural network Dagan, Karov & Roth (1997), inductive learning algorithms Apte, Damerau & Weiss, (1994), and support vector machines Joachims (1999). In this section, we present our system, an intelligent and dynamic email ltering tool. As we said before existing email ltering systems have limited efciency and largely rely on user participation especially in rules management. In
174
our system, we propose to combine several techniques in such away to: Improve the efciency of the ltering process by taking into account the semantic aspect. To do the work on behalf of the user with a minimum of its involvement. To dynamically and autonomously improve the system knowledge and expertise.
3.2. Expert module The expert module consists of an expert system in charge of driving the ltering process in cooperation with SVS and features-based model that help in processing, evaluating, ltering and classifying messages. It is composed of a set of rules (IF kconditionl THEN kactionl). Conditions are related to the different elds of the message (From, Subject, To, etc.) and the users status (absent or present). The ltering rule might, for example select only messages, which contain the word money in the subject eld. The system causes different actions to be performed on incoming messages, such as: to delete the message, to forward the message automatically to some other email address, to sort and save the messages into a separate folder, to present the messages in a certain order, etc. Furthermore, the system can explain and justify all message ltering decisions, by displaying all selected rules. 3.3. Knowledge model In order to construct the system knowledge model, we collected a set of 1800 messages (1000 spam and 800 nonspam) and explored a set of factors that may discriminate emails and classify them best. The word space for text will tend to very large. Consequently, initial attempt to create our model consists in simply selecting words with the highest value according to mutual information criteria Yang and Pedersen (1997). The mutual information MIX; C between each word X and the class C is given by: MIX; C log2 pX; C : pXpC
3.1. Architecture of the system The design of an efcient ltering tool involves several tasks, such as: Content analysis of the message header elds and content analysis of the message text. Representation of messages and users interests (concepts space). Conceptual similarity measure. Filtering process (actions). Learning process. Mainly, the architecture of our system is composed of four modules: an expert module to drive the ltering process, a neural network based model of the users interests, a semantic vector space (SVS) to take into account the semantic aspect and a set of genetic algorithms for the learning process (Fig. 1). During the rst stage of message ltering, the full text of a message to be ltered must be parsed to produce a list of potential features that could serve as a basis for ltering process. We rst collect the individual words occurring in the message. Words that belong to the stop list, which is a list of high frequency words with low content discriminating power, are deleted. Then a stemming process is used to reduce each remaining word to word-stem form (term).
Some important keywords for the spam prole are: business, desires, free, fast, investment, miracle, money, quick, sex, etc. It is important to note that in addition to the individual words, there are many other important properties in emails and particular features characterizing messages as spam. For example, particular phrases, such free money,
Fig. 1. Architecture of the system.
O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179 Table 1 Specic features Feature Domain type of the message sender (.com, .edu, .gov, etc.) Header length Type of message (html, plain text) Message length Unusual words Abbreviation Non-alphanumeric characters Numeric characters Language Attached documents Sentence length Punctuation (!, ?) Date Etc.
175
3.4. Neural network model representation The model of our system is represented by a neural network (Fig. 2). The nodes represent the features (words, phrases and specic features) and the arrows represent the relationships between features. Each feature is assigned a weight, which represents its degree of importance. We have included specic features into initial wordbased model by simply adding variables denoting the presence or the absence of these features into the vector of each message. The network is composed of three levels (Nouali, 2002; Oubbad & Nouali, 1999): Level 1. This level represents the meaning of the message. It is the input layer, represents the entry of the network and is created dynamically. Level 2. It is the hidden layer. The nodes of this level represent the system knowledge. Level 3. It is the last layer, called the output layer. It represents messages classes (or proles) and provides the outputs of the network. Once a message arrives, the vector of features is calculated and then presented to the network in a signal form (level 1). For each term ti ; a tfidf weight qi is computed and assigned to represent its degree of importance Salton and Buckley (1988). The signal is propagated through the network. In fact, each node receiving a sufcient signal becomes active and sends it to its neighbours. At level 3, each node receiving a sufcient signal from level 2 becomes active and constitutes the response corresponding to incoming message. The same sigmoid function f is used to activate nodes in different levels of the system model. The function f is dened by (Davalo & Naim, 1993; Oubbad & Nouali,
credit card, business opportunities, investment opportunities, miracle health products, etc. are indicative of spam email. Furthermore, email contains many non-textual features, which provide a great deal of information allowing emails discrimination. We considered particular non-textual features, such as the domain type of the sender. For example, spam email is virtually never sent from.edu domains. Another indicator (recipient) is found in examining if the message was sent by an individual user or via a mailing list. A number of other simple distinctions, such as whether a message has attached documents (most spam email does not have attachment), or when a given message was received (most spam email is sent at night), are also powerful distinguishers between spam emails and nonspam. We also considered a number of other useful distinctions, such as the percentage of non-alphanumeric characters, abbreviations, or numerical characters. For example, spam email contains a high percentage of nonalphanumeric characters. Table 1 shows a summary of some important specic features taken into account.
Fig. 2. Adapted neural network model.
176
1999): f x ex 2 1 : ex 1 1
Fig. 3. Crossover genetic operator.
3.5. Message ltering Our system performs two kinds of ltering: ltering with propagation and ltering without propagation. Filtering without propagation (FWP). Message vector M (level 1) which represents an incoming signal will be propagated towards hidden layer (level 2), where each node calculates its entry according to the formula: ET ti qi : 2
Fig. 4. Mutation genetic operator.
After this, each feature will be activated with the equation: ST ti f ET ti : 3
Each activated node of layer T will transmit its outcoming signal throughout links pij towards the output layer P (level 3). Proles will evaluate their incoming according to the formula (Nouali, 2002; Oubbad & Nouali, 1999): EP Pj
T X i1
ST ti pij :
And will be activated by the equation: S Pj f EP Pj :

P
Activated proles whose outcome is upper than threshold S will be considered as representative message and will be sorted in a decreasingly order. The high value will be sent out; when no value is upper than S; value zero is sent out. Filtering with propagation (FOP). Propagation is done by extending incoming signals between features throughout links. In this type of ltering, the activation value of each node is calculated and nodes with a sufcient signal become active and forward the signal to their neighbours. Feature nodes that received a signal by propagation calculate their entry using (Nouali, 2002; Oubbad & Nouali, 1999): ET ti
T X i1
by unifying the semantic of messages and users proles. However, with SVS approach the system can lter and select messages, which dont match any words with the users interests. As in VS, both messages and proles are represented as terms vectors. Before measuring the degree of similarity between a message and a prole, SVS uses a thesaurus to retrieve a real or equivalent meaning of terms. A thesaurus is a data structure more powerful than a traditional dictionary. It is organized into terms and different semantic relations: synonymy, antonymy, hyponymy, etc. Such semantic relations between words facilitate and contribute to model the same idea or terms semantically close (for example: tree, plant, etc.). We use the LSI method to build a thesaurus. First, it requires the study of messages to extract the semantic relationships between words. LSI is used to calculate and simulate these relationships. The result is a users implicit thesaurus that can be enriched by genetic algorithms (Davalo & Naim, 1993; Oubbad & Nouali, 1999). The steps to build a thesaurus are: To build a matrix A (terms-message) on the basis of a set of annotated messages (spam or non-spam). To apply the LSI method to extract the relationships between terms: (1) A is decomposed into the product of three other matrices using SVD, A UWV; such that U and V are orthogonal and W is diagonal. W is reduced by ignoring some axes that correspond to the minimal singular values. (2) To represent each term in the space of concepts (W). (3) To estimate the similarity between two terms, called conceptual similarity, by the cosine or scalar product of their representation in the space of the concepts.
Table 2 Classication results using various feature sets Feature set Spam Non-spam
ST ti wij :
And will be activated by formula (3). 3.6. Semantic vector space After studying different information ltering approaches and electronic email features, we propose a new approach, SVS, to complement our features-based email ltering system. SVS is the result of two methods: VS and LSI. It brings a solution to VS method limit concerning the semantic aspect. It improves the effectiveness of ltering
Precision (%) Rappel (%) Precision (%) Rappel (%) WF 97.8 WF PF 99.4 WF PF SF 99.5 81.8 86.3 95.4 77.2 82.2 93.2 97.1 99.2 99.2
177
Fig. 5. Precision and recall in FOT.
3.7. Learning Our system includes two kinds of learning. 3.7.1. Assisted learning During this phase, user is invited to view ltered messages by the system and to give indications on system decision quality. Following this advice, feedback operation modies the thesaurus (Feedback on Thesaurus), and the proles (Feedback on prole). Feedback operation modies weights of features that had contributed in the system decision and creates new ones to improve ltering quality. 3.7.2. Automatic learning Unlike feedback, the system tries, upon explicit user request, to generate other proles using existing ones. This allows the elimination of bad proles and exploration of new domains, which can interest user. Automatic learning is insured by the genetic algorithm, which consists of a set of proles, which are applied two operators: Crossover. This operator is the informatics transposition of natural reproduction phenomenon, which allows inheritance of some characteristics from parents Oubbad and Nouali (1999). It selects the best proles and generate from each pair of them two children. Each child inherits some of its characteristics from the rst parent and the rest from the second one (Fig. 3). Mutation. It consists of random change of one or several prole characteristics Oubbad and Nouali (1999). Crossover operator becomes inefcient with time, because children generated tend to be similar to their parents; at this moment mutation takes all its importance, it allows
proposing to the user new domains, which can interest him (Fig. 4).
4. Evaluation According to the results obtained by three experimental evaluations, that we will illustrate bellow, we will measure system performances using recall and precision rate and analyze how the learning process inuences them. In our rst experiment, we seek to determine the efcacy of using particular features in addition to the individual words specically for the problem of spam email ltering. We use a corpus of 1800 emails of which 1100 messages are pre-classied as spam and 700 messages are pre-classied as non-spam. This collection is split into a training set of 1250 messages (750 spam and 500 non-spam) and a testing set of 550 messages (350 spam and 200 non-spam). We rst consider using just the individual words as the feature set (WF). We then augmented these features with particular phrasal features (PF). Finally, we further enhance the feature set with specic features (SF), which are explicitly described above. Using the training set in conjunction with each such feature set, we build a neural network classier that is then used to lter and classify the testing set as spam or non-spam. The precision and recall for both spam and non-spam email for each feature set is given in Table 2. We clearly nd that the incorporation of additional features, especially PF and SF, gives consistently superior results to just considering WF in the messages. The second evaluation consists of presenting to the system, in two different cases, a set of messages that will be ltered in many sessions. During each session, the precision and recall are measured and the system is given the users
Fig. 6. Precision and recall in FWT.
178
Fig. 7. Automatic learning efciency.
position to accomplish feedback operation in order to see its inuence on these two rates. At the beginning of the experience, unlike FOT (Fig. 5), precision in FWT (Fig. 6) case is greater than recall. In the rst case, ltered messages that are not represented by proles are ignored. In the second case, thesaurus contribute in ltering, so some messages are ltered even they dont share keywords with proles; this increase recall. FWT model is leading towards user interests more rapidly than FOT model. We notice that certain features correlate with message content at a level that approaches but does not reach signicance. Thus, any measure of a message text must take into account features that are always present and those that occur only occasionally. In the third evaluation (Fig. 7), automatic learning is performed after a series of assisted learning operations. After feedback of many sessions weights are adjusted and bad features tend to disappear from the model, consequently the improvement of precision and recall is reached. We notice that automatic learning has a higher impact when the number of assisted learning operations varies from a session to another. Automatic learning efciency depends linearly on assisted learning sessions number.
an expert module to drive the ltering process, a neural network to model the users interests (proles) and a set of genetic algorithms for the learning process to improve ltering quality. Furthermore, the system can explain and justify all message ltering decisions, by displaying all selected rules. The strong feature of our system is that it takes into account the semantic aspect. To this end, we use a thesaurus that is organized into terms and different semantic relations. Such semantic relations between words facilitate and contribute to model the same idea or terms semantically close. With this thesaurus the system can lter and select messages that match the users interests without matching any users keywords; this improves ltering quality. The other feature of the system is its capability to continuously learn from its own experience and from user recommendations. Learning tries to attend system knowledge by the modication of weights and by adding new features. Genetic algorithm (automatic learning) creates new generation of proles, which proposes to user new domains that may be interesting, it also eliminates bad proles that had committed errors during ltering process. At the end, we can say that our system includes advanced aspect of articial intelligence. This qualies it to be a real user assistant by reecting its intentions during ltering operation and by having capability of auto-organization to act better.
References
Apte, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233 251. Belkin, N. J., & Croft, W. B. (1992). Information ltering and information retrieval: two sides of the same coin? Communication of the ACM, 35(12), 29 38. Cohen, W.W. (1996). Learning rules that classify e-mail. In Proceedings of the 1996 AAAI spring symposium on machine learning in information access. Dagan, I., Karov, Y. & Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of the EMNLP-97, second conference on empirical methods in natural language processing (pp. 5563). Davalo, E.,& Naim, P. (1993). Des Reseaux de Neurones, Edition Eyrolles. Dumais, S. T. (1997). Using LSI for information retrieval, information ltering and other things. Bellcore cognitive technology conference. Eberts, R. (1991). Knowledge acquisition using neural networks for intelligent interface design. IEEE, ISSN # 0-7803-0233-8/91. Elligot, M., & Sorensen, H. (1994). An evolutionary connectionist approach to personal information ltering. Irish neural network conference94, University college Dublin, September 1213. Foltz, P. W. (1990). Using latent semantic indexing for information ltering. In Proceedings of the ACM conference on ofce information systems ACM/SIGOIS, New York (pp. 4047). Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system. Communication of the ACM, 30(11), 964 971. Goldberg, D., Nichols, D., Oki, B., & Terry, D. (1992). Using collaborative ltering to weave an information tapestry. Communication of the ACM, 35(12), 61 70.
5. Conclusion Because ltering is closely related to information retrieval, most information ltering methods are based directly or indirectly on traditional techniques of information retrieval. Research in information ltering is still an open eld. One of the most promising approaches is to use advanced natural language technology such as lexical semantic, terminology, shallow parsing, etc. An efcient ltering tool must use various strategies to retrieve, lter, or to infer information. Our system is an intelligent and dynamic email ltering tool. It helps user in ltering messages and in selecting what to read according to his interests. The system contain
O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179 Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with tdf for text categorization. In Proceedings of ICML-97, 14th international conference on machine learning (pp. 143 151). Joachims, T. (1999). Text categorization with support vector machines: learning with many relevant features. In Proceeding of ECML-99, 16th European conference on machine learning (pp. 137 142). Lewis, D.D., & Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. In Proceedings of the third annual symposium on document analysis and information retrieval SDAIR94. Mackay, W. E., Malone, T. W., Crowston, K., Rao, R., Rosenblitt, D., & Card, S. K. (1989). How be experienced information lens user use rules? In Proceeding of the ACM CHI91 conference on human factors in computing systems, ACM/SIGCHI, New York (pp. 211 216). Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: MIT Press. Mc Callum, A., & Nigam, K. (1998). A comparison of event models for nave Bayes text classication. In Learning for text categorization. Mc Donough, J., & NG, K. (1994). Approaches to topic identication on the switchboard corpus. In International conference on acoustics, speech and signal processing, Yokohama, Japan (pp. 385 388). Nouali, O. (2002). Classication Automatique de messages: une approche hybride, RECITAL2002, Nancy. ` Oubbad, L., & Nouali, O. (1999). Systeme intelligent de ltrage du courrier electronique. Engineer thesis, INI, Algiers, Algeria.
179
Ram, A. (1991). Interest-based information ltering and extraction in natural language understanding systems. Bellcore workshop on highperformance information ltering, Morristown, NJ. Ram, A. (1992). Natural language understanding for information ltering systems. Communications of the ACM, 35(12), 8081. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513523. Sebastiani, F. (1999). A tutorial on automated text categorisation. Proceedings of ASAI-99, First Argentinean Symposium on Articial Intelligence. Stadnyk, I., & Kass, R. (1992). Modeling users interests in information lters. Communication of the ACM, 35(12), 4950. Tan T.W., & Garcia-Molina H. (1995). SIFT, a tool for wide-area information dissemination. In Proceedings of the USENIX technical conference (pp. 177 186). Malone, T. W., Grant, K. R., Turbak, F. A., Brobst, S. A., & Cohen, M. D. (1987). Intelligent information sharing systems, computing practices. Communications at ACM, 30(5), 390 402. Yan, T. W., & Garcia-Molina, H. (1993). Index structures for selective dissemination of information under the Boolean mode. Stanford, CA 94305: Department of Computer Science, Stanford University. Yan, T. W., & Garcia-Molina, H. (1994). Index structures for information ltering under the vector space model. Stanford, CA 94305: Department of Computer Science, Stanford University. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. International conference on machine learning ICML, Nashville, TN, USA.

A Semantic Vector Space and Features-Based Approach For Automatic

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Semantic Vector Space and Features-Based Approach For Automatic

Hochgeladen von

Copyright:

Verfügbare Formate

Expert Systems with Applications 26 (2004) 171179 www.elsevier.

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

Fig. 1. Architecture of the system.

Fig. 2. Adapted neural network model.

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

Fig. 4. Mutation genetic operator.

After this, each feature will be activated with the equation: ST ti f ET ti : 3

And will be activated by the equation: S Pj f EP Pj :

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

Fig. 5. Precision and recall in FOT.

Fig. 6. Precision and recall in FWT.

O. Nouali, P. Blache / Expert Systems with Applications 26 (2004) 171179

Fig. 7. Automatic learning efciency.

Das könnte Ihnen auch gefallen