Spam Detection Using SDT Algorithm

Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug.
2008
RESEARCH ON ILLEGAL E-MAILS RECOGNITION BASED ON VSM AND STATISTICAL DECISION TREE
KE-JIAN WANG, XIAN-ZHONG HAN, TAO GUO School of Information Science and Technology, Agricultural University of Hebei, Baoding 071002, China E-MAIL: wkj71@163.com, hxz0312@163.com
Abstract:
This paper introduces an algorithm based on VSM algorithm and Statistical Decision Tree(SDT) to recognize illegal E-mails. The vector space model is simple and easy to operate. At first, the vector space model (VSM ) can filter some specific words which are often used in illegal E-mails. Then, SDT can judge illegal E-mails by Semanteme analyze. After the two steps, the illegal E-mails can also be easily identified and the recognition rate of illegal E-mails has been improved by basic experiments. Theoretical analysis and basic experiments shows that the illegal emails can be recognized effectively with VSM and SDT algorithm.
Keywords:
Vector Space Model; Statistical Decision Tree; Illegal E-mails; Semanteme analyze
E-mails. But sometimes it may make some mistakes, because the keywords which often appear in illegal E-mails can also be involved in normal E-mails. This kind of method is confined because a word has some different meaning and diverse words may be same one in Chinese natural language. We do some study on VSM and statistical decision tree of the semantic analysis to inhomogeneous illegal E-mails. At first, we use VSM to filter E-mails, at result, E-mails are divided into two part, one part are some normal junk E-mails, another part E-mails include specific information keywords. We delete normal junk E-mails and analyze the rest of Emails the semantics. In this process we will analyze the sentences which include special keywords, judge if E-mails are accepted or rejected. 2. Algorithm based on Vector Space Model
1.
Introduction
Our life has been changed a lot with the invention of internet. Because of its convenience, cheapness and quickness, People intercommunicate with their family, friends and lots of business affairs are also settled by E-mails all around the world everyday. E-mails bring us lots of benefits. But at the same time, a great many bad E-mails are full of internet such as junk emails, virus mails, and anonymous mails. Those illegal mails, which can not only occupy internet resource but also reduce working efficiency and waste time, now have already become a part of social pollution. Especially in our country, illegal E-mails and junk E-mails bring us many serious problems. They have two kinds of direct serious harms to our country. One is those E-mails waste lots of social resources. The other one is our country has begun to be regarded as the hotbed of illegal E-mails and junk mails by other countries. A great deal of IP addresses face the danger of being closed down. If it can go on for a long time, our country may become one "the detached island of information". So, it is imperative to identify and control bad e-mails. VSM is one kind of effective method to filter illegal 978-1-4244-2239-5/08/$25.00 2008 IEEE
The classification method based on VSM is very simple and easy to run, it generate a centre vector that stand for a type of illegal e-mails for each type of illegal e-mails according to arithmetic average. When a new E-mail reached, we find new e-mail vector and compute the distance between new e-mail vector and the vector for each type of centre. We denote distance with similarity measurement, and finally judge whether or not the e-mail belongs to the illegal e-mails. This method is easy, but the effect of classification is not very satisfactory. For illegal e-mails and junk mails may coexist or may be among legitimate mails. The number of illegal messages is smaller than the one of junk E-mails, training sets are difficult to collect. Therefore, in order to improve the effect of classification, the illegal e-mails are judged not only by VSM, but also the linguistic. 2.1. E-mails text messages format
At present, the vector space model (VSM) can expressed the text message in information processing. The basic idea of algorithm is the texts are expressed vector:
480
Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug. 2008 (W1, W2, W3,Wn), Wi is the weight of the ith characteristic item. In the paper, we will regard the keywords as characteristics item and the frequency Keywords as a weight by the experiment, TF-IDF formula is used when calculation. TF-IDF may effectively express the text features, there is now a variety of TF-IDF formula. The following formula is used in the paper.
| D|
P (W | C j ) =
1 + N (W , d i ) | V | +
z =1 i =1 |V | | D|
s i
N (W , d )
i =1
P(W | C j ) is the possibility of W in C j , |D| is the

number of training texts, N (W , d i ) is the frequency of W in
K W (t , d ) =
[
td
K tf (t , d ) log( N / nt + 0.01) K 2 K tf (t , d ) log( N / nt + 0.01)
d i |V| is the total words number,
N (W , d )
s i z =1 i =1
|V |
| D|
K (1) K W (t , d ) is the weight that t word appear in the d text, K and tf (t , d ) is the frequency, N is the size of training text, nt is the num of the text in which word t appear in training
text selectivedenominator is normalized. Form (1), the more documents including a certain word are in the documents collective, the weakness is the ability of word distinguishing the documents categoryand the smaller W (t , d ) is; On the other hand, the higher times appear a certain word in a certain document, the better is the ability of the word distinguishing the documents text,
is the num of the type words frequency. P (W ) is the possibility in all training text. S3 The words are sorted by mutual information value. S4 Extract a certain number of words as the characteristics items, the best number of words is determined through experimental results. S5 For each type of all training text, compress the vector dimension to simplify express it according to the extraction characteristics. 2.3. The classification method of simple VSM
K and the greater W (t , d ) is.

2.2. Features extraction
The number of words is quite large in E-mails, which lead to the vector space dimension of the text is quite large. In order to improve efficiency and accuracy of the classification, the vocabulary those are not strong performance of expressing text meaning should be get rid of, and selected the words those contribute to the classification of the text, they are higher proportion in some certain type than there in other types. In this paper, the words are selected by the mutual information between words and type. Its steps as follows: S1 At begin, the sets of the characteristics contains all the words in the type. S2 For each word, compute the value of mutual information words and types as log( Among them
The simple VSM classification method is a more effective to text classification, the steps are as follows: S1 Compute the centre of Vector for each type of illegal e-mail, by calculating the simple arithmetic average value of all training E-mails vector. 2 When a new E-mail reach, segmented words, the text messages are expressed feature vectors. 3 Calculate the similarity measure between the new features vector and the central vector to each type, as follows:
sim(d i , d j ) =
W
k =1 m k =1
ik
* W jk
m
(2)
2 ( Wik2 )( W jk ) k =1
d i is the features vector of new e-mails, d j is the

center vector in j type, M is the dimension of the features vectors Wk is the kth dimension. 4 Compare the similarity measure of the new text vector to each type of centre vector, new e-mail should belong to the largest similarity of the type.
P (W | C j ) P (W )
481
Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug. 2008 3. Statistical decision tree dictionary. During the process, we only do semantic analysis of the single sentence when sensitive keywords appear in it, and then determine the meaning of this sentence by the character of subject, predicate and object. We determine the structure of a sentence by the central predicate. During the process of the recognition of central predicate, we introduce SDT to determine the character of the key words, and then to do the final judgment by statistics method. According to documents, whether a candidate word is central predicate determined by both its grammar character and context environment. We call its grammar character as static attribute and context environment as dynamic attribute. 3.1. The model of SDT
Decision tree is also called judgment tree[7]. We structure Decision tree from instance collection firstly. That is a instructional learning method. During this process, we can structure Decision tree depending on the data from training collection. If this tree can not classify all the objects correctly, we will select some additional data to training collection. Repeat this process until relevant decision collection is formed. Decision tree is the tree-like structure of Decision collection. The final result is a tree, its leaf nodes are the name of class, the middle nodes represent this branch's character, this branch correspond to the possible value of this character. The illegal characteristic of E-mails can often be reflected by the single sentence. We can analyze the single sentence which contain the keywords that always appear in the illegal mails. According to characteristics of the words, we can make decision by statistical method. So the method of SDT is feasible here. In order to shorten processing time and protect users' privacy, only when a mail contains the key words that always appear in illegal mails, we will analyze the text and semantic and whats more, we will just analyze the single sentence containing the key words. During this process, just when the single sentence contains some meaning that disobeys the law, can we decide this mail to be an illegal mail. It means that we should do the logic judgment in the sentence; the character of it should be decided by the result of logic operation among verb, noun, and adjective. So, the emphases of the analysis are to determine whether the key word that always appears in illegal mails is the central predicate. According to the character of subject and predicate, we can determine the meaning of this simple sentence. This paper determines whether the keyword is the central predicate [1] by SDT. After the E-mails are filtered by the method of VSM, we do the further process by SDT. The method of VSM can just filter the key words always appear from illegal mails, and determines the illegal mails from its frequency. We all know that some normal mails can also have those words. If we do not have an appropriate definition about similarity measure threshold value, the method of VSM may make some mistakes. Even if we have an appropriate definition of threshold value, the method of VSM may still make mistakes because it does not consider the semantic. During the secondary judgment by SDT, the correctness of judging an illegal E-mail will be improved. During the secondary judgment, we should form a collection of the key words which often appear in illegal mails firstly. We call this collection sensitive keywords
SDT is a kind of decision-making mechanism, depending on a series of character which give a possibility value P ( f | h) to every possible option. h represent a series of characters ,f represent the currently choice. Possibility value P ( f | h) is determined by former character question array q1 , q 2 ,......q n .The
qi just
associate with q1 , q 2 ,......qi 1 .After several questions, if the probability (the probability satisfy the questioned candidate word as central predicate) of the words which come from training collection ,is high enough or low enough, we can finish current question route. The condition of stop question is: P ( f | h) >T and P ( f | h) <1-T,T is threshold value. The mum( f | h) is the number of condition word which satisfy when question array is h and character is f. P ( f | h) = P ( Head | q1 , q 2 ,......q l ) num( Head | q1 , q 2 ,......q l ) num( Head | q1 , q 2 ,......q l ) + num( Head | q1 , q 2 ,......q l ) In a SDT, in general the inside node is the question node, one question node represent a question of a kind of character. The route come from the node represent the possible value of the character. The leaf node is choice node. It represents all the character types of words from root node to leaf node (the character types represent the candidate word is the central predicate or not),and the choose made by leaf node is represented in the form of probabilities. Recognizing the central predicate in certain sentence is to find a leaf node that has the highest probability among all the nodes as central predicate. To judge the central predicate is the essential of analysis the single sentence in
482
Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug. 2008 mails. When the central predicate[3] is included in the sensitive keywords dictionary, we should analyze its object and subject to determine the character of the mail. For example, I is "you must submit detailed name and bank account message according to above-mentioned demand please", Key(I) ={Ask, according to, go, submit}, Veb(I) ={Ask, according to, go }. They are verbs not in the sensitive key storage. When the number of word is 4 in the environmental window , the characteristic value of each candidate word is as table 1 shows: Table 1 The features for the candidates
candidate Adjectiv Verb word e before before Adjectiv Noun Tread e after after after verb Objectiv Objective of Is e before haves
some words as predicate in a sentence. Recognizing the word as predicate is the key to analyze the single sentence in mails, when the word as predicate is in the sensitive key storage[5], we should judge whether the predicate word have subjective and objective, and what is the words as subjective and objective. According to those we can determine whether mails are illegal ones. The algorithm flow chart is as chart 2.
Samples training mails
of
Please accordi ng to go submit
N N N N
N Y Y Y
N N N N
Y Y N Y
Y Y Y N
N Y Y N
N N N Y
N N N N
Newly mail feature storage Feature abstract VSM filter Illegal mail Suspicious illegal mail Normal mail
Feature abstract
Among them, " Y " shows attribute is " yes " currently and " N " shows " no". The attribute Value is corresponding to node of SDT, we can get the probability of candidate word as predicate. Generally, if there is only one verb in the single sentence, the verb is a enter word of predicate, but if there are several verbs in a sentence, we should choose the candidate word whose probability value is the biggest as predicate according to SDT[2]. A part of SDT is as following:
Adjective before
Semantic collectio
Judge by SDT
sensitive key
whether the sensitive keyword is word as predicate Y Word as predicate

Semantic collection
is
Verb before Tread verb after Adjective after
Analysis subjective and objective
Judge what subjective and objective i
Objective before
Illegal mail
Normal mail
Figure 2 Flow chart of VSM and SDT

Noun after Objective haves of
4. Figure 1 A part of a SDT
Conclusions
Therefore, so long as we can build a SDT that recognize the center word as predicate, we can recognize
Because illegal E-mails are difficult to collect, we only do some basic experiments. Experiments show that some words including bank account, bank credit card, kill, grab, and so on, appear frequently in a illegal mail then the e-mails belong to the illegal type, it should be paid attention
483
Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug. 2008 to closely. This algorithm based on VSM and Statistical Decision Tree (SDT) is effective. The applied research based on VSM is booming, although we had already made the certain progress, we must try our best to improve. The illegal means of a E-mail is renovating unceasingly, we must study these recognition algorithms further. Acknowledgements This paper is supported by the bureau of science and technology Hebei. Code: 072135126 References [1] Zhensheng Luo, Changjian Sun, Cai Sun. Research of predicate recognition on pattern composition automatical analysis in the Chinese sentence. Published by Tsing-Hua University in Beijing. 1995P159~164. [2] Xiaohe Chen, Dingyi Shi The theme of Chinese sentence- the subject marking Published by Tsing-Hua University in Beijing. 1997P102~108. [3] Shiwen Yu,Xuefeng Zhu, Hui WangSpecification manual of " modern Chinese grammar information dictionary " Chinese information journal 1996 V10(2)P1~22. [4] David M MagermanNatural Language Parsing As Statistical Pattern Recognition[ dissertation ] the United StatesStanford University1994 [5] L E Baum An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of Markov Processes Inequalities, 1972, 3p1-8 [6] Shiwen YuImagination on limited regular Chinese. Educational publishing house of Shandong in Jinan .1995P193-205 [7] Zhifang Sui, Shiwen Yu. Knowledge of obtaining and application on recognition of the predicate center word in single Chinese sentence". journal of Beijing University (natural science edition ).V(34)2 P221-230. [8] Guangbin Zhou. The situation of junk email of our country accelerates managing. Communication information newspaper. 2004.V(3).P24-28. [9] David D.Lewis, Feature selection and feature extraction for text categorezation In Proceedings of Speech and Natural Language Workshop, P212-217. [10] Yiming Yang. An evaluaton of statistical approaches [13] to text categorization In Journal of Information Retrieval, 1999. Vol(1), No.1/2. p67-68. Aitao Chen, Chinese text retrieval without using a dictionary, In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp42~49, 1997.. David D. Lewis, Feature Selection and Feature Extraction for Text Categorization, In Speech and Natural Language workshop, New York, pp212~217, 1992.. Alfons Juan, and Hermann Ney, Reversing and Smoothing the Multinomial Nave Bayes Text Classifier, In 2nd International Workshop on pattern recognition in information systems, 2002. C.Apte, F.Damerau, and S.Weiss. Text mining with decision rules and decision tree[C]. In workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, Carnegie Mellon University. pp.899-926,1998..
[11]
[12]
[14]
484

Spam Detection Using SDT Algorithm

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spam Detection Using SDT Algorithm

Hochgeladen von

Copyright:

Verfügbare Formate

Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31 Aug.

P(W | C j ) is the possibility of W in C j , |D| is the

K tf (t , d ) log( N / nt + 0.01) K 2 K tf (t , d ) log( N / nt + 0.01)

d i |V| is the total words number,

K and the greater W (t , d ) is.

d i is the features vector of new e-mails, d j is the

Samples training mails

Please accordi ng to go submit

whether the sensitive keyword is word as predicate Y Word as predicate

Analysis subjective and objective

Judge what subjective and objective i

Figure 2 Flow chart of VSM and SDT

4. Figure 1 A part of a SDT

Das könnte Ihnen auch gefallen