Sie sind auf Seite 1von 15

AN OBJECT ORIENTED EMAIL CLUSTERING MODEL USING WEIGHTED SIMILARITIES BETWEEN EMAILS ATTRIBUTES

Presented by
P. Sai Kiran J. Veeraiah Chowdary P. Sudheer Kumar T. Bhargav

Guided By Mr.M.M.M.Durga

ABSTRACT

Email mining is a process of discovering useful pattern from emails. Clustering techniques can be applied over email data to create groups of similar emails for measuring the similarity between pair of email objects . To measure the distance between two email objects more accurately, normal clustering distance techniques could not be a good choice. A weighted email attribute similarity based data mining model is proposed to for email clustering to discover email groups. Custom user defined weights are assigned for the similarity measured between a pair of email attributes to calculate the similarity between pairs of emails.

INTRODUCTION
Email communication has came up as the most effective and popular way of communication today. E-mail data that are now becoming the dominant form of interand intra-organizational written communication for many companies and government departments. Emails are the essential parts of life now just like mobile phones.

Email as a database Email Mining Clustering Emails

CLUSTERING ALGORITHMS
The most widely used clustering algorithm in textual data is the K-Means algorithm. In order to group some points in K clusters, K-Means works in 4 basic steps: 1. Randomly choose K instances within the dataset and assign them as cluster centers 2. Assign the remaining instances to their closest cluster center 3. Find a new center for each cluster. 4. If the new cluster centers are identical to the previous ones, then the algorithm stops. Otherwise, repeat steps 2-4.

EXISTING

APPROACHES

Existing model solutions include following: Automatic foldering is a more sophisticated approach based on filters matching the message with existing mail folders. Conversation view is an improved variation on the threaded view approach. It has been introduced in Google's Gmail service.

DISTANCE FUNCTIONS MEASUREMENTS

AND

SIMILARITY

1. Dice Similarity 2. Cosine Similarity 3. TF-IDF Similarity 4.Jaccard Similarity

1. DICE SIMILARITY

2. Cosine Similarity

3. TF-IDF SIMILARITY

4.Jaccard Similarity
Jaccard Sim = (X*Y) / (|X||Y|-(X*Y))

PROPOSED MODEL

The overall similarity between a pair of emails is represented by SimEmail which is the weighted summation of all of the similarities.

SimEmail = Wf * SimFrom + Ws * SimSub + Wc * SimContent

The sum of the weights assigned to the similarities should be 1. Wf + Ws + W c = 1

Weighted similarity between email objects

3 stages of email clustering


1.Pre-processing 2. Weighted Email Object similarity 3. clustering technique

EXPERIMENTAL ANALYSIS

CONCLUSION

This technique includes the distance between all of the attributes of an email. The other direction of work for more email mining operations like thread summarization, automatic answering of the emails and classification of the emails for participating all the attributes of the emails and achieving more accurate results.

Das könnte Ihnen auch gefallen