Clustering Spammers

The Community Behavior of Spammers
Fulu Li Mo-Han Hsieh Pawel Gburzynski Dept. of Computing Sci. University of Alberta Edmonton, AB Canada T6G 2E8 pawel@cs.ualberta.ca The Media Laboratory Engineering Systems Division Massachusetts Institute of Technology Massachusetts Institute of Technology Cambridge, MA USA 02139 Cambridge, MA USA 02139 Email: fulu@mit.edu mohan76@mit.edu
Abstract
We have conducted an empirical study of the community behavior of spammers, which has revealed various clustering structures among their population. Based on those structures (discovered solely by processing sighted spam mail), we propose a family of new spam-blocking techniques exploiting group membership of perceived spam sources. In particular, we have found that if a spammer is associated with multiple groups, it has a higher probability of sending more spam in the near future. We have also observed that spam from the same community of spammers tends to arrive in bursts, and a very small fraction of spammers account for a large portion of the total spam mail. The high clustering coefficient of the community of spammers seems to reveal a hidden collaborative nature of that community.
the spammers. Most spam messages take the form of advertising or promotional materials, with roughly half being related to money (debt reduction plans, getting-rich-quick schemes, gambling opportunities), one third dealing with pornography, 10% being health-related, and the rest covering a variety of other topics [7,10]. With the staggering amount of daily spam, it is hard to imagine that all those spammers acted individually and independently. In fact, it is widely believed that most of the spam messages are sent directly from a collection of bots [23], i.e., compromised machines controlled by a relatively narrow sub-community of spammers (typically, spammers purchase the right to use compromised machines from worm developers/attackers) [11,15]. This is because, with the proliferation of various IP-based blacklists [1,23], which in practicality enforce SMTP server authentication, it is virtually impossible to use an honestly provisioned system for sending effective spam [1,22,23]. Due to the constant availability of security holes in popular systems [25], as well as the ease at which, once detected, a hole can be exploited on a massive scale [24], a professionally crafted worm (a Trojan horse) can easily infect a large collection of computers turning them into bots, i.e., powerful spamming tools. The worm developer can then sell those bots to spammers for immediate financial profit. Some bots offer the possibility to open a SOCKS proxy on a compromised machine, which can then be used for spamming. With the help of thousands of bots, spammers can send a massive volume of mail within a short period of time [3,4,5]. The motivation of this work is to understand and analyze the community behavior of spammers by studying a large collection of spam mail. The findings from this work may help us lay foundation for group-based anti-spam strategies. Specifically, by identifying common behavior patterns for a group of bots (i.e., a botnet), one may be able to effectively block all the spammers in the underlying group instead of blocking each of them individually. To the best of our knowledge, this is the first time that spammers have been classified and categorized by their communities, and it is also
1. Introduction
With the popularity of the Internet, email has been remarkably woven into the fabric of our society. According to IDC, the global daily email traffic reached 35 billion messages in 2005, up from 9.7 billion in 2000 [7]. However, the increase in the worldwide use of email comes with an overwhelming increase in spam mails. It is hard to give a precise definition on what is a spam mail. In short, a spam mail is an unsolicited, unwanted bulk/commercial email that endangers the very existence of the email system with massive and uncontrollable amounts of messages [10]. Different studies have shown that spam accounts for more than 50% of all Internet email. The cost of spam mail consists of several components: the loss of productivity (as people have to spend time on spam), the cost of bandwidth wasted by spam, the cost of storage and network infrastructures, etc. It is no surprise that the projected worldwide spam cost reaches almost 200 billion US dollars in 2007 with roughly 50 billion daily spam messages, according to Radicati Group. To find better anti-spam strategies, we have to better understand the motivation as well as the modus operandi of
the first time that group-based anti-spam strategies have been proposed. The work in this paper is built upon our previous work in [16]. The rest of the paper is organized as follows. In Section 2, we discuss the related work on spam traffic analysis and antispam strategies. In Section 3, we give a comprehensive analysis on the community behavior of spammers based on a large collection of spam mail. Then, Section 4 suggests a few group-based spam categorization techniques derived from the observations made in Section 3. Finally, Section 5 brings about the conclusions.
people would insist on putting the two concepts into different baskets arguing that a pure filtering scheme always acts locally. Among the variants of the latter are: signature-based filtering, Bayesian filtering, rule-based filtering, challengeresponse filtering, etc. We refer the reader to [6] for a survey of those techniques. SpamAssassin, one of the most popular spam filters, uses a sophisticated set of rules to identify and block spam mail. A number of server-end techniques deployed by system administrators at the SMTP reception point have evolved to become a standard set in the anti-spam arsenal. The idea behind one such trick is to maintain a local database of trusted SMTP servers. An unknown server connecting for the first time is treated as hostile and its first attempt at delivery is rejected. The rationale is that a serious SMTP server will try again a while later, while a spambot will simply move on. On the second attempt (presented within some reasonable time interval), the connecting server is granted a temporary trusted status, which can be revoked and verified periodically in the same manner. An accompanying technique is usually to slow down the non-trusted server by reducing its TCP window to a single byte and being intentionally slow with acknowledgments. Such techniques are recently gaining ground in the commercial world (e.g., they are among the tools promoted by Turntide Inc., which was recently acquired by Symantec Inc). The role of computation-based tools is mostly to verify that the message has been sent by a human being, who has demonstrated a bona-fide intention to contact the particular recipient. As argued in [18], this is the most relevant feature discerning legitimate mail from spam. Various challengeresponse systems fall into this general category. Their premise is to tell apart human senders from spambots [26]. One should note the importance of such schemes for certifying the nonspam status of a message that otherwise can only be ranked and categorized with some finite accuracy (this applies to both pure filtering as well as blacklisting, which can be false [18]). Misqualifying a legitimate message as spam is a serious problem of all filtering techniques. For example, the computation-based solution employed by Hashcash Inc. (http://www.hashcash.org) is being advertised as a way to avoid false positives. Their scheme has been recently incorporated into SpamAssassin. The generic idea of mail channels [27] is to maintain separate email identities for different contacts. With the proper organization of those identities and the inclusion of a challenge-response mechanism [18], it can become a highly effective tool in combating spam, especially for a cognizant user, who maintains a certain doze of discipline in his/her email activities.
2. Related Work
In this section, we give a brief overview of the related work on spam traffic analysis and the state-of-the-art of the anti-spam technology. In general, the anti-spam strategies can be classified into four major categories [6]: blacklists, filteringbased approaches, networking-based schemes, and computation-based methods. In [2], Gomes and Cazita gave a comprehensive study on the characterization of spam traffic in terms of workload variation, density, inter-arrival time distribution, email size distribution, temporal locality, etc., compared with non-spam emails. Their characterization reveals significant differences in the spam and non-spam traffic patterns. The interesting observation is that non-spam email exchange is typically driven by bilateral social relationship while spam transmission is usually a unilateral action, solely based on the spammers will to reach as many recipients as possible. In [1], Jung and Sit examine the use of DNS blacklists for address-based filtering of spam. The basic idea is that once the IP address of a host (SMTP server) engaged in spam delivery is identified, it will be registered in a centrally maintained database, which is made publicly available via the standard Internet DNS service. Subscribing mail recipients can query the database using DNS lookup tools and refuse to accept mail from hosts that are listed there. Their studies found that around 80% of spam sources that they were able to identify were listed in some DNS blacklists, and that different DNS blacklists tend to be well correlated. In a filtering-based scheme [6], an incoming email message is passed through a series of policies/rules/patterns to assess its legitimacy. From this general perspective, blacklisting can be viewed as a special case of filtering, whereby the simple rule is the presence or absence of the sending SMTP servers IP address in the blacklist database. Thus, the filter is distributed: the rule is local to the email recipient, while the patterns driving the rule are provided by the blacklist maintainer. Most
According to [3,5], about 30,000 new machines are compromised daily and become bots. One of the most common usages of botnets is to launch massive spam attacks. The recent study in [11,15] suggests that bot rental has become the preferred and most effective spamming technique. In [14], Ramachandran and Feamster studied the networklevel behavior of spammers by examining a large collection of spam mail. They found that most spam is sent from a few regions of IP address space, and that spammers appear to use transient bots to send a few spams over a very short period of time. The authors in [14] suggest that filtering email based on network-level properties may prove to be much more effective to combat spam due to the fact that network-level properties are less variable than the email content. As argued in [18], filtering based solely on email content is fundamentally flawed as an idea. This is because that content is perfectly malleable and has essentially unlimited paths for evolution in response to aggressive filtering.
becomes more and more distributed, with the spammer controlling thousands of bots and having several levels of indirection to cover his/ her identity. Consequently, the returns from IP spoofing (which the spammer can seldom bet on) quickly diminish to the point that this technique can be safely ignored. Besides, by blocking email from knowingly unused IP sources, one will never risk a false positive. The spam mail data contains the full mail header information and the full mail contents including the attachment files. The mail header information contains the real IP address of the spam source, the route information (which can be fake up to the real IP address of the actual sender), and the TCP SYN fingerprint, which can be used to identify the OS information of the spam source. The data set covers one week of operation and consists of 86,819 spam messages.
3.2 Overview of the Spam Traffic

The week-load of 86,819 spam messages has originated from 41,874 distinct spam hosts (i.e., IP addresses). Based on the TCP SYN fingerprint information used to identify the hosts OS, 74% of messages have been sent from Windows machines, about 10% of them have arrived from Linux hosts, about 5% of the messages have originated at BSD and Solaris systems, and about 11% could not be classified because of the lack of OS data in the fingerprint. Notably, very few spam cases have arrived from Macintosh machines (we could only count 5 of them) and the vast majority of all spam has been sent from Windows. This is likely due to the fact that Windows machines are more vulnerable to virus attacks and they are also more prone to become the victims of worms turning them into spambots. To the best of our knowledge, there has been yet no widespread virus to affect Macintosh machines.
1.E+00
3. Community Behavior Analysis

In this section we analyze the community behavior of spammers though a large collection of spam mail. Section 3.1 describes the spam source data. An overview of the spam traffic is presented in Section 3.2. Section 3.3 shows the clustering structures of the spammer communities based on different grouping criteria.
3.1 Spam Data Source

Our spam data was obtained from Jaeyeon Jung and Nick Feamster at CSAIL MIT, where it has been collected at a domain mail server1 in such a way that the IP addresses of the spam sources were recorded when the spammer tried to establish the TCP connection with the domain mail server to deliver the spam message. The IP address of the spammer recorded during the 3-way handshake should be the real IP address of the spammer in most cases even though there are rare scenarios in which the spammer could try IP spoofing. In this context, practically the only workable case of IP spoofing involves presenting a formally unused IP address within the subnet on which the spammers host is actually located. One can argue that in such a case the de facto compromised subnet can be viewed as a single source of spam. Another remote alternative is BGP hijacking to propagate fake route entries with unused IP addresses to the nearby ISPs [12,14]. As BGP hijacking is considerably more difficult (and less predictable) than compromising random hosts in random subnets, we can assume that the impact of such cases on our studies (if present at all) is negligible. As we argued in Section 1, spamming
1
1.E-01
CCDF
1.E-02
1.E-03
1.E-04
1.E-05 1 10 100 1000 Number of IP appearance
Figure 1: The CCDF of the number of appearances of the same IP address. Figure 1 illustrates the complementary cumulative probability (CCDF) of the number of appearances of any specific IP
The identity of that server cannot be disclosed due to privacy concerns.
address as a spam host. The x-axis shows the number of appearances of an IP address, and the y-axis represents the CCDF. Both axes are plotted in the log scale. As we can see, most of the IP addresses appeared only once or twice. The number of spam messages that a spammer sent during the week ranges from 1 to as many as 446. More precisely, 68% of the spammers sent only a single spam, 15% of them sent two messages, and less than 2% of them send more than 10 messages. However, those less than 2% of the spammers accounted for 20% of the total spam traffic. According to [6], over 95% of spam messages include URLs. We show the complementary cumulative probability (CCDF) of the number of appearances of the same URL in multiple spam messages in Figure 2. Many URLs appeared only once, some appeared between 10 and 100 times, and very few URLs appeared close to 1000 times.
1.E+00
observation is that the more groups a spammers IP address is associated with (due to multiple distinct URLs appeared in the spam mail from this spammer), the higher the probability that spam mail from this IP address will arrive in the near future. We hypothesize that those pivoting points play an important role in the botnet.
Figure 3: The clustering structure of the spammers based on the URLs in spam messages in day 1.
1.E-01
CCDF
1.E-02
1.E-03
1.E-04
1.E-05 1 10 100 1000 10000 Number of URL appearance
Figure 4: The major component of the clustering structure from Figure 3.
Figure 2: The CCDF of the number of appearances of the same URL.
3.3 Clustering Structures

In this section, we group the spammers in terms of the URLs, stock symbols, monetary amount, which appear in most of the spam messages. Figure 3 visualizes the clustering relationships based on the URLs in the spam in day 1. If the same URL appears in a spam message sent from source A and source B (where A and B represent IP addresses of two spammers), then an edge is plotted to connect the nodes A and B. The clustering structure is clear. The number of members in each cluster ranges from 1 to 716. According to [3], a typical botnet consists of several hundred compromised machines, which is consistent with the sizes of some clusters visible in Figure 3. The major component with 716 spammers is further illustrated in Figure 4. An interesting observation is that the spam mail at the pivoting point often comes earlier than those homogenous points further away from the clusters center. Another key
As shown in Fig. 5 ~ Fig. 10, similar clustering patterns have been observed for the remaining days (Day 2 ~ Day 7). In Fig. 11, we depict the clustering structure of the spammers based on the money amounts in spam messages in day 1. As most of the spam is related to money, the clustering structure in Fig. 11 seems quite interesting and relevant. Normally, the products, services, commodities, stocks, etc., advertised in spam come with unit prices, so it is unlikely that the spammers would intentionally make those attributes random (e.g., to diversify the signatures of their spam). Consequently, such values are likely to be good discerning and unifying features of spam arriving from different sources. Unfortunately, compared with Figures 3 ~ 10, we observe that the cluster sizes in this case are relatively small, which hints that this characterization may not work well by itself. Nevertheless, it can be at least applied as a supplementary feature, e.g., together with the URL-based criteria. In Fig. 12, we show the clustering structure of the spammers based on stock symbols in spam messages in day 1. Clearly, both the number of clusters and the cluster sizes are small. We determine that stock symbol may not be a good criterion to classify the spammers.
Figure 11: The clustering structure of the spammers based on the monetary amounts in day 1. Figure 9: The clustering structure of the spammers based on the spam messages arriving in day 6.
Figure 12: The clustering structure of the spammers based on the stock symbols in day 1.
Now, let us examine the correlation coefficient of the interarrival time of the spam messages from the spammers belonging to the same cluster. This coefficient is given by following formula:
undertake to identify and block spam, those spammers should receive a preferential treatment. In Figures 10 and 11, we show the overall clustering coefficient of the community of spammers for each day in the 7-day period, and the clustering coefficient of the largest component from the community from day 1 to day 7. We also show the total number of spammers in each day and the number of spammers in the largest component for the day. As we can see from Figures 10 and 11, the overall clustering coefficient for the entire community is very high, indicating the property of small world networks [17]. We also observe that as the community size of the spammers grows, the clustering coefficient decreases slightly. It has been reported that generally collaborative networks such as movie actor networks, co-authorship networks, etc, tend to have higher clustering coefficients [17]. It is thus consistent to say that spammers form a collaborative network, organized in the form of BotNet [3], with a high clustering coefficient. This high clustering behavior of spammers bodes well for a community-based approach to spam elimination: it is an argument that dealing with the community as a whole is likely to bring about better results than treating each spammer individually.
Ov erall
k =
1 N k
N k i =1
(xi x)(x(i +k ) x)
2 x
(1)
Where N is the number of messages, k is the lag index (the separation between the inter-arrival times being considered), x i is the i th inter-arrival time, x is the average inter-arrival
2 time, x is the variance in the inter-arrival time, and k is the correlation coefficient of the inter-arrival time with the lag index of k . The intuition is to show that even though some spam arrivals within the same group of spammers are far apart, i.e., their lag index is large, they may still be correlated, which would mean that the messages from the same group of spammers tend to arrive in burst.
The number of spammers
Following [17], given a clustering criterion, we define the clustering coefficient of a spammer as the ratio between the actual number of edges among the neighbors of that spammer node and the number of possible edges among those neighbors. Formally, assume that node i in the network has k i edges, connecting to k i other nodes. Let Ei stand for the number of edges that actually exist among these k i nodes. The total number of possible edges among these k i nodes would be k i (k i 1) / 2 . The value of the clustering coefficient of node i is thus given by Ei Ci = (2) (k i (k i 1)) / 2
The clustering coefficient of the whole network is the average of all the individual Ci s. The motivation behind the clustering coefficient is to identify the spammers associated with multiple communities (based on the given set of clustering criteria). The idea is to be able to detect the smallest number of spammer responsible for sending the largest amount of spam. Whatever efforts we
Clustering co-efficient
As shown in our previous results in [16], spam arrival within the same cluster of spammers exhibits strong long-range dependency and the bursty character of the spam arrival process. The trend is clearly visible despite the somewhat noisy character of the curve. The correlation coefficient oscillates along the trend line due to the granularity of the timestamp. This is because, within a given time grain (one second), there are often multiple spam arrivals.
0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 1 2 3 4 5 6 7 The s equence of days
Larges t c o m p
Figure 13: The clustering coefficient of the community of spammers from day 1 to day 7.
1900 1700 1500 1300 1100 900 700 500 300 100 1 2 3 4 5 6 7 The s equenc e of day s
O v erall Larges t c o m p
Figure 14: The total number of spammers and the number of spammers in the largest component for each day.
4. Group-based anti-spam strategies

In this section we briefly discuss some group-based anti-spam strategies based on the results of our empirical study on the community behavior of spammers in Section 3. The groupbased anti-spam framework can be used as a complementary component for an existing anti-spam system, e.g., SpamAssasin, to efficiently block spams from organized spammers. The idea is that if we can perceive some groupbased behavior/patterns of spam senders based upon some common signatures from the email content and/or headers, e.g., URL or some other criteria, we can assign a high spam score to the messages from this group. The more members in the given suspected spammer group, the higher spam scores for the messages from that group. The intuition behind this is that it is highly unlikely for a large group of legitimate senders to send emails with exactly the same type of signatures, e.g., the same URL. The source IP address of an incoming message is used as a unique identifier of the email sender for group-based classification. We do not use the IP address of the email sender for blacklist blocking. So, even for rare cases of spoofed source IP address, it will have little impact on the effectiveness of our group-based anti-spam approach.
5. Conclusions
In this paper we have investigate the community structures of spammers based on collection of spam traffic sighted at a domain mail server. Our study shows that the relationship among spammers demonstrates highly clustering structures based on URL-grouping. The inter-arrival time of spams from the same group of spammers exhibits long-range dependency in the sense that the spam messages from the same group of spammers often arrive in burst. We also observe that spammers associated with multiple groups tend to send more spam messages in the near future. Finally, the high clustering coefficient of the community of spammers reveals the collaborative nature of those spammers.
Acknowledgements
We would like to thank Jaeyeon Jung, our shepherd, for the spam sources and her valuable suggestions that have inspired our interests in this project. We also thank Nick Feamster and Anirudh Ramachandran for the spam sources. We thank Mythili Vutukuru, Jeremy Stribling and Prof. Hari Balakrishnan for their insightful feedback and comments that have greatly improved the quality of this paper.
Reference
[1] J. Jung and E. Sit, An Empirical Study of Spam Traffic and the Use of DNS Black Lists, in ACM IMC 04, Oct. 2004. [2] L. H. Gomes, C. Cazita, Characterizing a Spam Traffic, in ACM IMC 04, Oct. 2004.
[3] Know Your Enemy: Tracking Botnets, the Honeynet Project and Research Alliance, http://www.honeynet.org, March 2005. [4] S. kandula, D. Katabi, M. Jacob, A. Berger, Botz-4-Scale: Surviving Organized DDOS Attacks That Mimic Flash Crowds, in USENIX NSDI, May 2005. [5] S. Katti, B. Krishnamurthy, D. Katabi, Collaborating Against Common Enemies, in ACM IMC05, Oct. 2005. [6] P. Graham, Different Methods of Stopping Spam, http://www.secinf.net/anti_spam/Stopping_Spam.html, Oct. 2003. [7] Dealing Effectively with Spam, GFI Software, http://www.secinf.net/anti_spam/Dealing_Effectively_with_Spam.ht ml, May 2003. [8] Blocking over 98% of Spam using Bayesian Filtering Technology, GFI Software, http://www.secinf.net/anti_spam/ Blocking_Spam_Bayesian_Filtering.html, Oct. 2003. [9] J. Synoradzki, P. Wawrzyniak, M. Zmudzinski, Four Popular Anti-Spam Filters for Exchange Reviewed,http://www.secinf.net /anti_spam/Preventing_Spam_Antispam_Filters_MS_Exchange.html. [10] Spamfighting Overview FAQ, http://www.secinf.net/anti_pam /Spamfighting_Overview_FAQ.html, May 2003. [11] B. Laurie, R. Clayton, Proof-of-Work Proves Not To Work, in Workshop on Economics and Information Security, MN, May 2004. [12] H. Balakrishnan, Computer Networks, Lecture Notes, class 6.829, MIT, Fall 2005. [13] J. Graham-Cumming, Tricks of the Spammers Trade, in Hakin9 magazine, issue 3, 2004. [14] A. Ramachandran, N. Feamster, Understanding the NetworkLevel Behavior of Spammers, in ACM SIGCOMM 2006. [15] J. Goodman, G. Cormack, D. Heckerman, Spam and the Ongoing Battle for the Inbox, Comm. of the ACM, Feb. 2007. [16] F. Li, M. Hsieh, An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-spam Strategies, in CEAS 2006. [17] R. Albert, A. Barabasi, Statistical Mechanics of Complex Networks, Reviews of Modern Physics, 74, 47-97 (2002). [18] P. Gburzynski, J. Maitan, Fighting the Spam Wars: A Remailer Approach with Restrictive Aliasing in ACM Trans. on Internet Technology, 2004. [19] S. Biswas, R. Morris, ExOR: Opportunistic Multi-Hop Routing for Wireless Networks, in the proceeding of ACM SIGCOMM 2005. [20] ASCII Table and Description, http://www.lookuptables.com/. [21] M.J.B. Robshaw, MD2, MD4, MD5, SHA and Other Hash Functions, Technical Report TR-101, version 4.0, RSA Laboratories, July 1995. [22] E. Blanzieri, A. Bryl. A Survey of Anti-Spam Techniques, Technical Report DIT-06-056, University of Trento, Department of Informatics, September 2006. [23] A. Ramachandran, D. Dagon, N. Fearnster. Can DNS-Based Blacklists Keep Up with Bots? in CEAS 2006. [24] L.A. Hughes. Viruses, Worms, and Trojan Horses: Serious Crimes, Nuisance, or Both, Social Science Computer Review, 25(1), 78-98 (2007). [25] D.M. Kienzle, M.C. Elder. Recent Worms: a Survey and Trends, in the proceedings of the ACM Workshop on Rapid Malcode (2006), Washington, DC, USA, pp. 1-10. [26] L. von Ahn, M. Blum, J. Langford. Telling humans and computers apart automatically, Communications of the ACM, 47(2), 56-60 (2004). [27] R. Hall. How to Avoid Unwanted E-Mail, Communications of the ACM, 41(3), 88-95 (1998).

Clustering Spammers

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering Spammers

Hochgeladen von

Copyright:

Verfügbare Formate

The Community Behavior of Spammers

3.2 Overview of the Spam Traffic

3. Community Behavior Analysis

3.1 Spam Data Source

1.E-05 1 10 100 1000 Number of IP appearance

The identity of that server cannot be disclosed due to privacy concerns.

1.E-05 1 10 100 1000 10000 Number of URL appearance

Figure 4: The major component of the clustering structure from Figure 3.

Figure 2: The CCDF of the number of appearances of the same URL.

3.3 Clustering Structures

The number of spammers

4. Group-based anti-spam strategies

Das könnte Ihnen auch gefallen