Sie sind auf Seite 1von 15

Authorship Analysis in Cybercrime Investigation

Rong Zheng, Yi Qin, Zan Huang, and Hsinchun Chen


Artificial Intelligence Lab Department of Management Information Systems The University of Arizona Tucson, Arizona 85721, USA {rong, yiqin, zhuang, hchen}@eller.arizona.edu

Abstract. Criminals have been using the Internet to distribute a wide range of illegal materials globally in an anonymous manner, making criminal identity tracing difficult in the cybercrime investigation process. In this study we propose to adopt the authorship analysis framework to automatically trace identities of cyber criminals through messages they post on the Internet. Under this framework, three types of message features, including style markers, structural features, and content-specific features, are extracted and inductive learning algorithms are used to build feature-based models to identify authorship of illegal messages. To evaluate the effectiveness of this framework, we conducted an experimental study on data sets of English and Chinese email and online newsgroup messages. We experimented with all three types of message features and three inductive learning algorithms. The results indicate that the proposed approach can discover real identities of authors of both English and Chinese Internet messages with relatively high accuracies.

1 Introduction
The development of networking technologies, and the Internet in particular, has created a new way to share information across time and space. While computer networks have enhanced the quality of life in many aspects, they have also opened a new venue for criminal activities. These activities have spawned the concept of cybercrime, which refers to illegal computer-mediated activities that can be conducted through global electronic networks, such as the Internet [31]. One predominant type of cybercrime is distribution of illegal materials in cyber space. Such materials include pirate software, child pornography materials, stolen properties, etc. Cyber criminals have been using various Web-based channels to distribute illegal materials such as Email, websites, Internet newsgroups, Internet chat rooms, etc. One common characteristic of these channels is anonymity. People usually do not need to provide their real identity information, such as name, age, gender, and address, in order to participate in cyber activities. Compared to conventional crimes, cybercrime conducted through such anonymous channels imposes unique challenges for law enforcement agencies in criminal identity tracing. The situation is further complicated by the sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. Law enforcement agencies have an urgent need for approaches that automate
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 59-73, 2003. Springer-Verlag Berlin Heidelberg 2003

60

R. Zheng et al.

criminal identity tracing in cyberspace and allow investigators to prioritize their tasks and focus on the major criminals. In this paper we propose to adopt the authorship analysis framework in the context of cybercrime investigation to help law enforcement agencies deal with the identitytracing problem. We extract three types of features that are identified in authorship analysis research from online illegal messages and use inductive learning techniques to build feature-based models to perform automatic message author identification. We are specifically interested in evaluating the general effectiveness of this approach and the effects of using different types of features in the cybercrime investigation context. Because of the multinational nature of cybercrime, we are also interested in evaluating the applicability of the proposed framework in a multilingual context. The remainder of the paper is organized as follows. Section 2 surveys the existing work on authorship analysis and summarizes major types of text features and techniques. Section 3 describes our proposed cyber criminal identity-tracing framework in detail and presents the specific research questions that we aim to address. Section 4 presents an experimental study that answers the research questions raised in Section 3, based on several experimental data sets. We conclude the article in Section 5 by summarizing our research contributions and pointing out future directions.

2 Literature Review
2.1 Authorship Analysis Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship. More specifically, the problem can be broken down into three sub-fields [35]:

Author Identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. Author Characterization summarizes the characteristics of an author and generates the author profile based on his/her work. Some of these characteristics include gender, educational and cultural background, and language familiarity. Similarity Detection compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author.
Authorship analysis has many applications. It is rooted in the author attribution problem of historical literature. The most famous one is its success in resolving the debate on Shakespeares work [10]. Similarly, authorship analysis techniques have assisted in solving the author debates over the Federalist Papers [23] and the Unabomber Manifesto [13]. Another application domain is software forensics [14]. People try to identify or characterize the author of some malicious programs by analyzing executable code or source code to investigate the crime and prevent future attacks. Since our work is mainly concerned with text, we will not discuss software forensics in this paper. Generally, the major topics on authorship analysis in the past research are feature selection and the techniques used to facilitate the analysis process. In the following sub- section we review the literatures from these two perspectives.

Authorship Analysis in Cybercrime Investigation

61

2.2 Feature Selection The essence of authorship analysis is the formation of a set of features, or metrics, that remain relatively constant for a large number of writings created by the same person. In other words, a set of writings from one author would exhibit greater similarity in terms of these features than a set of writings from different authors. Initially researchers identified authors by categorizing different sets of words used by different authors. One example is the authorship analysis of Shakespeares work [10]. Elliot and Valenza [10] conducted a study that compared the poems of Shakespeare with those of Edward de Vere, the leading candidate as the true author of the works credited to Shakespeare. Modal testing based on keyword usage was conducted. However, the effectiveness of this approach is limited by the fact that word usage is highly dependent on the text topic. For discrimination purposes we need content-free features. We also call this kind of features as style marker. The basic idea came from Yules work, in which features like sentence length [39] and vocabulary richness [40] were proposed. Mosteller and Wallace [23] extracted some function words (or word-based style markers) such as while and upon to clarify the disputed work, Federalist Papers. Later Burrows developed a set of more than 50 high-frequency words, which were also tested on the Federalist Papers. Tomoji [32] used a 74-word set to analyze Dickenss narrative style. Binongo and Smith [2] used the frequency of occurrence of 25 prepositions to discriminate between Oscar Wildes plays and essays. Holmes [17] analyzed the use of "shorter" words (2 or 3 letters word) and "vowel words" (words beginning with a vowel). Such word-based methods can require intensive effort to select the most appropriate set of words that best distinguish a given set of authors [16]. In summary, the word-based approach is highly author and language dependent and is difficult to apply to a wide range of applications. In order to avoid these problems, Baayen [4] proposed the use of syntax-based features. This approach is based on the statistical measures and methods applied to rewrite rules which appear in a syntactically annotated corpus. They demonstrated that syntax-based features can be more reliable in authorship identification problems than word-based features. Chaniak [8] discussed some statistical techniques for processing such syntactic information. Rudmen [29] concluded that almost 1,000 style markers had been used in authorship analysis applications. There is no agreement on a best set of style markers. As the size of feature set became larger, conventional methods gave way to some more powerful analytical methods such as machine learning methods. 2.3 Techniques for Authorship Analysis In early studies most analytical methods used in authorship analysis were statistical methods. The basic idea is that different authors have different text compositions which are characterized by a probability distribution of word usage. More specifically, given a population of an authors texts, the identification of a new text can be considered as a statistical hypothesis test or a classification problem. Most early work used statistical methods to facilitate authorship analysis. Brainerd [1] used Chisquared and related distributions to perform lexical data analysis. An important statistical test was introduced by Thisted and Efrons paper [30]. Farringdon [12] first ap-

62

R. Zheng et al.

plied the CUSUM technique in authorship analysis. Francis [11] gave a summary of early statistical approaches used to resolve the Federalist Papers dispute. Baayen [3] proposed a linguistic evaluation of diverse statistical models of word frequency. Although statistical methods achieved much success in authorship analysis, there are some constraints for particular methods. For example, Holmes [17] found that the CUSUM analysis was unreliable because the stability of those characteristics over multiple texts is not warranted. Moreover, the prediction capability of statistical methods, such as attributing a new text to a certain author, is limited. The advent of powerful computers instigated the extensive use of machine learning techniques in authorship analysis. The Bayesian model, was conducted by Mosteller and Wallace [24] to test the Federalist Papers. Based on their work, McCallum and Nigam [25] compared two different nave Bayesian models for text classification. While the nave Bayesian models for text classification still have structural limitations, a number of more powerful methods were also applied in text categorization and authorship analysis. The most representative one is the neural network. Tweedie [33] used a standard feedforward artificial neural network, also called multi-layer perceptron, to attribute authorship to the disputed Federalist Papers. The network they used had three hidden layers and two output layers. It was trained with a conjugate gradient and was tested by the k-fold cross-validation approach. The result was consistent with the results of the previous work on this topic. Another neural network, named radial basis function (RBF), was used by Lowe and Matthews [21]. They applied RBF to investigate the extent of Shakespeares collaboration with his contemporary, John Fletcher, on various plays. More recently, Khmelev [19] presented a technique for authorship attribution based on a simple Markov Chain, the key idea of which is using the probabilities of the subsequent letters as features. Diederich [9] introduced the Support Vector Machine (SVM) to this problem. Experiments were carried out to identify the writings of 7 target authors from a set of 2,652 newspaper articles written by several authors covering three topic areas. This method detected the target authors in 60%-80% of the cases. A new area of study is the identification of electronic message authors based on message contents. de Vel et al. [35] used SVM as a learning algorithm to classify 150 email documents from 3 authors. In this experiment an average accuracy of 80% was achieved. Generally speaking, machine learning methods achieved higher accuracies than statistical methods. They can model the underlying distribution of personal word usage with a large set of features. Based on the previous review, we present a taxonomy for authorship analysis research in Table 1. Table 2 shows some example studies in the field. Some general conclusions can be drawn from Table 2. First, most previous studies addressed resolving an authorship identification problem, which actually initiated this research domain and kept attracting researchers endeavor and application of new techniques (e.g. the dispute on Shakespeares work and Federalist Papers). Second, style markers were used most frequently as features. The reason is that style markers are general content-free features in most types of literatures. Finally, statistical approaches were extensively used in this field and more machine learning methods were introduced recently to this field.

Authorship Analysis in Cybercrime Investigation Table 1. Taxonomy for Authorship Analysis


Category Authorship Identification Problems Authorship Categorization Similarity Detection Style markers Feature s Structural features Content-specific features Manual Analysis Approaches Statistical Analysis Machine Learning Description Determines the likelihood of a particular author having written a piece of work by examining other works produced by the same author. Summarizes the characteristics of an author and determines the author profile based on his/her works. Compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author Content-free features such as frequency of function word, total number of punctuations, average sentence length, vocabulary richness Such as use of a greeting statement, position of requoted text, use of a farewell statement etc. Such as frequency of keywords, special character for special content etc. Uses manual examination and analysis of a set of works to draw conclusions about the authors characteristics such as background, personality, and technical skill. Uses statistical methods for calculating document statistics based on metrics, in order to analyze the characteristics of the author or to examine the similarity between various pieces of work Uses classification methods to predict the author of a piece of work based on a set of metrics. Label P1 P2 P3 M1 M2 M3 A1

63

A3 A4

Table 2. Previous Studies on Authorship Analysis


Problems Research Mosteller [23] de Vel[35] Thisted [30] Yule [39, 40] Elliot [10] Tomoji [32] Binongo [2] Baayen [4] Gray et al [14] Bosch [5] Foster [13] Diederich [9] Brainerd [1] Farringdon [12] McCallum [25] Khmelev [19] P1 P2 P3 M1 Features M2 M3 A1 Approaches A2 A3

64

R. Zheng et al.

3 Applying Authorship Analysis in Cybercrime Investigation


The large amount of cyber space activities and their anonymous nature make cybercrime investigation extremely difficult. One of the major tasks in cybercrime investigation is tracing the real identity source of an illegal document. Normally the investigator tries to attribute a new illegal message to a particular criminal in order to get some new clues. Conventional ways to deal with this problem rely on manual work, which is largely limited by the sheer amount of messages and constantly changing author IDs. Automatic authorship analysis should be highly valuable to cybercrime investigators. Figure 1 depicts the typical process of cybercrime identity tracing using the authorship analysis approach.

Fig. 1. A Framework of Cybercrime Investigation with Authorship Analysis

Assume that an investigator has a collection of illegal documents created by a particular suspected cyber criminal. In the first step the feature extractor runs on those documents and generates a set of style features, which will be used as the input to/for the learning engine. A feature-based model is then created as the outcome of the learning engine. This model can identify whether a newly found illegal document is written by that suspicious criminal under different IDs or names. This information will help the investigator focus his/her effort on a small scope of illegal documents and effectively keep track of more important cyber criminals. Cyberspace texts have several characteristics which are different from those of literary works or published articles and make authorship analysis in cyber space a challenge to researchers. One big problem is that cyber documents are generally short in length. This means that many language-based features successfully used in previous studies may not be appropriate (e.g., vocabulary richness). This may also give rise to the weak perform-

Authorship Analysis in Cybercrime Investigation

65

ance of some techniques such as the Nave Bayesian approach [35]. Also, the structure or composition style used in a cyber document is often different from normal text documents, possibly because of the different purposes of these two kinds of writings. In other words, the style of cyber documents is less formal and the vocabulary is limited and less stable. These factors might also lead to the ineffectiveness of previous feature selection heuristics. However, as a user spends more time in cyber space a more stable writing style will be formed. Some particular features, such as structural layout traits, unusual language usage, illegal content markers, and sub-stylistic features, may be useful in forming a suitable feature collection in the cybercrime investigation context. Another new challenge is that cyber criminals can use any language to conduct crime. In fact, most big crime groups or terrorists have international characteristics. They use the Internet to formulate plans, raise funds, spread propaganda, and communicate. For example, Osama bin Laden was known to use the Internet as his communication media. Applying authorship analysis in a multilingual context is becoming an important issue. Our study aimed to answer the following research questions: 1. Will authorship analysis techniques be applicable in identifying authors in cyber space? 2. What are the effects of using different types of features in identifying authors in cyber space? 3. Will the authorship analysis framework be applicable in a multilingual context?

4 Experiment Evaluation
To address the proposed research questions, we created a testbed and conducted several experiments which are described in detail in this section. 4.1 Testbed Two English data sets and one Chinese data set were collected for the purpose of this study. The English data sets consist of an email message collection and an Internet newsgroup message collection. The Chinese data set consists of a Bulletin Board System (BBS) message collection. English Email Messages. The first dataset contains 70 email messages provided by 3 students. Each of the students randomly selected 20-30 messages from their primary email account. The content of these messages covered a variety of topics, ranging from school work to research activities to personal interests. The purpose of introducing different topics is to minimize the impact of content similarity which may contribute to high accuracy. English Internet Newsgroup Messages. The second dataset contains 153 Internet newsgroup messages. Over a time period of two weeks, we observed the activities of several USENET newsgroups involving computer software trading. Based on average

66

R. Zheng et al.

number of reads, posts, and unique user IDs per day, we identified the three most popular newsgroups relevant to our research. Through observation we were able to spot illegal sales of pirate software in all three newsgroups. Figure 2 is an example of such a message.

From: "The Collectaholic" <mkusz@comcast.net> Subject: Software Titles - Only $3.00 Newsgroups: misc.forsale.computers.other.software Date: 2002-10-04 12:07:22 PST All CDs are the original CDs in working condition and come with all theoriginal documentation. Shipping is $3.00 for first title and $.50 for each additional title. $1.00 Titles PC World The Best of MediaClips: sounds and graphics that can be used onmedia projects $3.00 Titles Boggle: classic word game Canon Publishing Suite: layout, drawing & photo editing tools

Fig. 2. Illegal Internet Newsgroup Message

We then identified the 9 most active users (represented by a unique ID and email address) who frequently posted messages in these newsgroups. Messages posted by these users were carefully checked to determine whether or not they indicated illegal activities. Between 8 and 30 illegal messages per user were downloaded for use in the experiment. Chinese BBS Messages. The Chinese BBS dataset consisted of 70 messages which were downloaded from the most famous Chinese BBS in the US, bbs.mit.edu. These messages were randomly selected from posted messages by three authors. Table 3, 4 and 5 summarize the composition of the three datasets.

Table 3. English Email Dataset


Author T1 T2 T3 RZ 8 9 3 JX 2 18 8 YQ 3 5 14 Grand Total Number of Messages T1 = number of messages under school work T2 = number of messages under research activity T3 = number of messages under personal interest Number of messages 20 28 22 70

Authorship Analysis in Cybercrime Investigation Table 4. English Internet Newsgroup Dataset


Author N1 N2 N3 Number of Messages DLW 1 28 1 30 KD 10 9 1 20 dCN 3 17 0 20 DB 0 16 4 20 SW 18 0 2 20 DLB 0 6 2 8 DLM 0 17 0 17 JKYS 9 0 0 9 JZ 0 9 0 9 Grand Total Number of Messages 153 N1 = number of messages from misc.forsale.computers.other.software N2 = number of messages from misc.forsale.computers.pc-specific.software N3 = number of messages from misc.forsale.computers.mac-specific.software

67

Table 5. Chinese BBS Dataset


Author QQ SKY SEMA Grand Total Number of Messages Total Number of Messages 20 28 22 70

4.2 Implementation We describe the implementation details of the two core components of our proposed authorship analysis framework: feature selection and inductive learning techniques. Feature selection. Based on the review of previous studies on text and email authorship analysis, along with the specific characteristics of the messages in our datasets, we selected a large number of features that were potentially useful for identifying message authors. Three types of features were used: style markers, structural features, and content-specific features. We used 122 function words and 48 markers suggested by de Vel [35]. Another 28 most common function words from the Oxford English Dictionary and 7 other markers were also included. And 2 additional structural features and content-specific features were added in our experiment, which are shown in Table 6. Techniques. We adopted a classification approach to predict the authorship of each message. Three learning algorithms (classifiers) were used in the experiments for comparison purposes, including decision tree [28], backpropagation neural networks [22], and support vector machines [7]. Among the various symbolic learning algorithms developed over the past decade, ID3 and its variants have been tested extensively and shown to rival other machine learning techniques in predictive power [6]. ID3 is a decision-tree building algorithm

68

R. Zheng et al. Table 6. Feature selection for authorship analysis in our experiment
Additional style markers -Total number of words in subject -Total number of characters in subject (S) -Total number of upper-case characters in words in subject/S -Total number of punctuations in subject/S -Total number of whitespace characters in subject/S -Total number of lines -Total number of characters Additional structural features -Types of signature (name, title, organization, email, URL, phone number) -Uses special characters (e.g. --------) to separate message body and signature Content-specific Features -Has a price in subject -Position of price in message body -Has a contact email address in message body -Has a contact URL in message body -Has a contact phone number -Uses a list of products -Position of product list in body message -Indicates product categories in list -Format of product list

developed by Quinlan [28]. It adopts a divide-and-conquer strategy and the entropy measure for object classification. In this experiment, we implemented an extension of the ID3 algorithm, the C4.5 algorithm, to deal with attributes with continuous values. Backpropagation neural networks have been extremely popular for their unique learning capability [38] and have been shown to perform well in different applications such as medical applications [34]. It was also introduced to authorship analysis by Kjell [20] and Tweedie [33]. We implemented a typical backpropagation neural network which consists of three layers: an input layer, an output layer and a hidden layer[26], in which the input layer nodes are style features and output nodes are author identities. Based on the general heuristic, the number of hidden layer nodes is typically set to /2 (number of input nodes + number of output nodes). In this study, because the number of input nodes is quite large we modified the heuristic to /10 (number of input nodes + number of output nodes) and achieved relatively high accuracies in our experiments. Support vector machine (SVM) is a novel learning machine first introduced by Vapnik [37]. It is based on the Structural Risk Minimization principle from the computational learning theory. Due to the fact that SVM is capable of handling millions of inputs and does not require feature selection [7], it has been used extensively in authorship analysis, which normally involves hundreds or thousands of input features [9]. For the experiment we used an SVM program written by Hsu and Lin [15] which was publicly available on the Internet. These three algorithms have their applications in authorship analysis. In general SVM and neural networks have better performance than decision trees [9]. But most testbeds are newspaper articles, such as the Federalist Papers. Because of the differences between on-line messages and formal articles, mentioned in Section 3, we still needed to test the performances of these three algorithms on our testbed.

Authorship Analysis in Cybercrime Investigation

69

4.3 Experiment Design We designed the procedure of the experiment as follows: three experiments were conducted on the newsgroup dataset with one classifier at a time. First 205 style markers were used, 9 structural features were added in the second run, and 9 contentspecific features were added in the third run. For the email dataset and Chinese BBS dataset, two experiments were conducted with one classifier at a time; 205 style markers (67 for Chinese BBS dataset) were first used as input to the classifiers, and 9 structural features were then added for a second run. A 30-fold cross-validation testing method was used in all experiments. To evaluate the prediction performance we use accuracy, recall and precision measures which have been commonly adopted in the information retrieval and authorship analysis literature [36]. The accuracy is a measure which indicates the overall prediction performance of a particular classifier, which is defined as in (1) for our experiments:

Accuracy =

Number of messages whose author was correctly identified Total number of messages

(1)

For a particular author, we use precision and recall to measure the effectiveness of our approach for identifying messages that were written by that author. We report the average precision and recall for all authors in a data set. The precision and recall are defined as in (2) and (3):

Precision =

Number of messages correctly assigned to the author Total number of messages assigned to the author

(2)

Recall =

Number of messages correctly assigned to the author Total number of messages written by the author

(3)

4.4 Results & Analysis Based on the three datasets we prepared, we conducted experiments according to the design. The results are presented in Table 7, and detailed discussions are presented in this sub-section. Techniques comparison. We observed that SVM and neural networks achieved better performance than C4.5 decision tree algorithms in terms of precision, recall, and accuracies for all three datasets in our experiment. For example, using style markers on the email dataset, the C4.5, neural networks, and SVM achieved accuracies of 74.29%, 81.11% and 82.86% respectively. SVM also achieved consistently higher accuracies, precision, and recall than the neural networks. However, the performance differences between SVM and neural networks were relatively small. Our results were

70

R. Zheng et al. Table 7. 30-Fold Testing Accuracy, Precision and Recall


Dataset C4.5 Measures Avg. Accuracy Avg. Precision. Avg. Recall Avg. Accuracy Avg. Precision. Avg. Recall Avg. Accuracy Avg. Precision. Avg. Recall SM SM+ SF 90.20 SM+S F+CF 90.85 SM Neural Network SM+ SF 94.77 SM+S F+CF 95.42 SM SVM SM+ SF 95.42 SM+S F+CF 96.08

86.28

84.31

88.24

Newsgroup

85.46 85.11 74.29

90.02 88.37 77.14

90.56 88.92

84.17 80.17 81.11

95.16 91.18 90.00

95.49 92.60

89.25 85.87 82.86

97.07 94.72 91.43

97.39 95.83

Email

72.23 71.07 54.83

79.03 78.27 N/A 72.58

82.67 81.97 59.67

90.10 91.43 N/A 82.25

83.92 83.17 69.06

91.23 91.74 N/A 82.58

Chinese BBS

54.73 54.90

71.83 72.37

60.50 59.60

82.40 82.13

70.45 68.32

83.92 81.88

SM: Style Markers Unit: Percent (%)

SF: Structural Features

CF: Content-specific Features

generally consistent with previous studies, in that neural networks and SVM typically had better performance than decision tree algorithms [9]. The good performance of SVM also conformed to its success in many other fields [18, 27]. Feature selection. As illustrated in Table 7, the authorship prediction performance varied significantly with different combinations of metrics. Pair-wise t-test results indicated that:

Using style markers and structural features outperformed using style markers only: We achieved significantly higher accuracies for all three datasets (p-values were all below 0.05) by adopting the structural features. The results might be explained by the fact that an authors consistent writing patterns show up in the messages structural features. Using style markers, structural features, and content-specific features did not outperform using style markers and structural features: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly (with p-value of 0.3086). We think this is because authors of illegal messages typically deliver diverse contents in their mes-

Authorship Analysis in Cybercrime Investigation

71

sages and little additional information can be derived from the message contents to determine the authorship. In response to our second research question, we conclude that the structural features help to achieve higher accuracies, while content-specific features do not improve the performance of online message authorship identification. We also observed that high accuracies were obtained using only style markers as input features for the English datasets. The accuracies ranged from 71% to 89%. The results indicated that style markers contain a large amount of information about writing styles of online message and were surprisingly robust in predicting the authorship. Chinese dataset performance. We noticed that there is a significant drop in prediction performance measures for the Chinese BBS dataset compared with the English datasets. For example, when using style markers only, C4.5 achieved average accuracies of 86.28% and 74.29% for the English Newsgroup and email datasets, while for the Chinese dataset it only achieved an average accuracy of 54.83%. The reason is that only 67 Chinese style markers were used in our current experiments, which are significantly fewer than the 205 style markers used with the English data set. We also observed that when structural features were added all three algorithms achieved relatively high precision, recall, and accuracies (from 71% to 83%) for the Chinese dataset. Considering the significant language differences, our proposed approach to the problem of online message identity tracing appears promising in a multilingual context.

5 Conclusion & Future Work


Our experiments demonstrated that with a set of carefully selected features and an effective learning algorithm, we were able to identify the authors of Internet newsgroup and email messages with a reasonably high accuracy. We achieved average prediction accuracies of 80%90% for email messages, 90%97% for the Newsgroup messages, and 70%85% for Chinese Bulletin Board System (BBS) messages. Significant performance improvement was observed when structural features were added on top of style markers. We also observed that SVM outperformed the other two classifiers on all occasions. The experimental results indicated a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identitytracing problem. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet, even though they may use different identities. This study will be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of our proposed approach. Also, more illegal messages will be incorporated into our testbed. The current approach will also be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography images. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset. We believe this will have a better performance cross the different datasets.

72

R. Zheng et al.

Acknowledgment. This project has primarily been funded by the following grants: National Science Foundation, Digital Government Program, "COPLINK Center: Information and Knowledge Management for Law Enforcement," #9983304, July, 2000-June, 2003; National Institute of Justice, "COPLINK: Database Integration and Access for a Law Enforcement Intranet," # 97-LB-VX-K023, July 1997-January 2000. We would like to thank Robert Chang from the Taiwan National Intelligence Office for initiating this project. We would also like to thank the officers from the Tucson Police Department: Detective Tim Petersen, Sergeant Jennifer Schroeder, and Detective Daniel Casey for their assistance for the project. Members of Artificial Intelligence Laboratory who have directly contributed to this paper are Michael Chau, Jie Xu, Wingyan Chung.

References
1. 2. 3. 4. 5. 6. B. Brainerd, Statistical analysis of Lexical data using Chi-squared and related distributions. Computers and the Humanities, 9, 161178. (1975). Binongo and Smith, A Study of Oscar Wilde's Writings, Journal of Applied Statistics, vol. 26-7, p.781, (1999). R. H. Baayen, Statistical Models for Word Frequency Distributions: A Linguistic Evaluation. Computers and the Humanities, 26 347363, 347363. (1993). R. H. Baayen, H. van Halteren, and F. J. Tweedie, Outside The Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 2, 110120, (1996). R. Bosch and J. Smith, Separating hyperplanes and the authorship of the disputed federalist papers, American Mathematical Monthly, 105(7): 601608, (1998). H. Chen, G. Shankaranarayanan, A. Iyer, and L. She, A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing, Journal of the American Society for Information Science, Volume 49, Number 8, Pages 693705, (1998). N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, (2000). E. Charniak, Statistical Language Learning. MIT Press, Cambridge, (1993). J. Diederich, J. Kindermann, E. Leopold, and G. Paass, Authorship Attribution with Support Vector Machines, Applied Intelligence, (2000). W. Elliot and R. Valenza, Was the Earl of Oxford The True Shakespeare? Notes and Queries, 38:501506, (1991). I. S. Francis, An Exposition of a Statistical Approach to the Federalist Dispute. In J. Leed (Ed.), The Computer and Literary Style (pp. 3879). Kent, Ohio: Kent State University Press. (1966). J. M. Farringdon, Analyzing for Authorship A Guide to the Cusum Technique. Cardiff: University of Wales Press. (1996). D. Foster, Author Unknown: On the Trail of Anonymous, Henry Holt, New York, (2000). A. Gray, P. Sallis, and S. MacDonell, Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 18, (1997). C. W. Hsu and C. J. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, pages 415425, (2002).

7. 8. 9. 10. 11. 12. 13. 14. 15.

Authorship Analysis in Cybercrime Investigation

73

16. D. I. Holmes and R. S. Forsyth, The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111127. (1995). 17. D. I. Holmes, The Evolution of Stylometry in Humanities. Literary and Linguistic Computing, 13, 3. (1998). 18. T. Joachims, Text Categorization with Support Vector Machines, in: Proceedings of the European Conference on Machine learning (ECML), (1998). 19. D.V. Khmelev and F. J. Tweedir, Using Markov Chains for Identification of Writers, Literary and Linguistic Computing, vol.16, no.4, pp.299307, (2001). 20. B. Kjell, Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers. Literary and Linguistic Computing, 9, 119124. (1994). 21. D. Lowe, and R. Matthews, Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions. Computers and the Humanities, 29, 449461 (1995). 22. R. P. Lippmann, An Introduction to Computing with Neural Networks, IEEE Acoustics Speech and Signal Processing Magazine, 4(2): 422, (1987). 23. F. Mosteller and D. L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, Mass., (1964). 24. F. Mosteller, Frederick, and D. L. Wallace, Applied Bayesian and Classical Inference: the Case of the Federalist Papers, in the 2nd edition of Inference and Disputed Authorship, The Federalist, Springer-Verlag, (1964). 25. A. McCallum and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on "Learning for Text Categorization", (1998). 26. J. Moody and J. Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995). 27. E. Osuna, R. Freund and F. Girosi, Training Support Vector Machines: An Application to Face Detection, Proceedings of Computer Vision and Pattern Recognition, 130136, (1997). 28. J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1): 81106, (1986). 29. J. Rudman, The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31, 351365. (1998). 30. R. Thisted, and B. Efron, Did Shakespeare Write a Newly Discovered Poem? Biometrika, 74, 445455. (1987). 31. D. Thomas, and B. D. Loader, Introduction Cyber Crime: law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, (2000). 32. T. Tomoji, Dickens's Narrative Style: A Statistical Approach to Chronological Variation. Revue, Informatique et Statistique dans les Sciences Humaines (RISSH, Centre Informatique de Philosophie et Lettres, Universite de Liege, Belgique), 30, 165182, (1994). 33. F. J. Tweedie, S. Singh, and D. I. Holmes, Neural Network Applications in Stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 110 (1996). 34. K. M. Tolle, H. Chen and H. Chow, Estimating Drug/Plasma Concentration Levels by Applying Neural Networks to Pharmacokinetic Data Sets, Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, 30(2), 139 152, (2000). 35. O. de Vel, A. Anderson, M. Corney and G. Mohay, Mining E-mail Content for Author Identification Forensics, SIGMOD Record, 30(4): 5564, (2001). 36. O. de Vel, Mining e-mail authorship. In Proc.Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), (2000). 37. V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, (1995). 38. B. Widrow, D. E. Rumelhart and M. A. Lehr, Neural Networks: Applications in Industry, Business, and Science, Communications of the ACM, 37, 93105, (1994). 39. G. U. Yule, On sentence length as a statistical characteristic of style in prose, Bometrikka, 30, (1938). 40. G. U. Yule, The statistical study of literary vocabulary, Cambridge University Press, (1944).

Das könnte Ihnen auch gefallen