You are on page 1of 9

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.

3, March 2012

Data Mining Techniques: A Key for Detection of Financial Statement Fraud

Rajan Gupta Research Scholar, Dept. of Computer Sc. & Applications, Maharshi Dayanand University, Rohtak (Haryana) India. Email: Nasib Singh Gill Head, Dept. of Computer Sc. & Applications, Maharshi Dayanand University, Rohtak (Haryana), India. Email: of their auditors. Warnings of fraud in US listed Chinese companies have grown in recent months. In January 2011, the shares of China Forestry Holdings were suspended after the auditor KPMG informed the board of directors of possible irregularities in its accounting books. On 11 April, 2011 the SEC suspended trading in RINO International due to questions surrounding the accuracy and completeness of information contained in RINOs public filings, and the companys failure to report the resignation of its chairman, directors of the board and an outside lawyer and forensic accountants brought in to investigate allegations of fraud. The finger was pointed at Sino-Forest Corporation, a Torontolisted forestry firm, on 2 June, 2011, after a shortseller accused the firm of inflating its assets. More recently, the unravelling of Longtop Financial Technologies Ltd highlighted the scale of the problem. The company regularly reported income that was slightly higher than executives predictions [1].1 The "cash balance" on Longtop's balance sheet was fake--a fiction created by the company's managers with bank complicity [2].2 Data Mining is an iterative process within which progress is defined by discovery of knowledge. Data Mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an interesting outcome [3].3 The application of Data Mining techniques for detection and identification of financial statement fraud is a fertile research area. Several law enforcement agencies and special investigative units have used data mining techniques successfully for detection of financial frauds In this study, we analyse the financial statements of various organisations for detection of financial statement fraud by using data mining techniques. This research aims at identifying the financial ratios / items from financial statements in order to help auditors in assessing the probability of financial fraud. In this study, data mining techniques namely CART, Naive Bayesian Classifier and Genetic Programming are tested for their applicability in detection of fraudulent

In recent times, most of the news from business world is dominated by financial statement fraud. A financial statement becomes fraudulent if it has some false information incorporated by the management intentionally. This paper implements data mining techniques such as CART, Nave Bayesian classifier, Genetic Programming to identify companies those issue fraudulent financial statements. Each of these techniques is applied on a dataset from 114 companies. CART outperforms all other techniques in detection of fraud.

1. Introduction Financial statement fraud is a serious social and economic problem worldwide and more severe in growing countries. A company listed with any stock exchange is required to publish its financial statements such as balance sheet, income statement, statements of retained earnings and cash flow statements yearly and quarterly. Financial statements of a company reflects its actual financial health by analysing which, stockholders can form a wise decision about investing in the company. An intentional distortion of information in the financial statement is termed as financial statement fraud. Conventionally, auditors are responsible for identification and detection of fraudulent financial statement. Although, auditors are supposed to provide information weather the statement is according to GAAP or not. With an increase in number of high profile fraud cases, auditors are overburdened with an additional duty of detection of fraud. Hence, various techniques of data mining are being used to ease out this extra pressure from the mind of the auditors. Some of the worlds major fraud cases include Enron, WorldCom, Satyam and many more. A number of Chinese companies listed on US stock exchanges have faced accusations of accounting fraud, and in June 2011, the U.S. Securities and Exchange Commission warned investors against investing with Chinese firms listing via reverse mergers. While over 20 US listed Chinese companies have been de-listed or halted in 2011, a number of others have been hit by the resignation

49 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

financial statements and differentiating between fraud and non fraud reporting. The dataset consists of financial ratios obtained from publicly available financial statements. The paper is organised as follows: Section 2 discusses the relevant prior research followed by section 3 which describes the various tricks adopted by management for falsifying financial statements. Section 4 reveals the key variables and financial ratios related to detection of financial statement fraud. Section 5 provides an insight in to the data mining techniques used in this study. Section 6 analyses the results followed by concluding remarks (Section 7). 2. Related Work An overview of the academic literature concerning detection of financial statement fraud is given here. Number of studies such as PwC [4]4, and ACFE [5]5 tells the story about detection of fraud. Findings of these studies suggest that many a number of times fraud has been detected by chance means or accident. For example reports of PwC [4] revels that 41% of the fraud cases were detected by means of tip offs or by chance. Several groups of researchers have devoted a significant amount of effort in studying the use of data mining techniques in detection of financial statements fraud from different perspectives. Beasley [6]6 used Logit regression to test the prediction that the inclusion of larger proportions of outside members on the board of directors significantly reduces the likelihood of financial statement fraud with a sample of 150 American firms. They found that non-fraud firms have boards with significantly higher percentages of outside members than fraud firms. Green and Choi [7]7 presented a neural network fraud classification model employing endogenous financial data. A classification model created from the learned behavior pattern is then applied to a test sample. Fanning and Cogger 8[8] also used an artificial neural network to predict management fraud. Using publicly available predictors of fraudulent financial statements, they found a model of eight variables with a high probability of detection. Kirkos 9[9], carry out an in-depth examination of publicly available data from the financial statements of various firms in order to detect FFS by using Data Mining classification methods. In this study, three Data Mining techniques namely Decision Trees, Neural Networks and Bayesian Belief Networks are tested for their applicability in management fraud detection. Spathis et al10 [10] compared multicriteria decision aids with statistical techniques such as logit and discriminant analysis in detecting fraudulent financial statements. Cecchini et al [11] 11 developed a novel financial kernel using support vector machines for detection of management

fraud. An innovative fraud detection mechanism is developed by Huang et al.[12] 12on the basis of Zipfs Law. This technique reduces the burden of auditors in reviewing the overwhelming volumes of datasets and assists them in identification of any potential fraud records. Hoogs et al[13] 13 presents a genetic algorithm approach to detecting financial statement fraud. Cerullo and Cerullo [14]14 explained the nature of fraud and financial statement fraud along with the characteristics of NN and their applications. They illustrated how NN packages could be utilized by various firms to predict the occurrence of fraud. Koskivaara [15]15 proposed NN based support systems as a possible tool for use in auditing. He demonstrated that the main application areas of NN were detection of material errors, and management fraud. Busta and Weinberg[16]16 used NN to distinguish between normal and manipulated financial data. They examined the digit distribution of the numbers in the underlying financial information. Koh and Low[17]17 construct a decision tree to predict the hidden problems in financial statements by examining the following six variables: quick assets to current liabilities, market value of equity to total assets, total liabilities to total assets, interest payments to earnings before interest and tax, net income to total assets, and retained earnings to total assets. Belinna et al [18] 18examine the effectiveness of CART on identification and detection of financial statement fraud. They concluded by saying that CART is a very effective technique in distinguishing fraudulent financial statement from non fraudulent. Further, Deshmukh and Talluru [19]19 demonstrated the construction of a rule-based fuzzy reasoning system to assess the risk of management fraud and proposed an early warning system by finding out 15 rules related to the probability of management fraud. Zhou & Kapoor [20]20 examine the effectiveness and limitations of data mining techniques such as regression, decision trees, neural network and Bayesian networks. They explore a self adaptive framework based on a response surface model with domain knowledge to detect financial statement fraud. Recently, Ravisankar et al [20]21 uses data mining techniques such as Multilayer Feed Forward Neural Network (MLFF), Support Vector Machines (SVM), Genetic Programming (GP), Group Method of Data Handling (GMDH), Logistic Regression (LR), and Probabilistic Neural Network (PNN) to identify companies that resort to financial statement fraud. They found that PNN outperformed all the techniques without feature selection, and GP and PNN outperformed others with feature selection and with marginally equal accuracies. If we summarize the existing academic research, we arrive at a conclusion that detection of financial statement fraud is an instance of classification and

50 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

decision problem. In present research, we apply the same idea and implements data mining classification methods for differentiation between fraudulent and non fraudulent observations. 3. Artifice used by top level executives for fraudulent financial reporting: Financial statements are a company's basic documents to reflect its financial status [22].22 A complete and thorough analysis of financial statements could help investors in judging the financial status of a company. Any material misstatement in the financial statement has been a major apprehension to the investors worldwide. The techniques associated with the production of fraudulent financial statement have been discussed in Schilits book Financial Shenanigans [23].23 The book reported seven common tricks: (1) Recording revenue before it is earned; (2) Creating fictitious revenue; (3) Boosting profits with non-recurring transactions; (4) Shifting current expenses to a later period; (5) Failing to record or disclose liabilities; (6) Shifting current income to a later period and (7) Shifting future expenses to an earlier period. The first five tricks aim at boosting current year earnings, and the last two shift current-year earnings to the future in order to create an illusion of steady income over years. A number of studies have been conducted for finding indicators for Fraudulent Financial statements. One such study conducted by C.Fei in china found four types of companies which are more prone to financial scandals [24].24 (i) Companies with frequent capital operations and related-party transactions. (ii) Companies with high- and volatile-stock prices (iii) Initial Public Offering (IPO) companies and (iv) Companies in a declining or over-competitive business environment Management of an organisation may falsify the financial statement to achieve the following: a. Good amount of loan sanctioned from a bank b. Paying less dividends to shareholders c. Avoid payment of taxes and d. Inflated stock prices In present time there is a steady increase in number of companies which are falsifying their financial statements in order to present a rosy picture about financial status to the stockholder and making selfish gain. Hence, detection of such fraud is an additional responsibility of auditors and is the need of the hour. Key financial items / ratios relevant to the detection of fraud: The values and numbers present in financial statements can be easily interpreted with the help of 4.

different financial ratios. Financial ratio assists investors / auditors in evaluating the actual position of the company. On the basis of existing academic research and experts knowledge, we identify the following financial variables / ratios (Table 1). (a) Z-score: Financial distress may be a motivation for management fraud [8]. To measure the financial distress Z-score is developed by Altman [25]25. It is a formula for estimating the financial status of a company and also helpful in bankruptcy prediction. The formula for Zscore for public companies is given by: Z-score= (Working capital / Total assets* 1.2) + (Retained earnings Total assets* 1.4)+ (Earnings before income tax Total assets* 3.3)+(Book value of total / Liabilities * 0.6) + (Sales Total assets* 0.999) (b) A high debt structure may be an indicator for fraudulent financial reporting, because it shifts the risk from mangers to debt owners. Hence we can state that higher levels of debt may increase the likelihood of financial statement fraud and one should carefully consider the financial ratios related to debt structure. (c) Continues growth: The need for continues growth may be another motivational factor for financial statement fraud [26].26So, sales to growth ratio should be measured as a fraudulent financial statement indicator. Sales to growth = (Current Year's sales - Last Year's sales) / (Last Year's sales) (d) Other items: A company may manipulate accounts receivable, inventories and gross margin. Accounts receivable may be manipulated by recording sales before they are earned. Inventory is also prone to manipulation. Mangers may manipulate inventory either by reporting inventory at lower cost or by obsolete inventory or both. A company may use gross margin as a factor for falsifying financial statement. The company may not match its sales with the corresponding cost of goods sold, thus increasing gross margin, net income and strengthening the balance sheet [8]. (e) Other qualitative variables: Qualitative variables such as qualification or the composition of the administrative board of a company, previous auditors, high turnover of CEO and CFO, size and age of a company could prove helpful in searching for indicators of cooked books.

51 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

5. Research Methodology 5.1 Dataset The dataset used in this research is obtained from 114 companies listed in different stock exchanges. Out of these 114 firms used in our analysis, 85 firms have not reported their financial statements fraudulently, whereas 29 organisations are having different charges of fraudulent financial reporting. The data has been collected form for all the 114 companies. We reviewed AAERs (Accounting and auditing enforcement releases) published by SEC (U.S. Securities and Exchange Commission) between 2007 and 2012, to identity companies accused of falsifying financial statements. All the firms in the sample have been checked by auditors. There was a clear indication of fraudulent financial reporting for 29 fraud firms. Some of the indicators of fraud includes: resignation by the auditors, chairman and board of directors, doubts reported by auditors, observations by the tax authorities. The 29 fraud firms have been matched with 85 non fraud organisations. These firms are classified as non fraud because no published indication or proof is present. However, absence of any proof does not guarantee that these firms have not falsified their financial statements or will not do the same in future. This research only assures that fraudulent reporting has been found for these firms.

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

5.2 Variables
All the variables to be used as a candidate for participation in the input vector have been extracted from published financial statements such as income statement and balance sheet. The dataset contain 52 financial ratios / items for each of the 114 companies. A list of these financial ratios / items is presented in Table 1. The selection of these financial variables is based on prior research and financial ratios on liquidity, safety, profitability and efficiency of the organisations under consideration. During the preprocessing stage, each of the independent financial variables has been normalized. In order to improve the reliability of the result further we perform ten fold cross validation.
Table 1: Items / Ratios from financial statement to be used for detection of financial statement fraud: S.No. Financial items / Ratios 1 Debt 2. Total assets 3 Gross profit 4 Net profit 5 Primary business income 6 Cash and deposits 7 Accounts receivable 8 Inventory/Primary business income 9 Inventory/Total assets 10 Gross profit/Total assets 11 Net profit/Total assets

Current assets/Total assets Net profit/Primary business income Accounts receivable/Primary business income Primary business income/Total assets Current assets/Current liabilities Primary business income/Fixed assets Cash/Total assets Inventory/Current liabilities Total debt/Total equity Long term debt/Total assets Net profit/Gross profit Total debt/Total assets Total assets/Capital and reserves Long term debt/Total capital and reserves Fixed assets/Total assets Deposits and cash/Current assets Capitals and reserves/Total debt Accounts receivable/Total assets Gross profit/Primary business profit Undistributed profit/Net profit Primary business profit/Primary business profit of last year Primary business income/Last year's primary business income Account receivable /Accounts receivable of last year Total assets/Total assets of last year Debit / Equity Accounts Receivable / Sales Inventory / Sales Sales Gross Margin Working Capital / Total Assets Net Profit / Sales Sales / Total Assets Net income / Fixed Assets Quick assets / Current Liabilities Revenue /Total Assets Current Liabilities / Revenue Total Liability / Revenue Sales Growth Ratio EBIT Z Score Retained Earnings / Total Assets EBIT / Total Assets

We compiled all the financial items / ratios of Table 1. We applied one way ANOVA on the dataset for reducing dimensionality and to test whether the differences between the two classes namely fraud and non fraud, were significant for each variable. The variables with high p value are considered non informative. Variables with p value <= 0.05 are considered informative and are tested further using data mining methods. The financial ratios which are considered informative are present in Table 2 along with their F values and p values.
S. No. 1 Table 2: Informative financial ratios / items PFinancial Ratios / Items value F - Value Debt 0.028 1.345

52 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Inventor ry/Primary busin ness income ry/Total assets Inventor fit/Total assets Net prof Cash/Total assets bt/Total assets Total deb s Fixed assets/Total assets s ent Deposits and cash/Curre assets g Working Capital / Total Assets Total Assets Sales / T ome ets Net inco / Fixed Asse e Revenue /Total Assets EBIT Z score Account receivable/Prim ts mary business income e/Total Primary business income assets Primary business income e/Fixed assets otal Capitals and reserves/To debt rofit/Primary bus siness Gross pr profit ts Sales Account Receivable / S Retined earnings / Total Assets Total Assets EBIT / T

0.001 31 3.03 0.046 0.001 0.001 0.002 0.05 0 0.001 0.001 0.002 0.001 0.002 0.026 0.001 0.018 99 6.09 0.001 3.04 48 0.001 57 3.05 0.003 0.008 3.92 25 0.013 0.001 0.001 2.31 11 3.04 44 3.05 59 2.2 22 5.74 49 3.04 45 2.90 06 2.30 03 2.07 75 2.93 37 2.86 63 12.07 77 3.0 04 12.07 77 4.29 97 3.13 37

with pr re-assigned cla asses. CART is a classificatio s on method different from traditio d onal statistic cal method ds. CART is a decision tree learnin s n ng techniq that produ que uces classifica ation tree if th he depend dent variable is categorical and regressio on tree ot therwise. Thi method be classify th is est he sample in to a nu es umber of non overlappin n ng regions The tree construction using CAR s. RT method dology include three steps. The first ste es ep also known as g k growing phas consists of se constru ucting maximu um tree whi ich means th hat splittin of learning sample should be done up a ng point where each terminal node contain h ns ations of only one class. Th step is mo y his ost observa time co onsuming bec cause each iter ration seeks th he best sp plitting variabl Tree constr le. ructed as abov ve may co onsist of hundr reds of levels a insignifica and ant nodes or subtrees. T Therefore, thi complex tree is validation as on ne should be pruned by using cross v pruning algorit thm. This pruning will result in of the p a right size tree which will be used in the third ste h ep ssifying new da ata. for clas CA ART as a cla assification method does n m not require variables to be selected in advance. Th e n his method automatica d ally identifi ies significa ant variables and ignor res the non significant on ne. Moreov CART is very sensitive to the trainin ver, e ng data if it consists o outliers. Ou f of utliers are ve ery promin nent in financia data due to financial crise al es. Non p parametric na ature and its capability of s handlin noisy data are one of the reasons f ng for selectin CART as on of the meth to be used in ng ne hod this res search. 3.2 5.3 Nave B Bayesian Class sifier

5.3 D Data Mining Methods M Detection of financial statement fra aud can be d l classification. considered as a classical problem of c Classification includes tw steps. In the first step, a wo e t et mined classes model that describes a se of predeterm is construc cted. The sam mple used in th process is he known as training sam s mple. Each tu uple in the training se is supposed to belong to a predefined et class as determined by th class label at he ttribute. This step of sup pervised learni is followe by second ing ed step in wh hich the model attempts to classify new objects wh hich form the validation sample. Data e mining su uggests a n number of classification c techniques which have an excellent re s, eputation for their class sification cap pabilities. Most of these classificati ion methods are derived fro artificial a om intelligence and statistics Three such classification s. c namely CART, naive Bayesi classifier , ian methods n and Genet Programmi tic ing are emplo oyed in this research st tudy. CART 5.3.1 Regression T Tree is a Classification and R zed, non pa arametric data exploration a computeriz and prediction technique which uses hi e istorical data

Naive B Bayesian class sifier is a proba abilistic learnin ng techniq based on a que applying Baye theorem wi es ith class c condition ind dependence as ssumption. Th his strong (naive) indepe endence assum mption states th hat presenc or absence o an attribute of a class is n ce of not related to presence or absence of any oth e her te. eorem calculat the posteri tes ior attribut Bayes the probability as P(X|H) * P(H)) / P(X) P(H|X) = (P Where, H is a hypo , othesis such a the object X as belongs to class C. ongs to one of i alternativ ve If an object X belo classify the ob bject a Bayesia an classes, in order to c ier he es all classifi calculates th probabilitie P(Ci|X) for a the pos ssible classes C and assigns the object to th Ci he class w the maximu probability P(Ci|X). with um y The co onditional distr ribution over th class variab he ble C can b expressed as be s

53 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

Where Z (the evidence) is a sca aling factor , i.e a constant e., dependent only on ues ture variables or attributes if the valu of the feat are know wn. If assump ption of clas condition ss independen nce holds tr rue, the naiv Bayesian ve classifier produces best accuracy rates. This classificati technique requires smal amount of ion ll data to gu uesstimate the parameters such as means and varian nces of the variables, ne ecessary for classificati ion. This assum mption greatly reduces the y computatio onal cost as on class distr nly ribution is to be counted d. However, this assumption of independe t n ence may not be valid in many cases, b n because genera attributes ally are depend dent in nature This naive design and e. simplified assumption should not be taken as its limitation because naive Bayes clas e ssifier works much bett in many complex and real world ter situations. 5.3.3 Genetic Pr rogramming Genetic p programming (GP) is an evolutionary learning te echnique that o offers a great potential for classificati 27[27]. GP follows Darw ion P wins theory of evolutio commonly known as su on, urvival of the fittest. T There is a ra andomly gene erated initial population of solutions that reproduc with each n ce other usin various ge ng enetic operato such as ors reproductio crossover, mutation etc. This process on, of evolutio is termed as generation. on GP is ess sentially considered to be a variant of genetic al lgorithms (GA that uses a complex A) representat tion language to codify in ndividuals 28 [28]. The b basic differenc between GP and GA is ce P the repr resentation of solutions s. Genetic programmi ing follows the following sequential g steps for so olving a proble 29 [29]. em a) Create a random population o programs, m of r or rules, using the symbolic expressions pr rovided as the initial populati ion. b) Ev valuate each program o rule by or as ssigning a fitn ness value acc cording to a pr redefined fitn ness function that can n measure the capability of the rule or m c pr rogram to solve the problem. c) U the reprod Use duction operat to copy tor ex xisting program into the new generation. ms w d) G Generate the new popul lation with cr rossover, muta ation, or othe operators er fr rom a randomly chosen set of parents. y f e) R Repeat the seco to the fou ond urth steps for th new popul he lation until a predefined te ermination crite erion is satisfie or a fixed ed, nu umber of gener rations is comp pleted. f) Th solution to the problem is the genetic he pr rogram with the best fitnes within all t ss ge enerations. The most important ope eration for gen nerating new n Crossover is population in GP is crossover. C

ed ting two par rent trees an nd achieve by select reprodu ucing to form t two new solut tions. The pare ent trees ar selected fro the initial population by a re om y functio of the fitness of the solutio The creatio on s ons. on of the offsprings fro the crossov operation is om ver accomp plished by dele eting the crosso over fragment of the firs parent and then inserting the crossov st ver fragme ent of the s second parent The secon t. nd offsprin is produced in a symmetr manner. Th ng d ric he fitness function to search the most efficie ent comput program that can solve th given proble ter he em is given below [29]. n
No. of samples classif correctly fied Fitness = _ ______________ ______________ ______________ _ No. of s samples used for training during evaluation r g

pplication of GP in patter classificatio rn on The ap offers t following a the advantages. 1) GP is very f flexible which means it can b be adapted to the needs of each particul lar problem. 2) GP can be employed on the data in i its m. original form owledge is not required abo t out 3) A priori kno the distributi of the data since GP is free ion from data distribution. 4) GP can easily expr ress unknow wn relationship in among the t data mathematica expressions. al 5) GP can be useful in pre eprocessing an nd postprocessi along with classification in ing order to enha ance the classifier. 6) GP can be helpful in f e finding out th he majority of discriminating features of a training stage. class in the t 6. Experiment Results and Analysis tal d Three data mining m methods discus ssed above hav ve been im mplemented on the dataset an compared o n nd on the bas of sensitivi and specifi sis ity icity. Sensitivi ity of a pa articular method can be meas sured as the rat tio of num mber of fraud dulent organis sation identifie ed accurat tely as fraudu ulent to the total number of t actual fraudulent firm whereas spe f ms ecificity is a rat tio of the number of non fraud firms id n n dentified as nonfraud to the total nu t umber of real non-fraudule l ent compan (Table 8). nies In this study, CART model is co s T onstructed usin ng SIPINA Research edi A ition software version 32 b bit. The tre given below has been built by using who ee w t ole sample as training set with conf e fidence level of 0.05.

Figu 1 CART ure: T

54 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

Table: 5 (Confusion matrix for Nave Bayesian Classifier) Label NF (Non F (Fraud) Fraud) 83 2 NF 10 19 F Genetic programming has been implemented using tool Discipulus version 5.1. The data set has been divided into training and validation data. The training data set is used to train the sample and validation dataset is used exclusively for the purpose of validation. 80% of the whole dataset is used to train the sample, while 20% is used of the purpose of validation. Since our dependent variable (target output) is binary, we select hits then fitness as a fitness function. Every single run of Discipulus has been set to terminate after it has gone 50 generations with no improvement in fitness. The confusion matrix for genetic programming is given as table 6. Table: 6 (Confusion Matrix for Genetic Programming) Label NF (Non Fraud) F (Fraud) 84 1 NF 13 16 F

The confusion matrix is given below (Table: 3) Table: 3 (Confusion Matrix for CART) Label NF (Non F (Fraud) Fraud) NF (Non Fraud) 85 0 F (Fraud) 4 25 CART manages to classify 96 % cases. This method well classifies all the non fraud cases (100 %) and misclassifies only 4 fraud cases. The percentage of classification for fraud cases is 86 %. The tree presented here uses Deposits and cash to current assets ratio as the first splitter. This ratio indicates that how better the company is in converting its non liquid assets into cash. At second level of the tree, retained earnings / total assets and fixed assets / total assets has been used as a splitter. Table 4 consist of all the ratios used by the tree. Table: 4 S. No. 1 2 3 4 5 6 Financial Ratios / Items Net profit/Total assets Fixed assets/Total assets Deposits and cash/Current assets Working Capital / Total Assets Sales / Total Assets Retained earnings / Total Assets

From table 7 we can observe the input impact of various input parameters on the model. Table: 7 Impact of input variables (Genetic Programming) S.N o. Variable Freque ncy Avera ge Impac t 00.00 000 22.52 747 Maxim um Impact 00.000 00 53.846 15

1 2

Debt Inventory/Pri mary business income Inventory/To tal assets Net profit/Total assets Cash/Total assets Total debt/Total assets Fixed assets/Total assets Deposits and

0.06 0.35

3 4

0.35 0.06

09.70 696 02.19 780 03.84 615 00.00 000 00.00 000 06.59

20.879 12 02.197 80 05.494 51 00.000 00 00.000 00 06.593

5 6

0.29 0.12

Second technique of classification, the Nave Bayesian Classifier has been implemented using SIPINA Research edition software version 32 bit. The method correctly classifies 89% cases. The confusion matrix is given below (Table 5):



55 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012

10 11 12 13 14 15

cash/Current assets Working Capital / Total Assets Sales / Total Assets Net income / Fixed Assets Revenue /Total Assets EBIT Z score Accounts receivable/Pr imary business income Primary business income/Total assets Primary business income/Fixed assets Capitals and reserves/Tota l debt Gross profit/Primar y business profit Accounts Receivable / Sales Retained earnings / Total Assets EBIT / Total Assets

341 0.06 00.00 000 00.00 000 07.69 231 09.01 099 05.49 451 19.78 022 00.54 945

41 00.000 00 00.000 00 09.890 11 14.285 71 05.494 51 19.780 22 01.098 90

7. Conclusion
In this study, data mining methods of good repute is implemented on dataset collected from financial statements of 114 companies for classifying organizations as fraud or non fraud. We collected and compiled 52 financial variables / ratios. Then, one way ANOVA is used for finding informative variables on the basis of p value. Then three intelligent classification methods namely CART, Nave Bayesian Classifier and Genetic Programming are applied on 22 informative ratios. In order to have better reliability of the result, ten fold cross validation has been implemented throughout the study. All the three methods have been compared on the basis of sensitivity and specificity. CART produces best sensitivity and specificity as compared with other two methods. The accuracy rate of these methods can be further enhanced by using some qualitative information such as composition of administrative board along with financial ratios used in this research.

0.00 0.41 0.29 0.06 0.06 0.29



02.74 725

03.296 70



03.29 670

08.791 21


1 2



00.00 000 05.65 149

00.000 00 09.890 11



Kantardzi3 c M. (2002), Data Mining: Concepts, Models, Methods, and Algorithms, Wiley IEEE Press. 4 PriceWaterhouse&Coopers: Economic crime: People, culture and controls. The 4th Biennial Global Economic Crime Survey (2007), available at:
5 Association of Certified Fraud Examiners: 2006 ACFE Report to the nation on Occupational fraud and abuse (2006), Technical report, Association of Certified Fraud Examiners, USA, available at:



00.00 000 02.93 040 03.29 670

00.000 00 04.395 60 05.494 51





6 Beasley, M. (1996). An empirical analysis of the relation between board of director composition and financial statement fraud. The Accounting Review, 71(4), 443466.

Table: 8 (Performance Matrix) S.No. 1 2 Predictor CART Nave Bayesian Classifier Genetic Programming Sensitivity (%) 86.2 65.5 Specificity (%) 100 97.6

7 Green, B. P., & Choi, J. H. (1997). Assessing the risk of management fraud through neural-network technology. Auditing: A Journal of Practice and Theory, 16(1), 1428.

8 Fanning, K., & Cogger, K. (1998). Neural network detection of management fraud using published financial data. International Journal of Intelligent Systems in Accounting, Finance & Management, 7(1), 2124.



9 Efstathios Kirkos, Charalambos Spathis & Yannis Manolopoulos (2007). Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications 32 (23) (2007) 9951003

56 ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No. 3, March 2012


C. Spathis, M. Doumpos, C. Zopounidis, Detecting falsified financial statements: a comparative study using multicriteria analysis and multivariate statistical techniques, European Accounting Review 11 (3) (2002) 509535. 11 M. Cecchini, H. Aytug, G.J. Koehler, and P. Pathak. Detecting Management Fraud in Public Companies. ntFraudInPublicCompanies.pdf 12 S.-M. Huang, D.C. Yen, L.-W. Yang, J.-S. Hua, An investigation of Zipf's Law for fraud detection, Decision Support Systems 46 (1) (2008) 7083. 13 Hoogs Bethany, Thomas Kiehl, Christina Lacomb and Deniz Senturk (2007). A Genetic Algorithm Approach to Detecting Temporal Patterns Indicative Of Financial Statement Fraud, Intelligent systems in accounting finance and management 2007; 15: 41 56, John Wiley & Sons, USA, available at:

Rajan Gupta obtained masters degree in computer application from Department of Computer Science & Application, Guru Jambheshwar University,Hisar, Haryana, India and Master Degree of Philosophy in Computer Science from Madurai Kamraj University, Madurai, India. He is currently pursuing Doctorate degree in Computer Science from Department of Computer Science & Application, Mahrshi Dayanand University, Rohtak, Haryana, India.

M.J. Cerullo, V. Cerullo, Using neural networks to predict financial reporting fraud: Part 1, Computer Fraud & Security 5 (1999) 1417. 15 E. Koskivaara, Artificial neural networks in auditing: state of the art, The ICFAI Journal of Audit Practice 1 (4) (2004) 1233. 16 B. Busta, R. Weinberg, Using Benford's law and neural networks as a review procedure, Managerial Auditing Journal 13 (6) (1998) 356366. 17 H.C. Koh, C.K. Low, Going concern prediction using data mining techniques, Managerial Auditing Journal 19 (3) (2004) 462476. 18 Belinna Bai, Jerome yen, Xiaoguang Yang, False Financial Statements: Characteristics of china listed companies and CART Detection Approach, International Journal of Information Technology and Decision Making , Vol. 7, No. 2(2008), 339 359 19 A. Deshmukh, L. Talluru, A rule-based fuzzy reasoning system for assessing the risk of management fraud, International Journal of Intelligent Systems in Accounting, Finance & Management 7 (4) (1998) 223241. 20 Wei Zhou, G. Kappor, Detecting evolutionary financial statement fraud, Decision Support Systems 50 (2011) 570 575. 21 P.Ravisankar, V. Ravi, G.Raghava Rao, I., Bose, Detection of financial statement fraud and feature selection using data mining techniques, Decision Support Systems, 50(2011) 491 - 500 22 W.H. Beaver, Financial ratios as predictors of failure, Journal of Accounting Research 4 (1966) 71111 23 H. M. Schilit, Financial Shenanigans (McGraw-Hill, Inc., New York, 1993). 24 C. Fei, The performances of four classes of listed companies are incredible (in Chinese), Hunan Daily (28-Sep-01). 25 E.I. Altman, Financial ratios, discriminant analysis and prediction of corporate bankruptcy, The Journal of Finance 23 (4) (1968) 589609. 26 Stice J., Albrecht S. and Brown L., (1991), Lessons to be learned-ZZZZBEST, Regina, and Lincoln Savings, The CPA Journal, April, pp. 52-53. 27 W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic ProgrammingAn Introduction; On the Automatic Evolution of Computer Programs and its Applications. San Mateo, CA/Heidelberg, Germany: Morgan Kaufmann/dpunkt.verlag, 1998. 28 J. H. Holland, Adaptation in Natural and Articial Systems. Ann Arbor, MI: Univ. of Michigan Press, 1975. 29 K.M. Faraoun, A. Boukelif, Genetic programming approach for multi-category pattern classification applied to network intrusion detection, International Journal of Computational Intelligence and Applications 6 (1) (2006) 7799.

Dr Nasib S. Gill obtained Doctorate degree in computer science and Post doctoral research in Computer Science from Brunel Univerrsity, U.K. He is currently working as Professor and Head in the Department of Computer Science and Application, Mahrshi Dayanand University, Rohtak, Haryana, India. He is having more than 22 years of teaching and 20 years of research experience. His interest areas include software metrics, component based metrics, testing, reusability, Data Mining and Data warehousing, NLP, AOSD, Information and Network Security.

57 ISSN 1947-5500