Beruflich Dokumente
Kultur Dokumente
Johan Perols
Associate Professor of Accounting
University of San Diego
jperols@sandiego.edu
Robert Bowen
Distinguished Professor of Accounting
University of San Diego
rbowen@sandiego.edu
Carsten Zimmermann
Associate Professor of Management
University of San Diego
zimmermann@sandiego.edu
Basamba Samba
RWTH Aachen University
basamba.samba@rwth-aachen.de
April 2016
Abstract: Developing models to detect financial statement fraud involves challenges related
to (i) the rarity of fraud observations, (ii) the relative abundance of explanatory variables
identified in the prior literature, and (iii) the broad underlying definition of fraud. Following
the emerging data analytics literature, we introduce and systematically evaluate three
methods to address these challenges. Results from evaluating actual cases of financial
statement fraud suggest that two of these methods improve fraud prediction performance by
approximately ten percent relative to the best current techniques. Improved fraud prediction
can result in meaningful benefits, such as improving the ability of the SEC to detect
fraudulent filings and improving audit firms client portfolio decisions.
Key words: Financial statement fraud, Data analytics, Fraud prediction, Risk
assessment, Data rarity, Data imbalance.
Data availability: Data are available from sources identified in the text.
percent of annual revenues specifically to financial statement fraud (ACFE 2014). Further, when
resources are misallocated because of misleading financial data, fraud can harm the efficiency of
capital, labor, and product markets. Financial statement fraud (henceforth fraud) also increases
business risk. For example, audit firms can face lawsuits, reputational costs, and loss of clients;
investors and banks are more likely to make suboptimal investment and loan decisions.
Data analytics is an important emerging field in both academic research (e.g., Agarwal and
Dhar 2014; Chen, Chiang, and Storey 2012) and in practice (e.g., Brown, Chui, and Manyika
2011; LaValle, Lesser, Shockley, Hopkins, and Kruschwitz 2013).1 In the fraud context, data
analytics can, for example, be used to create fraud prediction models that help (i) auditors
improve client portfolio management and audit planning decisions and (ii) regulators and other
oversight agencies identify firms for potential fraud investigation (e.g., SEC 2015; Walter 2013).
However, the usefulness of data analytics in fraud prediction is hindered by three challenges.
First, fraud prediction is a needle in a haystack problem. That is, the relative rarity of fraud
firms compared to non-fraud2 control firms (Bell and Carcello 2000) makes fraud prediction
difficult (Perols 2011). Second, fraud prediction is complicated by the curse of data
dimensionality (Bellman 1961). The rarity of fraud observations relative to the large number of
explanatory variables identified in the fraud literature (Whiting, Hansen, McDonald, Albrecht,
1 Data analytics refers to techniques that are grounded in data mining (e.g., decision trees, artificial neural networks, and support
vector machines) and statistics (e.g., ANOVA, regression analysis, and logistic regressions) (Chen et al. 2012). Data analytics
draws from statistics, artificial intelligence, computer science, and database research. It is related to big data in that it provides
tools that enable the analysis of large datasets. Data analytics is typically focused on prediction as opposed to explanation.
2 We use the term fraud as opposed to other terminology, such as material misstatements (Dechow, Ge, Larson, and Sloan 2011)
or misreporting. The primary difference between fraud and misstatements is that fraud is intentional while misstatements can be
either intentional or errors. Further, we use the term non-fraud firms to describe all firms for which fraud has not been detected.
This primarily includes firms that have not committed fraud, but also includes undetected cases of fraud. To the extent that
undetected fraud exists in our data, noise is introduced. This noise reduces the effectiveness of all prediction models, and
methods that address this noise might further improve fraud prediction. However, this noise is not likely to bias performance
comparisons among prediction models that use the same data.
-1-
and Albrecht 2012) can result in over-fitted prediction models that perform poorly when
predicting new observations. Third, prior research generally treats all frauds as homogeneous
events. This can make fraud prediction more difficult because prediction models have to detect
patterns that are common across different fraud types (e.g., revenue vs. expense fraud).
While prior fraud detection research enhances our general understanding of fraud indicators
and prediction methods, this research rarely addresses these problems explicitly. With a primary
methods grounded in data analytics research.3 The methods we examine have performed well in
other settings characterized by data rarity, such as predicting credit card fraud (e.g., Chan and
Stolfo 1998). The first method, Multi-subset Observation Undersampling (OU), addresses the
imbalance between the low number of fraud observations relative to the number of non-fraud
observations by creating multiple subsets of the original dataset that each contain all fraud
observations and different random subsamples of non-fraud observations. The second method,
Multi-subset Variable Undersampling (VU), addresses the imbalance between the low number of
fraud observations relative to the number of explanatory variables identified in the fraud
The third method, VU partitioned by type of fraud (PVU), is a variation of the second method
that addresses issues associated with treating all fraud cases as homogenous events. Rather than
randomly selecting variables, we instead use our a priori knowledge to partition the variables
into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud).
We use a dataset with 51 fraud firms, 15,934 non-fraud firm years, and 109 explanatory
variables from prior research. We then analyze over 10,000 prediction models to systematically
3We evaluate our results on out-of-sample data and thus perform predictive modeling. To clearly delineate our work from
explanatory models, we refer to our models as prediction models throughout the paper.
-2-
evaluate how to best implement these methods, e.g., how many data subsets to use in OU. In
benchmarks that represent the current standard in the literature, e.g., model 2 in Dechow et al.
(2011) and simple undersampling as used in Perols (2011). To avoid biasing the results, we
evaluate prediction performance using the prediction models probability predictions on hold-out
Results indicate that including additional data subsets (up to approximately 12 subsets)
increases OU fraud prediction performance, i.e., additional subsets after 12 do not appear to
While results indicate that VU also has the potential to improve fraud prediction, the
performance of this method is highly dependent on the specific variables selected in the various
independent variables into different subsets based on the type of fraud they are likely to predict,
e.g., revenue or expense fraud. This method, i.e., PVU, improves fraud prediction performance
by 9.6 percent relative to the best performing VU benchmark. Additional analyses also show
that performance can be further improved by combining OU and PVU, but only under certain
Our paper makes at least five important contributions. First, by introducing and
systematically evaluating three new methods and showing that two of these methods improve
4 We follow recent fraud data analytics research (e.g., Cecchini et al. 2010) and findings in Perols (2011) and implement all
prediction models using support vector machines. Support vector machines determine how to separate fraud firms from non-
fraud firms by finding the hyperplane that provides the maximum separation in the training data between fraud and non-fraud
firms. In additional analyses we also use logistic regression and bootstrap aggregation to examine the robustness of our results.
-3-
research that focuses on improving the performance of fraud prediction models. The
performance improvements from OU and PVU are large relative to other approaches for
improving prediction performance, e.g., (i) a 0.9 percent performance advantage in Dechow et al.
(2011) when two additional significant independent variables are added to their initial model and
(ii) a 2.2 percent improvement in Price, Sharp, and Wood (2011), when comparing Audit
Integritys Accounting and Governance Risk measure to Dechow et al. (2011) model 2.5
Second, the finding that OU significantly improves prediction performance has important
methodological implications for research that evaluates the value of new explanatory variables.
This research can potentially benefit from applying OU to ensure that (i) results are robust across
different subsamples and (ii) new variables provide incremental predictive value to models
Third, we show that the ability of VU to predict fraud improves consistently only when we
recognize that not all frauds are alike and subdivide the general fraud problem into types of
fraud. The importance of this approach likely extends beyond variable undersampling. For
example, future research could reorganize or design new fraud variables to predict a specific
Fourth, OU and PVU can be extended to address rarity and data dimensionality problems that
5 Dechow et al. (2011) do not report predictive performance and the 0.9 percent difference is based on a separate analysis that we
performed using the two models in their paper (Model 1 and Model 2). This analysis uses the same procedures used in our
material misstatement analysis described in Section IV. Price et al. (2011) compare Audit Integritys Accounting and
Governance Risk measure, which is considered the gold standard in commercial risk measures, to Dechow et al. (2011) Model 1
using material misstatement data. Based on their results, we calculate a 3.16 percent fraud prediction performance improvement
of the commercial measure to model 1. This implies a 2.24 percent improvement over Dechow et al. (2011) Model 2, which we
include as one of our benchmark models.
-4-
Finally, the introduction and evaluation of these methods makes an important contribution to
practice. Better prediction models can, for example, help the SEC and external auditors improve
their identification of potentially fraudulent accounting practices (Walter 2013; SEC 2015).
The remainder of the paper is organized as follows. Section II summarizes the fraud
literature, discusses data rarity, and describes how methods drawn from the data analytics
literature can be applied to fraud prediction. Section III describes the data, performance
measure, and experimental design. Section IV provides results, and section V concludes.
be used to predict fraud. Prior research includes testing fraud hypotheses grounded in the
earnings management and corporate governance literatures (e.g., Beasley 1996; Dechow, Sloan,
and Sweeney 1996; Summers and Sweeney 1998; Beneish 1999; Sharma 2004; Erickson
Hanlon, and Maydew 2006; Lennox and Pittman 2010; Feng, Ge, Luo, and Shevlin 2011; Perols
and Lougee 2011; Caskey and Hanlon 2012; Armstrong, Larcker, Ormazabal, and Taylor 2013;
Markelevich and Rosner 2013). This research also evaluates the significance of a variety of
other potential explanatory variables, such as red flags emphasized in auditing standards,
discretionary accruals measures, and non-financial indicators (e.g., Loebbecke, Eining, and
Willingham 1989; Beneish 1997; Lee, Ingram, and Howard 1999; Apostolou, Hassell, and
Webber 2000; Kaminski, Wetzel, and Guan 2004; Ettredge, Sun, Lee, and Anandarajan 2008;
Jones, Krishnan, and Melendrez 2008; Brazel, Jones, and Zimbelman 2009; Dechow et al. 2011).
Varian (2014) highlights the importance of the emerging field of data analytics. He suggests
that researchers using traditional econometric methods should consider adapting recent advances
from this field. A second stream of financial statement fraud prediction research follows this
-5-
suggestion and applies developments in data analytics research to improve fraud prediction.
Early research within this stream concludes that artificial neural networks perform well relative
to discriminant analysis and logistic regressions (e.g., Green and Choi 1997; Fanning and Cogger
1998; Lin, Hwang, and Becker 2003). More recent research in this stream examines additional
classification algorithms, such as support vector machines, decision trees, and adaptive learning
methods (e.g., Cecchini et al. 2010; Perols 2011; Abbasi, Albrecht, Vance, and Hansen 2012;
Gupta and Gill 2012; Whiting et al. 2012) and text mining methods (e.g., Glancy and Yadav
2011; Humpherys, Moffitt, Burns, Burgoon, and Felix 2011; Goel and Gangolly 2012; Larcker
Stolfo 1998), auto insurance fraud (Phua, Alahakoon, and Lee 2004), bankruptcy (Shin, Lee, and
Kim 2005), and financial statement fraud (Whiting et al. 2012). Classification algorithms (e.g.,
logistic regression) have inherent difficulties in processing rarity (Weiss 2004), and data rarity is
regarded as one of the primary challenges in data analytics research (Yang and Wu 2006). Data
rarity is particularly severe in financial statement fraud detection because financial statement
fraud is characterized by both (i) relative rarity (a.k.a., the needle in the haystack problem) and
(ii) absolute rarity combined with an abundance of explanatory variables proposed in the
The needle in a haystack problem. Relative rarity occurs when detected fraud observations
are a relatively small percentage of the majority non-fraud observations, e.g., approximately only
0.6 percent of all audited U.S. financial reports have been identified as fraudulent (Bell and
Carcello 2000). Relative rarity is a challenge since it forces classification algorithms to consider
a large number of potential patterns without having enough fraud observations to determine
-6-
which patterns are driven by noisy data. This increases the risk that identified patterns are based
on spurious relations in a particular sample, resulting in increased false positive rates for a given
false negative rate when the developed model is applied to a new sample (Weiss 2004). Further,
observations from the majority class correctly (e.g., Maloof 2003). To illustrate, if 99 percent of
all observations are non-fraudulent, a prediction model identifying all observations as non-
fraudulent achieves an overall accuracy of 99 percent correctly classifying 100 percent of the
Perols (2011) takes an initial step towards addressing the relative rarity problem in a fraud
context by examining the performance of classification algorithms after undersampling the non-
fraud observations. However, while the simple undersampling method used in Perols (2011),
i.e., a method that simply removes non-fraud observations from the sample, generates more
balanced datasets, it also discards potentially useful non-fraud observations. We, therefore,
introduce a more sophisticated undersampling method that does not discard non-fraud
Chan and Stolfo (1998), to address relative rarity. OU uses multiple data subsets, where each
subset contains all fraud observations but different subsamples of non-fraud observations. We
specifically select OU because prior research shows that it performs well in other settings
constrained by relative rarity, such as predicting credit card fraud (e.g., Chan and Stolfo 1998).
OU is also effective compared to (i) other undersampling and oversampling methods (Nguyen,
Cooper, and Kamei 2012) and (ii) various types of bootstrap aggregation, boosting, and hybrid
ensemble data rarity methods used in the data analytics literature (Galar, Fernndez,
-7-
Barrenechea, Bustince, and Herrera 2012). OU is conjectured to improve performance (e.g.,
Nguyen et al. 2012) not only because it improves the balance between minority (fraud) and
majority (non-fraud) observations, but also because it more efficiently incorporates potentially
useful majority cases. By creating multiple prediction models that are based on different non-
overlapping subsets of majority observations, each prediction model is likely to differ somewhat
from the other prediction models. Importantly, patterns that are predictive of fraud are likely to
be present in multiple subsets. However, spurious patterns that exist by random chance in
individual subsets are unlikely to also exist in other subsets. By using a combination of these
models rather than a model built using a single data set, potentially important patterns are more
likely to be identified and estimated accurately (assuming that each model has a slightly different
estimate of the pattern). Additionally, when individual models are combined, spurious patterns
are likely to be discarded (or given less weight). This decreases the risk of overfitting, i.e., that
the prediction model has good in-sample performance but does not generalize to new
observations.
When applied in the fraud setting, OU first preprocesses the model building data by dividing
the data into multiple subsets, where each subset includes all fraud observations and a random
sample of non-fraud observations selected without replacement (Figure 1). Thus, all fraud
observations are included in all subsets while each non-fraud observation is part of at most one
subset. Each subset is then used in combination with a classification algorithm to build a fraud
prediction model.6 To perform fraud prediction, each prediction model is then applied to out-of-
sample data. For each observation in the out-of-sample data, the resulting model predictions are
-8-
averaged into an overall fraud probability prediction for the observation.7 For example, if OU is
implemented with 12 subsets, the method first creates 12 subsets as described above. Each
subset is then used to build a prediction model, for a total of 12 prediction models. The
prediction models are then applied to out-of-sample data, resulting in 12 fraud probability
predictions for each observation in the out-of-sample data. The probability predictions for each
observation are then combined by taking the average of the 12 probability predictions. Section
The curse of data dimensionality problem. According to the curse of data dimensionality,
data requirements increase exponentially with the number of explanatory variables in the dataset
(Bellman 1961).8 This is a potential problem in fraud prediction because the number of known
fraud cases is small relative to the extensive number of independent variables identified in prior
fraud research. Hence, only a small number of fraud observations are available to identify
patterns among the large number of independent variables and fraud. This may result in over-
fitted prediction models that perform poorly when predicting new observations.
model, Dechow et al. (2011) partially address the problem of data dimensionality in the fraud
context. However, while stepwise backward variable selection is designed to retain explanatory
7 This method has been found to perform well compared to more complex combiner methods (Duin and Tax 2000). Other
combiner methods, such as a Dempster-Shafer Fusion method, may be able to further improve the effectiveness of our proposed
methods; we encourage future research to examine this and other methods in more detail.
8 More specifically, when the number of explanatory variables increase, data used to fit models are spread across an increasingly
large feature space that grows exponentially with each additional explanatory variable, e.g., with one explanatory variable the
feature space is a line, with two variables the feature space is a plane, with three variables the feature space is a three-dimensional
space, etc. For example, with a dataset containing 50 fraud and 50 non-fraud observations and only one continuous explanatory
variable, the 100 observations are positioned on a line. If another variable is added, these same 100 observations are spread
across a two dimensional space. If a third variable is added, the 100 observations are spread within a three dimensional space.
For every variable that is added, the observations cover a smaller portion of the feature space. Thus, to cover a given percentage
of the feature space, the number of required observations would have to increase exponentially with the number of variables.
-9-
variables with the highest significance levels, it may discard potentially useful variables. We
build on Dechow et al. (2011) and introduce a new method that attempts to address the curse of
To reduce the imbalance between minority fraud observations and the number of variables
identified in the literature to predict fraud, we design a new data rarity method, Multi-subset
Variable Undersampling (VU).9 VU randomly splits the set of explanatory variables without
replacement into different subsets (Figure 2). Each subset contains the same observations, but
different non-overlapping sets of explanatory variables. As with OU, each subset is then used in
combination with a classification algorithm to build a fraud prediction model that is applied to
out-of-sample data. For each observation in the out-of-sample data, the resulting model
predictions are then combined into an overall fraud probability prediction for the observation.
financial statement fraud variables used in the literature are inherently related to a specific type
of fraud. For example, abnormal revenue growth is a potential measure of revenue fraud while
9 In an attempt to further mitigate problems associated with having a small number of fraud observations to learn from, we
examine the usefulness of an observation oversampling method named SMOTE in fraud prediction. SMOTE was developed by
Chawla, Bowyer, Hall, and Kegelmeyer (2002) and performs well across multiple classification problems (e.g., Chawla et al.
2002; He and Garcia 2009). We perform two experiments to investigate (i) the number of fraud observations to use when
creating a new synthetic fraud observation and (ii) the oversampling ratio to use, which determines how many additional
synthetic fraud observations are generated. In the first experiment, untabulated results indicate that SMOTE only performs
significantly better than the benchmark (simple oversampling, i.e., duplication of fraud observations in the training data) in one
out of 27 comparisons. In the second experiment, we again fail to find a significant performance advantage for SMOTE relative
to simple oversampling. Finally, we implement SMOTE after partitioning the data on fraud types and find that this
implementation does not statistically differ from the original implementation of SMOTE. Based on the above results, we cannot
recommend SMOTE to address data rarity in the fraud context.
- 10 -
an abnormally low amount of allowance for doubtful accounts is a potential measure of expense
fraud. Although these variables may provide useful information about a specific type of fraud,
they are less likely to detect multiple types of fraud.10 When different fraud types are combined
into a binary classification problem, variables that are helpful when detecting a specific type of
fraud may be discarded if they do not do well in predicting fraud in general. For example, a
variable that provides a good signal about expense fraud but provides no useful information
about other types of fraud will only provide value when classifying expense fraud cases, which
in our sample is only about ten percent of the fraud cases. Additionally, by combining different
fraud types into a binary classification problem, the classification algorithms focus on finding
patterns common to all fraud types. Given heterogeneity among different fraud types, such
To reduce the potential negative effects associated with combining different fraud types into
on different fraud types (PVU).11 When implementing PVU, we place all variables that appear
to predict a specific fraud type into a separate variable subset. Variables that can be used to
predict multiple fraud types are placed in multiple subsets. This creates four subsets of variables
relating to revenue, expenses, assets, and liabilities (each subset is also restricted to fraud
observations that represent the associated fraud type). We also include three additional variable
subsets, because some fraud variables measure general attributes of fraud, such as incentives,
opportunities, or the aggregate effect of fraud. The first of these subsets includes all variables
10 Since accounting information is recorded using a double entry system, specific variables may capture the effect of multiple
fraud types.
11 Additionally, the use of multiple VU variable subsets that focus on different fraud types increases the likelihood that different
prediction models capture different fraud patterns, which improves diversity among the prediction models. Prediction model
diversity is important for performance when combining multiple models (Kittler et al. 1998). We do not modify OU based on
different fraud types because OU only undersamples the non-fraud data and does not preprocess the fraud data.
- 11 -
not categorized as a specific fraud type variable. The second subset includes the variables used
in Dechow et al. (2011). These variables are included for their utility in binary fraud prediction.
The third subset includes all variables and is created to allow the classifiers to find patterns
among both fraud type specific and non-fraud type specific variables.
(2011). We only include one firm year for each fraud observation that corresponds to the first
year that the Accounting and Auditing Enforcement Release (AAER) alleges that fraud was
committed. We do not include previous years as the fraud may have predated the reported first
fraud year. We do not include multiple fraud years for each fraud firm to prevent a single fraud
firm from being included in both the model building dataset and the out-of-sample model
evaluation dataset.
Perols (2011) identifies fraud firms in SEC investigations reported in AAERs between 1998
and 2005 that explicitly reference Section 10(b) Rule 10b-5 (Beasley 1996) or contain
descriptions of fraud. This fraud firm dataset excludes: financial firms; firms without the first
fraud year specified in the SEC release; non-annual financial statement fraud; foreign firms;
10-KSB or IPO; and firms with missing Compustat (financial statement data), Compact D/SEC
(executive and director names, titles and company holdings), or I/B/E/S data (one-year-ahead
analyst earnings per share forecasts and actual earnings per share) in relevant years.13 Randomly
12 This sample size of 51 fraud firms is comparable to other fraud studies (e.g., Beasley 1996, Erickson et al. 2006; Brazel et al.
2009). Other research (e.g., Dechow et al. 2011) uses AAERs to create samples focused on material misstatements. Material
misstatement data include firms with AAERs that explicitly allege fraud as well as other firms that describe a material
misstatement without explicitly alleging fraud. While such samples are larger, they do not necessarily focus on fraud.
13 Since we add additional variables to the Perols (2011) dataset, some of the variables have missing values. Missing values are
replaced by global means/modes. The effect of this is a reduction in the utility of variables that have many missing values.
- 12 -
selected Compustat non-fraud firms (excluding observations following the applicable criteria
specified for fraud firms above) are added to the fraud firm dataset to create a sample with 0.3
percent fraud firms, which allows us to examine the robustness of the results around best
estimates of prior fraud probability, i.e., 0.6 percent (Bell and Carcello 2000), in the population
of interest. We include explanatory variables (summarized in Appendix A) that have been used
in recent literature to predict fraud or material misstatements (Cecchini et al. 2010; Dechow et al.
2011; Perols 2011). More specifically, we include all variables from Perols (2011) and all
variables from the final Dechow et al. (2011) model that can be calculated using Compustat data.
Following and extending Cecchini et al. (2010), we also include 48 variables measuring levels
and changes in levels, percentage changes in levels, and abnormal percentage changes of
Experimental Design
Overview of the experiments. As summarized in Table 1, we perform multiple experiments
to (i) determine how to best implement OU and VU (e.g., how many subsets to use) and (ii)
evaluate their relative performance compared to various benchmarks. The primary objective in
these experiments is to detect trends that indicate how to implement the methods in future
research. By detecting clear trends between the number of subsets and predictive ability rather
than selecting implementations that happen to be the most predictive, we reduce the risk that we
recommend implementations that perform well on our test data, but do not generalize well.
In experiment 1, we use OU to create observation subsets that contain all fraud observations
and random samples of non-fraud observations that yield 20 percent fraud observations per
subset. In an evaluation of simple undersampling ratios, Perols (2011) finds that this ratio
provides relatively good performance. We then evaluate how many observation subsets to
include when implementing OU. In experiment 2a, we use VU to randomly divide the variables
- 13 -
used in prior fraud prediction research into 20 subsets. We then assess how many variable
subsets to include when implementing VU. In experiment 2b, we examine whether the number
of variables included in each subset affects performance by dividing the total number of
variables into subsets as follows: one subset with all variables, two subsets each with one-half of
the variables, four subsets each with one-quarter, six subsets each with one-sixth, eight subsets
each with one-eighth, etc. We then evaluate how many variables per subset to include when
independent variables are grouped together based on their relation to specific types of fraud.
to the fraud detection literature to reduce the imbalance between the number of fraud versus non-
fraud observations, we use simple undersampling as a benchmark (Perols 2011) for OU.14 This
benchmark randomly removes non-fraud observations from the sample to generate a more
balanced model-building sample. OU and the OU benchmarks use all variables (as independent
variable reduction is examined in the VU analysis) and are implemented using support vector
machines, following recent fraud data analytics research (e.g., Cecchini et al. 2010; Perols 2011),
method that has the potential to improve the performance over currently used variable selection
methods. As a baseline we include a benchmark (the Dechow benchmark) that uses the
independent variables from model 2 in Dechow et al. (2011). We also use (i) a benchmark that
randomly selects variables and (ii) a benchmark that includes all variables (the all variables
14We also used no undersampling as an additional benchmark. However, because simple undersampling performs better than
no undersampling by 7.3 percent, we adopted simple undersampling as the benchmark.
- 14 -
benchmark) where data dimensionality is not reduced. The benchmark that randomly selects
variables performs better than both the Dechow benchmark and the all variables benchmark.15
Thus, we report our VU (and PVU) results using the benchmark that randomly selects variables.
VU, PVU, and their benchmarks use all observations (observation undersampling is examined in
in-sample performance measures because they provide a more realistic measure of prediction
performance than measures commonly used in economics (Varian 2014: 7), and cross-
validation is particularly useful. We use stratified 10-fold cross-validation, where 10 folds (i.e.,
subsamples of observations) are generated using random sampling without replacement. The 10
folds rotate between being used for training and testing the prediction models. In each rotation,
nine folds are used for training (i.e., model building) and one fold is used for testing (i.e., model
evaluation). For example, in the first round, subsets one through nine are used for training and
subset 10 is used for testing; in round two, subsets one through eight and subset 10 are used for
training, and subset nine is used for testing. By using stratified cross-validation, we ensure that
the ratio of fraud to non-fraud observations is kept consistent across the training and test sets in
the different rounds. With a total of 51 fraud firms in the sample, 45 or 46 fraud firms are used
for model building and five or six fraud firms are used for model evaluation in each cross-
validation round. In our experiments, the OU and VU methods are only applied to training data.
Prediction performance metric. Following prior financial statement fraud research (e.g.,
Beneish 1997; Kwon, Pastena, and Park 2000; Lin et al. 2003), we use expected cost of
misclassification (ECM) as the preferred performance metric. ECM allows the researcher to
15The Dechow benchmark performed 0.02 percent better than the all variables benchmark and the random variable selection
benchmark performed 3.87 percent better than Dechow benchmark.
- 15 -
vary two important parameters in evaluating the prediction models performance on out-of-
sample data: (i) estimated percentage of fraud firms in the population of interest and (ii)
estimated ratio of the cost of a false negative to the cost of a false positive in the population of
interest. Including both parameters is important in settings such as fraud prediction that are
characterized by relative rarity and uneven misclassification costs. Given specific classification
where CFP and CFN are estimates of the cost of false positive and false negative classifications,
respectively, deflated by the lower of CFP or CFN; P(Fraud) and P(Non-Fraud) are estimates of
prior probability of fraud and non-fraud, respectively; nFP and nFN are the number of false
positive and false negative classifications, respectively, on the cross-validation test data;16 and nP
and nN are the number of fraud and non-fraud observations, respectively, in the cross-validation
test data. Bayley and Taylor (2007) estimate that actual cost ratios (FN to FP cost) average
between 20:1 and 40:1, while Bell and Carcello (2000) estimates that approximately 0.6 percent
of all firm years represent detected fraud. Thus, in experiments, which compare model
prediction performance at best estimates of prior fraud probability and cost ratios, we calculate
ECM at a cost ratio of 30:1 and a prior fraud probability of 0.6 percent (together with the
prediction models actual false positive and false negative rates). The goal of the prediction
16Following prior research (e.g., Beneish 1997; Feroz, Kwon, Pastena, and Park 2000; Lin et al. 2003), nFP and nFN are obtained
using optimal fraud classification thresholds (e.g., probability cutoffs for classifying an observation as fraud or non-fraud) for
each combination of prior fraud probability and cost ratio. These optima are established by examining ECM scores using all
unique fraud probability predictions as potential thresholds.
- 16 -
1) The full sample is first separated into model-building data (a.k.a., training data) and model-
evaluation data (a.k.a., test data) using 10-fold cross-validation.
2) For each cross-validation round and OU implementation, the OU method is applied to the
training data (but not the test data, which is left intact) to partition the training data into OU
subsets. For example, in the first cross-validation round when evaluating the OU
implementation with 12 subsets, the OU method creates 12 subsets of the first training set.
3) A classification algorithm is used with each OU training subset generated in step 2 to build
one prediction model for each OU subset. For example, in OU with 12 subsets, a total of 12
prediction models are generated.
4) The test set, which was not modified using the OU method, is applied to each of the
prediction models generated in step 3.
5) For each observation in the test set, the probability predictions from each prediction model
are averaged. After combining the probability predictions, each observation in the test set
has a single probability prediction representing the average prediction of all the prediction
models developed in step 3.
6) The probability predictions along with the class labels (e.g., fraud or non-fraud) are used to
calculate ECM scores. When calculating ECM scores, optimal fraud classification thresholds
(cutoffs) are first determined for each combination of prior fraud probability and cost ratio
by examining ECM scores at different classification threshold levels (Beneish 1997).
Optimal thresholds are then used to calculate ECM scores for each combination of prior
fraud probability and cost ratio for that specific test dataset.
7) The experimental procedure repeats steps two through six for each cross validation round and
each OU implementation, e.g., OU with two subsets, OU with three subsets, etc., within each
cross validation round.
8) After completing all ten rounds, each OU implementation has ten ECM scores (one for each
test set) for each prior fraud probability and cost ratio combination. Averages of the ten ECM
scores are then used to examine prediction performance of different OU implementations and
against the benchmarks at different prior fraud probability and cost ratio levels.
- 17 -
IV. RESULTS
Main Results
Figures 4-6 summarize the performance results of different OU and VU implementations.
For each implementation, the results represent the average expected cost of misclassification
(ECM) from ten test folds. ECM is reported at the best estimates of (i) prior fraud probability,
i.e., 0.6 percent, and (ii) false negative to false positive cost ratios, i.e., 30:1. The results are
presented as the percentage difference in ECM between each OU and VU implementation and
their respective benchmarks.17 Given that each figure is plotted using a single benchmark that is
held constant across different implementations, we first use the figures to look for clear trends
that indicate how to implement OU and VU, respectively. We then compare the performance of
details in Table 2) presents the performance results of OU relative to the best performing OU
benchmark (i.e., simple undersampling) as the number of subsets in OU increases. Our results
indicate that the benefit provided by OU initially increases as additional subsets are used but
Figure 4 also includes the corresponding results from two sensitivity analyses, i.e., the
experiments in which the subsets are selected in a different order and the random selection of
non-fraud cases is repeated. The results across all three versions of experiment 1 are similar in
that each shows a performance benefit from using OU that initially increases in the number of
OU subsets, but starts to plateau after about 10 subsets. These results indicate that the marginal
17Reported p-values are based on pairwise t-tests using the average and standard deviation of ECM scores across the ten test
folds and are one-tailed unless otherwise noted. Assumptions related to normality and independent observations are unlikely to
be satisfied, and p-values are only included as an indication of the relation between the magnitude and the variance of the
difference between each implementation and the respective benchmarks.
- 18 -
performance benefit from adding subsets declines as new subsets become less and less likely to
Taken together, these experiments indicate that OU provides performance benefits and that
the number of subsets to include in OU is relatively consistent in the fraud setting. In an attempt
to balance performance benefits (we want to include enough subsets to make sure that we have
reached the performance plateau) with analysis costs (given that we have reached the plateau, we
want to keep the number of subsets low since adding additional subsets increases processing
OU(12). This configuration lowers the expected cost of misclassification in the primary analysis
examine two versions. The dashed line shows the results when the number of variables in each
subset remains constant per experimental round (Experiment 2a). The round dotted line shows
the results when all variables are included and divided evenly across the subsets in each
18 OU, which uses all variables and under-sampled non-fraud firm observations (across multiple subsets), appears to improve
performance in two ways. First, simple undersampling improves performance over no undersampling by 7.3 percent. Second,
OU(12) further improves the performance over simple undersampling by another 10.8 percent. This indicates that OU improves
performance relative to the benchmarks that use all observations (i) because it undersamples observations, but more importantly
(ii) because of the way it undersamples these observations. That is, it creates multiple subsets including non-overlapping non-
fraud observations. This suggests that OU creates diverse models using different subsets. To better understand the source of this
diversity, i.e., if using different observations in the subsets allows OU to obtain more robust parameter estimates of a subset of
important variables or if different variables are emphasized in the different models, we perform an additional comparison. This
supplemental analysis indicates that OU(12) with all variables (as implemented in the paper) performs 7.0 percent better than
OU(12) with only the Dechow variables, which in turn performs 11.1 percent better than the Dechow benchmark that uses all
observations. The improvement in the Dechow benchmark when combined with OU(12) suggests that some performance benefit
is obtained by OU(12) creating more robust parameter estimates. The additional performance benefit of OU(12) with all
variables over OU(12) with only the Dechow variables (together with results in footnote 19), indicates that different models at
least partially rely on different variables. OU thus appears to improve performance by generating more robust parameter
estimates and by emphasizing different variables in different models.
- 19 -
<Insert Figure 5 Here>
When the number of variables is kept constant in each subset (the dashed line), the
subsets, and then decreasing at 19 subsets. However, even at the plateau (VU with 11 to 18
subsets), the performance difference between VU and the benchmark only approaches statistical
significance (p = 0.125 on average). In addition, the jagged line indicates that VU is sensitive to
When all available variables are divided into the selected subsets (the round dotted line), VU
does not provide a performance benefit relative to the random variable selection benchmark.
Consistent with the results from the analysis where the number of variables is kept constant in
each subset, these results indicate that the performance of VU is dependent on the specific
variables included in each subset. This second VU experiment also emphasizes the importance
The VU results discussed above suggest that a more deliberate partitioning of variables may be
important. We earlier argued that fraud consists of multiple types (e.g., revenue vs. expense
fraud) and that it might be beneficial to partition the explanatory variables with this in mind. Our
results for PVU support this conjecture. More specifically, in untabulated results, PVU lowers
the expected cost of misclassification by 9.6 percent (p = 0.019) relative to the best performing
VU benchmark.19
19 To better understand why PVU (and VU) improves performance over the benchmarks, we first note that the small performance
difference (0.02 percent) between the all variables benchmark (that uses all observations and all variables) and the Dechow
benchmark (that uses all observations and a subset of variables as selected in Dechow et al. 2011) suggests that performance does
not improve by simply adding more variables. Given that VU (as well as PVU that performs even better) improves performance
relative to the all variables benchmark by 7.2 percent, it appears that the segmentation of the variables rather than the inclusion of
additional variables contributes to the performance improvement. Additionally, because PVU performs 6.3 percent better than
VU, it appears that how the variables are segmented matters.
- 20 -
Additional Analyses
Further validation using misstatement data. We use the observations in a material
misstatement dataset that is an expanded version (additional years) of the data used in Dechow et
al. (2011) to perform three additional analyses. This dataset is available from the Center for
Financial Reporting and Management at the University of California, Berkeley and includes the
fraud firms used in our primary dataset as well as additional material misstatement firms reported
in AAERs by the SEC.20 Unless otherwise noted, the prediction models are implemented using
the same variables as in the main experiments (e.g., OU is implemented using all variables) and
we use the Dechow benchmark given that these data are based on Dechow et al. (2011). To
evaluate predictive performance, we again use 10-fold cross validation. Further, due to a lack of
good estimates of prior probabilities and cost ratios for material misstatements, we use a
performance metric known as area under the Receiver Operating Curve (ROC) or simply AUC
The first analysis provides further validation of out-of-sample prediction performance of the
proposed methods and compares OU and PVU to the Dechow benchmark when using the
20 We exclude firms from the finance industry and, following Dechow et al. (2011), add all Compustat non-fraud firms in the
same year and industry as the fraud firms. We do, however, only include the first fraud year, i.e., we do not include multiple
years for each fraud firm, due to the potential bias introduced when including fraud firm years. We also follow the procedure
used in Dechow et al. (2011) to eliminate observations with missing values in one or more of the variables included in the
Dechow benchmark. We use mean replacement to handle missing values in the remaining variables. We also perform the
analyses reported in this section after eliminating all observations with one or more missing values. Before performing this
elimination, we remove six variables with over 25 percent missing values: abnormal change in order backlog, allowance for
doubtful accounts, allowance for doubtful accounts to accounts receivable, allowance for doubtful accounts to net sales, expected
return on pension plan assets, and change in expected return on pension plan assets.
21 While ECM is a preferred performance metric when prior probabilities and cost ratios are known, AUC is preferred over other
performance measures in settings with unknown error costs and prior probabilities (Provost, Fawcett, and Kohavi 1998). AUC
has become the de facto standard performance measures in machine learning research and has also been used in accounting
research (e.g., Larcker and Zakolyukina 2012). A single ROC curve is generated for each predicted evaluation dataset by
changing the classification threshold and then plotting the true positive rate (positive cases classified correctly to all positive
cases) to the false positive rate (negative cases classified incorrectly to all negative cases). ROC curves thus depict the trade-off
between classifying additional positive cases correctly and the cost of classifying additional negative cases incorrectly, as the
classification threshold decreases. Alternatively, they also show how well the prediction model performs in ranking the
evaluation dataset observations. The area under the ROC curve (AUC) provides a numeric value of this trade-off and represents
the probability that a randomly selected positive instance is ranked higher than a randomly selected negative instance. An AUC
of 0.5 is equivalent to a random rank order while an AUC of 1 is perfect ranking of the evaluation cases.
- 21 -
observations in the material misstatement data. This analysis also provides insight into the
usefulness of the proposed methods in a slightly different setting (material misstatements vs.
fraud). Results in Table 3 suggest that OU and PVU (panel A) continue to improve performance
over the Dechow benchmark when using material misstatement data now by 16.9 (p = 0.004)
The second analysis examines the sensitivity of the results to the classification algorithm
used. This analysis evaluates the performance of the methods when combined with logistic
regression and bootstrap aggregation instead of support vector machines (that is used in all other
analyses). Results in Table 3 (panel B) show that the performance of OU and PVU are
consistent (OU more so than PVU) across the different classification algorithms. The
algorithm used. More importantly, OU and PVU perform significantly better than the Dechow
benchmark across all of the different classification algorithms. OU improves the performance
over the benchmark by 3.6 (p = 0.004) and 50.5 (p = 0.003) percent when logistic regression and
bootstrap aggregation are used, respectively.22 Similarly, PVU improves the performance over
the benchmark by 7.8 (p < 0.001) and 36.0 (p < 0.001) percent when logistic regression and
bootstrap aggregation are used, respectively. These results suggest that the performance benefits
from using OU and PVU are robust to the specific classification algorithm used.23
22 The difference between OU and the Dechow benchmark when using logistic regression does not appear to be as strong as that
suggested in the main experiment using fraud data. In the main experiment, we used the same classification algorithm (support
vector machines) for all methods and benchmarks to maintain internal validity and to avoid making the experiments overly
complex. To evaluate the effect of a potential bias against the Dechow benchmark associated with this decision, we examine the
performance of the Dechow benchmark in the main experiment with logistic regression instead of support vector machines. The
results indicate an insignificant difference between the two implementations (p = 0.984, two tailed), and this result is robust
across different prior probability levels and cost ratios. Thus, the decision to use support vector machines for all methods and
benchmarks does not appear to have biased the results against the Dechow benchmark.
23 These results are also robust to an additional analysis using a sample that excludes all variables with over 25 percent missing
values and all observations with one or more missing values in remaining variables. We also performed some limited analysis
using boosting, and OU and PVU continue to outperform the Dechow benchmark by 44.8 (p < 0.001) and 5.1 (p = 0.044)
percent, respectively. However, the performance of both PVU and the Dechow benchmark fell considerably (while the
- 22 -
<Insert Table 3 Here>
The third analysis provides insight into (i) the usefulness of OU when used in combination
with a different set of independent variables (based on the financial kernel of Cecchini et al.
2010) and (ii) whether OU provides incremental predictive power when used in combination
with this kernel. Cecchini et al. (2010) based their financial kernel on 23 financial statement
variables commonly used to construct independent variables for fraud prediction models. The
financial kernel divides each of the 23 original variables by each other both in the current year
and in the prior year and calculates changes in the ratios. Both current and lagged ratios as well
as their changes are then used to construct a dataset with 1,518 independent variables.
We use the same initial set of observations used in the previous analysis and recreate the
financial kernel following Cecchini et al. (2010). We also follow their procedures and exclude
all observations with missing values. We then compare OU implemented with the variables in
the financial kernel to the Cecchini benchmark, which uses the financial kernel but does not
attempt to implement PVU, as it is not clear how we would separate the 1,518 variables into
different fraud types. Results in Table 3 (panel C) indicate that OU (AUC = 0.67) outperforms
the financial kernel (AUC = 0.59) in misstatement prediction by 14.2 percent (p = 0.004).24
Combining the Methods. We analyze whether various combinations of OU, PVU, and
SMOTE (see footnote 9) provide additional performance benefits compared to OU(12), the best
performing individual method. Figure 6 plots the performance difference of various method
combinations compared to OU(12) at different cost ratios. The selection of the specific
performance of OU only fell slightly) when using boosting. Similarly, we performed some limited experiments using Bayesian
learning, but the performance of all three methods fell drastically. Thus, boosting and Bayesian learning do not appear to be
viable options, and we do not tabulate these results.
24 When including fraud firm years, OU performs 5.7 percent (p < 0.001) better than the Cecchini benchmark and both
approaches have high AUC values (AUC = 0.863 and AUC = 0.816, respectively).
- 23 -
configurations used in these combinations is based on their general performance in the previous
experiments. The combinations are generated by creating prediction models using OU and VU
separately and then averaging the predictions from the OU and PVU prediction models.25
In untabulated results, the three-method combination does not perform significantly different
than OU(12) (p = 0.465) at best estimates of prior fraud probability and cost ratios. Similarly,
the two-method combination of OU(12) and PVU also does not perform significantly different
than OU(12) (p = 0.421) at best estimates of prior fraud probability and cost ratios. Thus, in
typical fraud prediction research settings, we recommend using OU(12). However, the two- and
three-method combinations provide performance benefits over OU(12) at higher cost ratios and
higher prior fraud probability levels (see Figure 6). Given that the combination of OU(12) and
PVU either performs significantly better or not significantly different than OU(12), we
recommend using this combination of the two methods if maximizing predictive ability is more
important than minimizing implementation costs. For example, when the SEC uses a prediction
model to help decide which firms to investigate for potential fraud, the additional
implementation costs associated with using the combination is likely to be small relative to the
costs of misclassifying a non-fraud firm and using resources to investigate the firm (and even
more so relative to misclassifying a fraud firm and not detecting the fraud).
identify new explanatory variables to improve fraud prediction. Traditionally, this research uses
the entire sample (i.e., all observations) or a single matched sample to evaluate the significance
of one or more independent variables that are hypothesized to be associated with the dependent
25
SMOTE is incorporated by oversampling the data used by OU and PVU. We also first create the OU subsets and then apply
SMOTE and PVU to these subsets, but this more integrated and complex combination does not improve performance further.
- 24 -
variable. However, the predictive performance benefits of OU reported earlier suggest that
classification algorithms (e.g., logistic regression) recognize different fraud patterns when
trained on different subsets of non-fraud firms. Thus, when evaluating explanatory variables in
hypothesis testing research, it may be important to consider the robustness of results across
example uses data from the additional analyses that examine misstatement data. In this example
we examine the significance of Sales to Employees given a set of control variables selected
based on prior research (the control variables in this example were selected using step-wise
backward feature selection). Traditionally, the hypothesis would be tested using all observations
in the sample, i.e., the full sample. The results for the full sample in Table 4 indicate that the
hypothesis is supported (p = 0.0116). However, the OU subsample analyses indicate that this
result might not be robust. For example, the average p-value of all Sales to Employees estimates
across the 12 models obtained using different sub-samples is p = 0.180 and the p-value is above
Results in Table 4 suggest that OU yields similar results to the traditional hypothesis testing
analysis, i.e., the most significant variables in the traditional approach tend to be the most
significant in the OU analysis. However, the OU results are generally more conservative. For
example, in only two cases are the median p-values from the OU results numerically smaller
(more significant) than the corresponding parametric result. For 12 of 17 variables, the median p-
values are numerically larger (less significant) than their parametric counterparts. Thus, we
- 25 -
encourage future research to consider applying OU as a robustness check for hypothesis testing.26
Hence, the accounting literature investigates a wide range of explanatory variables and various
classification algorithms that contribute to more accurate prediction of fraud and material
misstatements. However, the rarity of fraud data, the relative abundance of variables identified in
prior literature, and the broad definition of fraud create challenges in specifying effective
prediction models.
Research in the emerging field of data analytics has been applied successfully in other
settings constrained by data rarity, such as predicting credit card fraud (Chan and Stolfo 1998).
We, therefore, follow the call of Varian (2014) to apply recent advances in data analytics in other
settings and investigate the ability of methods drawn from data analytics to improve fraud
undersampling of non-fraud observations to establish a more effective balance with scarce fraud
observations. When used with 12 subsamples, this method improves fraud prediction by
lowering the expected cost of misclassification by more than ten percent relative to the best
performing benchmark. This method is also both efficient and relatively easy to implement.
explanatory variables to put them more in balance with scarce fraud observations. Fraud
26In untabulated results, we repeat the analysis using bootstrapping. More specifically, the full sample is used to generate 1,000
bootstrap subsamples (each sample contained observations selected randomly with replacement). Each bootstrap subsample is
then used to fit a logistic regression model from which 2.5 and 97.5 percentiles of independent variable coefficient estimates are
obtained. The bootstrapping results are similar to the OU results in that they are also generally more conservative.
- 26 -
into different subsets. However, it does not do so reliably. When we instead implement Multi-
subset Variable Undersampling by partitioning variables into subsets based on the type of fraud
they are likely predict (PVU), the expected cost of misclassification is reduced by 9.6 percent
Our research makes multiple contributions to the prior literature. First, we identify and
directly address financial statement fraud data rarity problems by systematically evaluating
multiple methods that we believe are new to the accounting literature. Based on our
experiments, we conclude that OU and PVU each produce economically and statistically
significant reductions in the expected cost of misclassification of about ten percent.27 This
compares to, for example, a 0.9 percent prediction performance advantage when, following
Dechow et al. (2011), two additional significant independent variables are added to their initial
model. The introduction and evaluation of these methods directly contributes to research that
focuses on improving fraud prediction. Beneish (1997) and Dechow et al. (2011), among others,
create fraud prediction models that can be used to indicate the likelihood that a company has
committed financial statement fraud. Our methods can be used to improve the quality of such
fraud predictions. We also directly extend research that examines the usefulness of data
analytics methods in fraud prediction (e.g., Cecchini et al. 2010; Perols 2011; Larcker and
27 We specifically recommend the use of OU(12) at times in combination with PVU. The choice between using OU by itself or
in combination with PVU depends on the cost ratio and the prior fraud probability assumed by the specific entity that is trying to
predict fraud (see Figure 6).
28 Future research that tries to improve fraud prediction using data analytics methods can examine other problems related to
rarity, such as (i) noisy data that potentially have more significant negative effects on rare cases (Weiss 2004), and (ii) mislabeled
non-fraud firms, i.e., firms that are labeled non-fraud but have actually committed fraud. We performed a limited analysis of one
potential approach. We (1) manipulated the training data in each cross-validation round by using OU to generate fraud
probability predictions for all the observations in the training data and then removed all non-fraud firms with high fraud
probability predictions (we tried five different thresholds: 0.9, 0.8, 0.7, 0.6, and 0.5) from the training data; (2) used the modified
training data from step 1 as input into OU; and (3) compared the results from step 2 to the original OU implementation.
Untabulated results did not show any significant performance improvements over the original OU implementation. When
compared to the original implementation, the average change in AUC across the ten test folds was -0.08% (p = 0.809; two-tailed),
- 27 -
Second, by showing that performance benefits can be gained by (i) addressing data rarity
problems in fraud detection and (ii) partitioning financial statement fraud into different fraud
types, our results provide an indication of the potential benefits that may result from addressing
similar problems in other settings. For example, bankruptcy, financial statement restatements,
material weaknesses in internal control over financial reporting, and audit qualifications are also
Third, our research has implications for research that focuses on designing new explanatory
variables and developing parsimonious prediction models (e.g., Dechow et al. 2011; and
Markelevich and Rosner 2013). Our findings suggest that classification algorithms recognize
different fraud patterns when trained on different subsets of non-fraud firms. Thus, even if an
explanatory variable is deemed significant in one subsample, it is valuable to show that it is also
measure proposed by Athey and Imbens (2015) that creates subsamples based on values of the
independent variables in the model. While we perform additional analyses that suggest that OU
(i) performs better than bootstrapping in predictive modeling and (ii) can be used to evaluate the
recommendations about which method(s) to use for hypothesis testing.29 Further, research that
concludes that a new explanatory variable provides incremental predictive power should
0.08% (p = 0.360; one-tailed), 0.12% (p = 0.337; one-tailed), 0.31% (p = 0.182; one-tailed), and 0.24% (p = 0.228; one-tailed)
when using thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. Future research is also needed to more directly address the
challenges associated with biases introduced by only having few fraud observations in absolute terms, while at the same time
having a potentially large number of undetected fraud cases. For example, to assess the impact of a potential overreliance on a
small sample of fraud firms and to attempt to improve out-of-sample predictive performance, future research could use random
subsamples rather than all fraud firms in each OU subset.
29 Future research can also examine the use of OU in conjunction with propensity score matching. For example, can OU be used
to generate more robust propensity scores? Alternatively, by applying OU and then generating propensity scores, matched
samples, and evaluating difference between the samples within each OU subsets, can OU be used to evaluate the robustness of
propensity score matching results?
- 28 -
consider showing that the variable provides incremental predictive value to models implemented
Fourth, we also make a contribution by following the call to consider different types of fraud
(Brazel et al. 2009). We partition financial statement fraud into types and show that this
reframing improves the performance of VU in fraud prediction. The importance of this finding
may extend beyond VU. Research that examines predictors of fraud could, similar to Brazel et
al. (2009), design new explanatory variables to detect a specific type of fraud instead of fraud in
general. For example, fraud research could potentially develop variables that predict different
fraud types using different types of analyst forecasts (e.g., revenue vs. earnings) or different
types of debt covenants (e.g., leverage vs. interest coverage). For example, an independent
variable that indicates whether a firm uses a leverage (interest expense) debt covenant can in turn
be used in a prediction model that predicts liabilities (expense) fraud. This reframing could as
such contribute to better theoretical understanding of fraud and also more precise evaluation of
explanatory variables.
Finally, we believe that regulators and practitioners can potentially benefit from our findings.
Regulators, such as the SEC, are investing resources in developing better fraud risk models
(Walter 2013; SEC 2015). Our findings may enhance their ability to identify firms that have
committed fraud. This is important because, due to resource constraints, the SEC has to focus
prediction models can be cost effective in identifying potential fraud firms. The negative effects
30Please refer to www.fraudpredictionmodels.com/ou for further details on OU in general and more specific guidance on how to
use OU to evaluate (1) the robustness of independent variable hypothesis testing results and (2) the incremental predictive
performance of new independent variables. The hypothesis testing example includes further details on the analysis performed in
Table 4 of this paper and also includes mock data. The predictive performance example explains how to use OU in combination
with out-of-sample and includes mock data and SAS code.
- 29 -
customers, and lenders can also be potentially reduced. For example, auditors can use our
methods to potentially improve fraud risk assessment models that, in turn, can improve audit
client portfolio management and audit planning decisions. Given the significant costs and
widespread effects of financial statement fraud, improvements in fraud prediction models can
- 30 -
REFERENCES
Abbasi, A., C. Albrecht, A. Vance, and J. Hansen. 2012. MetaFraud: A Meta-Learning
Framework for Detecting Financial Fraud. MIS Quarterly. 36(4): 1293-1327.
Agarwal, R., and V. Dhar. 2014. EditorialBig Data, Data Science, and Analytics: The
Opportunity and Challenge for IS Research. Information Systems Research. 25(3): 443-448.
Apostolou, B., J. Hassell, and S. Webber. 2000. Forensic Expert Classification of Management
Fraud Risk Factors. Journal of Forensic Accounting. 1(2): 181-192.
Armstrong, C. S., D. F. Larcker, G. Ormazabal, and D. J. Taylor. 2013. The relation between
equity incentives and misreporting: the role of risk-taking incentives. Journal of Financial
Economics. 109(2): 327-350.
Association of Certified Fraud Examiners. 2014. Report to the Nation on Occupational Fraud
and Abuse. Austin, TX.
Athey, S., and G. Imbens. 2015. A Measure of Robustness to Misspecification. American
Economic Review. 105(5): 476-80.
Bayley, L., and S. Taylor. 2007. Identifying earnings management: A financial statement
analysis (red flag) approach. Proceedings of the American Accounting Association Annual
Meeting, Chicago, IL.
Beasley, M. 1996. An Empirical Analysis of the Relation between the Board of Director
Composition and Financial Statement Fraud. The Accounting Review. 71(4): 443-465.
Bell, T., and J. Carcello. 2000. A Decision Aid for Assessing the Likelihood of Fraudulent
Financial Reporting. Auditing: A Journal of Practice & Theory. 19(1): 169-184.
Bellman, R. Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961.
Beneish, M. 1997. Detecting GAAP Violation: Implications for Assessing Earnings Management
among Firms with Extreme Financial Performance. Journal of Accounting and Public Policy.
16(3): 271-309.
Beneish, M. 1999. Incentives and Penalties Related to Earnings Overstatements That Violate
GAAP. The Accounting Review. 74(4): 425-457.
Brazel, J. F., K. L. Jones, and M. F. Zimbelman. 2009. Using nonfinancial measures to assess
fraud risk. Journal of Accounting Research. 47(5): 1135-1166.
Breiman, L. 1996. Bagging predictors. Machine learning. 24(2): 123-140.
Brown, B., M. Chui, and J. Manyika. 2011. Are you ready for the era of big data. McKinsey
Quarterly. 4: 24-35.
Caskey, J., and M. Hanlon. 2013. Dividend Policy at Firms Accused of Accounting Fraud.
Contemporary Accounting Research. 30(2): 818-850.
Cecchini, M., G. Koehler, H. Aytug, and P. Pathak. 2010. Detecting Management Fraud in
Public Companies. Management Science. 56(7): 1146-1160.
Chan, P., and S. Stolfo. 1998. Toward Scalable Learning with Non-uniform Class and Cost
Distributions: A Case Study in Credit Card Fraud Detection. Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining, New York, NY.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic
Minority Oversampling Technique. Journal of Artificial Intelligence Research. 16: 321-357.
- 31 -
Chen, H., R. H. Chiang, and V. C. Storey. 2012. Business Intelligence and Analytics: From Big
Data to Big Impact. MIS Quarterly. 36(4): 1165-1188.
Dechow, P. M., R. G. Sloan, and A. P. Sweeney. 1996. Causes and consequences of earnings
manipulation: An analysis of firms subject to enforcement actions by the sec. Contemporary
Accounting Research. 13(1): 1-36.
Dechow, P. M., W. Ge, C. R. Larson, and R. G. Sloan. 2011. Predicting Material Accounting
Misstatements. Contemporary Accounting Research. 28(1): 17-82.
Duin, P. W. R., and M. J. D. Tax. 2000. Experiments with Classifier Combining Rules.
Proceedings of the International Workshop on Multiple Classifier Systems 2000.
Erickson, M., M. Hanlon, and E. L. Maydew. 2006. Is There a Link between Executive Equity
Incentives and Accounting Fraud? Journal of Accounting Research. 44(1): 113-143.
Ettredge, M. L., L. Sun, P. Lee, and A. A. Anandarajan. 2008. Is earnings fraud associated with
high deferred tax and/or book minus tax levels? Auditing: A Journal of Practice & Theory.
27(1): 1-33.
Fanning, K., and K. Cogger. 1998. Neural network detection of management fraud using
published financial data. International Journal of Intelligent Systems in Accounting, Finance
and Management. 7(1): 21-41.
Feng, M., W. Ge, S. Luo, and T. Shevlin. 2011. Why do CFOs become involved in material
accounting manipulations? Journal of Accounting and Economics. 51(1): 21-36.
Feroz, E., T. Kwon, V. Pastena, and K. Park. 2000. The Efficacy of Red-Flags in Predicting the
SEC's Targets: An Artificial Neural Networks Approach. International Journal of Intelligent
Systems in Accounting, Finance & Management. 9(3): 145-157.
Galar, M., A. Fernndez, E. Barrenechea, H. Bustince, and F. Herrera. 2012. A review on
ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based
approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and
Reviews. 42(4): 463-484.
Glancy, F. H., and S. B. Yadav. 2011. A computational model for financial reporting fraud
detection. Decision Support Systems. 50(3): 595-601.
Goel, S., and J. Gangolly. 2012. Beyond The Numbers: Mining The Annual Reports For Hidden
Cues Indicative Of Financial Statement Fraud. Intelligent Systems in Accounting, Finance
and Management. 19(2): 75-89.
Green, B. P., and J. H. Choi. 1997. Assessing the Risk of Management Fraud Through Neural
Network Technology. Auditing: A Journal of Practice & Theory. 16(1): 14-28.
Gupta, R., and N. S. Gill. 2012. A Solution for Preventing Fraudulent Financial Reporting using
Descriptive Data Mining Techniques. International Journal of Computer Applications. 58(1):
22-28.
He, H., and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on
Knowledge and Data Engineering. 21(9): 1263-1284.
Humpherys, S. L., K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix. 2011.
Identification of fraudulent financial statements using linguistic credibility analysis. Decision
Support Systems. 50(3), 585-594.
Jones, K. L., G. V. Krishnan, and K. D. Melendrez. 2008. Do Models of Discretionary Accruals
- 32 -
Detect Actual Cases of Fraudulent and Restated Earnings? An Empirical Analysis.
Contemporary Accounting Research. 25(2): 499-531.
Kaminski, K., S. Wetzel, and L. Guan. 2004. Can Financial Ratios Detect Fraudulent Financial
Reporting. Managerial Auditing Journal. 19(1): 15-28.
Kittler, J., M. Hatef, R.P.W. Duin, and J. Matas. 1998. On Combining Classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence. 20(3): 226-239.
Larcker, D. F., and A. A. Zakolyukina. 2012. Detecting deceptive discussions in conference
calls. Journal of Accounting Research. 50(2): 495-540.
LaValle, S., E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz. 2013. Big data, analytics
and the path from insights to value. MIT Sloan Management Review. 21.
Lee, T. A., R. W. Ingram, and T. P. Howard. 1999. The Difference between Earnings and
Operating Cash Flow as an Indicator of Financial Reporting Fraud. Contemporary
Accounting Research. 16(4): 749-786.
Lennox, C., and J. A. Pittman. 2010. Big Five Audits and Accounting Fraud. Contemporary
Accounting Research, 27(1): 209-247.
Lin, J., M. Hwang, and J. Becker. 2003. A Fuzzy Neural Network for Assessing the Risk of
Fraudulent Financial Reporting. Managerial Auditing Journal. 18(8): 657-665.
Loebbecke, J. K., M. M. Eining, and J. J. Willingham. 1989. Auditors experience with material
irregularities: Frequency, nature, and detectability. Auditing: A Journal of Practice and
Theory. 9(1): 1-28.
Maloof, M. 2003. Learning When Data Sets are Imbalanced and When Costs are Unequal and
Unknown. Proceedings of the Twenty International Conference on Machine Learning,
Washington, DC.
Markelevich, A., and R. L. Rosner. 2013. Auditor Fees and Fraud Firms. Contemporary
Accounting Research. 30(4), 1590-1625.
Nguyen, H. M., E. W. Cooper, and K. Kamei. 2012. A comparative study on sampling
techniques for handling class imbalance in streaming data. Soft Computing and Intelligent
Systems. 1762-1767.
Perols, J. 2011. Financial statement fraud detection: An analysis of statistical and machine
learning algorithms. Auditing: A Journal of Practice & Theory. 30(2): 19-50.
Perols, J. L., and B. A. Lougee. 2011. The relation between earnings management and financial
statement fraud. Advances in Accounting. 27(1): 39-53.
Phua, C., D. Alahakoon, and V. Lee. 2004. Minority Report in Fraud Detection: Classification of
Skewed Data. SIGKDD Explorations. 6(1): 50-59.
Price III, R. A., N. Y. Sharp, and D. A. Wood. 2011. Detecting and predicting accounting
irregularities: A comparison of commercial and academic risk measures. Accounting
Horizons. 25(4): 755-780.
Provost, F. J., T. Fawcett, and R. Kohavi. 1998. The case against accuracy estimation for
comparing induction algorithms. Proceedings of the Fiftheenth International Conference on
Machine Learning, Madison, WI. 98: 445-453.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
- 33 -
SEC 2015. Examination Priorities for 2015. Retrieved from
http://www.sec.gov/about/offices/ocie/national-examination-program-priorities-2015.pdf.
Sharma, V. 2004. Board of Director Characteristics, Institutional Ownership, and Fraud:
Evidence from Australia, Auditing: A Journal of Practice & Theory. 23(2): 105-117.
Shin, K. S., T. Lee, and H. J. Kim. 2005. An Application of Support Vector Machines in
Bankruptcy Prediction Models. Expert Systems with Application. 28: 127-135.
Summers, S. L., and J. T. Sweeney. 1998. Fraudulently Misstated Financial Statements and
Insider Trading: An Empirical Analysis. The Accounting Review. 73(1): 131-146.
Varian, H. R. 2014. Big data: New tricks for econometrics. The Journal of Economic
Perspectives. 28(2): 3-27.
Walter, E. (2013, February). Harnessing Tomorrows Technology for Todays Investors and
Markets. Speech Presented at American University School of Law, Washington, D.C.
Weiss, G. 2004. Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations
Newsletter. 6(1): 7-19.
Whiting, D. G., J. V. Hansen, J. B. McDonald, C. Albrecht, and W. S. Albrecht. 2012. Machine
Learning Methods For Detecting Patterns Of Management Fraud. Computational
Intelligence. 28(4): 505-527.
Yang, Q., and X. Wu. 2006. 10 challenging problems in data mining research. International
Journal of Information Technology & Decision Making. 5(4): 597-604.
- 34 -
APPENDIX A: Definitions of explanatory variablesa
- 35 -
APPENDIX A: Definitions of explanatory variables (continued)
- 36 -
APPENDIX A: Definitions of explanatory variables (continued)
Total discretionary RSST Accrualst-1 + RSST Accrualst-2 + RSST Accrualst-3, where
accrual
Value of issued IF CSHI > 0 THEN CSHI*PRCC_F/(CSHO*PRCC_F) ELSE IF (CSHO-CSHOt-1)>0
securities to market THEN ((CSHO - CSHOt-1)*PRCC_F) / (CSHO*PRCC_F) ELSE 0
value
Whether accounts IF (RECT/RECT t-1) > 1.1 THEN 1 ELSE 0
receivable > 1.1 of last
years
Whether firm was listed IF EXCHG=5, 15, 16, 17, 18 THEN 1 ELSE 0
on AMEX
Whether gross margin IF ((SALE-COGS) / SALE) / ((SALEt-1 - COGSt-1)/SALEt-1) > 1.1 THEN 1 ELSE 0
percent > 1.1 of last
years
Whether LIFO IF INVVAL=2 THEN 1 ELSE 0
Whether new securities IF (CSHO-CSHOt-1)>0 OR CSHI>0 THEN 1 ELSE 0
were issued
Whether SIC code larger IF 2999<SIC<4000 THEN 1 ELSE 0
(smaller) than 2999
(4000)
- 37 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in return on NI/AT - NIt-1/ATt-1
assets
% Change in return on (NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1)
assets
Abnormal % change in (NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1) -
return on assets INDUSTRY(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1))
Return on equity NI/CEQ
Change in return on NI/CEQ - NIt-1/CEQt-1
equity
% Change in return on (NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1)
equity
Abnormal % change in (NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1) -
return on equity INDUSTRY(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1))
Return on sales NI/SALE
Change in return on NI/SALE - NIt-1/SALEt-1
sales
% Change in return on (NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1)
sales
Abnormal % change in (NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1) -
return on sales INDUSTRY(NI/SALE - NIt-1/ SALEt-1) / (NIt-1/SALEt-1))
Accounts payable to AP/INVT
inventory
Change in accounts AP/INVT - APt-1/INVTt-1
payable to inventory
% Change in accounts (AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1)
payable to inventory
Abnormal % change in (AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1) -
accounts payable to INDUSTRY(AP/INVT - APt-1/ INVTt-1) / (APt-1/INVTt-1))
inventory
Liabilities LT
Change in liabilities LT - LTt-1
% Change in liabilities (LT - LTt-1) / (LTt-1)
Abnormal % change in (LT - LTt-1) / (LTt-1) - INDUSTRY(LT - LTt-1) / (LTt-1))
liabilities
Liabilities to interest LT/XINT
expenses
Change in liabilities to
interest expenses LT/XINT - LTt-1/XINTt-1
% Change in liabilities (LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1)
to interest expenses
Abnormal % change in (LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1) -
liabilities to interest INDUSTRY(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1))
expenses
Assets AT
Change in assets AT - ATt-1
% Change in assets (AT - ATt-1) / (ATt-1)
Abnormal % change in (AT - ATt-1) / (ATt-1) - INDUSTRY(AT - ATt-1) / (ATt-1))
assets
Assets to liabilities AT/LT
- 38 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in assets to AT/LT - ATt-1/LTt-1
liabilities
% Change in assets to (AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1)
liabilities
Abnormal % change in (AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1) -
assets to liabilities INDUSTRY(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1))
Expenses XOPR
Change in expenses XOPR - XOPRt-1
% Change in expenses (XOPR - XOPRt-1) / (XOPRt-1)
Abnormal % change in (XOPR - XOPRt-1) / (XOPRt-1) -
expenses INDUSTRY(XOPR - XOPRt-1) / (XOPRt-1))
Notes:
a
The explanatory variables included represent a relatively comprehensive set of variables based on recent fraud and
material misstatement literature (Cecchini et al. 2010; Dechow et al. 2011; Perols 2011). We include all variables
from Perols (2011) and all variables from the final Dechow et al. (2011) model that can be calculated using
Compustat data. Dechow et al. (2011) perform step-wise backward feature selection to derive more parsimonious
material misstatement models. We use their second model, which is the most complete model in their study that
only relies on Compustat data (they also include a model that requires market related data). This study predicts
material misstatements using the following variables: RSST accruals, change in receivables, change in inventory,
soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal
change in employees, and existence of operating leases. The model in Cecchini et al. (2010) includes a total of
1,518 explanatory variables derived using 23 financial statement items. These items are divided by each other both
in the current year and in the prior year and used to calculate changes in the ratios. Both current and lagged ratios as
well as their changes are then used to construct a dataset with 1,518 independent variables. Rather than including all
1,518 variables in our study, we follow and extend the approach used in Cecchini et al. (2010) by including 48
variables measuring levels and changes in levels, percentage change in levels, and abnormal percentage change of
commonly manipulated financial statement items and ratios. We examine a model with all 1,518 variables from
Cecchini et al. (2010) in an additional analysis.
b
ACT is Current Assets - Total; AT is Assets - Total; AU is Auditor ; CAPX is Capital Expenditures; CEQ is
Common/Ordinary Equity - Total; CEQ is Common/Ordinary Equity - Total; CHE is Cash and Short-Term
Investments; COGS is Cost of Goods Sold; CSHI is Common Shares Issued; CSHO is Common Shares
Outstanding; DLC is Debt in Current Liabilities - Total; DLTIS is Long-Term Debt Issuance; DLTT is Long-Term
Debt - Total; DP is Depreciation and Amortization; EMP is Employees; EXCHG is Stock Exchange ; FINCF is
Financing Activities Net Cash Flow; IB is Income Before Extraordinary Items; INVT is Inventories - Total;
INVVAL is Inventory Valuation Method; IVAO is Investment and Advances Other; IVST is Short-Term
Investments - Total; LCT is Current Liabilities - Total; LT is Liabilities - Total; MRC1 is Rental Commitments
Minimum 1st Year; MRC2 is Rental Commitments Minimum 2nd Year; MRC3 is Rental Commitments
Minimum 3rd Year; MRC4 is Rental Commitments Minimum 4th Year; MRC5 is Rental Commitments
Minimum 5th Year; NI is Net Income (Loss); OANCF is Operating Activities Net Cash Flow; OB is Order
Backlog; PPEGT is Property Plant and Equipment - Total (Gross); PPENT is Property Plant and Equipment - Total
(Net); PPROR is Pension Plans Anticipated Long-Term Rate of Return on Plan Assets; PRCC_F is Price Close -
Annual - Fiscal Year; PSTK is Preferred/Preference Stock (Capital) - Total; RE is Retained Earnings; RECD is
Receivables - Estimated Doubtful; RECT is Receivables Total; SALE is Sales/Turnover (Net); SIC is SIC Code;
SSTK is Sale of Common and Preferred Stock; TXDI is Income Taxes - Deferred; TXP is Income Taxes Payable;
TXT is Income Taxes - Total; WCAP is Working Capital (Balance Sheet); XINT is Interest and Related Expense -
Total; XINT is Interest and Related Expense - Total; and XOPR is Operating Expense. We also included controls
for year and industry (two-digit SIC code).
c
Similar variable used in both Dechow et al. (2011) and Perols (2011).
d
Variable construction based on Financial Kernel in Cecchini et al. (2010).
- 39 -
Figure 1 Multi-subset Observation Undersampling (OU)
Notes:
Column 1 represents the raw data with the fraud observations stacked on top and non-fraud cases below. Column 1
also shows that model building and out-of-sample data are kept separated. Column 2 shows the data subsets that are
created based on the OU method. All fraud data are used in each subset while the non-fraud data are under-sampled
to address data rarity within each subset. Cumulatively across all subsets, all of the non-fraud data can be used, but
a single non-fraud observation is only used in one subset. In column 3, a classification algorithm is used to build
one prediction model per subset with the goal of accurately classifying firms into fraud or non-fraud cases. Each
model is then applied out-of-sample and generates a fraud probability prediction for each observation in the out-of-
sample data. In column 4, for each out-of-sample observation, the individual fraud prediction probabilities are then
combined to arrive at an overall combined fraud probability prediction for each observation.
More formally, let M={f1, f2, f3,, fk} be a set of k fraud observations f and let C={c1, c2, c3,, cn} be a set of n non-
fraud observations c, where M is the minority class, i.e., k < n. Note that the union of M and C, i.e., M U C, forms a
set that contains k fraud and n non-fraud observations. To achieve a more balanced dataset, d non-fraud
observations c are removed from the non-fraud set C, where 0 < d n - k. However, instead of deleting these
removed non-fraud observations, OU segments the non-fraud observations into n / (n - d) or fewer subsets Ui that
each contains n - d different non-fraud observations c, i.e., C={U1, U2, U3,, Un/n-d}. Note that all subsets Ui
contain mutually exclusive (disjoint) sets of non-fraud observations, Ui Uj = for i j. OU then combines all
fraud observations, i.e., the entire set M, with each Ui to create subsets Wi. OU thus creates up to n / (n - d) subsets
Wi that contain all fraud observations f and n - d unique non-fraud observations c. Each subset Wi is then used to
build a prediction model that is used to predict out-of-sample observations. In our experiments, OU is only used on
the model building data and the model evaluation data is left intact. Finally, for each out-of-sample observation, the
different prediction models probability predictions are averaged into an overall probability prediction for each
observation.
- 40 -
Figure 2 Multi-subset Variable Undersampling (VU)
Notes:
Column 1 represents the raw data that include all explanatory variables used to predict fraud. These explanatory
variables are partitioned into different subsets represented by the vertical lines. Each subset contains a subset of the
explanatory variables and all of the observations. Column 1 also shows that model building and out-of-sample data
are kept separated. In column 2, a classification algorithm is used to build one prediction model per variable subset
with the goal of classifying firms into fraud vs. non-fraud cases. Each prediction model is then applied out-of-
sample to generate a fraud probability prediction for each observation in the out-of-sample data. In column 3, for
each out-of-sample observation, the fraud prediction probabilities from the different prediction models are combined
to arrive at an overall combined fraud prediction probability for each observation.
More formally, let W denote a dataset with m variables x, i.e., W={x1, x2, x3,, xm}. VU reduces data dimensionality
by randomly dividing the variables in W into q subsets X, where each X contains m/q variables, i.e., the following
variable subsets are created by VU, X1={x1, x2, x3,, xm/q}, X2={xm/q+1, xm/q+2, xm/q+3,, x2m/q}, X3={x2m/q+1, x2m/q+2,
x2m/q+3,, x3m/q},, Xq={xm-m/q+1, xm-m/q+2, xm-m/q+3,, xm}. The subsets X are then used to build q prediction models.
The prediction models are then (i) used to predict out-of-sample observations and (ii) for each out-of-sample
observation, the prediction models probability predictions are combined into an overall prediction for each
observation by taking an average of the individual probability predictions.
- 41 -
Figure 3 Experimental Procedures Multi-subset Observation Undersampling (OU) Example
Start
Raw Data
Perform 10-fold
cross-validation
For each OU
implementation l
= {1, 2, 3,, 20}
Create l OU
subsets
Round n test data
l round n training OU subsets
Build prediction
models
l prediction models
Predict
test data
l round n test data sets with predictions
For each n test data
observation, average the Combine
l probability predictions
OU subset
predictions
Round n test data with combined predictions
l = 20? False
True
n = 10? False
True
End
- 42 -
Figure 4 Multi-subset Observation Undersampling (OU) with Different Numbers of Subsets -
Percentage Performance Improvement Relative to Benchmark
15ECM % New subsets
Improvement
14
13
12 New order
11
10 Original order
9
8
7
6
5
4
3
2
1
0 # OU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subsets
Notes:
ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, three versions of the experiment were conducted. Original order refers to the main
OU experiment; new order refers to the analysis in which the OU subsets are selected in a different order; and
new subsets refers to the analysis in which the random sampling of non-fraud cases is repeated using a different
random draw.
The benchmark is simple undersampling (Perols 2011), which randomly removes non-fraud observations from
the sample to generate a more balanced training sample. This benchmark performs better than a benchmark that
includes all fraud and non-fraud observations. OU and the OU benchmarks use all variables (independent
variable reduction is examined in the VU analysis) and are implemented using support vector machines.
- 43 -
Figure 5 Multi-subset Variable Undersampling (VU) with Different Numbers of Subsets of
Explanatory Variables - Percentage Performance Improvement Relative to Benchmark
8ECM %
Improvement
7
6 Constant
5 number of
variables in each
4 subset
3
2
1
0
-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# VU
-2 Subsets
-3
-4 All variables in
-5 each round
-6
Notes:
ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, two versions of the experiment were conducted. The constant number of variables in
each subset experiment (the dashed line) uses subsets that contain five or six variables in each subset; the all
variables in each round experiment (the round dotted line) uses all variables in each experimental round by
randomly dividing all 109 variables into different subsets (consequently, as the number of subsets increases, the
number of variables in each subset decreases).
The all variables in each round experiment only manipulates the number of VU subsets in even increments.
The benchmark contains six randomly selected variables (from the 109 variables described in Appendix A) and
is equivalent to the VU implementation with only one subset. This benchmark performed better than
benchmarks implemented using (i) all the variables in the dataset and (ii) the variables selected in Dechow et al.
(2011), i.e., RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash
sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of
operating leases VU and the VU benchmarks use all fraud and non-fraud observations (observation
undersampling is examined in the OU analysis) and are implemented using support vector machines.
- 44 -
Figure 6 Performance of combinations of OU, PVU, and SMOTE
Percentage Performance Improvement Relative to OU(12)
ECM %
Difference to PVU + OU(12) +
OU(12) SMOTE(600)
8
6 PVU + OU(12)
4
2 PVU
0 Cost
1:1 10:1 20:1 30:1 40:1 50:1 60:1 70:1 80:1 90:1 100:1 Ratio
-2
-4
-6 SMOTE(600)
-8
-10
-12
Notes:
OU is Multi-subset Observation Undersampling. OU(12) represents the best performing individual OU
implementation.
PVU is Multi-subset Observation Undersampling partitioned on fraud type.
SMOTE(600) is Multi-subset Observation Oversampling with an oversampling ratio of 600 percent. This represent
the best performing SMOTE implementation.
ECM is calculated assuming an evaluation fraud probability of 0.6 percent.
- 45 -
TABLE 1
Summary of Experiments
- 46 -
TABLE 1 (continued)
Summary of Experiments
Notes:
a
Since we introduce OU to the fraud detection literature to reduce the imbalance between the number of fraud and the number of non-fraud observations, we use
simple undersampling as a benchmark (Perols 2011) when evaluating the performance of OU. This benchmark randomly removes non-fraud observations from
the sample to generate a more balanced training sample. We also use no undersampling as an additional benchmark. However, simple undersampling performs
on average 7.3 percent better no undersampling and we consequently report only simple undersampling. The OU and the OU benchmarks use all variables (as
data dimensionality reduction is examined in the VU analysis). VU is introduced as a data dimensionality reduction method that is argued to improve the
performance over currently used variable selection methods. As a baseline, we use a benchmark that was created using the variables included in Dechow et al.
(2011) model 2 (the Dechow benchmark): RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in
return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. This model compares different fraud detection
variables with the objective of creating a parsimonious fraud prediction model. We also use (i) a benchmark that randomly selects variables and (ii) a benchmark
that includes all variables (the all variables benchmark), i.e., where data dimensionality is not reduced. The benchmark that randomly selects variables performs
better than the Dechow benchmark and the all variables benchmark. More specifically, VU with 12 variable subsets performs on average 7.2 percent better than
both the All Variable Benchmark and the benchmark based on Dechow et al. (2011). Thus, we report our results using the benchmark that randomly selects
variables. VU and the VU benchmarks use all fraud and non-fraud observations. Following recent fraud prediction research (e.g., Cecchini et al. 2010) and
findings in Perols (2011), all prediction models are implemented using support vector machines. Sensitivity analyses are used to examine other classification
algorithms.
b
Perols (2011) finds that a simple undersampling ratio of 20 percent provides relatively good performance compared to other undersampling ratios.
c
More specifically, we first create one subset and examine the performance of OU with this single subset. We then create a second subset and use this subset
along with the previously created subset to evaluate the performance of OU with two subsets. Note that while it is possible to derive a total of 41 subsets
following Chan and Stolfos (1998) approach, the addition of another OU subset is only valuable if the additional subset contains new information. We expect
that the marginal benefit of adding an additional subset decreases as the total number of subsets in OU increases. Additionally, for each subset that is added,
another prediction model has to be built, used for prediction, and combined with the other prediction models predictions. Thus, there is a computational cost
associated with increasing the number of subsets used. Based on this and the results that indicate that the performance benefit tapers off around 12 subsets, we
do not extend the experiment beyond 20 subsets.
- 47 -
TABLE 2
Multi-subset Observation Undersampling (OU)
Performancea - Increasing the Number of Subsets
Percentage
Number of Difference to
OU Subsets ECM Benchmark p-valueb
Benchmarkc 0.160
2 0.156 2.3% 0.146
3 0.151 5.4% 0.015
ECM Improving
4 0.151 5.4% 0.036
5 0.148 7.3% 0.031
6 0.149 6.7% 0.039
7 0.148 7.4% 0.012
8 0.146 8.9% 0.005
9 0.145 9.3% 0.005
10 0.143 10.8% 0.003
11 0.142 11.1% 0.003
Performance Plateau
- 48 -
TABLE 3
Prediction Performancea,b of OU and PVU on a
Material Misstatements Hold-Out Sample
Notes:
a
Prediction performance is evaluate using 10-fold cross-validation in which separate datasets are used for model
building vs. model evaluation. Performance is area under the ROC curve (AUC). AUC provides a numeric
value of how well the prediction model ranks the observations in the test sets and represents the probability that a
randomly selected positive (misstatement) instance is ranked higher than a randomly selected negative (non-
misstatement) instance. An AUC of 0.5 is equivalent to a random rank order while an AUC of 1 is perfect
ranking of the evaluation cases.
b
The results in Panel A compares the performance of OU and VU to the Dechow benchmark using material
misstatement data (all methods and benchmarks are implemented using support vector machines; Panel B reports
results when other classification algorithms are used). This comparison provides further validation of the results
reported earlier on fraud data and provides insight into the usefulness of the proposed methods in a slightly
different setting. The results in Panel B examine the sensitivity of the proposed methods to the use of other
classification algorithms, i.e., logistic regression and bootstrap aggregation. Please see footnotes 4, 22, and 26 in
the text for details about support vector machines and bootstrap aggregation. The results in Panel C compare the
performance of the financial kernel from Cecchini et al. (2010) with and without OU (both implementations use
support vector machines). This analysis provides insight into (i) the usefulness of OU when used in combination
with a different set of independent variables (created using the financial kernel of Cecchini et al. (2010)) and (ii)
whether OU provides incremental predictive power when used in combination with the financial kernel.
- 49 -
TABLE 3 (continued)
Prediction Performance of OU and PVU on a
Material Misstatements Hold-Out Sample
c
In panels A and B, given the source, i.e., Dechow et al. (2011), and the nature of the material misstatement data,
we use the Dechow et al. (2011) benchmark in these comparisons. This benchmark is based on model 2 from
Dechow et al. (2011): material misstatement = RSST accruals + change in receivables + change in inventory +
soft assets + percentage change in cash sales + change in return on assets + actual issuance of securities +
abnormal change in employees + existence of operating leases. The independent variables in this model were
selected using a material misstatement sample that is similar to the sample used in this experiment. Because the
entire sample was used when selecting these variables it is possible that the benchmark performance represents
an overfitted model. In this experiment, OU uses all 107 variables, but under-samples the non-fraud
observations using the OU method. PVU uses all data, but partitions the original 107 variables based on fraud
types.
d
The financial kernel consists of 1,518 independent variables representing current and lagged ratios and changes
in the ratios of 23 financial statement variables commonly used to construct independent variables in fraud
research. In this experiment, OU is implemented using the same 1,518 independent variables and support vector
machines. PVU is not implemented in this experiment, as it is not clear how to partition the 1,518 independent
variables into different fraud categories.
e
p-values are one-tailed based on pairwise t-tests using the average and standard deviation of ECM scores across
the ten test folds. Assumptions related to normality and independent observations are unlikely to be satisfied and
p-values are only included as an indication of the relation between the magnitude and the variance of the
difference between each implementation and the benchmark.
- 50 -
TABLE 4
Hypothesis Testing: Results on Full Sample Logistic Regressions versus 12 OU Subsamples Logistic Regressions
Note: Average estimates, standard deviation estimates, and average p-values are based on estimates and p-values from the 12 OU subsample logistic regression
results. P-values less than 0.0001 were converted to 0.0001 before taking the average.
- 51 -