You are on page 1of 6

Binary Classification Performance Measures/Metrics:

A Comprehensive Visualized Roadmap to Gain New Insights

Grol Canbek Seref Sagiroglu


METU and HAVELSAN, Ankara, Turkey Gazi University, Ankara, Turkey
gcanbek@havelsan.com.tr ss@gazi.edu.tr

Tugba Taskaya Temizel Nazife Baykal


METU, Ankara, Turkey METU, Ankara, Turkey
ttemizel@metu.edu.tr baykal@metu.edu.tr

AbstractBinary classification is one of the most frequent unsuccessful aspects of a classification model. Some represent
studies in applied machine learning problems in various domains, performance from the specific point of view while ignoring the
from medicine to biology to meteorology to malware analysis. others. Many researchers who design a classification model give
Many researchers use some performance metrics in their narrow metrics causing misperceptions.
classification studies to report their success. However, the
literature has shown a widespread confusion about the In this study, our approach to performance metrics is from
terminology and ignorance of the fundamental aspects behind holistic perspective covering the wide range of the subject. This
metrics. This paper clarifies the confusing terminology, suggests is important for some emerging domains such as malware
formal rules to distinguish between measures and metrics for the classification or other new machine learning classification
first time, and proposes a new comprehensive visualized roadmap applications that focus on implementation details and acquainted
in a leveled structure for 22 measures and 22 metrics for exploring with only a few misleading metrics such as Accuracy (ACC),
binary classification performance. Additionally, we introduced True Positive Rate (TPR) or F-measures to claim their success.
novel concepts such as canonical notation, duality, and The researchers who want to improve their machine learning
complementation for measures/metrics, and suggested two new algorithms on different domain problems and compare their test
canonical base measures simplifying equations. It is expected that results with others have difficulties to understand performance
the study will guide other studies to have standardized approach metrics and select the most proper ones from the wide set of
to performance metrics for machine learning based solutions.
possibilities. For this reason, an originally developed visually
Keywordsbinary classification; classification performance;
enhanced performance metrics roadmap is designed as a chart
metrics; measures; machine learning; visualization; ontology based on the confusion matrix to help these researchers.
The proposed comprehensive roadmap shows the complete
I. INTRODUCTION set of primary metrics not only the common ones such as ACC,
Machine learning classification performance that is an TPR, True Negative Rate (TNR), False Positive Rate, (FPR),
important subject in several domains is related to state how well Positive Predictive Value (PPV), False Negative Rate (FNR), F1
a classifier that implements a specific machine learning score but also the others such as Prevalence, (Label) Bias,
algorithm or model makes a correct distinction between classes. INFORM (informedness), MARK (markedness), MCC
The most basic and studied classification type is binary (Matthews Correlation Coefficient), BACC (balanced accuracy,
classification or two-class classification that separates a given also known as strength), Gm (G-mean), Cohens Kappa (CK),
input into two opposite classes such as 'presence' vs. 'absence' of and Matthews Correlation Coefficient (MCC).
a disease or a condition, respond vs. no respond for a
treatment [1], 'spam' vs. 'non-spam' for an e-mail, and 'malign' The roadmap is domain independent and useful in all the data
vs. 'benign' for software. mining, machine learning and statistics studies. We aim that this
study is also a reference study for covering all the primary
Stating or comparing a classification performance with only measures and metrics with their equations specifically arranged
4 base measures is not suitable and understandable. Therefore, in binary classification context. We also reviewed metrics'
several metrics have been proposed for evaluating classification naming used interchangeably in academic and online resources
performances. Area Under (ROC) Curve (AUC) has its origins and included here in order to suit different naming conventions.
in signal detection theory in the 1970s is considered as a best The corresponding terminology in other domains such as
metric to state the performance [2], but there are other combined meteorology, medicine, or statistics is provided to see the
metrics that are useful for indicating the successful and synonyms of the measures and metrics.

978-1-5386-0930-9/17/$31.00 2017 IEEE

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 


The aim of our study is to review, summarize and clarify the literature [7]. However, many of them are duplicate and
large number of binary classification primary performance including the ones we provided in this study [7], [8]. Relatively
measures and metrics from a structural point of view so that the new works have been concentrated on performance measures
individual measures/metrics and their dependencies and and metrics and their properties related to classification
relations become visible and could be understood easily in a (skewness effect [9], accuracy paradox [10], multi-class [11],
natural way. The study makes the following contributions: chance correction [12], usage cost [13], constraints [14],
chronology [7], equation patterns [3]), requirements and
Reviewed and clarified performance measure-metrics recommendations [15]). Lavesson and Davidsson examine the
terminology, conventional performance metrics from quality attributes and
Put the difference between measures and metrics with metrics in software engineering domain (robustness, complexity
other confused terms such as scores, indicators, criteria, along with performance that includes accuracy, space, and time
factors or indices, sub-attributes) [16]. Similar to performance metrics, the
Introduced canonical forms for performance equations, combination of binary association measures is practically
Proposed formal rules to decide whether an equation is limitless. Paradowski proposed generalized measures/metrics
a measure or metric, based on coefficients [6] whereas Koyejo et al. introduced
Provided the systematic approach on which the generalized performance metric [17].
performance measures and metrics are defined with a
leveled structure showing dependency and relations, III. THE CLARIFICATION OF TERMINOLOGY
Discovered duality and complementation in measures The classification performance is calculated, presented and
and metrics and presented notations for expressing compared by some 'performance metrics' that are also called as
duality and complementation, 'performance measures', 'evaluation measures', and 'prediction
Suggested diagonal and off-diagonal totals as new scores' in supervised learning where a training set is provided
additional measures: True Classification (TC) and False with known inputs along with the correct labels (outcomes, i.e.
Classification (FC), the corresponding class for given inputs). Classification
Provided equations to calculate measures and metrics in performance is named differently in other domains such as
binary classification notation as a reference study, diagnostic (test) accuracy in medicine [18] or skill score or
Transformed the related equations into a new simplified forecast skill in meteorology (forecast vs. observation) [19].
canonical form for easy interpretation, and Categorization is commonly used in philosophy and statistics
Designed a roadmap visualizing 22 measures and 22 instead of classification [2].
metrics in all-in-one style so that the designated new
Although metrics, scores, measures, indicators,
geometry and leveling approach provide a better
criteria, factors or indices seem the same and are used
understanding of them with relations and dependencies.
interchangeably to state classification performance, there are
The rest of the paper is organized as follows. Section II slight and important differences. Surprisingly, the confusion
surveys literature on performance metrics. This section is short over such terms is widespread in the literature. Even, the studies
due to the lack of space nevertheless it provides many resources related to classification performance use incorrect terminologies
examining different aspects of performance metrics. Section III and to the best of our knowledge, this study is the first one that
provides the clear distinction among confused terminologies clarifies the terminology semantically in this section and
metrics, measures, and indicators. Section IV presents and formally as described in explained in Section V.
describes our comprehensive visualized roadmap for binary
As specifically examined from the general perspective by
classification performance metrics and describes the base and 1st
Texel [20], measures at the bottom of pyramid hierarchy of
level measures. It also describes the new two measures proposed
concept are numerical values with little or no context, metrics
by us. Section V introduces our new canonical form approach in
that are above measures possess a collection of measures in
measures and metrics equations and our new suggestion of
context, and indicators at the top are the comparison of metrics
certain formal rules to define a given equation as a measure or
to a baseline. Score has a similar meaning to measure. Therefore,
metric. Section VI reintroduces duality and complementation in
correct usage in classification is performance metrics.
measures and metrics with proposed notation to state them.
Section VII describes the 2nd level measures, Section VIII Another terminological clarification that we address is about
describes base and conventional 1st level metrics with the four types of direct results of binary classification namely
column/row geometry and examines Accuracy as the most used (number of) True Positives (TP), False Positives (FP), False
performance metric in studies and other base metrics. It Negatives (FN), and True Negatives (TN) that are explained in
addressed the complicated usage of Accuracy metric. Section IX Section IV.A. These values, which are displayed in 2 rows by 2
briefs the 1st and 2nd level metrics that are rarely used in binary columns of confusion matrix or contingency table for binary
classification studies in machine learning. The final section classification, should be stated as classification measures not
summarizes the goals and outlines the contributions. metrics, etc. Performance metrics are calculated based on these
base measures and some other measures are derived from those
II. LITERATURE REVIEW base measures. These three terminology clarifications (i.e.
Performance measures and metrics are historically based on measure vs. metric) in mind, you see that many past and even
binary similarity, distance or association measures and metrics current classification studies in the literature use incorrect
and several studies on several domains examine them [3][6]. terminology and intermingle with one another.
There are hundreds of measures and metrics defined in the

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 


IV. THE PROPOSED COMPHREHENSIVE VISUALIZED innocence. An anti-malware product should be designed or
ROADMAP FOR BINARY CLASSIFICATION PERFORMANCE configured to decrease the False Positives (or False Alarms) to
avoid annoying interruptions due to excessive malware
We have designed Figure 1 originally to provide a visualized
warnings. Whereas precautionary logic focuses on more
roadmap for binary classification performance measures and
underestimates (False Negatives) than overestimates (False
metrics in all-in-one style. Both 22 measures and 22 metrics that
Positives) in criminal justice [22].
are described in next sections are shown in one chart like the
periodic table of elements. Gray colored cells correspond to B. The Proposed Performance Measures (TC, FC) and The
measures and orange colored ones correspond to metrics and 1st Level Measures (P, N, OP, ON, Sn)
they are grouped in a leveled structure and positioned according
The 1st level performance measures are one level above the
to similarities and row/column geometries. The depended
base performance measures. Condition Positive (P) and
metrics and measures in their equations that are also shown in condition Negative (N) measures that are column totals (also
the roadmap. Measures and metrics are built upon total 4 and 3
known as marginal totals in probability theory) of confusion
levels, respectively. The roadmap provides a neat distinction matrix represent the real or actual values of the two classes (i.e.
between measure and metrics and describes measures, metrics,
the real labels). These measures correspond to the reality,
their similarities, dependencies, and types via geometrical and
observed or ground truth. OP and ON measures that are row
visualization techniques.
totals (also known as marginal totals in probability theory) of
The roadmap also presents useful information as explained confusion matrix represent the predicted (test or classification
in the legend. The names of measures and metrics that have no result) of the two classes (i.e. the given labels). These measures
upper limit are written in bold. The numbering for measures are correspond to the prediction or estimated (classification output).
written in italic. The measures and metrics in the left of the
When we examined many measures and metrics we had seen
confusion matrix square are row type (depended solely upon
that some of them have specific elements that are the total value
base measures, Sample Size (Sn) and OP or ON whereas the
of diagonal base measures (TP and TN) and off-diagonal ones
ones in above the confusion matrix square are column type (the
(FP and FN). For the first time, we have introduced and named
same as row type but OP or ON are replaced by number of
those totals as True Classification (TC) and False Classification
condition Positive (P) and Negative (N). Row types are related
(FC), respectively. Substituting those totals have significantly
to testing or prediction whereas column types are related to
simplified the metrics equation and their interpretation. Sn
reality as described more in below. The roadmap with these
measure that is the sum of four base performance measures
details is worthy of all-in-one attribution. Note that AUC
could be stated by P plus N measures, by OP plus ON, and by
metric is out of scope in this study.
TC plus FC.
In this section and next sections, we provide all the
Before going into details of performance measures and
performance measures and metrics equations in binary
metrics based on the measures explained above, two attributes
classification specific terms and give equivalent or derived
of them, namely duality and complement concepts, are
canonical equations to assist interpreting them easily. The
introduced and described here. The literature has not addressed
measures that represent all combinations of a classification are metrics duality and complements sufficiently. We also
stated by two combinations of two letter groups as T: True or
introduced the canonical transformation of the equations and
matches vs. F: False or non-matches for classification proposed formal rules to define classification performance
correctness with P: Positive vs. N: Negative for
measures and metrics. In the abundance of metrics, these
representation of two-classes. They are presented in 2x2 attributes that we presented in this study make measures and
confusion matrix or contingency table.
metrics more comprehensible.
A. Confusion Matrix, The Four Base Performance Measures
V. A NEW FORMAL VIEW: CANONICAL FORMS AND
As stated in Section III, we call the four direct outputs of DEFINITIONS OF MEASURES AND METRICS
classification performance (TP, FP, FN, TN) as base measures
in a binary classification with supervised learning approach. In this study, we suggest canonical forms of measure and
True classification results or prediction/reality matches (TP metric equations that are defined as follows:
and TN) are located on diagonal whereas False classification
results, non-matches or errors (FP and FN) are off-diagonal in Definition: Canonical Forms
confusion matrix as seen in Figure 1.
The forms where equations are stated by 11 base measures
In critical engineering and medicine practices, type II errors, and 1st level measures (i.e. TP, FP, FN, TN, P, N, OP, ON, TC,
False Negatives, could be more serious or worse than type I FC, and Sn).
errors, False Positives. But the proper approach depends on the
domain and its specific application. For example, in malware We have clarified the difference between metrics and
analysis, it could be better to mistakenly label a benign software measures semantically in Section III. Determining a given
as malign than miss a malign software by incorrectly labeling equation is a performance measure or metric is also not studied
it as benign (labeled malware are prioritized and an expert could before. Because a binary classification performance is related to
go through further manual malware analysis to eliminate false TP, FP, FN, TN base measures, metrics should depend on at
positives then [21]). Although in law or social perspective the least one of them. The following proposed rules are valid for
opposite is likely to be valid to ensure presumption of measures. Otherwise, given canonical form is a metric.

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 


Fig. 1. Visualized roadmap for binary classification performance measures/metrics (22 measures and 22 metrics providing names,
abbreviations, other namings, levels, dependencies, row/column geometries, equations with new canonical forms, and special notes such as
range apart from (0,1), duals, complements, etc.). See the legend for details. Visit http://bitly.com/metricsroadmap for feature updates.

Knowing this transformation, we could become aware of duality


Definition: Measure formally in every metric or measure.
The canonical form includes only P, N, OP, ON, or Sn We propose to adapt duality notation used in vector space
Otherwise, the possible values have no lower (-) that states a duality of a vector space V to its dual vector space
with asterisk superscript (*) as V*. In this notation, we can
and/or upper limit (+)
express the duality among performance measures and metrics.
For example, the followings are the dualities for the 1st level
measures defined in previous subheadings:
VI. THE DUALITY AND COMPLEMENTATION REVEALED IN
  &    &   &    &  *+
CLASSIFICATION PERFORMANCE MEASURES/METRICS
Duality is basically related to the transformation of one The symmetry or involution is always valid for the duality of
concept into another concept in a bilateral approach. We performance measures and metrics. The duality is especially
introduce the duality concept in classification performance important to transform a mapping known in one concept to the
metrics in this study. The proper transformation for performance same mapping in dual concept. For example, a mapping exists
metrics duality lies behind the column-row approach on in one definition of a metric could be transformed or seek in
confusion matrix that we described above. The row versus corresponding dual metric.
column transformation corresponds to prediction versus reality.

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 


Because all the performance measures and metrics are for classification model and always labeled a given instance with
binary classes and normalized in range (0, 1) mostly and (-1, 1) Negative. Any classification should perform better than this
rarely, the ratios could be complemented. Therefore, all the limit. NIR that is an improved version of NER specifies the
measures and metrics have complements. The complements minimum performance by counting the larger condition from
could be written by this adapted notation: Positive and Negative conditions. Eventually, the following
conditions should be achieved at least for a better classification.
"  #      Note that a classification may have a close performance to NER
# & ! " and NIR measures, this case is called as Accuracy Paradox
 % "   ( " (  [10], [24]. Therefore, providing Accuracy as a single
& $ & performance metric is not sufficient.
%"  % ( " ( 
In contrast with duality, having both a measure/metric and  '  ) 

its complement does not contribute any extra information. A


CKc is Kappa Chance coefficient that is also called as
complement could be used for simplification of equations or Random Agreement consists of column and row totals. Go
switching the primary point of view to another one such as
further in geometry as seen in Figure 1, likelihood ratios (LRP,
switching from positive condition based view (e.g. TPR or PPV) LRN) are row based measures that do not have an upper limit
to negative one (e.g. FNR or FDR) or focusing on errors (i.e.
(e.g. not in range (0, 1) as in other metrics). Odds Ratio (OR) is
Misclassification Rate, MCR) instead of correctness (i.e. ACC) a 3rd level measure. It is the ratio of Positive Likelihood Ratio to
as illustrated in Figure 1.
Negative Likelihood Ratio 2nd level measures. Positives have
VII. THE 2ND LEVEL MEASURES OR times the odds of being positive compared to Negatives [25].
Discriminant Power (DP) and OR that are different from all the
Prevalence is the ratio of Positive condition size (P) to the metrics and measures are in (-, +) range.
total number of conditions (P + N) or sample size (Sn).
Prevalence (PREV), which is called as Pretest Probability of IX. BEYOND THE CONTROVERSIAL ACCURACY METRIC
Admission in medicine, is related to the reality. (Label) Bias
(BIAS) is the ratio of Outcome Positive size (OP) to the total A. Informedness, Markedness
number of test size (OP + ON) or sample size (Sn). Bias is Informedness and Markedness are dual metrics, the former
related to the prediction (classification output). It is also known is about reality whereas the latter is about prediction. It
as Detection Prevalence or Warning Rate. Prevalence and represents consolidated true classification capability per
bias are dual measures. (Class) skew that is the ratio of majority condition so that the performance is about reality. Informedness
class to minority class (usually N to P) is an important measure is equal to Peirce Skill Score (e.g. in weather forecasting [19]).
along with Prevalence in classification [9]. Markedness, also known as Difference in Proportions
(NPVFOR) represents consolidated true classification
VIII.BASE AND CONVENTIONAL 1ST LEVEL METRICS capability per test outcome so that the performance is about
The base metrics are calculated within measures in confusion prediction. Markedness is equal to Clayton Skill Score [19]
matrix columns (vertically) or rows (horizontally) as seen in
Figure 1. The column type base metrics (TPR, TNR, FPR, FNR) B. More TPR-TNR Combinations
are the rate of each base measure to corresponding condition BACC and Gm are the alternatives to ACC. They are actually
(Positive or Negative). The row type base metrics are the rates means of correct classification rates; the former is arithmetic and
of each base measure to corresponding test output. The first level the latter is geometric mean of TPR and TNR. BACC, also known
metrics are the most preferred and known metrics in expressing as Strength. Gm is also known as FowlkesMallows index.
binary classification performance as a single value. C. F-measures (F1, F0.5, F2, F), Cohens Kappa (CK),
A. Accuracy (ACC) Matthews Correlation Coefficient (MCC)
Accuracy is the ratio of total number of correct F-measures are the metrics covering two base metrics one
classifications to sample size. It is a diagonal metric covering from column (TPR) and one from row (PPV). F-measures that
neither rows nor columns of confusion matrix. It was defined as are harmonic means of those two metrics are actually parametric
a similarity measure called matching coefficient between two metrics, which define the weights on TPR and PPV. They are
individuals characterized by a number of binary attributes by insensitive to TN [26]. F1 is harmonic mean of precision and
Sokal and Michener in 1958 [23]. Accuracy is the most provided recall with equal weights whereas F0.5 puts more emphasize on
and perhaps the most abused metric in binary classification TPR whereas F2 puts more emphasize on PPV. CK that is a
performance reports as described below. bidirectional metric in (-1, 1) range is related to Accuracy [9] (it
is equal to Heidke Skill Score [19]). MCC is bidirectional in (-1,
B. Minimum Expected Performance and Other Measures 1). F-measures, CK, and MCC are the unconventional metrics
Although the ultimate goal of a classification is achieving the that we suggest the researchers take into account additional to
highest accuracy as possible another performance criterion may Accuracy and other metrics.
be disregarded: what is the minimum performance measure that
should be expected that a binary classification? The two X. DISCUSSION AND CONCLUSION
measures are the answers to this question namely Null Error Performance metrics are critical instruments for assessing
Rate (NER) and No Information Rate (NIR). NER specifies the and expressing the success of a binary classification study.
minimum successful classification rate if we do not have a Although many metrics are available in the literature only a few

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 


permission metrics are reported in the conducted studies. As Association Measures, Int. J. Appl. Math. Comput. Sci., vol. 25, no. 3,
explained in Accuracy Paradox, these accustomed metrics pp. 645657, 2015.
have some complications. Using the proper and sufficient [7] C. Seung-Seok, C. Sung-Hyuk, and C. C. Tappert, A Survey of Binary
Similarity and Distance Measures., J. Syst. Cybern. Informatics, vol. 8,
number of metrics while comparing different binary no. 1, pp. 4348, 2010.
classification approaches leads to a more objective assessment. [8] Z. Hublek, Coefficients of Association and Similarity, Based on
We have seen that there are many studies in the literature on Binary (Presence-Absence) Data: an Evaluation, Biol. Rev., vol. 57, no.
4, pp. 669689, 1982.
several domains (botanical, meteorology, chemistry, biology,
[9] S. Straube and M. M. Krell, How to evaluate an agents behavior to
medicine, economics, malware analysis, etc.) in which different infrequent events? Reliable performance estimation insensitive to class
metrics and measures are proposed. Because of domain distribution, Frontiers in Computational Neuroscience, vol. 8, no.
independence of performance measures, those studies may April, pp. 16, 2014.
provide researchers new alternative resources in other domains [10] F. J. Valverde-Albacete and C. Pelez-Moreno, 100% classification
to explore so that knowledge transfer would be possible. accuracy considered harmful: The normalized information transfer factor
explains the accuracy paradox, PLoS One, vol. 9, no. 1, 2014.
We also have seen that there are many different notation [11] M. Sokolova and G. Lapalme, A systematic analysis of performance
adaptations in the related studies and established different measures for classification tasks, Inf. Process. Manag., vol. 45, no. 4,
notation conventions for the same metrics on different domains pp. 427437, 2009.
(e.g. recall vs. sensitivity for TPR). Nevertheless, we suggested [12] V. Labatut and H. Cherifi, Evaluation of Performance Measures for
our specific naming and notation that still reflects the majority Classifiers Comparison, Ubiquitous Comput. Commun. J., vol. 6, pp.
2134, 2011.
of common conventions in machine learning classification
literature. We preferred abbreviations of metrics that are more [13] B.-G. Hu and W.-M. Dong, A study on cost behaviors of binary
classification measures in class-imbalanced problems, Comput. Res.
explicit for binary classification. Repos., vol. abs/1403.7, 2014.
In this study, we have uncovered the semantic and formal [14] A. Forbes, Classification-algorithm evaluation: five performance
distinction between performance measure and metric. Although measures based on confusion matrices, J. Clin. Monit. Comput., vol. 11,
no. 3, pp. 189206, 1995.
classification performance measures/metrics are indispensable
[15] R. E. Tulloss, Assessment of Similarity Indices for Undesirable
instruments of classification experiments, the confusing Properties and a new Tripartite Similarity Index Based on Cost
terminology that is widespread in even academic studies has not Functions, in Mycology in Sustainable Development: Expanding
been clarified before. We have suggested formal rules to Concepts, Vanishing Borders., 1997, pp. 122143.
determine a given equation as a measure or metric as well as [16] N. Lavesson and P. Davidsson, Analysis of Multi-Criteria Metrics for
establishing the terminology. We have provided 44 measures Classifier and Algorithm Evaluation, in Proceedings of the 24th
and metrics with their equations and introduced new concepts AnnualWorkshop of the Swedish Artificial Intelligence Society, 2007, pp.
1122.
on handling them such as canonical forms to use in definitions,
[17] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon,
relating different measures/metrics by duality/complementation. Consistent Binary Classification with Generalized Performance
In addition, we have designed a comprehensive, visualized, Metrics, Adv. Neural Inf. Process. Syst. 27 Annu. Conf. Neural Inf.
Process. Syst. 2014, December 8-13 2014, Montr. Quebec, Canada, pp.
row/column structured, leveled, and all-in-one style roadmap 27442752, 2014.
shown in Figure 1 that could be considered as the periodic table [18] K. J. van Stralen, V. S. Stel, J. B. Reitsma, F. W. Dekker, C. Zoccali, and
of elements in binary classification performance in machine K. J. Jager, Diagnostic methods I: sensitivity, specificity, and other
learning. Our new basing and leveling approach in the layout measures of accuracy, Kidney Int., vol. 75, no. 12, pp. 12571263, 2009.
contributes to ordering measures/metrics logically and providing [19] D. S. Wilks, Statistical methods in the atmospheric sciences, 2nd ed.,
a more comprehensive style. It is expected that the domain- vol. 59. Elsevier, 2006.
independent roadmap gives researchers a more standard, [20] P. P. Texel, Measure, metric, and indicator: An object-oriented
complete and easier way to understand the performance approach for consistent terminology, in Proceedings of IEEE
Southeastcon, 2013.
measures/metrics, their relations and dependencies.
[21] G. McWilliams, S. Sezer, and S. Y. Yerima, Analysis of Bayesian
ACKNOWLEDGMENT classification-based approaches for Android malware detection, IET
Inf. Secur., vol. 8, no. 1, pp. 2536, 2014.
G.C. thanks HAVELSAN for supporting this study. [22] H. M. Lomel, Punishing the uncommitted crime: Prevention, pre-
emption, precaution and the transformation of criminal law, in Justice
REFERENCES and Security in the 21st Century: Risks, Rights and the Rule of Law, 1st
[1] A. Shaar, T. Abdessalem, and O. Segard, Pessimistic Uplift Modeling, ed., B. Hudson and S. Ugelvik, Eds. Abingdon, Oxon, United Kingdom:
in 22nd SIGKDD Conference on Knowledge Discovery and Data Mining Routledge, 2012.
(ACM SIGKDD), 2016. [23] D. W. Goodall, The distribution of the matching coefficient,
[2] C. Sammut and G. I.Webb, Eds., Encyclopedia of Machine Learning. Biometrics, vol. 23, no. 4, pp. 647656, 1967.
New York: Springer, 2011. [24] T. Bruckhaus, The business impact of predictive analytics, in
[3] M. J. Warrens, Similarity Coefficients for Binary Data, Leiden Knowledge Discovery and Data Mining: Challenges and Realities, X.
University, 2008. Zhu and I. Davidson, Eds. Information Science Reference, 2007, pp.
114138.
[4] M. R. Anderberg, Measures of Association among Variables, in
Cluster Analysis for Applications: Probability and Mathematical [25] M. Szumilas, Explaining odds ratios, J. Can. Acad. Child Adolesc.
Statistics: A Series of Monographs and Textbooks, Academic Press, Psychiatry, vol. 19, no. 3, pp. 227229, 2010.
1973, pp. 8292. [26] D. M. W. Powers, What the F-measure doesnt measure: Features,
[5] M. M. Deza and E. Deza, Encyclopedia of Distances. Springer, 2009. Flaws, Fallacies and Fixes, KIT-14-001, 2015.
[6] M. Paradowski, On the Order Equivalence Relation of Binary

8%0. QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ