00 positive Bewertungen00 negative Bewertungen

2 Ansichten11 SeitenThis research work addresses the problem of detecting these unusual instances (outliers) in data that are either erroneous or may present special/unique cases in the dataset, which can be interesting
for gaining new insights into the observed domain.

Nov 08, 2017

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

This research work addresses the problem of detecting these unusual instances (outliers) in data that are either erroneous or may present special/unique cases in the dataset, which can be interesting
for gaining new insights into the observed domain.

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

2 Ansichten

00 positive Bewertungen00 negative Bewertungen

This research work addresses the problem of detecting these unusual instances (outliers) in data that are either erroneous or may present special/unique cases in the dataset, which can be interesting
for gaining new insights into the observed domain.

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 11

TO THE NIGERIAN FOOTBALL LEAGUE

Uduak Akpan1 , Mary I. Akinyemi2

Department of Mathematics/Statistics,University of Calabar, CRS, Nigeria1

Department of Mathematics,University of Lagos, Lagos, Nigeria2

Abstract

Outlier analysis is an exciting aspect of science - finding something totally new, unique and

unexpected can lead to a significant scientific discovery or make ones career. This research

work addresses the problem of detecting these unusual instances (outliers) in data that are

either erroneous, or may present special/unique cases in the dataset, which can be interesting

for gaining new insights into the observed domain.

In professional sports, especially in Nigeria, the decision to recruit players/athletes into

the National team is made purely by instincts. Instincts are important, but they may not be

enough to make good decisions consistently. We demonstrate that, outlier detection analysis

can supplement instinct with evidence rooted in data - by recognizing players that stand out

or have exceptional skills. Hence, an unsupervised ensemble-based outlier detection method

is constructed by unifying outputs from three (3) outlier detection methods, Local outlier

factor (LOF), Angle-based outlier degree (ABOD) and Subspace outlier degree (SOD) via

Regularization and Gaussian scaling. We also present a heuristic framework for prediction

and quantitative performance evaluation of the ensemble. The Ensemble is applied to the

Nigerian football players performance statistics data, the detected outlier instances were

qualitatively evaluated by a sports analyst confirming the usefulness of the proposed frame-

work in identifying even the unexpected instances as well as unusual special cases.

prediction.

1

1 INTRODUCTION

In many data analysis tasks, a large number of variables are being recorded or sampled.

One of the first steps towards obtaining a coherent analysis is the detection of out-laying

observations. Although outliers are often considered as an error or noise, they may carry

important information. Outliers can be due to several causes, the measurement can be

incorrectly observed, recorded, or the observed datum can come from a different population

with respect to the normal situation and thus is correctly measured but represents a rare

/special event. Outlier detection methods aim to automatically identify those valuable or

disturbing observations in collections of data.

The oldest methods for outlier detection are rooted in probabilistic and statistical models,

and date back to the nineteenth century (Edgeworth, 1887). The most basic form of outlier

detection is extreme univariate analysis. In such cases, it is desirable to determine data

values at the tails of univariate distributions, along with a corresponding level of statistical

significance. In general, outlier detection techniques can be classified into three main cate-

gories, namely supervised, unsupervised and semi-supervised techniques based on whether

or not the response variable is labelled. These categories can be further subdivided into sta-

tistical approaches, proximity-based approaches, clustering-based approaches, classification

approaches or ensemble based approaches.

Statistical outlier detection techniques are based on the assumption that inlier data in-

stances occur in high probability regions of a stochastic model, while outliers occur in the

low probability regions of the stochastic model. For example the Grubbs test (Grubbs, 1969;

Anscombe & Guttman, 1960), kernel density estimation (Desforges, Jacob, & Cooper, 1998)

etc. Classification based outlier detection techniques operate in a two-phase fashion. The

training phase learns a classifier using the available labeled training data. The testing phase

classifies a test instance as an outlier or inlier using the classifier. Any base classification

method can be used provided it is able to output some indication of its confidence in the

predictions (Aggarwal, 2013). Proximity-based approaches assume that the proximity of an

outlier object to its nearest neighbours significantly deviates from the proximity of the object

to most of the other objects in the data set, they include, LoOP (Local Outlier Probability)

outlier detection method (Kriegel, Kroger, Schubert, & Zimek, 2009) and distance-based

outlier detection (Aggarwal & Yu, 2001; Breunig, Kriegel, Ng, & Sander, 2000; Zhang, Hut-

ter, & Jin, n.d.). Clustering-based approaches detect outliers by examining the relationship

between objects and clusters. Intuitively, an outlier is an object that belongs to a small and

remote cluster, or does not belong to any cluster. Some examples of this approach include

SNN clustering (Ertoz, Steinbach, & Kumar, 2003), K-means Clustering (Smith, Bivens,

2

Embrechts, Palagiri, & Szymanski, 2002) etc.

The main idea behind the ensemble methodology is to weigh several individual data analysis

techniques, and combine them in order to obtain a technique that results in significant

improvement from the base methods (Polikar, 2009). The history of ensemble methods

dates back to as early as 1977 with Tukeys Twicing (Tukey, 1977). The work on ensembles

for outlier detection exist but is often scattered in literature, and in comparison to other

problems such as classification, not as well formalized. Outlier ensemble methods can be

categorized into sequential ensembles, independent ensembles, model centered-ensembles and

data-centered ensembles.

In sequential ensembles, one or more outlier detection methods are applied sequentially

to either all or portions of the data. Thus, depending upon the approach, either the data set

or the method may be changed in sequential executions (Aggarwal, 2013). A classic example

of this is applied for cluster-based outlier analysis (for constructing more robust clusters in

later stages) (Barbara, Li, Couto, Lin, & Jajodia, 2003). In independent ensembles, different

instantiations of the method or different portions of the data are used for outlier analysis.

Alternatively, the same method may be applied, but with either a different initialization,

parameter set or even random seed in the case of a randomized algorithms. The results

can be combined in order to obtain a more robust outlier score. For example, the methods

in Lazarevic & Kumar, 2005 and Liu, Ting, & Zhou, 2008, sample subspaces from the

underlying data in order to determine outliers from each of these executions independently.

In data-centered ensembles, different parts, samples or functions of the data are explored in

order to perform the analysis. The core idea is that each part of the data provides a specific

kind of insight, and by using an ensemble over different portions of the data, it is possible

to obtain different insights. One of the earliest data-centered ensembles was discussed in

Lazarevic & Kumar, 2005. Model centered ensembles attempt to combine the outlier scores

from different models built on the same data set. The major challenge of this model is

that the scores from different models are often not directly comparable to one another. For

example, the outlier score from a k-nearest neighbour approach is very different from the

outlier score provided by an angle-based detection model. This causes issues in combining

the scores from these different outlier models. Therefore, it is critical to be able to convert

the different outlier scores into normalized values which are directly comparable, and also

preferably interpretable, such as a probability (Zimek, Campello, & Sander, 2013). The

broad concept of decision trees can also be extended to outlier analysis by examining those

3

paths with unusually short length, since the outlier regions tend to get isolated rather quickly.

An ensemble of such trees is referred to as an isolation forest (Liu et al., 2008) and has been

used effectively for making robust predictions about outliers.

2 METHODS

2.1 Framework for Outlier Ensemble Detection

The goal is to develop a framework that would enable the detection of outlaying data in-

stances, evaluate the performance of this model in collaboration with a domain expert and

build a predictor for the classes. The design of this framework is as follows:

Apply baseline outlier detection methods to the data which will return a set of suspi-

cious instances with their corresponding outlier scores.

Build ensemble by transforming, unifying and combining the outlier scores from the

different methods.

Present result from ensemble to a domain expert/s; the domain expert inspects the

detected outlier instances and decides whether they are interesting outliers which lead

to new insights in domain understanding, erroneous instances which should be removed

from the data, false alarms (regular instances) and/or instances with minor corrected

errors to be reintroduced into the dataset.

Label data using results from (3) above and build a classifier to assess the performance

of the ensemble by its ability to improve the performance of the classifier compared to

classifiers built using individual outlier methods that constitute the ensemble.

Random forests (Breiman, 2001) is a non-parametric method for classification/regression. A

Random forest consists of a collection or ensemble of simple decision trees, each capable of

producing a response when presented with a set of predictor values. Random forest grows

each tree on an independent bootstrap sample from the training data. But when building

these decision trees, each time a split in a tree is considered, a random sample of m predictors

is chosen as split candidates from the full set of p predictors and typically we choose m p.

Next, the best split on the selected m variables is found and the trees grown to maximum

4

depth. At each bootstrap iteration, the predicted class of an observation is calculated by a

majority vote of the data not in the bootstrap sample (what Breiman calls out-of-bag or

OOB observations) using the tree grown with the bootstrap sample.

Breunig et al., 2000, proposed this density-based approach to find an outlier. For each point

p in the given dataset, they evaluated its local outlier factor (LOF); a point whose LOF

is large is declared as an outlier. The method is as follows: For a given data point X,

let Dk (X) be its distance to the k-nearest neighbour of X, the distance measure being the

Euclidean distance and let Lk (X) be the set of points within the k-nearest neighbour distance

of X. Then, the reachability distance Rk (X, Y ) of object X with respect to Y is defined as

The average reachability distance ARk (X) of data point X with respect to its neighbour-

hood Lk (X) is defined as:

ARk (X)

LOFk (X) = M EANY Lk (X) (2.3.3)

ARk (Y )

Kriegel, Schubert, & Zimek, 2008, proposed this method. The idea in angle-based methods

is that data points at the boundary of the data are likely to enclose the entire data within a

smaller angle whereas, points in the interior are likely to have data around them at different

angles. Consider three data points X, Y and Z. Then, the angle between the vectors Y X

and Z X, will not vary much for different values of Y and Z, when X is an outlier.

Furthermore, the angle is inverse weighted by the distance between the points (Aggarwal,

2013). The corresponding angle (weighted cosine) is defined as follows:

h (Y X) (Z X) i

W Cos(Y X, Z X) = (2.4.1)

k Y X k22 k Z X k22

where, k k2 represents L2 -norm and h i represents scalar product. Then the angle-

based outlier degree (ABOD) of the data point X is defined as follows:

5

ABOD(X) = V arY,Z W Cos(Y X, Z X) (2.4.2)

Data points which are outliers will have a smaller spectrum of angles and will therefore

have lower values of the angle-based outlier degree.

This is a method for finding outliers in lower dimensional projections of the data, proposed

in Kriegel, Schubert, & Zimek, 2009. In this approach, a local analysis is provided specific to

each data point. For each data point X, a set of reference points S(X) are determined, which

represent the proximity of the current data point being examined. Once this reference set

S(X) has been determined, the relevant subspace for S(X) is determined as the set Q(X) of

dimensions in which the variance is small. The Euclidean distance of X is computed to the

mean of the reference set S(X) in the subspace defined by Q(X). This is denoted by G(X).

The subspace outlier degree SOD(X) of a data point is defined by normalizing this distance

G(X) by the number of dimensions in Q(X) (Aggarwal, 2013).

G(X)

SOD(X) = (2.5.1)

| Q(X) |

The score unification method used in this work was originally proposed by Kriegel, Schubert,

and Zimek in 2011 but with little modifications to accommodate ABOD scores that are equal

to zero. This unification method contains two steps which are, regularization and scaling.

Regularization is used to transform the original outlier scores to [0, ] interval, and

the smaller the transformed score, the more likely it is to be an inlier.

For LOF the transformation is done as follows: Let the expected inlier value be E, then

for LOF E = 1 (Breunig et al., 2000) and the outlier score for an object o be S(o), then the

regularized score is given by

logarithmic inversion is used since its scores have low contrast and the logarithmic inversion

addresses the enhancement of contrast between inliers and outliers. Let the outlier score

for an object o be S(o) and the maximum possible (or observed) score be Smax , then the

regularized score is given by:

6

S(o) +

RegS = log (2.6.2)

Smax +

where 0 < 1, which in this work was set to 1e-10.

For SOD the scores are already in the range [0, ], so no need for transformation.

Scaling is then performed to transform the scores to a range [0, 1]. Given the mean S

and the standard deviation S of the set of regularized values using the outlier score S(o),

this method uses the normal CDF and the Gaussian error function, erf() to transform

the outlier score into a probability value:

S(o) S

NormS (o) = Max 0, erf (2.6.3)

S 2

where, erf(x) = 2(x. 2) 1. Then, the scores from the different methods are combined

using the averaging function(mean).

We used standard classification performance metrics, sensitivity, specificity, Accuracy, and

area under the Receiver Operating Characteristic(ROC) curve (AUC) to evaluate perfor-

mance.

3 DATA

The Nigerian football players statistics data from 1997 - July 2015 used for this research was

retrieved from the online database of the Association of Football Statisticians (AFS) (AFS,

2015). The instances are records of 314 players consisting of the following variables: name of

player, number of appearances, number of substitutions, number of goals scored, number of

penalties scored, position of the player on the field (goalkeeper, defender, midfielder, forward)

and the number of red and yellow cards received.

Count 17 100 100 97

Max. number of Appearances/player 98 60 67 95

Max. number of Goals Scored/player 0 21 12 5

No. of Players who scored Goals 0 62 34 18

No. of Players who scored penalties 0 10 8 0

7

Table 1, shows separate summaries for all positions on the field. It can be seen here

that there are 17 goalkeepers, 97 defenders, 100 midfielders and 100 forwards. The player

with the most number of appearances is a goalkeeper. It was further observed that forwards

scored most goals, with players scoring up to 21 goals and both goalkeepers and defenders

scored no penalty.

Position on the field. Position on the field.

Furthermore, it was observed that defenders received more yellow/red cards, which should

be expected because a defender is an outfield player whose primary role is to prevent the

opposing team from scoring. Goalkeepers received the lowest number of yellow cards and

received no red card.(see Figure 1 and Figure 2)

4 Results

4.1 OUTLIER DETECTION ANALYSIS VIA THE ENSEMBLE

The methods LOF, ABOD and SOD were first applied to the data and then the ENSEMBLE

is built by combining the outlier scores from individual methods using approaches described

in section 2. The parameter settings for LOF; k = 30:50 , SOD ; alpha = 0.8 are set

as suggested by the various authors, and for ABOD the percentage of data to used when

calculating the angle variance was set to 0.5 due to computational time constraint. In order

to declare points as outliers a threshold setting equal to 0.1 is used.

8

4.2 EXPERT INTERPRETATION OF RESULTS OBTAINED

BY THE ENSEMBLE

The Sport analyst/journalist was shown outliers detected by the Ensemble, to confirm the

meaningfulness of the outliers found. In the qualitative evaluation he used all the available

retrospective data for these players to identify why they were detected as special cases.

The analyst agreed with about 98% of the results from the ENSEMBLE, in particular, he

identified two players as being false positives. Hence, they were not labelled as outliers in

further analysis done.

For each outlier identified, a new data frame is created to include a new variable named

Label to show whether a data instance is an outlier or inlier. Random forest is applied to

do the classification on the updated data frame. Optimization is performed to achieve the

best performance measures on the out-of-bag samples. The results of the analysis is shown

in Table 2.

LOF 97.4 70.8 98.6 91.5

ABOD 100 100 97.3 97.9

SOD 97.3 85 96 93.6

ISO 99.9 100 98.6 98.9

ENSEMBLE 99.9 90.9 100 97.9

5.1 Summary

The main aim of this work was to develop a framework for unsupervised outlier ensemble

detection and a heuristic approach for prediction and quantitative performance evaluation

of the ensemble. The unsupervised ensemble-based outlier detection technique was applied

to a real-life data, thus affirming the practical applicability of the developed framework.

9

5.2 Empirical Findings

The experimental results presented in this paper show that outlier ensemble analysis can

identify meaningful outliers in data which can be of immense contribution to the process

of selecting players in the professional sports. The idea of outlier ensembles can clearly

been seen from the results. Results from the analysis of Nigerian football players statistics

data showed that ISO performed best with very high performance measure values. The

ENSEMBLE did not perform badly it made correct predictions 97% of the time.

There are two directions for future work. The first one is on how to describe or explain

why the identified outliers are exceptional and which of the variables make them so. This

is particularly important for high-dimensional datasets, because an outlier may be outlying

only on some, but not on all dimensions. Secondly, the development of an approach to a

user-guided outlier detection where the user can choose which methods to include in the

outlier ensemble construction stage from a collection of both supervised and unsupervised

methods.

References

AFS. (2015). Nigeria national football team statistics and records: all-time record. Associ-

ation of Football Statisticians, http://www.11v11.com/about-the-association-of

-football-statisticians-afs-/.

Aggarwal, C. (2013). Outlier analysis. Springer.

Aggarwal, C., & Yu, P. (2001). Outlier detection for high dimensional data. In Proceedings

of the ACM SIGMOD Conference on Management of Data (p. 37-46).

Anscombe, F., & Guttman, I. (1960). Rejection of outliers. Technometrics, 2 (2), 123-147.

Barbara, D., Li, Y., Couto, J., Lin, J.-L., & Jajodia, S. (2003). Bootstrapping a data mining

intrusion detection system. Symposium on Applied Computing.

Breiman, L. (2001). Random forests. Machine Learning, 45 (1), 5-32.

Breunig, M., Kriegel, H.-P., Ng, R., & Sander, J. (2000). Lof: Identifying density-based

local outliers. In Proceedings of the ACM SIGMOD Conference on Management of

Data (p. 93-104). Dallas, TX.

Desforges, M., Jacob, P., & Cooper, J. (1998). Applications of probability density estimation

to the detection of abnormal conditions in engineering. In Proceedings of Institute of

Mechanical Engineers (p. 687-703).

10

Edgeworth, F. Y. (1887). On discordant observations. Philosophical Magazine, 25 (5),

364-375.

Ertoz, L., Steinbach, M., & Kumar, V. (2003). Finding topics in collections of documents:

A shared nearest neighbour approach. In Clustering and Information Retrieval (p. 83-

104).

Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technomet-

rics, 11 (1), 1-21.

Kriegel, H.-P., Kroger, P., Schubert, E., & Zimek, A. (2009). Loop: Local outlier proba-

bilities. In Proceedings of the 18th ACM Conference on Information and Knowledge

Management (CIKM) (p. 1649-1652). Hong Kong, China.

Kriegel, H.-P., Schubert, E., & Zimek, A. (2008). ABOD: Angle-based outlier detection in

high-dimensional data. In Proceedings of the 14th ACM International Conference on

Knowledge Discovery and Data Mining (SIGKDD) (p. 444-452). Las Vegas, NV.

Kriegel, H.-P., Schubert, E., & Zimek, A. (2009). Outlier Detection in Axis-Parallel Sub-

spaces of High Dimensional Data. In Proceedings of the 13th PAKDD Conference.

Bangkok, Thailand.

Kriegel, H.-P., Schubert, E., & Zimek, A. (2011). Interpreting and unifying outlier scores.

In Proceedings of the 11th SIAM International Conference on Data Mining (SDM)

(p. 13-24). Mesa, AZ.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In ACM KDD

Conference.

Liu, F., Ting, K., & Zhou, Z.-H. (2008). Isolation forest. In ICDM Conference.

Polikar, R. (2009). Ensemble learning. Scholarpedia, 4 , 2776.

Smith, R., Bivens, A., Embrechts, M., Palagiri, C., & Szymanski, B. (2002). Clustering

approaches for anomaly based intrusion detection. In Proceedings of Intelligent Engi-

neering Systems through Artificial Neural Networks (p. 579-584). ASME.

Tukey, J. (1977). Exploratory data analysis. Addison-Wesley.

Zhang, K., Hutter, M., & Jin, H. (n.d.). A new local distance based outlier detection approach

for scattered real world data. In Proceedings of the 13th PAKDD conference.

Zimek, R. J., Campello, G., & Sander, J. (2013). Ensembles for unsupervised outlier

detection: Challenges and research questions. SIGKDD Explorations, 15 (1).

11

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.