Beruflich Dokumente
Kultur Dokumente
Anomaly
Detection
Overview
Data Reduction
PCA
SIT742 | Modern Data Science (G. Li) 2 SIT742 | Modern Data Science (G. Li) 3
SIT742 | Modern Data Science (G. Li) 4 SIT742 | Modern Data Science (G. Li) 5
1
4/18/19
Clustering Algorithm
Unsupervised learning feedbacks
Data
algorithms
Discovered
Performance
Patterns/Clusters
Clustered Instances
SIT742 | Modern Data Science (G. Li) 6 SIT742 | Modern Data Science (G. Li) 7
SIT742 | Modern Data Science (G. Li) 8 SIT742 | Modern Data Science (G. Li) 9
A B C D E F G H A B C D E F G H
D = [ AC + AE + AG + BC + BE + BG + DC + DE + DG + D = [ AE + AF + AG + AH + BE + BF + BG + BH +
FC + FE + FG + HC + HE + HG ]/15 = 40/15 = 2.67 CE + CF + CG + CH + DE + DF + DG + DH]/16 = 68/16 = 4.25
SIT742 | Modern Data Science (G. Li) 10 SIT742 | Modern Data Science (G. Li) 11
2
4/18/19
SIT742 | Modern Data Science (G. Li) 12 SIT742 | Modern Data Science (G. Li) 13
SIT742 | Modern Data Science (G. Li) 14 SIT742 | Modern Data Science (G. Li) 15
SIT742 | Modern Data Science (G. Li) 16 SIT742 | Modern Data Science (G. Li) 17
3
4/18/19
éx ! x ! x ù
ê 11 1f 1p ú
ê " " " " " ú
êx ! x ! x ú
ê i1 if ip ú
ê " " " " " ú
ê ú
The i-th instance
êë xn1 ! xnf ! x ú
np û
SIT742 | Modern Data Science (G. Li) 18 SIT742 | Modern Data Science (G. Li) 19
SIT742 | Modern Data Science (G. Li) 22 SIT742 | Modern Data Science (G. Li) 23
4
4/18/19
• Min-max
• Z-score
SIT742 | Modern Data Science (G. Li) 24 SIT742 | Modern Data Science (G. Li) 25
SIT742 | Modern Data Science (G. Li) 26 SIT742 | Modern Data Science (G. Li) 27
& @
$
> ?" , … , ? @ = A A (&) !B − ? )
B )
SIT742 | Modern Data Science (G. Li) 28 SIT742 | Modern Data Science (G. Li) 29
5
4/18/19
SIT742 | Modern Data Science (G. Li) 30 SIT742 | Modern Data Science (G. Li) 31
SIT742 | Modern Data Science (G. Li) 32 SIT742 | Modern Data Science (G. Li) 33
SIT742 | Modern Data Science (G. Li) 34 SIT742 | Modern Data Science (G. Li) 35
6
4/18/19
SIT742 | Modern Data Science (G. Li) 36 SIT742 | Modern Data Science (G. Li) 37
SIT742 | Modern Data Science (G. Li) 38 SIT742 | Modern Data Science (G. Li) 39
SIT742 | Modern Data Science (G. Li) 40 SIT742 | Modern Data Science (G. Li) 41
7
4/18/19
SIT742 | Modern Data Science (G. Li) 46 SIT742 | Modern Data Science (G. Li) 47
8
4/18/19
SIT742 | Modern Data Science (G. Li) 48 SIT742 | Modern Data Science (G. Li) 49
Can we do better?
SIT742 | Modern Data Science (G. Li) 50 SIT742 | Modern Data Science (G. Li) 51
SIT742 | Modern Data Science (G. Li) 52 SIT742 | Modern Data Science (G. Li) 53
9
4/18/19
SIT742 | Modern Data Science (G. Li) 56 SIT742 | Modern Data Science (G. Li) 57
SIT742 | Modern Data Science (G. Li) 58 SIT742 | Modern Data Science (G. Li) 59
10
4/18/19
Fraud Detection
Intrusion Detection
• Fraud detection refers to detection of criminal activities
• Intrusion Detection: occurring in commercial organizations
– Process of monitoring the events occurring in a computer system or – Malicious users might be the actual customers of the organization or
network and analyzing them for intrusions might be posing as a customer (also known as identity theft).
– Intrusions are defined as attempts to bypass the security mechanisms of a
• Types of fraud
computer or network
– Credit card fraud
• Challenges
– Insurance claim fraud
– Traditional signature-based intrusion detection
systems are based on signatures of known – Mobile / cell phone fraud
attacks and cannot detect emerging cyber threats – Insider trading
– Substantial latency in deployment of newly • Challenges
created signatures across the computer system
– Fast and accurate real-time detection
• Anomaly detection can alleviate these
– Misclassification cost is very high
limitations
SIT742 | Modern Data Science (G. Li) 60 SIT742 | Modern Data Science (G. Li) 61
SIT742 | Modern Data Science (G. Li) 62 SIT742 | Modern Data Science (G. Li) 63
11
4/18/19
SIT742 | Modern Data Science (G. Li) 66 SIT742 | Modern Data Science (G. Li) 67
varying density
https://quantdare.com/isolation-forest-algorithm/
SIT742 | Modern Data Science (G. Li) 68 SIT742 | Modern Data Science (G. Li) 69
n An object is an outlier if (1) it does not belong to any cluster, (2) • Introduction
there is a large distance between the object and its closest cluster , • Why Data Reduction?
or (3) it belongs to a small or sparse cluster
n Case I: Not belong to any cluster
n Identify animals not part of a flock: Using a density-based
SIT742 | Modern Data Science (G. Li) 70 SIT742 | Modern Data Science (G. Li) 71
12
4/18/19
SIT742 | Modern Data Science (G. Li) 72 SIT742 | Modern Data Science (G. Li) 73
SIT742 | Modern Data Science (G. Li) 74 SIT742 | Modern Data Science (G. Li) 75
SIT742 | Modern Data Science (G. Li) 76 SIT742 | Modern Data Science (G. Li) 77
13
4/18/19
SIT742 | Modern Data Science (G. Li) 78 SIT742 | Modern Data Science (G. Li) 79
SIT742 | Modern Data Science (G. Li) 80 SIT742 | Modern Data Science (G. Li) 81
SIT742 | Modern Data Science (G. Li) 82 SIT742 | Modern Data Science (G. Li) 83
14
4/18/19
SIT742 | Modern Data Science (G. Li) 84 SIT742 | Modern Data Science (G. Li) 85
SIT742 | Modern Data Science (G. Li) 86 SIT742 | Modern Data Science (G. Li) 87
SIT742 | Modern Data Science (G. Li) 88 SIT742 | Modern Data Science (G. Li) 89
15