Sie sind auf Seite 1von 15

4/18/19

Unit Learning Outcomes


Lecture Notes on • ULO2:
– Describe advanced constituents and underlying theoretical foundation
Modern Data Science of data science.
• Unsupervised Learning
• Dimensionality Reduction
• ULO5:
Session 09: Data Analytics - Unsupervised
Learning – Collect, model and conduct inferential as well predictive tasks from data.
• KMeans
Gang Li • DBScan
School of Information Technology • Anomaly Detection
Deakin University, VIC 3125, Australia • PCA

SIT742 | Modern Data Science (G. Li) 1

Road map Clustering


Distance • What is clustering?
Measures
• What is a good clustering?
Clustering KMeans
Data Analytics
Unsupervised
Learning Association Rule
Mining DBSCan

Anomaly
Detection
Overview

Data Reduction

PCA

SIT742 | Modern Data Science (G. Li) 2 SIT742 | Modern Data Science (G. Li) 3

Pattern Discovery from Data Pattern Discovery from Data


a• puppy?
Gene expression detection
Why do we tend to Community detection

patterns in the cloud?


– Our mind prefers
patterns.
– Our brains do ‘clustering’
unconsciously.
– In fact, we are ‘encoded’
• How do we ‘teach’ a computer to
do this? to see patterns in
– This is a form of learning. everything:
– Also called unsupervised learning.
– Key component of exploratory • shopping, traffic,
data analysis. • what to eat, wear, etc. …

SIT742 | Modern Data Science (G. Li) 4 SIT742 | Modern Data Science (G. Li) 5

1
4/18/19

Clustering What Is Clustering?


• Unsupervised learning • Clustering is the process of grouping a set of physical or
– Aka: clustering, exploratory data analysis, pattern abstract objects into classes of similar objects.
discovery, learning without a teacher, etc..
– the machine is not given with training data, it has to
automatically figure out and induce patterns from data!

Clustering Algorithm
Unsupervised learning feedbacks
Data
algorithms

Discovered
Performance
Patterns/Clusters
Clustered Instances

SIT742 | Modern Data Science (G. Li) 6 SIT742 | Modern Data Science (G. Li) 7

What Is Good Clustering? What Is Good Clustering?


• A `good clustering’ has the following properties: • There is usually no `correct’ clustering, sometimes it is
– Items in the same cluster tend to be close to each other not hard to come up with a metric
– Items in different clusters tend to be far from each other – an easily calculated value that can be used to give a score
to any clustering.
– There are many such metrics. E.g.
intra-distance
• S = the mean distance between pairs of items in the same
inter-distance
cluster
• D = the mean distance between pairs of items in different
clusters
• Measure of cluster quality is: D/S
– The higher the better.

SIT742 | Modern Data Science (G. Li) 8 SIT742 | Modern Data Science (G. Li) 9

What Is Good Clustering? What Is Good Clustering?

A B C D E F G H A B C D E F G H

S = [AB + AD + AF + AH + BD + BF + BH + DF + DH + FH S = [AB + AC + AD + BC+ BD + CD +


+ CE + CG + EG] / 13 = 44/13 = 3.38 EF + EG + EH + FG + FH + GH ] / 12 = 20/12 = 1.67

D = [ AC + AE + AG + BC + BE + BG + DC + DE + DG + D = [ AE + AF + AG + AH + BE + BF + BG + BH +
FC + FE + FG + HC + HE + HG ]/15 = 40/15 = 2.67 CE + CF + CG + CH + DE + DF + DG + DH]/16 = 68/16 = 4.25

Cluster Quality = D/S = 0.77 Cluster Quality = D/S = 2.54

SIT742 | Modern Data Science (G. Li) 10 SIT742 | Modern Data Science (G. Li) 11

2
4/18/19

What Is Good Clustering? What Is Good Clustering?


• Some important notes
– Clustering algorithms (whether or not they work with cluster
A B C D E F G H
quality metrics) always use some kind of distance or similarity
measure
S = [AB + CD +
EF + EG + EH + FG + FH + GH ] / 8 = 12/8 = 1.5 • the result of the clustering process will depend on the chosen
distance measure.
D = [ AC + AD + AE + AF + AG + AH +
BC + BD + BE + BF + BG + BH + – Choice of algorithm, and/or distance measure, will depend
CE + CF + CG + CH + DE + DF + DG + DH]/20 = 72/20= 3.6 on the kind of cluster shapes you might expect in the data.
Cluster Quality = D/S = 2.40 – The D/S measure for cluster quality will not work well in lots
of cases
• Think about why

SIT742 | Modern Data Science (G. Li) 12 SIT742 | Modern Data Science (G. Li) 13

What Is Good Clustering? What Is Good Clustering?


• sometimes groups are difficult to spot, even in 2D • sometimes groups are difficult to spot, even in 2D

SIT742 | Modern Data Science (G. Li) 14 SIT742 | Modern Data Science (G. Li) 15

What Is Good Clustering? Clustering Algorithms


• A good clustering will produce high quality clusters • Partition-based Clustering
with • Hierarchical Agglomerative Clustering
– High intra-class similarity
– Low inter-class similarity
• The quality of a clustering depends on
– The similarity measures used
– The clustering algorithm
• The goodness of the clusters depends on the opinion of
the user

SIT742 | Modern Data Science (G. Li) 16 SIT742 | Modern Data Science (G. Li) 17

3
4/18/19

Distance Measures Data Matrix


• Distance Measures
• For memory-based clustering
– Also called object-by-variable structure
• Represents n objects with p variables
– A relational table The f-th attribute

éx ! x ! x ù
ê 11 1f 1p ú
ê " " " " " ú
êx ! x ! x ú
ê i1 if ip ú
ê " " " " " ú
ê ú
The i-th instance
êë xn1 ! xnf ! x ú
np û
SIT742 | Modern Data Science (G. Li) 18 SIT742 | Modern Data Science (G. Li) 19

Dissimilarity Matrix Distance Measures


• For memory-based clustering • Dissimilarity or similarity, is often defined with the help
– Also called object-by-variable structure of some distance measures
– Proximities of pairs of objects • A distance measure should satisfy following
Dissimilarity between the 2nd
– d(i, j): and the 3rd instance requirements:
• dissimilarity between objects i and j – d(i,j) ≥ 0 the distance is nonnegative
• Nonnegative – d(i,i) = 0 the distance to itself is 0
é 0 ù
– Close to 0: similar êd (2,1) ú – d(i,j) = d(j,i) the distance is symmetric
ê 0 ú – d(i,j) ≤ d(i,h) + d(h,j) triangular inequality
êd (3,1) d (3,2) 0 ú
ê ú i j
ê " " " ú k
êëd (n,1) d (n,2) ! ! 0úû
SIT742 | Modern Data Science (G. Li) 20 SIT742 | Modern Data Science (G. Li) 21

Distance Measures Interval-valued Variables


• Suppose two instances Xi and Xj, both have n • Continuous measurements of a roughly linear scale
attributes: – Weight, height, latitude and longitude coordinates, temperature, etc.
Xi = (xi1 ,…, xin ), and Xj = (xj1 ,…, xjn )
• Effect of measurement units in this kind of attributes
• Minkowski Distance – Smaller unit à larger variable range à larger effect to
d(i,j)= (| xi1 - xj1 |q+| xi2 – xj2 |q +…+| xin - xjn |q)1/q the result
• if q=2, it becomes the Euclidean distance – Standardization + background knowledge
• if q=1, it becomes the Manhattan distance
– also known as city block distance

SIT742 | Modern Data Science (G. Li) 22 SIT742 | Modern Data Science (G. Li) 23

4
4/18/19

Interval-valued Variables Interval-valued Variables


• Euclidean distance or other cases of Minkowski
distance can be applied Euclidean distance

• Before applying distance measure, the attributes should


be normalized
– so that attributes with larger ranges will not out-weight
attributes with smaller ranges
– How to normalize? Cosine distance

• Min-max
• Z-score

SIT742 | Modern Data Science (G. Li) 24 SIT742 | Modern Data Science (G. Li) 25

Kmeans Algorithm Partitioning Methods


• K-Means • Given a data set, a partitioning method construct k (k<n)
• K-Medoids partitions of the data, where each partition represents a
• Other Algorithms cluster
• Two schemes:
– each cluster is represented by the mean value of the data in the
cluster
– each cluster is represented by one of the objects in the data near
the centre of the cluster
• representatives:
– K-means, K-medoids, K-modes, PAM, EM,
– CLARA, CLARANS

SIT742 | Modern Data Science (G. Li) 26 SIT742 | Modern Data Science (G. Li) 27

Meet the Kmeans Meet the Kmeans


• The most popular clustering algorithm on the planet! • Objective function:
– Simple and fast! – Data !" , !$ , … , !& and assume ' clusters
• Independently discovered by 60s and 70s: – Define indicator variables:
– Steinhaus (1955), Lloyd (1957), Ball and Hall (1965) and 1 if !& belongs to cluster ;
(&) = +
McQueen (1967). 0 otherwise
• Still remains the algorithm-of-choice for today analysis
tasks. – Objective function for Kmeans:

& @
$
> ?" , … , ? @ = A A (&) !B − ? )
B )

SIT742 | Modern Data Science (G. Li) 28 SIT742 | Modern Data Science (G. Li) 29

5
4/18/19

How Kmeans works? Issues with Kmeans


Kmeans Algorithm:
A schematic illustration of the K-means algorithm
• We need to supply the number of clusters before hand.
Each cluster is represented by one of for two-dimensional data clustering – In practice, this can be difficult if we don’t know in
the objects near the center of the cluster advance how many clusters we are looking for.
Step 1: randomly select k objects as the
centroids of the clusters – Tip: always to visualise the data, e.g., scatter plot, etc.
Step 2: for each remaining object, assign
• However, this can be difficult if data is high-dimensional,
it to the cluster whose centroid is the nearest to
the object i.e., (dealing with text).
Step 3: calculate the new means to be the
centroids of the observations in the new
• Other advances methods exists: Principal Component
clusters, if the and go to step 2; Analysis (PCA), Multidimensional-Scaling.
otherwise, exit
• Kmeans is sensitive to initialization
Ref: https://arxiv.org/abs/1611.01849

– Lots of research has tried to fix this problem.

SIT742 | Modern Data Science (G. Li) 30 SIT742 | Modern Data Science (G. Li) 31

Meet DBSCAN How DBSCAN works


• Clusters are dense regions of instances that are separated by • DBSCAN estimates the density of a
low-density regions point as the number of points from the
dataset that lie in its !-neighbourhood.
• Strengths
–Find arbitrarily shaped clusters • A core point is a point with estimated
–Filter out outliers density above or equal to a user-
–A non-parametric method (don’t need specified threshold MinPts
to specify the number of clusters)
Classification of points with
• All points in each core point’s ! -
MinPts =3
neighbourhood are linked to the core
• Weaknesses • Red is Core point
point, and then the points which are • Yellow is Border point
–Cannot find all clusters with different densities • Blue is Noise point
directly linked or transitively linked
–Limited to low-dimensional data sets are grouped into the same cluster.
–Run time grows quadractically w.r.t. the number of instances

SIT742 | Modern Data Science (G. Li) 32 SIT742 | Modern Data Science (G. Li) 33

Other Clustering Algorithm Market Basket Analysis

• Flat clustering • People go shopping everyday, So


– Kmeans
Transaction data are
accumulated day by day, month
– DBSCAN
by month and year by year.
– Affinity Propagation
– And eventually a huge amount
– Spectral Clustering, etc. of data are collected and
• Hierarchical Clustering: become available
– Agglomerative Clustering • Then the market analyser want to
– Advanced hierarchical/nested clustering algorithms know what is the customers’
• New clustering algorithms are being invented everyday! consumption preference.

SIT742 | Modern Data Science (G. Li) 34 SIT742 | Modern Data Science (G. Li) 35

6
4/18/19

Market Basket Analysis Market Baskets Analysis


• Analyze tables of transactions • Did customers buy chips also buy Salsa?
• Which items are frequently purchased together by
customers? Chips Þ Salsa
Customer Basket
C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
• Did customers prefer buying Lettuce together with
Spinach?
C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C3 Chips, Salsa, Frozen Pizza, Frozen Cake Lettuce Þ Spinach
C4 Lettuce, Spinach, Milk, Butter

SIT742 | Modern Data Science (G. Li) 36 SIT742 | Modern Data Science (G. Li) 37

Market Baskets Analysis Associations Rules


• In general, data consists of • Association Rules express relationships between items
– e.g.
cereal, milk Þ fruit
TID Basket – “Peoples who bought cereal and milk also bought fruit”

Stores might want to offer specials


Transaction ID Subset of items on milk and cereal to get people to
buy more fruit.

SIT742 | Modern Data Science (G. Li) 38 SIT742 | Modern Data Science (G. Li) 39

Basic Concepts Measuring Interesting Rules


• Set of items • Support
I = {i1 , i2 ,..., im } – Ratio of numbers of transactions containing A and B
to the total numbers of transactions
• Transaction
TÍI
sup port(A → B)
• Association Rule
= p(A ∧ B)
AÞ B
#_ of _ tuples _ containing _ both _ A _ and _ B
A Ì I, B Ì I, AÇ B = Æ =
#_ of _ tuples
• D - set of transactions (i.e., our data)

SIT742 | Modern Data Science (G. Li) 40 SIT742 | Modern Data Science (G. Li) 41

7
4/18/19

Measuring Interesting Rules Market Basket Analysis


• Confidence • What is I?
– Ratio of numbers of transactions containing A and B • What is T for Customer C2?
to numbers of transactions containing A
• What is support(Chips=>Salsa)?
confidence( A ® B) • What is confidence(Chips=>Salsa)?
= p( B | A) Customer Basket
# _ of _ tuples _ containing _ both _ A _ and _ B
= C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
# _ of _ tuples _ containing _ A C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C3 Chips, Salsa, Frozen Pizza, Frozen Cake
C4 Lettuce, Spinach, Milk, Butter
SIT742 | Modern Data Science (G. Li) 42 SIT742 | Modern Data Science (G. Li) 43

Market Basket Analysis Measuring Interesting Rules


• What is I? • Rules are mined based on two metrics
– I={ Chips, Salsa, Cookies, Crackers, Coke, Beer, Lettuce, Spinach, – minimum support
Oranges, Celery, Apples, Grapes, Frozen Pizza, Frozen Cake,
Spinach, Milk, Butter} • how frequently an association rule appear in transactions
• What is T for Customer C2?
– T={Lettuce, Spinach, Oranges, Celery, Apples, Grapes} – minimum confidence
Customer Basket • how frequently the left hand side of a rule implies the
C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
right hand side

C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes


C3 Chips, Salsa, Frozen Pizza, Frozen Cake
C4 Lettuce, Spinach, Milk, Butter
SIT742 | Modern Data Science (G. Li) 44 SIT742 | Modern Data Science (G. Li) 45

Frequent Itemsets Strong Association Rules


• itemset • Given an itemset, it’s easy to generate association rules
– any set of items – Given itemset, {Chips, Salsa}
• => Chips, Salsa
• k-itemset
• Chips => Salsa
– an itemset containing k items
• Salsa => Chips
• frequent itemset • Chips, Salsa =>
– an itemset that satisfies a minimum support level
• Strong rules are interesting
If I contains m items, how many
itemsets are there? – Generally defined as those rules satisfying minimum support and
minimum confidence

SIT742 | Modern Data Science (G. Li) 46 SIT742 | Modern Data Science (G. Li) 47

8
4/18/19

Generating Frequent Itemsets: Naïve


Association Rule Mining
algorithm
• Two basic steps Input: Database of Transactions,Minimum Support
Output:Frequent Itemsets
1. Find all frequent itemsets
• Satisfying minimum support n=|D|
for each subset s of I do
2. Find all strong association rules l= 0
• Generate association rules from frequent itemsets for each transaction T in D do
if s is a subset of T then
• Keep rules satisfying minimum confidence
l = l + 1
if minimum support <= l/n then
add s to frequent subsets

SIT742 | Modern Data Science (G. Li) 48 SIT742 | Modern Data Science (G. Li) 49

Generating Frequent Itemsets: Generating Frequent Itemsets:


Naïve algorithm Apriori Algorithm
• Analysis of naïve algorithm • A key property of frequent itemset
– 2m subsets of I – If A is not a frequent itemset, then any superset of A is not a
frequent itemset.
– Scan n transactions for each subset
– O(2m n) tests of s being subset of T – If A is a frequent itemset, then any subset of A is also a frequent
• Growth is exponential in the number of items! itemset.

Can we do better?

SIT742 | Modern Data Science (G. Li) 50 SIT742 | Modern Data Science (G. Li) 51

Generating Frequent Itemsets: Generating Frequent Itemsets:


Apriori Algorithm Apriori Algorithm
Input: Database, minimum_Support
• Idea: Output: Frequent Itemset
– Build candidate k-itemsets from frequent (k-1)-itemsets Begin:
L 1 = {frequent 1-itemsets} // Scan the Database
for (k=2; L (k-1) is not empty; k++)
{
• Approach C k = generate k-itemset candidates from L (k-1)
for each transaction t in D
– Find all frequent 1-itemsets { // The candidates that are subsets of t
C t=subset(C k,t)
– Extend (k-1)-itemsets to candidate k-itemsets for each candidate c in C t
c.count++;
– Prune candidate itemsets that do not meet the minimum }
L k = {c in C k | c.count >= min_sup}
support. }
L= È L k
end

SIT742 | Modern Data Science (G. Li) 52 SIT742 | Modern Data Science (G. Li) 53

9
4/18/19

Apriori Algorithm: Example Apriori Algorithm: Example


Database D ItemSet sup. ItemSet sup. Database D
TID Basket {A} 2 Possible Association Rules that can be derived from {B C E}:
{B} 3 Remove
{A} 2 TID Trans.
100 A C D
200 B C E Scan D {C} 3 {B} 3
100 ACD
300 A B C E {D} 1 {C} 3
{E} 3
{E} 3 200 BCE
400 B E B, C --> E sup = 2/4 con = 2/2 = 1.0
ItemSet sup 300 ABCE
{A C} 2 ItemSet B, E --> C sup = 2/4 con = 2/3 = 0.6
{A B} 1 400 BE
{B C} 2 Remove {A C} 2 Scan D {A B]
C, E --> B sup = 2/4 con = 2/2 = 1.0
{C E} 2 {A E} 1 {A C}
{B E} 3 {B C} 2 {A E} B --> C, E sup = 2/4 con = 2/3 = 0.6
{B E} 3 {B C} C --> B, E sup = 2/4 con = 2/3 = 0.6
{C E} 2
ItemSet {B E}
ItemSet sup {C E} E --> B, C sup = 2/4 con = 2/3 = 0.6
{A B C} {A B C} 1
{ A C E} Scan D {A C E} 1 Remove
{A B E} {A B E} 1 ItemSet sup
{B C E} {B C E} 2
{B C E} 2
SIT742 | Modern Data Science (G. Li) 54 SIT742 | Modern Data Science (G. Li) 55

Generating Frequent Itemsets Generating Frequent Itemsets


• Basic apriori reduces the number of itemsets • Improving candidate generation
considered. • Join together frequent (k-1)-itemsets
– Reduces scans of D – If itemsets have (k-2)-items in common, create a k-
itemset candidate by adding the two differing items
to (k-2) common items
• How do we generate candidates?
– Naïve approach • Example:
• For each item i not in a frequent (k-1)-itemset – {Lettuce, Spinach, Milk} join {Lettuce, Spinach, Chips} , here k=4
– Add i to the itemset to create a k-itemset candidate – {Lettuce, Spinach, Chips, Milk}
• Might generate duplicates, so remove them

SIT742 | Modern Data Science (G. Li) 56 SIT742 | Modern Data Science (G. Li) 57

Generating Frequent Itemsets Anomaly/Outlier Detection


– Pruning • Outlier: A data object that deviates significantly from the
normal objects as if it were generated by a different
• If candidate contains a subset that is infrequent, mechanism
then the candidate is infrequent • Outliers are different from the noise data
– Noise is random error or variance in a measured variable
• All (k-1)-subsets of candidate must be frequent – Noise should be removed before outlier detection
• Outlier detection vs. novelty detection: early stage, outlier; but later
• Apriori property! merged into the model
• Applications:
– Credit card fraud detection, telecommunication fraud
detection, network intrusion detection, fault detection

SIT742 | Modern Data Science (G. Li) 58 SIT742 | Modern Data Science (G. Li) 59

10
4/18/19

Fraud Detection
Intrusion Detection
• Fraud detection refers to detection of criminal activities
• Intrusion Detection: occurring in commercial organizations
– Process of monitoring the events occurring in a computer system or – Malicious users might be the actual customers of the organization or
network and analyzing them for intrusions might be posing as a customer (also known as identity theft).
– Intrusions are defined as attempts to bypass the security mechanisms of a
• Types of fraud
computer or network
– Credit card fraud
• Challenges
– Insurance claim fraud
– Traditional signature-based intrusion detection
systems are based on signatures of known – Mobile / cell phone fraud
attacks and cannot detect emerging cyber threats – Insider trading
– Substantial latency in deployment of newly • Challenges
created signatures across the computer system
– Fast and accurate real-time detection
• Anomaly detection can alleviate these
– Misclassification cost is very high
limitations

SIT742 | Modern Data Science (G. Li) 60 SIT742 | Modern Data Science (G. Li) 61

Healthcare Informatics Industrial Damage Detection


• Industrial damage detection refers to detection of different faults
• Detect anomalous patient records
and failures in complex industrial systems, structural damages,
– Indicate disease outbreaks, intrusions in electronic security systems, suspicious events in video
instrumentation errors, etc. surveillance, abnormal energy consumption, etc.
• Key Challenges – Example: Aircraft Safety
• Anomalous Aircraft (Engine) / Fleet Usage
– Only normal labels available • Anomalies in engine combustion data
– Misclassification cost is very high • Total aircraft health and usage management
– Data can be complex • Key Challenges
– Data is extremely huge, noisy and unlabelled
– Most of applications exhibit temporal behaviour
– Detecting anomalous events typically require immediate intervention
http://seninp.github.io/software.html

SIT742 | Modern Data Science (G. Li) 62 SIT742 | Modern Data Science (G. Li) 63

Supervised Techniques Unsupervised Techniques


• Assume the normal objects are somewhat ``clustered'‘ into multiple
•Advantages: groups, each having some distinct features
• Models that can be easily understood • An outlier is expected to be far away from any groups of normal
• High accuracy in detecting many kinds of known anomalies objects
• Models that can be easily understood • Newer methods: tackle outliers directly
• Normal behavior can be accurately learned • Weakness: Cannot detect collective outlier effectively
•Drawbacks: – Normal objects may not share any strong patterns, but the
collective outliers may share high similarity in a small area
• Require both labels from normal and/or anomaly class
• Cannot detect unknown and emerging anomalies
• Many clustering methods can be adapted for unsupervised methods
• Possible high false alarm rate - previously unseen (yet legitimate) data records – Find clusters, then outliers: not belonging to any cluster
may be recognized as anomalies – Problem 1: Hard to distinguish noise from outliers
– Problem 2: Costly since first clustering: but far less outliers than
normal objects

SIT742 | Modern Data Science (G. Li) 64


65

11
4/18/19

Statistical Approaches Proximity-based Approaches


• Assume a parametric model describing the distribution of the
data (e.g., normal distribution) • Intuition: Objects that are far away from the others are outliers
• Apply a statistical test that depends on • Assumption of proximity-based approach: The proximity of an
– Data distribution outlier deviates significantly from that of most of the others in
– Parameter of distribution (e.g., mean, variance) the data set
– Number of expected outliers (confidence limit) • Three types of proximity-based outlier detection methods
• In many cases, data distribution/model may not be known – Distance-based outlier detection: An object o is an outlier if
its neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if
its density is relatively much lower than that of its neighbors
– Isolation-based outlier detection: An object o is an outlier if
is likely to be isolated from other points

SIT742 | Modern Data Science (G. Li) 66 SIT742 | Modern Data Science (G. Li) 67

Distance-based outlier detection Isolation-based outlier detection


• Isolation Forest
• Nearest Neighbour (kNN) approach – Randomly select a feather and split all instances into two non-empty subsets
– For each data point d compute the distance to the k-th nearest neighbour dk – Repeat the random split procedure on each subset
– Sort all data points according to the distance dk – The instance which is likely to be isolated (only one instance in that subset)
– Outliers are points that have the largest distance dk and therefore are located in earlier is considered as the outlier
the more sparse neighbourhoods
– Usually data points that have distance dk higher
than a threshold are identified as outliers
– Not suitable for datasets that have modes with Threshold

varying density

https://quantdare.com/isolation-forest-algorithm/

SIT742 | Modern Data Science (G. Li) 68 SIT742 | Modern Data Science (G. Li) 69

Clustering-based Approaches Data Reduction

n An object is an outlier if (1) it does not belong to any cluster, (2) • Introduction
there is a large distance between the object and its closest cluster , • Why Data Reduction?
or (3) it belongs to a small or sparse cluster
n Case I: Not belong to any cluster
n Identify animals not part of a flock: Using a density-based

clustering method such as DBSCAN


n Case 2: Far from its closest cluster
n Using k-means, partition data points of into clusters

n For each object o, assign an outlier score based on its distance


from its closest center

SIT742 | Modern Data Science (G. Li) 70 SIT742 | Modern Data Science (G. Li) 71

12
4/18/19

Data Reduction Principal Component Analysis

• Data Reduction is the elimination or the de-emphasis of • Why PCA?


certain data records and attributes • Example
– Data Reduction may involve the choosing a subset of attributes
• Dimensionality reduction is often used to reduce the number of
• PCA Formulation
dimensions to two or three
– Alternatively, pairs of attributes can be considered

– Data Reduction may also involve choosing a subset of records


• A region of the screen can only show so many points
• Can sample, but want to preserve points in sparse areas

SIT742 | Modern Data Science (G. Li) 72 SIT742 | Modern Data Science (G. Li) 73

Why PCA? Why PCA?


• Raw data can be high • Issues
dimensional but redundant
– If we knew what to measure – Data might be redundant
or how to represent our
measurements, we might – Data might be naively represented in raw form
find simple relationships
– But often we measure
redundant signals • Goal
• US vs EU shoe sizes
– Or we represent the data via – Find a better representation of the data
the method by which it was • to visualize and discover hidden patterns
gathered
• pixel representation of • preprocessing for supervised task, such as attribute
human face data hashing

SIT742 | Modern Data Science (G. Li) 74 SIT742 | Modern Data Science (G. Li) 75

Example: Shoe Sizes Example: Shoe Sizes


• We take noisy • We take noisy
measurements on US and measurements on US and
EU scale EU scale
– module noise – module noise
– we expect perfect – we expect perfect
correlation correlation
• How can we do better? • How can we do better?
find a simple, compact find a simple, compact
representation? representation?
– Pick a direction and – Pick a direction and
project onto this direction project onto this direction

SIT742 | Modern Data Science (G. Li) 76 SIT742 | Modern Data Science (G. Li) 77

13
4/18/19

Example: Shoe Sizes Example: Shoe Sizes


• PCA: Linear Regression PCA
– Try to find a direction • predict y from x. • reconstruct 2D data via
that minimizes 2D data with single
• Evaluate accuracy of
distances between degree of freedom.
predictions (represented
original data and their
by blue line) by vertical • Evaluate reconstructions
projections
distances between points (represented by blue line)
and the line by Euclidean distances

SIT742 | Modern Data Science (G. Li) 78 SIT742 | Modern Data Science (G. Li) 79

Example: Shoe Sizes Example: Shoe Sizes


Linear Regression PCA • Goal:
– Find a better data
• predict y from x. • reconstruct 2D data via representation
• Evaluate accuracy of 2D data with single • Another Perspective
predictions (represented degree of freedom. – to identify patterns we
by blue line) by vertical • Evaluate reconstructions want to study variation
distances between points (represented by blue line) across observations, we
can find a compact
and the line by Euclidean distances representation that
captures variation
– PCA finds directions of
maximal variance

SIT742 | Modern Data Science (G. Li) 80 SIT742 | Modern Data Science (G. Li) 81

Example: Shoe Sizes PCA Formulation


• Goal: • PCA: find lower-dimensional representation of data
– Find a better data
representation
– ! ∈ ℝ$×& : n data records, each with d attributes
(*)
• Another Perspective • '( : j-th attribute of the i-th data record
– to identify patterns we • , ( : mean of j-th attribute
want to study variation
across observations, we – ℙ ∈ ℝ&×. : /×0 matrix
can find a compact • columns are k principal components
representation that
captures variation – ℤ = !×ℙ ∈ ℝ$×. : 3×0 matrix
– PCA finds directions of • reduced representation
maximal variance • PCA ‘scores’

SIT742 | Modern Data Science (G. Li) 82 SIT742 | Modern Data Science (G. Li) 83

14
4/18/19

PCA Formulation (k=1) PCA Formulation (k=1)


• Goal: PCA finds directions of maximal variance • Goal: PCA finds directions of maximal variance
– ! ∈ ℝ$×& : n data records, each with d attributes – Variance
+
– ℙ ∈ ℝ&×( : )×1 matrix 1
!ℤ# = '(- (() )# = - ##
– ℤ = !×ℙ ∈ ℝ$×( : -×1 matrix &
()*
• Variance: .ℤ/ =
(
∑ $12( (4 (1) )/ = 4 /
/ = /ℙ 1 /ℙ = ℙ1 /1 /ℙ = ℙ1 ℂ/ ℙ
$

– Find ℙ to maximize .ℤ/ , subject to ℙ =1 – Optimization:


/
max ℙ 1 ℂ / ℙ, where ℙ # =1

SIT742 | Modern Data Science (G. Li) 84 SIT742 | Modern Data Science (G. Li) 85

PCA Formulation (k=1) Choosing k


• Goal: max ℙ%ℂ'ℙ, where ℙ ( =1 • How should we pick the dimension of the new
ℙ representation?
• Solution: – Visualization:
– Lagrangian: maximizing ℙ% ℂ' ℙ − ,(ℙ% ℙ-1) • Pick top 2 or 3 dimensions for plotting purposes
– Differentiating with respect to ℙ: 2ℂ' ℙ − 2,ℙ = 0 – Other analyses:
– Hence we have ℂ' ℙ = ,ℙ • Capture ‘most’ of the variance in the data
• The best ℙ is the eigenvector of ℂ ' , with eigenvalue , • Recall that eigenvalues are variances in the directions
specified by eigenvectors, and that eigenvalues are sorted
• Choose a k such that the fraction of retained variance at
– Similar argument can be used for k > 1 ∑%
"#$ & "
least 95%:
∑'
"#$ & "

SIT742 | Modern Data Science (G. Li) 86 SIT742 | Modern Data Science (G. Li) 87

Questions? This Session’s Readings


• Principal Component Analysis
– https://en.wikipedia.org/wiki/Principal_component_an
alysis
– A Tutorial on Principal Component Analysis
– https://www.youtube.com/watch?v=q5w8FyF3m2o
• Random Projection
– Random projection in dimensionality reduction:
Applications to image and text data

SIT742 | Modern Data Science (G. Li) 88 SIT742 | Modern Data Science (G. Li) 89

15

Das könnte Ihnen auch gefallen