Unit Learning Outcomes: Modern Data Science

4/18/19
Unit Learning Outcomes

Lecture Notes on • ULO2:
– Describe advanced constituents and underlying theoretical foundation
Modern Data Science of data science.
• Unsupervised Learning
• Dimensionality Reduction
• ULO5:
Session 09: Data Analytics - Unsupervised
Learning – Collect, model and conduct inferential as well predictive tasks from data.
• KMeans
Gang Li • DBScan
School of Information Technology • Anomaly Detection
Deakin University, VIC 3125, Australia • PCA
SIT742 | Modern Data Science (G. Li) 1
Road map Clustering

Distance • What is clustering?
Measures
• What is a good clustering?
Clustering KMeans
Data Analytics
Unsupervised
Learning Association Rule
Mining DBSCan
Anomaly
Detection
Overview
Data Reduction
PCA
SIT742 | Modern Data Science (G. Li) 2 SIT742 | Modern Data Science (G. Li) 3
Pattern Discovery from Data Pattern Discovery from Data

a• puppy?
Gene expression detection
Why do we tend to Community detection
patterns in the cloud?

– Our mind prefers
patterns.
– Our brains do ‘clustering’
unconsciously.
– In fact, we are ‘encoded’
• How do we ‘teach’ a computer to
do this? to see patterns in
– This is a form of learning. everything:
– Also called unsupervised learning.
– Key component of exploratory • shopping, traffic,
data analysis. • what to eat, wear, etc. …
1
4/18/19
Clustering What Is Clustering?

• Unsupervised learning • Clustering is the process of grouping a set of physical or
– Aka: clustering, exploratory data analysis, pattern abstract objects into classes of similar objects.
discovery, learning without a teacher, etc..
– the machine is not given with training data, it has to
automatically figure out and induce patterns from data!
Clustering Algorithm
Unsupervised learning feedbacks
Data
algorithms
Discovered
Performance
Patterns/Clusters
Clustered Instances
What Is Good Clustering? What Is Good Clustering?

• A `good clustering’ has the following properties: • There is usually no `correct’ clustering, sometimes it is
– Items in the same cluster tend to be close to each other not hard to come up with a metric
– Items in different clusters tend to be far from each other – an easily calculated value that can be used to give a score
to any clustering.
– There are many such metrics. E.g.
intra-distance
• S = the mean distance between pairs of items in the same
inter-distance
cluster
• D = the mean distance between pairs of items in different
clusters
• Measure of cluster quality is: D/S
– The higher the better.
A B C D E F G H A B C D E F G H
S = [AB + AD + AF + AH + BD + BF + BH + DF + DH + FH S = [AB + AC + AD + BC+ BD + CD +

+ CE + CG + EG] / 13 = 44/13 = 3.38 EF + EG + EH + FG + FH + GH ] / 12 = 20/12 = 1.67
D = [ AC + AE + AG + BC + BE + BG + DC + DE + DG + D = [ AE + AF + AG + AH + BE + BF + BG + BH +
FC + FE + FG + HC + HE + HG ]/15 = 40/15 = 2.67 CE + CF + CG + CH + DE + DF + DG + DH]/16 = 68/16 = 4.25
Cluster Quality = D/S = 0.77 Cluster Quality = D/S = 2.54
2
4/18/19

• Some important notes
– Clustering algorithms (whether or not they work with cluster
A B C D E F G H
quality metrics) always use some kind of distance or similarity
measure
S = [AB + CD +
EF + EG + EH + FG + FH + GH ] / 8 = 12/8 = 1.5 • the result of the clustering process will depend on the chosen
distance measure.
D = [ AC + AD + AE + AF + AG + AH +
BC + BD + BE + BF + BG + BH + – Choice of algorithm, and/or distance measure, will depend
CE + CF + CG + CH + DE + DF + DG + DH]/20 = 72/20= 3.6 on the kind of cluster shapes you might expect in the data.
Cluster Quality = D/S = 2.40 – The D/S measure for cluster quality will not work well in lots
of cases
• Think about why

• sometimes groups are difficult to spot, even in 2D • sometimes groups are difficult to spot, even in 2D
What Is Good Clustering? Clustering Algorithms

• A good clustering will produce high quality clusters • Partition-based Clustering
with • Hierarchical Agglomerative Clustering
– High intra-class similarity
– Low inter-class similarity
• The quality of a clustering depends on
– The similarity measures used
– The clustering algorithm
• The goodness of the clusters depends on the opinion of
the user
3
4/18/19
Distance Measures Data Matrix

• Distance Measures
• For memory-based clustering
– Also called object-by-variable structure
• Represents n objects with p variables
– A relational table The f-th attribute
éx ! x ! x ù
ê 11 1f 1p ú
ê " " " " " ú
êx ! x ! x ú
ê i1 if ip ú
ê " " " " " ú
ê ú
The i-th instance
êë xn1 ! xnf ! x ú
np û
Dissimilarity Matrix Distance Measures

• For memory-based clustering • Dissimilarity or similarity, is often defined with the help
– Also called object-by-variable structure of some distance measures
– Proximities of pairs of objects • A distance measure should satisfy following
Dissimilarity between the 2nd
– d(i, j): and the 3rd instance requirements:
• dissimilarity between objects i and j – d(i,j) ≥ 0 the distance is nonnegative
• Nonnegative – d(i,i) = 0 the distance to itself is 0
é 0 ù
– Close to 0: similar êd (2,1) ú – d(i,j) = d(j,i) the distance is symmetric
ê 0 ú – d(i,j) ≤ d(i,h) + d(h,j) triangular inequality
êd (3,1) d (3,2) 0 ú
ê ú i j
ê " " " ú k
êëd (n,1) d (n,2) ! ! 0úû
Distance Measures Interval-valued Variables

• Suppose two instances Xi and Xj, both have n • Continuous measurements of a roughly linear scale
attributes: – Weight, height, latitude and longitude coordinates, temperature, etc.
Xi = (xi1 ,…, xin ), and Xj = (xj1 ,…, xjn )
• Effect of measurement units in this kind of attributes
• Minkowski Distance – Smaller unit à larger variable range à larger effect to
d(i,j)= (| xi1 - xj1 |q+| xi2 – xj2 |q +…+| xin - xjn |q)1/q the result
• if q=2, it becomes the Euclidean distance – Standardization + background knowledge
• if q=1, it becomes the Manhattan distance
– also known as city block distance
4
4/18/19
Interval-valued Variables Interval-valued Variables

• Euclidean distance or other cases of Minkowski
distance can be applied Euclidean distance
• Before applying distance measure, the attributes should

be normalized
– so that attributes with larger ranges will not out-weight
attributes with smaller ranges
– How to normalize? Cosine distance
• Min-max
• Z-score
Kmeans Algorithm Partitioning Methods

• K-Means • Given a data set, a partitioning method construct k (k<n)
• K-Medoids partitions of the data, where each partition represents a
• Other Algorithms cluster
• Two schemes:
– each cluster is represented by the mean value of the data in the
cluster
– each cluster is represented by one of the objects in the data near
the centre of the cluster
• representatives:
– K-means, K-medoids, K-modes, PAM, EM,
– CLARA, CLARANS
Meet the Kmeans Meet the Kmeans

• The most popular clustering algorithm on the planet! • Objective function:
– Simple and fast! – Data !" , !$ , … , !& and assume ' clusters
• Independently discovered by 60s and 70s: – Define indicator variables:
– Steinhaus (1955), Lloyd (1957), Ball and Hall (1965) and 1 if !& belongs to cluster ;
(&) = +
McQueen (1967). 0 otherwise
• Still remains the algorithm-of-choice for today analysis
tasks. – Objective function for Kmeans:
& @
$
> ?" , … , ? @ = A A (&) !B − ? )
B )
5
4/18/19
How Kmeans works? Issues with Kmeans

Kmeans Algorithm:
A schematic illustration of the K-means algorithm
• We need to supply the number of clusters before hand.
Each cluster is represented by one of for two-dimensional data clustering – In practice, this can be difficult if we don’t know in
the objects near the center of the cluster advance how many clusters we are looking for.
Step 1: randomly select k objects as the
centroids of the clusters – Tip: always to visualise the data, e.g., scatter plot, etc.
Step 2: for each remaining object, assign
• However, this can be difficult if data is high-dimensional,
it to the cluster whose centroid is the nearest to
the object i.e., (dealing with text).
Step 3: calculate the new means to be the
centroids of the observations in the new
• Other advances methods exists: Principal Component
clusters, if the and go to step 2; Analysis (PCA), Multidimensional-Scaling.
otherwise, exit
• Kmeans is sensitive to initialization
Ref: https://arxiv.org/abs/1611.01849
– Lots of research has tried to fix this problem.
Meet DBSCAN How DBSCAN works

• Clusters are dense regions of instances that are separated by • DBSCAN estimates the density of a
low-density regions point as the number of points from the
dataset that lie in its !-neighbourhood.
• Strengths
–Find arbitrarily shaped clusters • A core point is a point with estimated
–Filter out outliers density above or equal to a user-
–A non-parametric method (don’t need specified threshold MinPts
to specify the number of clusters)
Classification of points with
• All points in each core point’s ! -
MinPts =3
neighbourhood are linked to the core
• Weaknesses • Red is Core point
point, and then the points which are • Yellow is Border point
–Cannot find all clusters with different densities • Blue is Noise point
directly linked or transitively linked
–Limited to low-dimensional data sets are grouped into the same cluster.
–Run time grows quadractically w.r.t. the number of instances
Other Clustering Algorithm Market Basket Analysis
• Flat clustering • People go shopping everyday, So

– Kmeans
Transaction data are
accumulated day by day, month
– DBSCAN
by month and year by year.
– Affinity Propagation
– And eventually a huge amount
– Spectral Clustering, etc. of data are collected and
• Hierarchical Clustering: become available
– Agglomerative Clustering • Then the market analyser want to
– Advanced hierarchical/nested clustering algorithms know what is the customers’
• New clustering algorithms are being invented everyday! consumption preference.
6
4/18/19
Market Basket Analysis Market Baskets Analysis

• Analyze tables of transactions • Did customers buy chips also buy Salsa?
• Which items are frequently purchased together by
customers? Chips Þ Salsa
Customer Basket
C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
• Did customers prefer buying Lettuce together with
Spinach?
C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C3 Chips, Salsa, Frozen Pizza, Frozen Cake Lettuce Þ Spinach
C4 Lettuce, Spinach, Milk, Butter
Market Baskets Analysis Associations Rules

• In general, data consists of • Association Rules express relationships between items
– e.g.
cereal, milk Þ fruit
TID Basket – “Peoples who bought cereal and milk also bought fruit”
Stores might want to offer specials

Transaction ID Subset of items on milk and cereal to get people to
buy more fruit.
Basic Concepts Measuring Interesting Rules

• Set of items • Support
I = {i1 , i2 ,..., im } – Ratio of numbers of transactions containing A and B
to the total numbers of transactions
• Transaction
TÍI
sup port(A → B)
• Association Rule
= p(A ∧ B)
AÞ B
#_ of _ tuples _ containing _ both _ A _ and _ B
A Ì I, B Ì I, AÇ B = Æ =
#_ of _ tuples
• D - set of transactions (i.e., our data)
7
4/18/19
Measuring Interesting Rules Market Basket Analysis

• Confidence • What is I?
– Ratio of numbers of transactions containing A and B • What is T for Customer C2?
to numbers of transactions containing A
• What is support(Chips=>Salsa)?
confidence( A ® B) • What is confidence(Chips=>Salsa)?
= p( B | A) Customer Basket
# _ of _ tuples _ containing _ both _ A _ and _ B
= C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
# _ of _ tuples _ containing _ A C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C3 Chips, Salsa, Frozen Pizza, Frozen Cake
Market Basket Analysis Measuring Interesting Rules

• What is I? • Rules are mined based on two metrics
– I={ Chips, Salsa, Cookies, Crackers, Coke, Beer, Lettuce, Spinach, – minimum support
Oranges, Celery, Apples, Grapes, Frozen Pizza, Frozen Cake,
Spinach, Milk, Butter} • how frequently an association rule appear in transactions
• What is T for Customer C2?
– T={Lettuce, Spinach, Oranges, Celery, Apples, Grapes} – minimum confidence
Customer Basket • how frequently the left hand side of a rule implies the
C1 Chips, Salsa, Cookies, Crackers, Coke, Beer
right hand side
C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C3 Chips, Salsa, Frozen Pizza, Frozen Cake
Frequent Itemsets Strong Association Rules

• itemset • Given an itemset, it’s easy to generate association rules
– any set of items – Given itemset, {Chips, Salsa}
• => Chips, Salsa
• k-itemset
• Chips => Salsa
– an itemset containing k items
• Salsa => Chips
• frequent itemset • Chips, Salsa =>
– an itemset that satisfies a minimum support level
• Strong rules are interesting
If I contains m items, how many
itemsets are there? – Generally defined as those rules satisfying minimum support and
minimum confidence
8
4/18/19
Generating Frequent Itemsets: Naïve

Association Rule Mining
algorithm
• Two basic steps Input: Database of Transactions,Minimum Support
Output:Frequent Itemsets
1. Find all frequent itemsets
• Satisfying minimum support n=|D|
for each subset s of I do
2. Find all strong association rules l= 0
• Generate association rules from frequent itemsets for each transaction T in D do
if s is a subset of T then
• Keep rules satisfying minimum confidence
l = l + 1
if minimum support <= l/n then
add s to frequent subsets
Generating Frequent Itemsets: Generating Frequent Itemsets:

Naïve algorithm Apriori Algorithm
• Analysis of naïve algorithm • A key property of frequent itemset
– 2m subsets of I – If A is not a frequent itemset, then any superset of A is not a
frequent itemset.
– Scan n transactions for each subset
– O(2m n) tests of s being subset of T – If A is a frequent itemset, then any subset of A is also a frequent
• Growth is exponential in the number of items! itemset.
Can we do better?
Generating Frequent Itemsets: Generating Frequent Itemsets:

Apriori Algorithm Apriori Algorithm
Input: Database, minimum_Support
• Idea: Output: Frequent Itemset
– Build candidate k-itemsets from frequent (k-1)-itemsets Begin:
L 1 = {frequent 1-itemsets} // Scan the Database
for (k=2; L (k-1) is not empty; k++)
{
• Approach C k = generate k-itemset candidates from L (k-1)
for each transaction t in D
– Find all frequent 1-itemsets { // The candidates that are subsets of t
C t=subset(C k,t)
– Extend (k-1)-itemsets to candidate k-itemsets for each candidate c in C t
c.count++;
– Prune candidate itemsets that do not meet the minimum }
L k = {c in C k | c.count >= min_sup}
support. }
L= È L k
end
9
4/18/19
Apriori Algorithm: Example Apriori Algorithm: Example

Database D ItemSet sup. ItemSet sup. Database D
TID Basket {A} 2 Possible Association Rules that can be derived from {B C E}:
{B} 3 Remove
{A} 2 TID Trans.
100 A C D
200 B C E Scan D {C} 3 {B} 3
100 ACD
300 A B C E {D} 1 {C} 3
{E} 3
{E} 3 200 BCE
400 B E B, C --> E sup = 2/4 con = 2/2 = 1.0
ItemSet sup 300 ABCE
{A C} 2 ItemSet B, E --> C sup = 2/4 con = 2/3 = 0.6
{A B} 1 400 BE
{B C} 2 Remove {A C} 2 Scan D {A B]
C, E --> B sup = 2/4 con = 2/2 = 1.0
{C E} 2 {A E} 1 {A C}
{B E} 3 {B C} 2 {A E} B --> C, E sup = 2/4 con = 2/3 = 0.6
{B E} 3 {B C} C --> B, E sup = 2/4 con = 2/3 = 0.6
{C E} 2
ItemSet {B E}
ItemSet sup {C E} E --> B, C sup = 2/4 con = 2/3 = 0.6
{A B C} {A B C} 1
{ A C E} Scan D {A C E} 1 Remove
{A B E} {A B E} 1 ItemSet sup
{B C E} {B C E} 2
{B C E} 2
Generating Frequent Itemsets Generating Frequent Itemsets

• Basic apriori reduces the number of itemsets • Improving candidate generation
considered. • Join together frequent (k-1)-itemsets
– Reduces scans of D – If itemsets have (k-2)-items in common, create a k-
itemset candidate by adding the two differing items
to (k-2) common items
• How do we generate candidates?
– Naïve approach • Example:
• For each item i not in a frequent (k-1)-itemset – {Lettuce, Spinach, Milk} join {Lettuce, Spinach, Chips} , here k=4
– Add i to the itemset to create a k-itemset candidate – {Lettuce, Spinach, Chips, Milk}
• Might generate duplicates, so remove them
Generating Frequent Itemsets Anomaly/Outlier Detection

– Pruning • Outlier: A data object that deviates significantly from the
normal objects as if it were generated by a different
• If candidate contains a subset that is infrequent, mechanism
then the candidate is infrequent • Outliers are different from the noise data
– Noise is random error or variance in a measured variable
• All (k-1)-subsets of candidate must be frequent – Noise should be removed before outlier detection
• Outlier detection vs. novelty detection: early stage, outlier; but later
• Apriori property! merged into the model
• Applications:
– Credit card fraud detection, telecommunication fraud
detection, network intrusion detection, fault detection
10
4/18/19
Fraud Detection
Intrusion Detection
• Fraud detection refers to detection of criminal activities
• Intrusion Detection: occurring in commercial organizations
– Process of monitoring the events occurring in a computer system or – Malicious users might be the actual customers of the organization or
network and analyzing them for intrusions might be posing as a customer (also known as identity theft).
– Intrusions are defined as attempts to bypass the security mechanisms of a
• Types of fraud
computer or network
– Credit card fraud
• Challenges
– Insurance claim fraud
– Traditional signature-based intrusion detection
systems are based on signatures of known – Mobile / cell phone fraud
attacks and cannot detect emerging cyber threats – Insider trading
– Substantial latency in deployment of newly • Challenges
created signatures across the computer system
– Fast and accurate real-time detection
• Anomaly detection can alleviate these
– Misclassification cost is very high
limitations
Healthcare Informatics Industrial Damage Detection

• Industrial damage detection refers to detection of different faults
• Detect anomalous patient records
and failures in complex industrial systems, structural damages,
– Indicate disease outbreaks, intrusions in electronic security systems, suspicious events in video
instrumentation errors, etc. surveillance, abnormal energy consumption, etc.
• Key Challenges – Example: Aircraft Safety
• Anomalous Aircraft (Engine) / Fleet Usage
– Only normal labels available • Anomalies in engine combustion data
– Misclassification cost is very high • Total aircraft health and usage management
– Data can be complex • Key Challenges
– Data is extremely huge, noisy and unlabelled
– Most of applications exhibit temporal behaviour
– Detecting anomalous events typically require immediate intervention
http://seninp.github.io/software.html
Supervised Techniques Unsupervised Techniques

• Assume the normal objects are somewhat ``clustered'‘ into multiple
•Advantages: groups, each having some distinct features
• Models that can be easily understood • An outlier is expected to be far away from any groups of normal
• High accuracy in detecting many kinds of known anomalies objects
• Models that can be easily understood • Newer methods: tackle outliers directly
• Normal behavior can be accurately learned • Weakness: Cannot detect collective outlier effectively
•Drawbacks: – Normal objects may not share any strong patterns, but the
collective outliers may share high similarity in a small area
• Require both labels from normal and/or anomaly class
• Cannot detect unknown and emerging anomalies
• Many clustering methods can be adapted for unsupervised methods
• Possible high false alarm rate - previously unseen (yet legitimate) data records – Find clusters, then outliers: not belonging to any cluster
may be recognized as anomalies – Problem 1: Hard to distinguish noise from outliers
– Problem 2: Costly since first clustering: but far less outliers than
normal objects
SIT742 | Modern Data Science (G. Li) 64

65
11
4/18/19
Statistical Approaches Proximity-based Approaches

• Assume a parametric model describing the distribution of the
data (e.g., normal distribution) • Intuition: Objects that are far away from the others are outliers
• Apply a statistical test that depends on • Assumption of proximity-based approach: The proximity of an
– Data distribution outlier deviates significantly from that of most of the others in
– Parameter of distribution (e.g., mean, variance) the data set
– Number of expected outliers (confidence limit) • Three types of proximity-based outlier detection methods
• In many cases, data distribution/model may not be known – Distance-based outlier detection: An object o is an outlier if
its neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if
its density is relatively much lower than that of its neighbors
– Isolation-based outlier detection: An object o is an outlier if
is likely to be isolated from other points
Distance-based outlier detection Isolation-based outlier detection

• Isolation Forest
• Nearest Neighbour (kNN) approach – Randomly select a feather and split all instances into two non-empty subsets
– For each data point d compute the distance to the k-th nearest neighbour dk – Repeat the random split procedure on each subset
– Sort all data points according to the distance dk – The instance which is likely to be isolated (only one instance in that subset)
– Outliers are points that have the largest distance dk and therefore are located in earlier is considered as the outlier
the more sparse neighbourhoods
– Usually data points that have distance dk higher
than a threshold are identified as outliers
– Not suitable for datasets that have modes with Threshold
varying density
https://quantdare.com/isolation-forest-algorithm/
Clustering-based Approaches Data Reduction
n An object is an outlier if (1) it does not belong to any cluster, (2) • Introduction
there is a large distance between the object and its closest cluster , • Why Data Reduction?
or (3) it belongs to a small or sparse cluster
n Case I: Not belong to any cluster
n Identify animals not part of a flock: Using a density-based
clustering method such as DBSCAN

n Case 2: Far from its closest cluster
n Using k-means, partition data points of into clusters
n For each object o, assign an outlier score based on its distance

from its closest center
12
4/18/19
Data Reduction Principal Component Analysis
• Data Reduction is the elimination or the de-emphasis of • Why PCA?

certain data records and attributes • Example
– Data Reduction may involve the choosing a subset of attributes
• Dimensionality reduction is often used to reduce the number of
• PCA Formulation
dimensions to two or three
– Alternatively, pairs of attributes can be considered
– Data Reduction may also involve choosing a subset of records

• A region of the screen can only show so many points
• Can sample, but want to preserve points in sparse areas
Why PCA? Why PCA?

• Raw data can be high • Issues
dimensional but redundant
– If we knew what to measure – Data might be redundant
or how to represent our
measurements, we might – Data might be naively represented in raw form
find simple relationships
– But often we measure
redundant signals • Goal
• US vs EU shoe sizes
– Or we represent the data via – Find a better representation of the data
the method by which it was • to visualize and discover hidden patterns
gathered
• pixel representation of • preprocessing for supervised task, such as attribute
human face data hashing
Example: Shoe Sizes Example: Shoe Sizes

• We take noisy • We take noisy
measurements on US and measurements on US and
EU scale EU scale
– module noise – module noise
– we expect perfect – we expect perfect
correlation correlation
• How can we do better? • How can we do better?
find a simple, compact find a simple, compact
representation? representation?
– Pick a direction and – Pick a direction and
project onto this direction project onto this direction
13
4/18/19

• PCA: Linear Regression PCA
– Try to find a direction • predict y from x. • reconstruct 2D data via
that minimizes 2D data with single
• Evaluate accuracy of
distances between degree of freedom.
predictions (represented
original data and their
by blue line) by vertical • Evaluate reconstructions
projections
distances between points (represented by blue line)
and the line by Euclidean distances

Linear Regression PCA • Goal:
– Find a better data
• predict y from x. • reconstruct 2D data via representation
• Evaluate accuracy of 2D data with single • Another Perspective
predictions (represented degree of freedom. – to identify patterns we
by blue line) by vertical • Evaluate reconstructions want to study variation
distances between points (represented by blue line) across observations, we
can find a compact
and the line by Euclidean distances representation that
captures variation
– PCA finds directions of
maximal variance
Example: Shoe Sizes PCA Formulation

• Goal: • PCA: find lower-dimensional representation of data
– Find a better data
representation
– ! ∈ ℝ$×& : n data records, each with d attributes
(*)
• Another Perspective • '( : j-th attribute of the i-th data record
– to identify patterns we • , ( : mean of j-th attribute
want to study variation
across observations, we – ℙ ∈ ℝ&×. : /×0 matrix
can find a compact • columns are k principal components
representation that
captures variation – ℤ = !×ℙ ∈ ℝ$×. : 3×0 matrix
– PCA finds directions of • reduced representation
maximal variance • PCA ‘scores’
14
4/18/19
PCA Formulation (k=1) PCA Formulation (k=1)

• Goal: PCA finds directions of maximal variance • Goal: PCA finds directions of maximal variance
– ! ∈ ℝ$×& : n data records, each with d attributes – Variance
+
– ℙ ∈ ℝ&×( : )×1 matrix 1
!ℤ# = '(- (() )# = - ##
– ℤ = !×ℙ ∈ ℝ$×( : -×1 matrix &
()*
• Variance: .ℤ/ =
(
∑ $12( (4 (1) )/ = 4 /
/ = /ℙ 1 /ℙ = ℙ1 /1 /ℙ = ℙ1 ℂ/ ℙ
$
– Find ℙ to maximize .ℤ/ , subject to ℙ =1 – Optimization:

/
max ℙ 1 ℂ / ℙ, where ℙ # =1
ℙ
PCA Formulation (k=1) Choosing k

• Goal: max ℙ%ℂ'ℙ, where ℙ ( =1 • How should we pick the dimension of the new
ℙ representation?
• Solution: – Visualization:
– Lagrangian: maximizing ℙ% ℂ' ℙ − ,(ℙ% ℙ-1) • Pick top 2 or 3 dimensions for plotting purposes
– Differentiating with respect to ℙ: 2ℂ' ℙ − 2,ℙ = 0 – Other analyses:
– Hence we have ℂ' ℙ = ,ℙ • Capture ‘most’ of the variance in the data
• The best ℙ is the eigenvector of ℂ ' , with eigenvalue , • Recall that eigenvalues are variances in the directions
specified by eigenvectors, and that eigenvalues are sorted
• Choose a k such that the fraction of retained variance at
– Similar argument can be used for k > 1 ∑%
"#$ & "
least 95%:
∑'
"#$ & "
Questions? This Session’s Readings

• Principal Component Analysis
– https://en.wikipedia.org/wiki/Principal_component_an
alysis
– A Tutorial on Principal Component Analysis
– https://www.youtube.com/watch?v=q5w8FyF3m2o
• Random Projection
– Random projection in dimensionality reduction:
Applications to image and text data
15

Unit Learning Outcomes: Modern Data Science

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unit Learning Outcomes: Modern Data Science

Hochgeladen von

Copyright:

Verfügbare Formate

4/18/19

Unit Learning Outcomes

SIT742 | Modern Data Science (G. Li) 1

Road map Clustering

Pattern Discovery from Data Pattern Discovery from Data

patterns in the cloud?

Clustering What Is Clustering?

What Is Good Clustering? What Is Good Clustering?

What Is Good Clustering? What Is Good Clustering?

S = [AB + AD + AF + AH + BD + BF + BH + DF + DH + FH S = [AB + AC + AD + BC+ BD + CD +

Cluster Quality = D/S = 0.77 Cluster Quality = D/S = 2.54

What Is Good Clustering? What Is Good Clustering?

What Is Good Clustering? What Is Good Clustering?

What Is Good Clustering? Clustering Algorithms

Distance Measures Data Matrix

Dissimilarity Matrix Distance Measures

Distance Measures Interval-valued Variables

Interval-valued Variables Interval-valued Variables

• Before applying distance measure, the attributes should

Kmeans Algorithm Partitioning Methods

Meet the Kmeans Meet the Kmeans

How Kmeans works? Issues with Kmeans

– Lots of research has tried to fix this problem.

Meet DBSCAN How DBSCAN works

Other Clustering Algorithm Market Basket Analysis

• Flat clustering • People go shopping everyday, So

Market Basket Analysis Market Baskets Analysis

Market Baskets Analysis Associations Rules

Stores might want to offer specials

Basic Concepts Measuring Interesting Rules

Measuring Interesting Rules Market Basket Analysis

Market Basket Analysis Measuring Interesting Rules

C2 Lettuce, Spinach, Oranges, Celery, Apples, Grapes

Frequent Itemsets Strong Association Rules

Generating Frequent Itemsets: Naïve

Generating Frequent Itemsets: Generating Frequent Itemsets:

Generating Frequent Itemsets: Generating Frequent Itemsets:

Apriori Algorithm: Example Apriori Algorithm: Example

Generating Frequent Itemsets Generating Frequent Itemsets

Generating Frequent Itemsets Anomaly/Outlier Detection

Healthcare Informatics Industrial Damage Detection

Supervised Techniques Unsupervised Techniques

SIT742 | Modern Data Science (G. Li) 64

Statistical Approaches Proximity-based Approaches

Distance-based outlier detection Isolation-based outlier detection

Clustering-based Approaches Data Reduction

clustering method such as DBSCAN

n For each object o, assign an outlier score based on its distance

Data Reduction Principal Component Analysis

• Data Reduction is the elimination or the de-emphasis of • Why PCA?

– Data Reduction may also involve choosing a subset of records

Why PCA? Why PCA?

Example: Shoe Sizes Example: Shoe Sizes

Example: Shoe Sizes Example: Shoe Sizes

Example: Shoe Sizes Example: Shoe Sizes

Example: Shoe Sizes PCA Formulation

PCA Formulation (k=1) PCA Formulation (k=1)

– Find ℙ to maximize .ℤ/ , subject to ℙ =1 – Optimization:

PCA Formulation (k=1) Choosing k

Questions? This Session’s Readings

Das könnte Ihnen auch gefallen